The Methodist Hospital System
The Methodist Hospital System. Leading Medicine

Center for Biostatistics

To learn more or schedule resources for this Core, please login to iLab.

Center for Biostatistics

The Center for Biostatistics is a fee-for-service core. It uses an integrative approach which combines classical and contemporary statistical methods covering all aspects of quantitative research, ranging from the initial study design through presentation of conclusions.

Services

The center offers a wide variety of consultation services, which include the following: biostatistical services (hypothesis formulation, study design, power and sample size determination, data analysis and interpretation), bioinformatics service (expression array analysis, SNP genomic classification and copy number variation, motif/homology searching), data mining (pattern recognition, text mining, predictive analytics), image/signal processing (autocorrelation, FFT, feature extraction, segmentation, classification), methodological advice (selecting a questionnaire or sample), instrument development (designing a questionnaire or form), system development (database, expert system, decision support system, etc.), grant application (statistical plan/write-up/grant section), and publication (analysis, re-analysis, review).

A request form for core services is available online.

Hardware and statistical software

Computational resources in CFB include a networked 64-CPU cluster and 3 PCs. HPC data mining and pattern recognition/signal processing software includes Peltarion Synapse, Predictive Dynamix, StatSoft Data Miner, Golden Helix, and Matlab. For statistical software, CFB staff use Stata (Version 10, Windows) for the bulk of analytic analyses, and PASS 2008 for power and sample size determination. SAS Base/STAT is also maintained for some procedures not offered in Stata. CFB staff has experience in SAS and SPSS and can nevertheless consult on data setup, analysis, and interpretation from these packages. Other software used by CFB staff includes Minitab, Sigmastat, Sigmaplot.

Data cleaning

For the most part, we insist on data being cleaned to the extent possible before using it. For example, ejection fraction values in text format like “30-35”, “<55”, “45+” include text characters, which are not amenable to analysis. If values include a range, then think about developing clinically meaningful cutpoints for recoding the values into categorical groups such as 1,2,…,k prior to seeking data analytic consultation. Other examples of data cleaning we avoid are, for example, recoding patient-specific strings such as “HTN, CHOL, CAD, FAMHX” into categorical codes, or collapsing character strings with multiple surgical treatment codes or pathology findings into categorical codes. The main issue here is that data base fields with string values (e.g., !@<-#$%aBcD...) are non-numeric, usually require substantial time to recode, and essentially reduce productivity related to number crunching. Thus, variables used for analyses must be cleaned up and presented in numeric form either as integers (e.g., -3,-2,-1,0,1,2,3…) or continuously-scaled real numbers (e.g. -0.0073, 4E-04, 1.23, 288.9).

Causality

It should be clear from the design of your study whether the focus is on association (independence) or establishing causality. If association and/or independence is the focus, then we will likely discuss your goals related to correlation (strength of association), regression modeling [explaining the variance of an outcome variable(s) with predictor variable(s)] or groupwise hypothesis testing of averages (central tendencies) or count frequencies unrelated and related samples. However, if you are trying to establish that a factor is causal for a particular outcome, then your design should have included either pre- and post-measurements in the framework of a prevention trial in which an exposure is given after a baseline measurement, or an experiment with treated and untreated groups (perhaps with varying dose). We are not addressing clinical trials here, but rather design of experiments/studies appropriate for what the hypotheses are. The primary recommendation is to know precisely whether you are looking for associations or trying to determine the chance or likelihood that a set of factors “cause” or elicit a set of responses or outcomes. Knowing this can save precious time required for the analysis.

Correlation

When exploring correlation, it is absolutely essential to make X-Y scatter plots to visually inspect the relationship between the variables used. Be sure not to construct X-Y scatter plots containing data acquired from multiple repeated measurements on the same objects (patient, animal, etc.). The assumption for data used for correlation analysis is that each pair of measurements (x,y) or (protein A, protein B) are independent. That is, the points on a X-Y scatter plot used to reflect correlation are supposed to be independent – in other words obtained from different objects(patients, animals, etc.). The fundamental issue here is the “independence assumption,” where measurements taken from different objects are uncorrelated (covariance(x,y)=0), and measurements taken from the same object are correlated (covariance(x,y) ≠0). Correlation plots are based entirely on independent pairs of measurements from unrelated objects. If your data points are related, then you will need to perform repeated measures analysis, described in the next section. Last, if you draw lines on your correlation plot, you are essentially trying to reveal the slope, or change in y as a function of the change in x. This would indicate you may be interested in slopes, or regression analysis.

Repeated measures analysis

The recommended format for time-dependent modeling of repeated measures is to use “long” format with fields such as id, visit#, visitdate, outcome, time(since baseline, in units of days, months, etc), factor1(level 1,2,…,k), factor 2(level 1,2,…,k), var1, var2, var3, etc. We will likely use GEE modeling (generalized estimating equations) which does not require uniform time intervals or discrete categorical factors, and can accommodate continuously-scaled fixed or time-dependent predictors.

Not overdoing the analyses for publications

It has been our experience that some clinicians follow numerous journals and want to perform the same analyses and generate the same tables presented in papers with the greatest citation rates. Over time, a sophisticated dream sheet of statistical tests is assembled, waiting to be used at the earliest opportunity. The problem is: Statistical tests are like spices in a meal -- use too many and you spoil the flavor. A manuscript with a full compliment of t-tests, ANOVA, contingency table analyses, linear and logistic regression, survival analysis, morbidity and mortality life tables (rates), ROC curves, can be somewhat overwhelming and would likely be better if partitioned into several papers. Our belief is that a good paper is not overambitious and only includes a demographics table describing group summary statistics, followed by 2-4 analyses resulting in several figures or tables. Anytime you have more than, say, 4 tables or 4 figures, you are putting yourself at risk of presenting too much information. Using this rule of thumb, once thousands of patients have been studied and it’s time to do the analyses, it would be better to partition the goals/hypothesis and break up large sets of analyses into several papers in order to get better mileage from the resources used. As a totally relevant aside, in machine learning-based classification analysis there is a method called “divide and conquer,” which breaks up data into smaller parts in order to determine the decision boundaries. Its performance levels are consistently quite high, and therefore spawned the general problem solving idea that large problems can be solved better when they are broken up into small problems. If you work with well-published clinicians who are leaders in the field, be sure to meet with them to flesh out redundancies or unnecessary analyses before seeking statistical consultation.

Help with homework, faculty advisorships, non-supported (frequent) meeting attendance

Our consultation services are directed toward TMH clinicians and investigative groups, and TMHRI principal investigators/co-investigators working on funded grants or grants being developed. Given these commitments, our schedules do not permit sufficient time in which we can solve homework problems, guide students through masters and doctoral programs, analyze thesis data, or spend significant time at research group meetings without being supported. We do serve on TMH/TMHRI committees, but this is a service responsibility for which we are supported. If you feel that you need extended consultation for large projects requiring a long-term commitment, then please contact Gary Lingle, Director of TMHRI Grants & Contracts, (E-mail: glingle@tmhs.org) to make the necessary arrangements.