SUMO - Cox Rgression

Survival analysis - Cox regression

Cox Regression builds a predictive model for time-to-event data. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time t for given values of the predictor variables. The shape of the survival function and the regression coefficients for the predictors are estimated from observed subjects; the model can then be applied to new cases that have measurements for the predictor variables. Note that information from censored subjects, that is, those that do not experience the event of interest during the time of observation, contributes usefully to the estimation of the model.

Example:
Do men and women have different risks of developing lung cancer based on cigarette smoking? By constructing a Cox Regression model, with cigarette usage (cigarettes smoked per day) and gender entered as covariates, you can test hypotheses regarding the effects of gender and cigarette usage on time-to-onset for lung cancer.

from IBM, SPSS manual
Cox regression analysis is typically applied to survival data, but it may be used for any other "event" type.

The standard Cox model approximated with the ("coxph" function) assumes proportional hazards between different covariates:

In case of converging or diverging hazards

The standard model might predict incorrect covariance/p-values.
For such cases a weighted cox regression (coxphw) might be applied.
For more details about this problem see:

Gene selection in microarray survival studies under possibly non-proportional hazards
Daniela Dunkler, Michael Schemper and Georg Heinze
Bioinformatics, Volume 26, Issue 6, Pp. 784-790.

N.B.: Above Kaplan-Meier pictures were taken from the publication.

SUMO uses the "coxph/cophw" function from the R statistical package to perform Cox regression analysis.
Thus, you need R and the standard package "survival" and/or package "coxphw" installed and correctly configured on your computer.
To install the survival/coxphw R-packages, start an R.session.
In the R Console type respectively:

install.packages('survival')
install.packages('coxphw')

R may ask you to select an R-mirror site. Select the geographically most closest one.
R will donwload and install the respective package.

Data for analysis should contain:

survival information
censoring information
free number of covariates
Covariates may be additional sample annotations, or intensity/expression profiles from genes.

To run a Cox regression analysis whitin SUMO select Cox regression from the Survival menu:

Select:

Embedded Hyb annotations: Use hybridisation annotation date from presently loaded data set,
Embedded Hyb annotations + SELECTED genes: Use hybridisation annotation as well as gene expression data, which passed the actuallly set statistical filter (e.g. t-test).
Embedded Hyb annotations + ALL genes: Use hybridisation annotation date und expression profiles from all genes.
But: data tables with 10000s of lines will require lots of RAM and take a while to load. Better, restrict you dataset to a few hundred lines.
Stand alone: Load data from any tab delimited text file, copy-paste from any spreadsheet or type in.

SUMO's Cox regression setup dialog opens up:

The data table shows the data like an "expression matrix:

Rows contain variate and covariates (survival, censoring, ...)
Columns contain sample / hybridisation data

Colors indicate:

light blue : 1st column = variate / covariate identifier for coxph function
red : non numeric values / empty cells: any kind of text which can not be converted into a number.
Such values will interrupt execution of Cox regression.
=> Dont use data rows for later Cox regression analysis.
light red : negative numbers - might result in meaningless or bizarre Cox regression values.
=> Avoid negative numbers or convert them to somehow more useful values
light gray : selected colums/rows; delete selected columns/rows by clicking the respective tool button
white - normal use-able data cells
yellow / Aqua - rows defined as variates / covariates see below

Data clean-up

Cox requires:

time to event data - typically survival data, i.e. time a patient survives after initial surgery
Time data MUST be numbers >= 0.
censor information
- not censored (=0): individual experienced the event, typically died after given time
- censored (=1): until to the given time point, individual did not exerience the event (was alive). NO further information about the individual is available.
covariates - any additional information which shall be correlated with survival:
- dichotome : smoker ? yes (=1) or no (=0)
- categories : how many cigarette boxes per day? 1,2,3,4,5,6,.....
- continuous : expression of a certain gene

If not yet done, bring your data in such a form.

Either edit data manually, or let SUMO assist you.

Click into the first cell of a data row which you wish to modify systematically.
(e.g. click into line 2, col 5 to convert the "group") Right mouse click to open the context menu:

n	Number of values analysed
Minimm	Smallest value in dataset
Maximum	Largest values
Arithmetic mean
Variance
Standard deviation
Skewness	Assymetry of the data set (3. distribution moment)
Kurtosis	Peakedness of the dat set (4. distribution moment)
Geometric mean	Meaningful for ratio data
Geometric SDev
Median	Robust, outlier insensitive average
Median Absolute Deviation
1. Quartile	Value at lowest 25% of the data set
2. Quartile	= Median
3. Quartile	Value at highest 25% of the data set N.b: Boxplot often illustrate the 1.-3.-Quatile data range
5% Percentile	Value at lowest 5% of the data set
n < 5% Percentile	Number of data values smaller 5%-Percentile
95% Percentile	Value at highest 5% of the data set N.b.: The Whisker in a Box-Whisker plot often illustrates the 5%-95%-data range
n > 95% Percentile	Number of data values Larger 95%-Percentile N.b.: In a Box-Whisker plot often <5% as well as >95% are often displayed as min/max data points.
Trimmed mean (10%)	Robust, outlier insensitive average
Trimmed mean (25%)	Robust, less outlier insensitive average
Winsorized mean (10%)	Robust, outlier insensitive average
Winsorized mean (25%)	Robust, less outlier insensitive average
Harmonic mean

Column operator

A few basic functions to process values column wise:

Anonymize:

Convert names (or whatever) into a series of asterisks, keeping the first character.
E.g. 2 colums: "Bilbo", " Boggins" => "B****" "B******".

Shift all positive

Negative values may cause erraneous or bizarre results.
In such cases it might be helpful to shift all values until the smallest value becomes zero.

Invert

Your categories are numbered from 0 - 5, but they are just anti-correlating to your event (e.g. tumor grading <=> survival).
In such cases it may be helpful to just invert the categories:
5 => 0
4 => 1
...
0 => 5

Filter data

In some cases you might not want to use all of you samples for Cox regression analysis.
E.g. you want to analyze the influence of smoking and blood pressure onto stroke, but only with the subset of male patients.

Click with right mouse button into the header cell of the respective row.

From context menu select Filter.

You may filter on numeric values:

Larger user defined threshold
Smaller user defined threshold
Inside range, e.g. all patients older 30 years but younger 60 years
Outside range, e.g. all patients where systolic blood pressure is abormal, e.g. either below 100 or higher 200

You may filter on text values:

Empty cells.
Identical to a custom text (not case sensitive
Contains any: Filter patient who got an egfr-rececptor inhibiton: erbitux OR cetuxicmab
Contains all, e.g. all patients which got temodal AND cetuximab AND initinib

Define the Action to be done with the filtered objects:

Delete matches. With the Undo option several steps of column/row deletion may be reverted.
Select matches. Review or refine filtering manually.

Undo a previous column/row deletion.

Clear selection: Deselect all columns and rows.

Sort data

For convenience, you may also sort your data based on a single column or row.

Click with right mouse button into a column's / row's header cell and select Sort to sort your data.

KM_Survival-graphs from all categories

As the name says:
- Find all categories in the selected row (e.b. Erbitux, Cisplatin, Temodal, ...)
- Assign samples (patients) with to the respective categories group
- Compute and show KM-graphs for all groups.

Define variates / covariates

Click one of the three buttons:

Set survival row
The data must contain numbers
Set censor row (optional)
Define whether the respective was lost from the study at the specified timepoint, not reaching its end point (e.g. death).
The row should only contain:
"0" (=not censored, individual experienced event) or
"1" (=censored, individual didn't expeirence event until to the specified time point, no further knowledge about outcome).
Set covariate row(s)

Now click into the respective row within the table.
The specified row will be used accordingly in Cox regression analysis.
Survival/Censor row are shown with light "aqua" colored background,
Covariate rows are shown with yellow background.

Obviously, you may have only ONE SURVIVAL and only ONE CENSOR row.

You may define as many rows, as covariates as you want.

But:

it is not recommended to have a large number of covariates (>>10) for multiple variate regession.
Under such conditons, the multiple regression analysis might not convergence or show only marginal results.
it is recommended to have > 10 samples per co-variate (Peduzzi et al.)
eg. if you want to analyze 13 covariates => you should have > 130 samples.
This does not necessarily mean, 500 samples allow useful cox-regression with 50 co-variates.

For mono-variate Cox regression you may hava as many covariates as you want.

Run Cox regression

When data survival / sensor / covariate rows are defined and the data in these rows are adjusted accordingly,
check the desired cox method:
- proportional hazards (coxph function)
- NON-proportional hazards (coxphw function)
Finally, click the Run button to execute Cox regression.

Run - all together
Run a single multivariate Cox-regression analysis with all covariates, estimating the individual covariates influence onto the whole predictor/validator.

Run - all single Run multiple mono-variate Cox regresson, to estimate how a single covariate supports the outcome.
This may also be used to test a larger number of covariates (>>10) individually.

SUMO now will check the data and try to detect (and probably fix) problems:

empty cells
non-numeric cells
cells with negative numbers
problematic names.
R's coxph function requires a certain syntax.
Use of special characters might scramble this function call.
Therefore, SUMO converts all special characters into underscores ("_"), except "a".."Z","0".."9".

Zero survival times with coxphw

Details of the problematic values are listed in the Results tabsheet.

Next, SUMO generates a data file as well as the R-script.
These files - as well as the result files generated by R - are created in your user accounts temp folder.

Now, SUMO launches the R-script, and tries to read the result files.

Result are read by SUMO and shown in the Results tab:

First view lines show names and folder for intermediate data files,
as well as count numbers for samples, events and covariates.

For each covariate cox-regression results are shown:

Covar	Name of covariate, as defined in the input data table / file
coef	regression coefficient of the log model
exp(coef)	regresson coefficient => Hazard ratio which tells us wether the probability to experience the "event" is increased (>1) or decreased (<1).
se	Standard error of hazard ratio
z	z-score
pr(>\|z\|)	probability value which tells us, whether the finding is statistically significant (e.g. p<=0.05).
Lower/Upper.95	5% confidence interval for hazard ratio

Finally p-values for the whole cox-regression analysis according to different statistics are shown.
For more details about the specific implmentation of coxph see R's documentation (or a local copy of this document). Ffor usage of coxph function see documentation for R's survival package (or a local copy of this document).

Additionally, hazard ratios for all covariates are graphically shown as "Forest plots" using SUMO's Dotchart

The results from the above example data show, that there is no significant correlation (p<0.05) at all or with any of the covariates, and only one ("alc") shows a mentionable anti-correlation.

Troubleshooting

In case of any kind of problems:

setup / functionality of R
problems in input data matrix
problems with the coxph function

SUMO will tell you about an empty or non-existing data file.

To analyze - and probably fix - these problems, check the View R output field on the Preferences tab-sheet:

and re-run the analysis.

Now you can see the WHOLE output of R and the coxph function in the Results tab-sheet:

Especially the "Log from, R-script-Processor" section should indicate problems with R and the coxph funtion.

To view all data files generated by SUMO and R click the "View results files button.
The four laste created data files are shown in Windows' Notepad.

Setup

SUMO needs to knwo the name and localization of R.

Go to Preferences tab-sheet:

In the Path to R executable field, enter name and path to R.
More easily, click the " ... " button and navigate to the R.exe executable.

Survival analysis - Cox regression

Data clean-up

Categorize

Binarize

Categorize by k-Means

Categories to binary groups

Row operator

Fill: