Survival analysis - Cox regression

Cox Regression builds a predictive model for time-to-event data. The model produces a survival function that predicts the probability that the event of interest has occurred at a given time t for given values of the predictor variables. The shape of the survival function and the regression coefficients for the predictors are estimated from observed subjects; the model can then be applied to new cases that have measurements for the predictor variables. Note that information from censored subjects, that is, those that do not experience the event of interest during the time of observation, contributes usefully to the estimation of the model.

Do men and women have different risks of developing lung cancer based on cigarette smoking? By constructing a Cox Regression model, with cigarette usage (cigarettes smoked per day) and gender entered as covariates, you can test hypotheses regarding the effects of gender and cigarette usage on time-to-onset for lung cancer.

from IBM, SPSS manual
Cox regression analysis is typically applied to survival data, but it may be used for any other "event" type.

The standard Cox model approximated with the ("coxph" function) assumes proportional hazards between different covariates:

In case of converging or diverging hazards

The standard model might predict incorrect covariance/p-values.
For such cases a weighted cox regression (coxphw) might be applied.
For more details about this problem see:
    Gene selection in microarray survival studies under possibly non-proportional hazards
Daniela Dunkler, Michael Schemper and Georg Heinze
Bioinformatics, Volume 26, Issue 6, Pp. 784-790.

N.B.: Above Kaplan-Meier pictures were taken from the publication.

SUMO uses the "coxph/cophw" function from the R statistical package to perform Cox regression analysis.
Thus, you need R and the standard package "survival" and/or package "coxphw" installed and correctly configured on your computer.
To install the survival/coxphw R-packages, start an R.session.
In the R Console type respectively:
R may ask you to select an R-mirror site. Select the geographically most closest one.
R will donwload and install the respective package.

Data for analysis should contain:
To run a Cox regression analysis whitin SUMO select Cox regression from the Survival menu:

SUMO's Cox regression setup dialog opens up:

The data table shows the data like an "expression matrix: Colors indicate:

Data clean-up

Cox requires: If not yet done, bring your data in such a form.

Either edit data manually, or let SUMO assist you.

Click into the first cell of a data row which you wish to modify systematically.
(e.g. click into line 2, col 5 to convert the "group") Right mouse click to open the context menu:


Co-variates must be numbers.
In case your data contain text tags, they must be converted to numbers.

In the example, row 2 contains "Group IDs" by which hybridisations are charecterized.
Cox regression would expect Numbers instead of short text tags.
SUMO analyzes data from the selected row and shows a list of "Text categories" (alphabetically sorted) with suggested numerical (increasing) values:

In the example, Categorie "IBC" is found in 15 samples (=Count) and would be replaced by "2" (=value).
If required, modify the values for the found categories.

In cases, where a single category is meaningless (e.g. the "--" in the example) you may remove all resperctive samples by setting its value to "nothing" (= empty cell), and checking the Del empty columns box.

You may assign the same new value to multiple categories.
E.g. Set the value of categories "not applicable", "--",... to empty cell and let SUMO delete the respective columns.

You may also use the categorizaton tool to homogenize original annotations:
Imagine you have a data row "Chemo therapy".
There may be different identifiers for the same drug, not taking typhigrahic errors into account:
Erbitux, Cetuximab, Avastin, Cetuximap, Erbutux, ....
Simply copy the Categories line (first line) from the categorization dialog and paste it into the Value line (second line) - using the toolbar buttos. Edit the indiviual new categorie names in the value line, and let SUMO replace all redundant information in one go. Click OK button and all text-tags will be replaced with the respective numerial values - or the corrected category names.
Click Cancel button to close the dialog without action.


- empty cells => 0,
- any other cells => 1.
For example , convert a row containing "day to last follow up" into censoring row.
All empty cells => not censored. All cells with time to last fowllow up => 1 - censored.

Categorize by k-Means

In case you have (quasi-)continuous numerical values, but you would like to group them into a few categories, SUMO can automatically generate the desired number of classes applying a k-means clustering algorithm.
Classes are ordered from low to high values.
After categorization, you may view the generated classes in the Results tab-sheet:

Categories to binary groups

In some cases you might want to split a multicategorial row into multiple singe binary rows.
Lets take the example for drug treatment:
Erbitux, Temodal, Cisplatin, Gemcitabin,... were used to treat the one or the other patient.
Instead of assigning Erbitux=>1, Temodal>=2, Cisplatin=>3, Gemcitabin=>4, ... it may be better to generate 4 (for this example) new data rows.
For each category (here drug) a new data row is created. The Patients which were annotated with the repesctive category will get value "1" all others "0".

You may apply this to any multi categorial annotation (anantomic site, site of metastases, ...).

Row operator

A few basic functions to process values row wise:


Prefill a new annotaton row with a custom defined constant value (e.g. "0") and manually edit only the deviating samples.
Or fill the data row with a gradient: starting with user defined value, and user defined increment. (e.g. "2,0.2" will generate the data: "2    2.2    2.4    2.6    2.8 ...")

Concatenate rows: Combine two rows by cell-wise appending the values from the defined rows.
Optionally define a divider, to be inserted between ther context of the two rows.
A input dialog opens up:

- number of first row
- number of second row
- optional divider (skip if no divider is required)
In the example, row 5 and 6 are concatenated with divider colon : "5,6,:".
Helpful for example to generate "date to event" in case you have "date to death" and "date to last followup" in separate rows of your data file.

Date difference:
Compute difference between 2 day rows.
E.g. you have date of treatment start and last follow up => compute time to last follow up in days.

- First row
- second row
- date format (e.g. dd/mm/yyyy => 24/12/2014)
This will calculate the difference between date values in Row1-Row2 in days.
Obviously you have to define the respective time format.

Search and replace a text pattern within a row (case insensitiv)
For example remove all "not available" values from "days to xxxxx" rows. Next concatenate "days to death" and "days last follow-up" into one row.

Compute basic descriptive statistics parameters from ALL numerical values in the selected row.
The result is shown in the Results tabsheet:

The parameters:
nNumber of values analysed
MinimmSmallest value in dataset
MaximumLargest values
Arithmetic mean
Standard deviation
SkewnessAssymetry of the data set (3. distribution moment)
KurtosisPeakedness of the dat set (4. distribution moment)
Geometric meanMeaningful for ratio data
Geometric SDev
MedianRobust, outlier insensitive average
Median Absolute Deviation   
1. QuartileValue at lowest 25% of the data set
2. Quartile= Median
3. QuartileValue at highest 25% of the data set
N.b: Boxplot often illustrate the 1.-3.-Quatile data range
5% PercentileValue at lowest 5% of the data set
n < 5% PercentileNumber of data values smaller 5%-Percentile
95% PercentileValue at highest 5% of the data set
N.b.: The Whisker in a Box-Whisker plot often illustrates the 5%-95%-data range
n > 95% PercentileNumber of data values Larger 95%-Percentile
N.b.: In a Box-Whisker plot often <5% as well as >95% are often displayed as min/max data points.
Trimmed mean (10%)Robust, outlier insensitive average
Trimmed mean (25%)Robust, less outlier insensitive average
Winsorized mean (10%)Robust, outlier insensitive average
Winsorized mean (25%)Robust, less outlier insensitive average
Harmonic mean

Column operator

A few basic functions to process values column wise:


Convert names (or whatever) into a series of asterisks, keeping the first character.
E.g. 2 colums: "Bilbo", " Boggins" => "B****" "B******".

Shift all positive

Negative values may cause erraneous or bizarre results.
In such cases it might be helpful to shift all values until the smallest value becomes zero.


Your categories are numbered from 0 - 5, but they are just anti-correlating to your event (e.g. tumor grading <=> survival).
In such cases it may be helpful to just invert the categories:
5 => 0
4 => 1
0 => 5

Filter data

In some cases you might not want to use all of you samples for Cox regression analysis.
E.g. you want to analyze the influence of smoking and blood pressure onto stroke, but only with the subset of male patients.

Click with right mouse button into the header cell of the respective row.

From context menu select Filter.

You may filter on numeric values:

You may filter on text values:

Define the Action to be done with the filtered objects:

Undo a previous column/row deletion.

Clear selection: Deselect all columns and rows.

Sort data

For convenience, you may also sort your data based on a single column or row.

Click with right mouse button into a column's / row's header cell and select Sort to sort your data.

KM_Survival-graphs from all categories

As the name says:
- Find all categories in the selected row (e.b. Erbitux, Cisplatin, Temodal, ...)
- Assign samples (patients) with to the respective categories group
- Compute and show KM-graphs for all groups.

Define variates / covariates

Click one of the three buttons:
Now click into the respective row within the table.
The specified row will be used accordingly in Cox regression analysis.
Survival/Censor row are shown with light "aqua" colored background,
Covariate rows are shown with yellow background.

Obviously, you may have only ONE SURVIVAL and only ONE CENSOR row.

You may define as many rows, as covariates as you want.

But: For mono-variate Cox regression you may hava as many covariates as you want.

Run Cox regression

When data survival / sensor / covariate rows are defined and the data in these rows are adjusted accordingly,
check the desired cox method:
- proportional hazards (coxph function)
- NON-proportional hazards (coxphw function)
Finally, click the Run button to execute Cox regression.

Run - all together
Run a single multivariate Cox-regression analysis with all covariates, estimating the individual covariates influence onto the whole predictor/validator.

Run - all single Run multiple mono-variate Cox regresson, to estimate how a single covariate supports the outcome.
This may also be used to test a larger number of covariates (>>10) individually.

SUMO now will check the data and try to detect (and probably fix) problems:
Details of the problematic values are listed in the Results tabsheet.

Next, SUMO generates a data file as well as the R-script.
These files - as well as the result files generated by R - are created in your user accounts temp folder.

Now, SUMO launches the R-script, and tries to read the result files.

Result are read by SUMO and shown in the Results tab:

First view lines show names and folder for intermediate data files,
as well as count numbers for samples, events and covariates.

For each covariate cox-regression results are shown:

CovarName of covariate, as defined in the input data table / file
coefregression coefficient of the log model
exp(coef)regresson coefficient => Hazard ratio
which tells us wether the probability to experience the "event" is increased (>1) or decreased (<1).
seStandard error of hazard ratio
pr(>|z|)probability value
which tells us, whether the finding is statistically significant (e.g. p<=0.05).
Lower/Upper.95   5% confidence interval for hazard ratio

Finally p-values for the whole cox-regression analysis according to different statistics are shown.
For more details about the specific implmentation of coxph see R's documentation (or a local copy of this document). Ffor usage of coxph function see documentation for R's survival package (or a local copy of this document).

Additionally, hazard ratios for all covariates are graphically shown as "Forest plots" using SUMO's Dotchart

The results from the above example data show, that there is no significant correlation (p<0.05) at all or with any of the covariates, and only one ("alc") shows a mentionable anti-correlation.


In case of any kind of problems: SUMO will tell you about an empty or non-existing data file.

To analyze - and probably fix - these problems, check the View R output field on the Preferences tab-sheet:

and re-run the analysis.

Now you can see the WHOLE output of R and the coxph function in the Results tab-sheet:

Especially the "Log from, R-script-Processor" section should indicate problems with R and the coxph funtion.

To view all data files generated by SUMO and R click the "View results files button.
The four laste created data files are shown in Windows' Notepad.


SUMO needs to knwo the name and localization of R.

Go to Preferences tab-sheet:

In the Path to R executable field, enter name and path to R.
More easily, click the " ... " button and navigate to the R.exe executable.