- Specific algorithms for complex data
- SUMO functions repacked / reorganized
- ...

Grubbs outlier removal | A tool to remove outliers from data vector(s) |

Methylation | Here you can find various functions to preprocess/convert methylation data, derived from Ilumina 450K methylation microarrays. |

12,13,11,10234,14,16,12,15

Arithmetic mean = 1290.875

Which is completetly wrong, due to the one outliying value (10234).

One method to circumvent this problem could be to use a "robust" mean estimator:

Median, Truncated mean, Winsorized mean, Tukey biweight, ....

But all these methods require to define a portion of potentially outlying values.

If this estimate is incorrect, you might remove "good" values, decreasing the power of your analysis. Alternatively, if you remove to few outliers your means might still be wrong.

Another method could be to iteratively remove extreme values which might not belong to the main data cloud: Grubbs' test for outliers.

In brief:

Iteratively compute:

- Mean / Standard devation
- Detect data value with largest distance to the mean
- Test with 2-class t-test: does the Outlier belong to the main data cloud at the 5% level?

- if yes: stop.

- if no, remove the value and continue iteration with the reduced data set.

In

A spread sheet like dialog opens up:

Paste you data vector (or multiliple lines) of tab delimited data into the sheet.

Now you can remove outlying data from Rows / Columns by clicking the respective button.

Finally you may copy rows/columns, or save the complete table as tab-delimited text file.

In the result tab-sheet,

**Extract Quants from IDat****M/U impute missing intensity values****Pairwise color quantile normalization****M/U Intensities => M or Beta****M/U Intensities => Total intensity****Methylation-BMIQ-normalisation****Convert Beta => M****Convert M => Beta****Feature averages**

For analysis outside Illumina GenomeStudio, instensity data (or others) have to be extracted.

Select:

A file selection dialog opens up.

Select all IDat files that shall be processed.

library(illuminaio)idat <- readIDAT("filename.idata") write.table (idat$Quants,"filename.idat.txt",sep="\t") |

For each IDat file a text files is created in the folder from which the IDat file was read.

The generated text files contain for each feature:

- Illumina-ID (column header missing)
- Mean - Average intensity from all beads for the respective feature
- SD - Standard deviation from above average
- NBeads - Number of beads averaged for this particular feature

- Functioning installation or R
- installed illuminaio package (via R-Gui).

Thus, SUMO can analyse independantly Mehtylated / unmethylated data and impute row averages for data cells, where cell value is below a user deined threshold.

Only requirement: data columns are organized pairwise: one sample mehtlyted/un-methylatd (or inverted pair-order).

A well known problem may occur from unequal labeling efficiencies with the two different dyes.

One way to circumvent ths problem, might be to perform quantile normalization for all intensities coming from one single samples (two hybs).

For this procedure we encounter the problem of the the two Illumina design types which have differernt numbers of probe sets.

Similar to:

Analyze Illumina Inhiium methylation microarray data

Pan Du, Gang Feng, Spencer Huang, Warren A. Kibbe, Simon Lin,

October 13, 2014

Averages for the quantiles are now computed from all four feature values, or from only two feature values, where there were no data mapped from the smaller data subsets:

Convert pairwise intensity values into M / Beta (=ratio values).

M = Methylated-intensity / Un-methylated-intensity

Beta = Un-methylated-intensity / Methylated-intensity

SUMO asks for column order:

- MU = Mehtylated, Un-mehtlyated
- UM = Un-mehtylated, mehtlyated
- anything else = UM

To circumvent this problem a small user defined additive constant (alpha)may be set (e.g. "100", => 101/102 or 104/101).

Default value = 100 (~typical background value for Illumina background intensities):

- M = (M+alpha) / (U+alpha)
- Beta = M / (M+U+alpha)]

Illumina 450k methylation arrays measure abundance of methylation sites spread accross the whole genome in high resolution.

Sum of Methylated and Un-methylated probe intensities represent abundance of the genomic region for the specific site.

Thus one can "misuse" methylation data for Copy Number Variation analysis (CNV).

Just convert Beta values to M (=direct ratios).

Instead of using raw Methylated/Un-methylated intensity values for computing of M (log

M = log

Obviously, this conversion can not reflect a correction for small intensity values ("alpha" see above).

Just convert M values to Beta .

Instead of using raw Methylated/Un-methylated intensity values for computing of Beta ( = M/(U+M+alpha) ), one can convert the values:

Beta = 2

Obviously, this conversion can not reflect a correction for small intensity values ("alpha" see above).

The two probe types underly different DNA methylation distributions and dynamic range, which may introduce biases into downstream analysis.

For more details see:

A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data.

Teschendorff AE1, Marabita F, Lechner M, Bartlett T, Tegner J, Gomez-Cabrero D, Beck S.

Bioinformatics. 2013 Jan 15;29(2):189-96. doi: 10.1093/bioinformatics/bts680. Epub 2012 Nov 21.

The code

In

- Illumina
**beta values**for multiple samples

(beta=intensity-methylated / (intensity-mehtylated + intgensity-not-methylated) ) - One gene annotation column, containing probe

design type (either "I" or "II")

Remove any non numeric values (e.g. impute missing values).

Ensure data range is from 0 to 1.

In

Main menu | Utilities | Methylation | BMIQ-normalisation

A column selection dialog opens up.

Select the "gene" annotation column which contains the

The annotations column should contain either " I " for design type 1 or " II " for design type 2 (of course one entry for each single feature.

A parameter dialog opens up:

Check and adjust parameters accordingly.

For more details about meaning and suggested parameter values see the publication about BMIQ.

When normalization has finished, don't forget to save data under a new name.

BMIQ normalisation is a time consuming complex process.

One sample will require around 60s on an Intel I7-2.5GHz CPU.

At any time press ESC key.

NB: Now you have a mixed data matrix: already processed samples as well as still original data.

In case data seem to look scrambled, check the intermidiate data files.

In your local windows temp directory you can find:

SUMO_BMIQ_r_result.txt | Log file from R, showing any potentially problems/exceptions during execution of BMIQ R-script. |

SUMO_BMIQ_r_input.txt | The data set exported by as nput for the R-script.SUMOThe file should cntain two columns: type=Ilumna designtype (1 or 2) data=beta values (0 to 1) |

SUMO_BMIQ_r_data.txt | Data created by the R-script, re importes by into the expression atrix.SUMOone data column containing the transformed Beta values (0 to 1) |

SUMO_BMIQ_r_output.txt | Other computational results generated by the Rscript, here the "quality" for the iterative Beta-fits |

SUMO_BMIQ_r_script.txt | The R-script generated by .SUMO |

- Functioning installation of R (can be same as used in any other
R-wrapper)*SUMO* - Check / install required packages:

- RPMM [ in R-GUI: install.packages("RPMM") ]

- survival [ in R-GUI: install.packages("survival") ]

with the R-GUI

- TSS1500

- TSS200

- 5'UTR

- 3'UTR

- 1st Exon

- Body

For further analysis it may be recommended to compute an average methylaton for the functional group - or even the complete gene.

Select

In the parameter dialog define:

- Colum number where "gene1:region1;gene2:region2;" information is given

- Row number which contains the sample ID to be shown in the result file

- Averaging method

- Result file name

- Search all unique gene names

- Index all original data rows, to the unique gene names

- for each gene, extract all values for all six (see above) functional regions for each sample

- compute a mean estimate for each region and sample

- save data to a tab-delimited result file:

The file contains:

- First rows: sample annotations just copied from original data file

- First column: Gene name as extracted from gene:region column as defined above

- Second column: Gene region as extracted from gene:region column defined above

- Third column: Number of features for this particular gene:region

- Fifth column: Arithmetic average from all samples for this gene_region

- following columns: Average from all feature for the respective gene:region/sample computed as above

The method my be applied to Beta, M or CNV values.

In the resampling test, we leave the assignment of samples to the defined classes, but generate random subsamples of the original assignment.

Thus we mainly test the stability of our hypotheses when using various subsets of the groups.

- Group balanced
- Stratified

Assume a 2-class test with 10 samples each:

- Group-1: a1,a2,a3,a4,a5,a6,a7,a8,a9,a10

- Group-2: b1,b2,b3,b4,b5,b6,b7,b8,b9,b10

Now we run 100 permutations, with 80% of the samples.

Permutation-001: a1,a3,a4,a5,a7,a8,a9,a10 - b2,b3,b4,b5,b6,b7,b8,b10

Permutation-002: a1,a2,a3,a5,a7,a8,a9,a10 - b1,b2,b4,b5,b7,b8,b9,b10

Permutation-003: a2,a3,a4,a5,a6,a7,a9,a10 - b1,b3,b4,b5,b6,b8,b9,b10

...

Permutation-100: a1,a2,a3,a4,a6,a7,a8,a10 - b1,b2,b3,b5,b7,b8,b9,b10

Select the

Additionally select the corresponding condition annotation in the

Assume you have a condition annotation which describes tumor

E.g. S1:G1, S2:G1, S3:G0, S4:G45, S5:G0, S6:--, S7:G0, S8:G2, ....

CLick

Select the

From the above example we would find 5 strate:

All permutations, are now build in such a way, that the fraction of the strata in the permutations is identical to the fractions of original starting configuration

.

In case you want to build a more complicated stratification (e.g. combine gender and smoking =>Male:Smoker, Male:NonSmoker, Female:Smoker, Female:NonSmoke), generate an additional condition annotation line with this stratification prior running the test.

Obviously, stratificiation may drastically reduce the number of unique permutations to be performed - in particular with jackknifing.

For each of the permutatons, a p-value with the underlying statistic is computed (here t-test or Man-Whitney).

Permutations with a p-value ≤ userdefined threshold (alpha) are used for computation of the resampling p-value and mean-p:

p

with: N

The test adds three additional "annotations columns to the data matrix:

- n-success

- p-resampling

- mean p-resampling

p-permutation may range from

- 1 = no single permutation resulted in the case p≤alpha)

- 0 = all permutation resulted in the case p≤alpha)

Like with any other permutation test, the highest achievable significance refers to the number of permutations.

With 1000 permutations the best meaningful p-value would be 0.001.

From

The group selection dialog opens up.

Define

Go to the

Select

- Distribution: 1-class/2-class ttest / multiclass ANOVA
- Non parametric: 1-class / 2-class Man-Whitney / Multiclass Kruskal-Wallace

Select

- Bootstrap - with replacement, i.e. you might have individual samples multiple times in the permutations
- Jackknife - without replacement.

As more as better. But: as more permutations as more execution time is required.

Anyhow: The number of permutations should be adopted to the group sizes.

Assume 2 groups, 3 samples each, 66% sample size.

There exist ~16 posible permutations.

Running a thousand permutations would mean: you repeat the 16 possible permutaions ~60 times.

=> p-values are not expedient.

Assume 2 groups, each with 40 samples, 80% sample size.

For each permutation, 32 samples are randomly selected from, each of the two groups.

Click

When finished, close the notification message box.

Now you may filter genes by the resampling test results.

Define a critical p-value and see how many featuers (genes) pass the test:

Click

Click

Click

Additionally, a histogram is shown, indicating the multiplicity how often individual genes were significant in the resampling test.

The example shows non-significant results.

None of the tested ~150 genes was >40x significant in all 100 permutation cycles.

=> p>0.6 - non significant.

A XML-key without value will get an empty value "--".

Additionally, a key list file is generated in the folder where the xml files are located.

The key list contains alphabetically sorted all unique keys extracted from all converted XML files.

In

Main menu | Utilities | File utilities | XML=>TSV

A file selection dialog will open up.

Select the xml files you wish to convert

In

The sample XML-fragment:

would be converted to:

TCGA-2A-A8VL.xml -- xml -- prad:tcga_bcr -- admin:admin -- admin:bcr Nationwide Children's Hospital admin:file_uuid C2A2058A-37FA-49BE-A806-567A9A332504 admin:batch_number 389.50.0 admin:project_code TCGA admin:disease_code PRAD admin:day_of_dcc_upload 31 admin:month_of_dcc_upload 3 admin:year_of_dcc_upload 2016 admin:patient_withdrawal -- admin:withdrawn false prad:patient -- admin:additional_studies -- clin_shared:tumor_tissue_site Prostate shared:other_dx No shared:gender MALE clin_shared:vital_status Alive clin_shared:days_to_birth -18658 clin_shared:days_to_death -- clin_shared:days_to_last_followup 621 ...

A XML-key without value will get an empty value "--".

In

Main menu | Utilities | File utilities | JSON=>TSV

A file selection dialog will open up.

Select the JSON files you wish to convert

In

The sample JSON-fragment:

{ "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 }, { "category": "fiction", "author": "Herman Melville", "title": "Moby Dick", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "fiction", "author": "J. R. R. Tolkien", "title": "The Lord of the Rings", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "color": "red", "price": 19.95 } } }

would be converted to:

root::store:bicycle:color red root::store:bicycle:price 19.95 root::store:book:0:title Sayings of the Century root::store:book:0:author Nigel Rees root::store:book:0:category reference root::store:book:0:price 8.95 root::store:book:1:title Sword of Honour root::store:book:1:author Evelyn Waugh root::store:book:1:category fiction root::store:book:1:price 12.99 root::store:book:2:title Moby Dick root::store:book:2:isbn 0-553-21311-3 root::store:book:2:author Herman Melville root::store:book:2:category fiction root::store:book:2:price 8.99 root::store:book:3:title The Lord of the Rings root::store:book:3:isbn 0-395-19395-8 root::store:book:3:author J. R. R. Tolkien root::store:book:3:category fiction root::store:book:3:price 22.99

At several places

Depending on the applicaton it may be valuable to generate random numbers that follow a certain distribution.

- Linear uniform
- Gaussian
- Poissonn
- Binomial
- Weibull
- Exponential
- Logarithmic series

In

Main menu | Utilities | Random numbers

A file Parameter dialog opens up:

Select

- Distribution type
- Distribution specific parameters.

In case a distrution requires only one (or no) parameter, the non requiered paramter is ignored.

Leave parameter fields empty to use default values. - Select to show (yes) or hide (no) graphs to show the generated numbers and their histogram.
- Copy the list of random numbers to clipboard or save the lsit to a text file

Examplary 1000 gaussian (mean=0, sdev=1) distributed random numbers:

Graph:

Histogram from this random number series:

Obviously, the relaive small random number set can not generate a "perfect" Gauss distribution.