Data files

Load data as a "single flatfile data base" containg gene expression data as well as sample and gene annotations.

SUMO supports several kinds of data files / import:

 

Expression matrices

To load a data file goto select Main menu | File | Open Data


 

 

Open data

With this option you can load an expression matrix as tab-/comma-delimited text file.
In the File-open dialog box select the corresponding file type from the File-type drop-down list.

SUMO expects:

Such files are easily generated from micro array databases or exported from spreadsheet programs (e.g. MS EXCEL | File save as | Tab delimited text).

Alternatively, you may drag and drop tab-delimited text files into SUMO.



A file preview window (showing first few hundreds of lines from the selected file) opens up:

Double click the most left / upper data cell containing expression data.

The size of the expression matrix is mainly limited by the computer's free RAM.

 

Analysis tree

File name and dimensions of the expression matrix are shown in the analysis tree:

 

Click the Data table node to preview the data table:

The data file is shown in a spreadsheet. For more details see information about data tables.


 

 

SUMO analyses files

Complete analyses generated with SUMO may be saved, including expression data, backu-up data sets and the multiple statistical tests which have been performed (no SAM analyses).

Select

    Main menu | Save analysis

to save an analysis, correspondingly

    Main menu | Load analysis

to load a previously saved analysis.







Amplification data files

SUMO may be used to analyze RT-PCR data.
Data generated with ABI's RQ-Manager software (exported as amplification data files) may be imported into SUMO.
Select:

Main menu | File | Import | ABI rtPCR amplification data

Select one or multiple files.

A file preview window shows up.
Ensure the correct data column (containing the CT values) is selected and load the data files.
SUMO extracts RN-values (which are used as "comments", useful to identify genes with low signal levels generating arbitrary CT-values) and CT-values.

Sometimes, very weak signal are named "undetermined" as CT-value by RQ-Manager software.

SUMO recognizes such missing values.
It is recommended to replace those values with some meanignful value (e.g. "40", the highest cycle number).
Most simple use Main menu | Adjust data | Data imputation | Row wise | Constant.
Select all samples and define "40" as replacement value.

SUMO tries to detect multiplex samples.
If found, SUMO requests a name for multiplex enodgenous controls.
I case such controls were used, give the unique name of the controls (or a unique part of the name).
IF no - cancel the dialogue.

SUMO now performs: Replicates, i.e. same Gene-ID and Sample-ID are automatically averaged - even across multiple amplification data files.

Additionally, SUMO computes averages and standard deviation from both deltaRN as well as from ct values and places them into the gene annotations. Such values might be used to filter genes with overall low abundance (i.e. high ct-values, e.g. >35) or low signal (i.e. low delta RN , e.g. <<1).

A new file containg averaging information is automatically created
(original filename extended with "_MenaSDevN", e.g. "MyExperiment.sdm-Amplification Data_MeanSDevN.txt").
For each sample it contains three data colums:
- Mean/Median CT-value from all technical replicates
- SDev/MAD
- number of replicates

Now you may use SUMO's functionality to analyse the PCR data.

But keep in mind:
CT-values represent ~log2 values !!





Sparse data matrices

In cases where the vast majority of expression values is zero or otherhow non-informative - a sparse matrix - it may be efficient to save the few non-zero matrix cells in coordinate format:


Presently, SUMO supports the MatrixMarket coordinate format, as used by e.g. 10xGenomic's Cellranger software.

Both, Row(Gene) as well as Column(Sample) coordinates should be supplied as Interger values (anyhow read values are converted to integer).
Expression vales as integer or floating point, preferably with english decimal "." and thousands "," divider, although SUMO tries to convert non-english formats.

SUMO supports "symmetric" (i.e. Marix cell Mi,j = Mj,i) as well as "skew-symmetric" matrices (i.e. Marix cell Mi,j = -Mj,i), those can be used and reconstructed.
Obviously, this is only meaningful with square matrices and filtering should be skipped (see below).
"Hermetian" matrix format is not supported, SUMO is not designed to work and process complex numbers.

The matrix itself does not contain gene nor sample annotatons, therfore a second genelist ("features.tsv") should be supplied and selected for loading together with the sparse-matrix file.

To open a sparse-matrix select

SUMO | Main menu | File | Open data

In the file selection dialog select as file type "Sparse matrix + Feature index":



Select a Sparse matrix file (*.mtx) and optional corresponding feature list.
If eiter of the two files files is GZip cpompressed, the file should have the extension ".gz".

In a first round SUMO builds count distributions for
and show them as simple "text" histograms in SUMO Log text box:



A parameter dialog opens up:







Save data as Sparse Matrix

In cases where the vast majority of expression values is zero or otherhow non-informative - a sparse matrix - it may be efficient to save the few non-zero matrix cells in coordinate format:


What does non-informative mean ?
In case of "count data" - e.g. raw genespreson counts from single cell sequencing - a Matrix cell will count == 0 is definitely noninformative, naything >=1 may be inforamtive.
But you might want to use adifferent/higher threshold: a "exression" count 1 of a genen in s single sample might well be a sequnece mapping or a cellular barcode identifation error with sinfgle cells.
Thus a higher threshold (e.g. 5( ay be selected.

Select from main menu:
SUMO | Main menu | File | Open data

In file save dialog select "Sparse matrix as file type.

Add ".gz" to the file names (e.g. matrix.mtx.gz") to GZip compress the matrix files using GNU GZip.exe.
In case SUMO can not found "Gzip.exe" in SUMO program folder, it will try to download "GZip.exe" from SUMO WEB site.

Next supply a threshold value to filter non-informative data values.
Dfatta cells with abs(Count) are skipped, all other will be saed.

Additional to the sparse matrix, afeatzres.tsc (.gz) file is vreated containing akk feature (gene) annottions.