SUMO - NGS utilities

Here you may find a few functions to perform basic operation with SAM, BED, ... files

SAM filter	Filter SAM files by Flag, Start and length of reference sequence, Q-value
SAM => Population map	Generate a map of sequence bins hit by sequences from multiple SAM files.
BED tools	A set of tools to work with BED like genomic ranges objects lists	View: View (multiple) BED files as line graphs. Features to Bumps: Combine closely neghbouring BED objects to a wider "Bump" 1/2 class test: find genomic regions hit by multiple BED files, respetively differentially hit by two different groups of BED files. Annotate: Annotate BED-format like files with genomic features.
SAM tools	A set of tools to work with SAM/BAM files	QC: Generate some diagnostic data / plots from SAM/BAM file.

Filter SAM

In certain screening applications (Crisp/Cas transfection, LAM PCR, ...) NGS sequencing may be applied to identify target products.
Quality information as well as soft clipping information from genome mappers (e.g. Bowtie2) may be used to skip unwanted features.

SUMO can filter SAM files by:

SAM flag	Value=0 or 16. Only unique forwared or reverse complemented sequences
Sequence start min/max	Range for Length of 5'-sequence not matching to reference. The SAM's CIGAR is evaluated. Left- as well as right soft-clipped sequences are used to measure 5'-/3'-non reference sequence parts and resulting matching sequence length. The non matching sequencs may correspond to adapters, parts of the transfection vectors, ....
Min match length	Minimum length of sequence matching to reference.
Q-value	The Quality value (p-value) of the sequence alignment
Map resolution	SUMO may generate a low reolution population map / coverage graph for the whole referenve (genome). Define a resolution (binsize) for these graphs. Resolution should be ~1 MegaBP.
Type of output	Define the otuput *SUMO* shall generate: - s : filtered SAM file (default). - b : filtered BED file. More compact but no sequences. - c : coverage plot for each individual SAM input file - h : histograms for size distributon 5'-/3'-soft clippend / length for each individual SAM file - h : low resolution population map define any combination of the respective characters. E.g. "sm" to generate filtered SAM and Population map.

Additionally, SUMO automatically generates integrated Coverage and Size Distribution plots from all samples.

Selece Main menu | Utilities | NGS | Filre SAM

A parameter dialog open up:

Fill in all parameters accordingly.
Drag one - or multiple - SAM files from Windows' explorer into the Name of SAM files Edit field,
double click the edit field or click the "..." button to open a file selection dialog.

SUMO generates an overall report:

For each sample (SAM file) is shown:

total number of analyzed sequences
number of sequences matching to any of the chromosomes
number of sequences with a single unique match to the genome
number of sequences passing the filter

Additonal graphs may show:

Histograms for each samples, showing length distribution of

5'-soft clipped sequence part (i.e. 5'-Linker/adapters not matching to the reference genome (blue).
In the example a maximums appears ~65bp, corresponding to the expected adapter length.
A second maximum around 150bp may indicate constructs with a duplicated adpater.
Length of sequence homologous to the reference genome (orange).
Large number of very short inserts (25-60)
This graph starts at 25bp - conforming to the lower sequence length threshold defined in the parameter dialog.
3'-soft clipped sequence part, not matching to the reference (red).

Coverage plot for each sample.

Orange indicates boundaries between chromomes (1-23,x,y).
Blue indicate how often a certain region was hit by the sequences within the respective BAM file.
In the example a resolution of 1000000bp was chosen to generate a coarse overview.
In the example (detection of DSB-sites) the graph indicates very few detected sites accross the genome in an individual sample.

"Population map"

A heatmap like graph shows the distribution of filtered sequences for all samples accross the human genome.

Data columns are grouped by chromosome, to easily recognize conserved patterns between samples.
The colored boxes on top of the heatmap indicate the chromosomes.
Rows represent 1 Mega-BP bins.

In the example, it is easy to see that on chromosome 3 (first yellow box) there is a higly conserved region around 17 M-BPs.
Especially if you zoom into the region:

SAM to population map

BED-tools

A set of tools to work with BED like genomic ranges object lists.

View: View (multiple) BED files as line graphs.
Features to Bumps: Combine closely neghbouring BED objects to a wider "Bump"
1/2 class test: find genomic regions hit by multiple BED files, respetively differentiall hit by two different groups of BED files.
Annotate: Annotate BED-format like files with genomic features.

After alingment (mapping) of sequences to a reference genome, sequence reads are often summarized into genomc intervals (e.g. ChipSeq binding sites, ...).
BED files are simpe lists of these genomic intervals and additional information.

BED file structure:

ASCII text file with one object per line.
Each line contains several tab delimited data fields
1. field: chrom - name of the chromosome or scaffold.
Any valid seq_region_name can be used,
and chromosome names can be given with or without the 'chr' prefix.
REQUIRED
2. field: chromStart - Start position of the feature in standard chromosomal coordinates
(i.e. first base is 0).
REQUIRED
3. field: chromEnd - End position of the feature in standard chromosomal coordinates
REQUIRED
all other fields are optional:
- 4. name - Label of the feature
- 5. score - A score for the particular feature between 0 and 1000 (e.g. count or p-value
- 6. strand - defined as + (forward) or - (reverse)
- and others

A BED file typically looks like:

chr1  213941196  213942363  Pos1  13  +
chr1  213942363  213943530  Pos2  14  +
chr2  158364697  158365864  Neg1  35  -
chr2  158365864  158367031  Pos1  17  +
chr3  127479365  127480532  Pos1   8  +
chr3  127480532  127481699  Neg2  12  -
...

See a more detailed description of BED files and variants.

At present SUMO supports:

1. Field: only Chromosome ID: chr1..chr22,chX,chrY,1..22,x,y (non-case sensitive)
Empty lines / comment lines (starting with #) are ignored
in some tools, order of fields may be specified (e.g. Start/Stop/Chrom-ID/Score/Strand

BEDPE file Structure

A BED like file containing basic information for both sequecces of Mate pairs form Paired-End sequecing.

ASCII text file with one object per line.
Each line contains several tab delimited data fields
1. field: Name of the chromosome or scaffold.
Any valid seq_region_name can be used,
and chromosome names can be given with or without the 'chr' prefix.
2. field: Start position of the feature in standard chromosomal coordinates for Mate1
3. field: End position for Mate1
4. field: Name of the chromosome or scaffold for Mate2.
5. field: Start position for Mate2
6. field: End position for Mate2
7. field: Name of the Mate pair
8. field: Score
other fields are optional:

A BED file typically looks like:

10          79359547    79359698    10          79359718    79359869    ST-E00210:236:HYJJ2CCXY:6:1101:3833:1555	44	+	-
KI270438.1	  110047	  110196	KI270438.1	  110098	  110249	ST-E00210:236:HYJJ2CCXY:6:1101:3853:1555	 1	+	-
.	                -1	      -1	.                 -1	      -1	ST-E00210:236:HYJJ2CCXY:6:1101:3975:1555	 0	.	.
14        	89566311	89566461	15	        89566427	89566578	ST-E00210:236:HYJJ2CCXY:6:1101:4239:1555	44	+	-
...

BED-tools - View

A basic tool to visualize content of BED,BEDPE,SAM,BAM file(s) on genomic scale as line graphs:

The graph illustrates two BED files (here in 3D projection:

Green-Line : ChipSeq peaks created with Macs2
Red-line : Clusters of closely neighbouring Macs2 peaks, generated with SUMO (see below)

Chromosomes boundaries are indicated by black dividers, centromers by gray dividers.

Depending on the pre-selected resolution ( here: 10000 BP) x-axes will be scaled.

You may customize the data view applying ProfileViewers options (e.g. Point-/Bar-/Line-Series, color withth, Zoom, ...).
You may use apply ProfileViewers DSP functions to transform the data (Smooth, BIN, NOrmalize, Log-transform, Segment, ...).

From SUMO's main menu select Main menu | Utilities | NGS | BED: View

A parameter dialog opens up:

Data files	Supply a single or a semi-colon separated list of data files. Click the "..." button to open a file system browser and select desired files. More easyly: drag one or multiple files from Windows Explorer into the BED viewer dialog. Newly dropped files are added to the list of already selected files. To clear the list, click the "X" button. To preview the (first) file from the selection list click the "binoculars" button. At present *SUMOsupports the file formats: BED, file extension ".bed" GZip compressed BED, file extension ".bed.gz" BEDPE file extension ".bedpe" GZip compressed BEDPE, file extension ".bedpe.gz" SAM, file extension ".sam" GZip compressed SAM, file extension ".sam.gz" BAM, file extension ".bam" (Only for files with SAM header << 65 KBytes (e.g Genomic alignments). *SUMO* tries to parse the repsective filetypes depending on the file extension. Thus, striclty stick to the expected extensions.
Value column ID	The data column within the BED file containing the value which shall be displayed on the Y-axes. Define "0" to just count the coverage.
Resolution	*SUMO* averages neighbouring BED objects into bins with given size. In the example bin size (=Resolution) of 10000 BP was selected, resultng in a line graph with 300,000 data points. You may go down to 100 BP resolution (at least for single BED files) but response time for graphs with ~30 million will be low.
Genome	At present, only Human genome HG38 is supported (overall size/chromosome dividers).
Reads to process	Often one can reacognize te underlying datastructure when analyzing only a small part of the data (e.g. 10Milion reads). Define "all" to process the complete data file.
Action	Waht to do with the result data: Display - Show only th graph. Save - Save a "Coverage file". (A tab delimited text file: Start TAB Stop TAB Count) Failed - A BEDPE file for mates not passing the defined filter (only for BEDPE) Or any combination of the three.
Chromosome-filter	Entries with "invalid" chromomes id (i.e. something else thean "1".."22","X","y",MT) are always skipped (and entered to the failed BEDPE).
Pair-filter	BEDPE entries where Mates are located to different chromosomes are always skipped (and entered to the failed BEDPE).
Size filter	Filter BEDPE entries an the target sequence region they span (i.e. Start-Mate1 - Stop-Mate2) Define "Min,Max". BEDPE entries < Min or > Max will be skipped (and entered to the failed BEDPE).
Score filter	Filter BEDPE entries an the score (Column8). Define "Thresol. BEDPE entries < Thresjold will be skipped (and entered to the failed BEDPE).

Progrss is indicated in the Stauts bar.
You may interrupt the process at any time by pressing ESC-Key or clicking the Break button.

BED-tools - Features to Bumps

Sometimes it may be helpful to agglomerate neighbouring BED objects into larger "bumps".

In the example we are analyzing regions where DNA Double Strand Breaks have occured.
In a region of 1-2 MEGA-BP around the DSB, histon H2AX gets phosphorylated at Serin 139 (γH2AX).
To analyze localization / distribution of DSB sites on the genome, you might use CHIP (Chromatin Immune Precipitation) applying anti-γH2AX-antibodies to extract short genomic regions where γH2AX occurs.
But with e.g. ChipSeq you are identiying only the small binding area for the protein (~100 PB) compared to the huge acivation area.

Thus it may be better to search gemonic regons where many strong ChipSeq peaks (e.g. generated by MACS2) are colocalized and use these "bumps" for further analysis.

From SUMO's main menu select Main menu | Utilities | NGS | BED: Features to bumps

A parameter dialog opens up:

Data files	Supply a single or a semi-colon separated list of BED files. Click the "..." button to open a file system browser and select desired files. More easyly: drag one or multiple files from Windows Explorer into the BED viewer dialog. Newly dropped files are added to the list of already selected files. To clear the list, click the "X" button. To preview the (first) file from the selection list click the "binoculars" button.
Min bumps size	Minimum length for a bump of neighbouring BED objects.
Max bumps size	Maximum length for a single bump of neighbouring BED objects. If you have huge enriched regions you may get multiple neighbouring / overlapping bumps.
Min feature count	A single bump must have at least this minimum number of features.

All selected files are analyzed independantly.

For each input file a new result BED file is created in the same location where the input file was found.

The resultung BED file contains for each bump:

"#CID" Chromosome
"Start" Start of bump (BP) within chromosome
"Stop" End of bump (BP) within chromosome
"Name" Dummy name (filename)
"Peaks" Number of original BED objects in the derived bump
(in the example: number of MACS2 γH2AX ChipSeq Peaks)
"P/MB" number of original peaks per MegaBase
"Mean coverage" arithmetic mean of "score" from all contained original BED files.
In the example (MACS2 analysis) the "score" column contains the coverage af a MACS2 peak.
"SDev" Standard deviation of scores from all original peaks.
"Median" Median of original scores
"MAD" Median Absolute Deviation (~SDEv for Median)
"length" of the resulting bump (BP)

An example result file may look like:

track name="Bumps" description="SUMO Bump finder"											
#CID	Start	Stop	Name	Peaks	P/MB	Mean-Coverage	SDev	Median	MAD	Length	
1	142964908	143464903	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	11	22.0002200022	22.6363639831543	10.7534351348877	18	5	499995	1
1	143164997	143485917	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	12	37.3924965723545	21.5	11.0412454605103	17	4.5	320920	1
4	9218283	9369432	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	12	79.3918583649247	21.9166660308838	7.55234241485596	19	5	151149	4
4	49131786	49571310	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	19	43.2285836495846	22.2105255126953	8.34910774230957	18	5	439524	4
4	49272653	49650909	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	16	42.2993951186498	23.5	8.10760974884033	22.5	6.5	378256	4
5	49406943	49682821	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	15	54.3718600250836	62.2000007629395	73.2863235473633	27	12	275878	5
7	51964259	52432007	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	10	21.3790331546046	16.5	4.71993398666382	15	2.5	467748	7
...

The image shows a zoom-in to a ~8 MB region:

Green indicates the original MACS2 peaks.
Red the resulting "bumps" (applying the above parameters).

BED-tools - 1/2 class test

Lets assume you performed an analysis resulting in a set of chromosmal regions saved in BED format.

Obvious questions would be:

are there regions conserved between multiple similar samples (replicates) ?
can i find differences between two (or multiple) groups in different sample groups

The BED-tools class-test allows to perform this analysis using BED files as input.

Select the BED files for the groups
- Supply files for both groups => 2-class test differential analysis
- Suuply files for group 1 only => 1 class test
Define the data column used as "intensity" measure (e.g. average coverage from MACS2 bumps).
Define the resolution.
SUMO divides the genome into small bins with RESOLTUTION number of bases (e.g. 1000 BP).
For each bin (in the eample ~3 Millon 1000 BP bins for the human genome), average / standard deviation from the groups' members are computed.
t-test/ANOVA are applied to estimates the individual bin's importance / significance.
t-test are computed with Welsh/Satterthwaite approximation.
A resultng BED file is created contaiing the union/difference of all regions and their statistical result values.
Output may be filtered based on the statistical reults.

From SUMO's main menu select Main menu | Utilities | NGS | BED: 1/2 class-test

A parameter dialog opens up:

Data files group1	Supply a single or a semi-colon separated list of BED files.
Data files group2	Supply a single or a semi-colon separated list of BED files - ir required.
Resolution	Size of bins in BP. Resolution downto 100 BP should work.
Score column	The column in all BED files from which the "intensity" information is extracted and used in the calss test. To simply count the coverage (coverage of individual samples for a specific bin) define "0"
Log2 transform	Define "yes" to log transform the "intensity values". But don't forget: for log2 transformation ALL VALUES HAVE TO BE >0
Offset	A contstant to add to the "intensity" values. Useful to avoid negative number with log2 transformation
Min difference	You may filter bins by - average "Intensity" (1-class test) or - differential regulation (2-class test) Only bins >Min-difference are passed. Min difference is sign independent. To bypass this filter supply "0"
p-value	You may filter bins by p-value. Only bins with p≤p-value are passed. To bypass this filter define p-value = "1"

a result file is created:

with extension "_union" for 1-class tests
with extension "_difference" for 2-class tests

The result file has BED-4 format and additional columns containing:

Count,Mean,SDev for the groups
difference for 2-class test
t-,p-value, Degrees of freedom

A (unfiltered) result file may look like:

CID	Start	  Stop  	    Name	  Cnt1	Mean1	SDev1	Cnt2	Mean2	Sdev2	C1-C2	M1-M2	t	p	DF
1	2239000  	 2414000  Segment	0	0	0	1	14.12	9.98	-1	-14.12	-1.41	0.126	3
1	2414000  	 2588000  Segment	0	0	0	2	16.94	3.99	-2	-16.94	-6.00	0.004	3
1	2588000 	 2723000  Segment	1	23.71	16.76	2	17.09	3.78	-1	6.62	0.390	3
1	2723000 	 2901000  Segment	0	0	0	2	18.03	2.44	-2	-18.03	-10.42	0.001	3
1	2901000 	 3506000  Segment	0	0	0	1	15	10.60	-1	-15	-1.41	0.12	3
1	10132000	11563000  Segment	0	0	0	1	16.42	11.61	-1	-16.42	-1.41	0.126	3
1	16054000	17324000  Segment	0	0	0	1	16.29	11.52	-1	-16.29	-1.41	0.126	3
1	19411000	19813000  Segment	0	0	0	1	15.30	10.81	-1	-15.30	-1.41	0.126	3
...

BED-tools - Annotate

Certain NGS applications generate results as "genomic ranges", e.g. LAM-PCR applications, ChipSeq peak detection ...
as e.g. SAM or BED files, or variants of those.

Often it will be reqired to annotate individual objects to their genomic context - i.e. annotate these data files.

SUMO expects:

Annotation file.
Any file with genomic ranges.
E.g. Gene/transcript coordinates, Promoter regions CpG islands,....
One - or multiple - Data files.
Any file. E.g. SAM, BED, ... All selected files must have same structure.
All Tab delimited text files
One object per line
Genomic coodinates as individual tab separated columns:
- Chromosome (e.g. "chr1", "1")
  SUMO will extract the chromosome number ("1".."23","X","Y")<
- Start position on the respective chromosome in BP
- End position on the respective chromosome in BP
- Reading orientation on the cromosome
  "+" forward (10000->20000)
  "-" reverse (10000<-20000)
- Name of the object
- Order/position of columns can be customized
Data may be supplied as plain ASCII text or gzip compressed files (coming soon).

SUMO scans each data file against the annotation file and adds annotation informaton to each feature of the data file (if possibley) in a newly created file.
Depending of the operation mode additonal other files are generated.

Annotation modes

Several annotation models may be used:
Assume we have an annotation file with coordinates for all transcript of a specie.

Closest	Search the genes with smallest distance to each individual data object. Independant of orientation of a transcipt, Within. upstream or donwstream of the transcript Start of end of the feature.
Closest-5'	Only up-stream positon is checked (depending on the transcripts orentation
Closest-3'	Only donw-stream positon is checked (depending on the transcripts orentation
Regions around Start	Find the transcript where a feature falls into a region around the start of transcript (e.g. promotor region). Region is defined by Range. Search region will (be StartTranscript+Range1 .. StartTranscript+Range2. Define one number (e.g. "2000"). Searh range will be Range1=-2000, Range 2=2000 Or 2 numbers (e.g. "-4000,-2000", thus range1=-4000, range2=-2000).
Regions around Stop	Same as above, but only searching for transcripts at the and.
Within	Similar as above. Search region will cover whole transcript (plus defined ranges, (e.g. Start-4000..Stop-4000)

For Start/Stop/Within search, additonal data files are generated:

List of targeted objects (transcripts).
Subset of targeted annotations, additional data columns contain, number and genomic ranges of matching data features.
Data matrix
indicating distribution of targeted features withing serarch ranges for indivudual annotation object.
You may display the matrix with a heatmap viewer.

Select SUMO | Main menu | Utilities | NGS | Annotate.

A parameter dialog open up:

Annotation file

Drag a file from Windows explorer into respective text field, click the "..." button or double click the edit field to open a file selection dialog.

In the demo we created an annotation file containing gene coordinates from data avialabla from Encode project.
The file has a BED6 file format:

##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)	Start	Stop	GeneSymbol	Score	Dir	NN1	NN2	NN2	NN2
chr1	11869	14409	DDX11L1		+	HAVANA	gene	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; 	; level 2; havana_gene "OTTHUMG00000000961.2";
chr1	14404	29570	WASH7P		-	HAVANA	gene	gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; 	; level 2; havana_gene "OTTHUMG00000000958.1";
chr1	17369	17436	MIR6859		-	ENSEMBL	gene	gene_id "ENSG00000278267.1"; gene_type "miRNA"; 	; level 3;
chr1	29554	31109	MIR1302		+	HAVANA	gene	gene_id "ENSG00000243485.5"; gene_type "lincRNA"; 	; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
...

Anno columns

A comma separated list of column-IDs containing the required genomic coordinates:
Chromosome, Start,End,Direction,Name,Score

In the example we would specify "1,2,3,4,6".

Feature file

A single (or muliple) feature files.
Drag files from Windows explorer, or click "..." button or double-click the respective edit field to selecect files.
All simultaneously selected file MUST have same structure.

Data files could be SAM files:

M01688:41:000000000-BGR6Y:1:1101:10374:6013	0	chr11	59327808	22	66S38M136S	*	0	0	GTGTGAC...	FGGGGGGG...	AS:i:69	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:13A24	YT:Z:UU	
M01688:41:000000000-BGR6Y:1:1101:12830:7500	0	chr14	77774120	22	63S30M147S	*	0	0	GTGTGAC...	FGGGGGGG...	AS:i:60	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:30	YT:Z:UU	
M01688:41:000000000-BGR6Y:1:1101:21160:9931	0	chr19	41748142	22	66S38M136S	*	0	0	GTGTGAC...0	FGGGGGFG...	AS:i:55	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:5A5A22G3	YT:Z:UU	
...

(The example shows a few records from a bar-coded LAM-PCR sequencing project. Thus first ~60 bp of the sequences are identical (bar-code, LTR), as well as the quality scores.
For this example we would define the FT-columns as "3,4,4,5".
As the SAM file does not contain explicit end of the feature (implicitely available from CIGAR string), annotion might not be perfect.
(NB: Above SAM filter can create BED formated output)

BED like formats:

Chr	Start	Stop	TotCnt	#Samples	F1	F2	F3  F4...
1	1756351	1756500	31	1	0.000	0.000	0.000	31.001...
1	1781251	1781400	3	1	0.000	0.000	0.000	0.000...
1	4048801	4048950	6	1	0.000	0.000	6.001	0.000...

Here we would correctly define the FT-columns as "1,2,3,4".

Distance range

Define the search range for Start/Stop/Within modes.

Define a single values: Search symmetrically around Eg. Start.
Features falling into the range: Start-Range .. Start+Range will be annotated with this annotation object (e.g. gene).

Define two values: Search in the specified range relativ to e.g. Star.
E.g. try to fnd gene's promoter regions with ChipSeq peaks, you would like to search 2kb regions upstream a gene' (transcript's) start.
Thus you might define distance range "-2000,0".

Search mode

See above.

Result files

For each input file a new file is generated. (Query_FileName_NEW).
Additional data columns are appended to the original file containing matched annnotations.

The eample shows the "annotated" matrix from a LAM_PCR expiment with ~30 samples, "Closest" mode:

Chr	Start	Stop	TotCnt	#Smpl	F1	F2	...	F29	F30    Name  CID  Start  Stop  Dir  Pos  Dist
1	5267701	5267850	281	1	0.000	0.000	...	0.000	0.000	AL139823.1	1  5301928	5301928	-	up	-34078
1	5354701	5354850	79	2	0.000	78.001	...	0.000	0.000	AL139823.1	1  5301928	5301928	-	up	47307
1	5887651	5887800	16	1	16.001	0.000	...	0.000	0.000	NPHP4	5862811	2  5862811	-	up	24840
...

Appended columns contain informaton about "closest" object (in this case gene) from the annotaion file:

Name - here gene name
CID - Chromosome number 1..24 (23=x,24=y)
Start - in BP on respective chromosome
Stop - in BP on respective chromosome
Dir - "+" = forward orientation of gene on chromosome, "-" reverse orientation
Up - feature is found around/upstream start of gene
Down - feature is found around/downstream end of gene
Dist - distance in BP relative to
- Start (upstrema) or
- stop (dwonstream) of the closest gene.

Objects file

For each input query file a new feature file is generated (QueryFileName_objects):

CID	Start	Stop	Name	Dir	Cnt	Locs
1	175123	1780457	NADK	+	2	1756351,1756500;1781251,1781400;
1	4012921	4019508	AL805961.2	-	1	4048801,4048950;
1	5301928	5307394	AL139823.1	+	2	5267701,5267850;5354701,5354850;
1	5862811	5992473	NPHP4	+	1	5887651,5887800;6037351,6037500;
1	6834333	6834618	AL590128.1	-	1	6834601,6834750;
...

The file contains a list of annotation "objects" (genes) wich were hit by the features from your data files.
One object per line.
For each object tab-delimited

CID - Id of chromosome (1..24)
Start - in BP
Stop - in BP
Name
Dir
Cnt - Number of features mapped to this object (gene)
Locs - list of genomic coordiantes of mapped features

Coverage matrix

For each input query file the coverage by the mapped features of targeted objects (genes) is generated (QueryFileName_covmat):

The matrix - a tab-delimited text - file accumulates for each targeted annotation object (gene), the distribution of mapped features accross the search range.
Obviously, it only makes sense to generate such a matrix for objects with same size (i.e. Start/Stop search modes.

The matrix may be displayed like a gene expression heatmap:

SAM tools

A set of tools to work with SAM/BAM files

SAM-QC

Generate some diagnostic data / plots from SAM/BAM file.

Selet SUMO main menu | Utilities | NGS | SAM-tools | SAM/BAM-QC.

A parameter dialog opens up:

Source files	Supply a list of files (separated by semicolon ";") to be processed. Click the "..."-button to open a file system browser. Select one or multiple files. Alternatively drag one or multiple files from Windows explorer into Source files text field. Presently you may use the file types: - SAM files (file extension ".sam") or gzipped sam files (".sam.gz") - BAM file ".bam" - FastQ (".fastq") or gzipped FastQ (".fastrq.gz") Click the binocular button to preview SAM/FastQ files (not for BAM files). I.e. just view the first view thousand lines from the file.
Single/Paired
Max reads to process	Often it is enough to analyses a part of the reads (e.g. a few million reads). This may already give indication for: - base compostion biases - not properly removed adapter sequneces - overrepresentation of chromosomes, MT, not assigned contigs - cloning / ampification artifacts - correctly mapped reads - genomc rearrangements, chimeric reads, low qulity mapping - ... Specify the number of reads to process from each file (e.g. "1000000"). Select "all" or leave the field empty to process the whole file.

For each of the selected files a report file (extension Aold_Fileanme_SSAM-QC.htm") is generated in the same folder where the source file is located.

See a demo file.

The analysis sections

Basics: Basic statistics about your data file.
Sequence length distribution
Base composition
Mapping Quality
Target length
Left clipped
Right clipped
Reads to Chromosomes
Performance

Basics

Basic statistics about you data file, e.g.:

Filename="D:\Demo_R1.fastq.gz_Aligned.out.sam.bam"
Total number of sequences: 7.600566 [MegaSeq]
Total number of bases: 0.943 [GigaBP]
Number of UNIQUE mapped sequences [Mega]:	3.304229	(43.47%)
Number of UNIQUE qualified mapped sequences [Mega]:	3.304229	( 43.47%)
Number of bases [Giga]:	0.410	( 43.47%)
AT-Content (%): 47.96
CG-Content (%): 52.03

Sequence length distribution

Depending on data preparation (e.g. adapter trimming) and mapping (softclipping of not aligning sequence parts) length of individual sequences may vary.

A strong variation of sequences from the expected length after the sequencing run may indicate problems during cloning / amplification / sequencing / mapping.

The graph shows the normalized density distribution as well as the cumulative distribution:

The example was sequenced with 125 bp.
The length variation mainly originates from soft clipping by BOWTIE2 mapper.

Base composition

Here we analyze the base composition per sequence position across all reads in the SAM/BAM file.

The "X"-series contains all ambiguity characters (e.g. "N", ...).

A priori, one would expect a more or less constant distribution of bases across all sequence positions.
(At least with whole genome / transcriptome sequncing).

Local inhomogenities may indicate contaminations with not properly removed adapters / bar-code sequences ....

The sample shows:

A strong base specific bias in the range BP 1-13 and BP 113-125
Overrepresentation of C in the range BP 14-30
Overrepresentation of G in the range BP 75-112

Removal of adapter sequences removes these biases:

Additionally, colormaps visualize distribution of Di-/Tri-/Tetra-nulceotides across the sequence positions:

Di-Nucleotides:

Tri-Nucleotides:

Tetra-Nucleotides:

The colored Columns at left side of the map encode the base sequence: ACGT.

The map colors indicate: ------- =over-represented, ------- =under-represented,

The gray ticks below at the bottom of the maps indicate base position, starting with one, 10 BP increments.

Distribution of Mapping Quality values

Most sequence mapper tools (e.g. BOWTIE, BWA) generate mapping quality values, indicating the significance of a mapped sequence.

A priori, one would hope that the vast majority of the newly generated sequences map correctly (e.g. MapQ>30) to the reference.

Only a smaller fraction of sequences should contain many discrepancies to the reference or should onyl partially match to the reference (e.g. chimeric sequences).
Structural variants like SNPs should not drop the MapQ values dramatically, chromosomal aberrations or rearrangements should only affect a very small fraction of the new reads.

The graph show density as well as cumulative distribution of Mapping Quality values:

Target length

In paired end reads, SAM/BAM files contain information about length of the region spanned by correctly mapped mate (read) pairs.

The extracted target length distribution should match the sequencing library size distribution.

The graph shows normalized density- as well as cumulative distribution of target lengths from all read pairs:

All target sequneces larger 1000 BP are aggregated in te the 1000 BP channel of the histogram.

For single-end sequencing this analysis is meaningless - and therefore not shown.

Distribution of left clippped bases

The number of soft clipped bases is read from the CIGAR string contained in the SAM / BAM files.

A priori one would expect thit clipping of seqences should hardly occur in case all adapter sequneces have been completely removed from the sequences before mapping.

The distribution of left, soft-clipped bases is shown in the graph, both as density as well as cumulative distribution:

The demo graph shows soft clipping from a not correctly prepared sample.
More then 10% of the sequences contain clipped sequences up to 80 PB in length. These conatmination origin from not correctly removed adapter sequences.

Distribution of right clippped bases

Similar to above, the distriubtion of sequneces clipped from the right end of th sequences is shown.

Again, a mentionable proportion of long clipped sequence regiosn may indicate not properly prepared sequences (adapters, ...).

Like above, the graph visualizes both, density as well as cumulative distribution of right soft-clipped bases.

Reads to chromosomes

The graph shows the distribution of reads mapped to chromosomes.

Channels "0" contains all sequences not mapped to any chromosme.
"1" to "22" represent the autosomes.
"23","24" the X and Y chromosome.
"25" the Mitochondrial genome.
Alternative mappings are assigned to the "0" channel.

A priori one would expect a rough correlation between chromosome size and fraction of mapped sequences (especially with genome- and exome-sequneces).

The sample shows a high overrepresentation of chromosome 21 (~25% of allreads).
A more detailed view uncovers the overrepresented region Chr21:8,200,000-8,400,000.
This region contains ribosomal s28 gene, indicating that ribosomal RNA depletion during sample preparation failed.

In case the reads where nt mapped against a (human) reference genome, this grpah is meanngless.

Performance

Data processing may take a while:

~150000 reads/s from fastq /SAM files
~ 70000 reads/s from GZ-compressed / BAM files

(On an I7-3930, 3.2GHz, 1 thread, data on local spinning disk.)
Typical file size is < 2 MByte.

Each report file is a single html document containg all graphs as in-line jpg-images.