SUMO - NGS utilities

Here you may find a few functions to perform basic operation with SAM, BED, ... files
SAM filterFilter SAM files by Flag, Start and length of reference sequence, Q-value
SAM => Population mapGenerate a map of sequence bins hit by sequences from multiple SAM files.
BED toolsA set of tools to work with BED like genomic ranges objects lists
  • View: View (multiple) BED files as line graphs.
  • Features to Bumps: Combine closely neghbouring BED objects to a wider "Bump"
  • 1/2 class test: find genomic regions hit by multiple BED files, respetively differentially hit by two different groups of BED files.
  • Annotate: Annotate BED-format like files with genomic features.
SAM tools
A set of tools to work with SAM/BAM files

  • QC: Generate some diagnostic data / plots from SAM/BAM file.







Filter SAM

In certain screening applications (Crisp/Cas transfection, LAM PCR, ...) NGS sequencing may be applied to identify target products.
Quality information as well as soft clipping information from genome mappers (e.g. Bowtie2) may be used to skip unwanted features.

SUMO can filter SAM files by:
SAM flagValue=0 or 16. Only unique forwared or reverse complemented sequences
Sequence start min/maxRange for Length of 5'-sequence not matching to reference.
The SAM's CIGAR is evaluated. Left- as well as right soft-clipped sequences are used to measure 5'-/3'-non reference sequence parts and resulting matching sequence length.
The non matching sequencs may correspond to adapters, parts of the transfection vectors, ....
Min match lengthMinimum length of sequence matching to reference.
Q-valueThe Quality value (p-value) of the sequence alignment
Map resolutionSUMO may generate a low reolution population map / coverage graph for the whole referenve (genome).
Define a resolution (binsize) for these graphs.
Resolution should be ~1 MegaBP.
Type of outputDefine the otuput SUMO shall generate:
- s : filtered SAM file (default).
- b : filtered BED file. More compact but no sequences.
- c : coverage plot for each individual SAM input file
- h : histograms for size distributon 5'-/3'-soft clippend / length for each individual SAM file
- h : low resolution population map
define any combination of the respective characters.
E.g. "sm" to generate filtered SAM and Population map.

Additionally, SUMO automatically generates integrated Coverage and Size Distribution plots from all samples.


Selece Main menu | Utilities | NGS | Filre SAM

A parameter dialog open up:


Fill in all parameters accordingly.
Drag one - or multiple - SAM files from Windows' explorer into the Name of SAM files Edit field,
double click the edit field or click the "..." button to open a file selection dialog.

SUMO generates an overall report:

For each sample (SAM file) is shown:

Additonal graphs may show:

Histograms for each samples, showing length distribution of





Coverage plot for each sample.

Orange indicates boundaries between chromomes (1-23,x,y).
Blue indicate how often a certain region was hit by the sequences within the respective BAM file.
In the example a resolution of 1000000bp was chosen to generate a coarse overview.
In the example (detection of DSB-sites) the graph indicates very few detected sites accross the genome in an individual sample.






"Population map"

A heatmap like graph shows the distribution of filtered sequences for all samples accross the human genome.


Data columns are grouped by chromosome, to easily recognize conserved patterns between samples.
The colored boxes on top of the heatmap indicate the chromosomes.
Rows represent 1 Mega-BP bins.

In the example, it is easy to see that on chromosome 3 (first yellow box) there is a higly conserved region around 17 M-BPs.
Especially if you zoom into the region:












SAM to population map












BED-tools

A set of tools to work with BED like genomic ranges object lists.

After alingment (mapping) of sequences to a reference genome, sequence reads are often summarized into genomc intervals (e.g. ChipSeq binding sites, ...).
BED files are simpe lists of these genomic intervals and additional information.

BED file structure:
A BED file typically looks like:
chr1  213941196  213942363  Pos1  13  +
chr1  213942363  213943530  Pos2  14  +
chr2  158364697  158365864  Neg1  35  -
chr2  158365864  158367031  Pos1  17  +
chr3  127479365  127480532  Pos1   8  +
chr3  127480532  127481699  Neg2  12  -
...

See a more detailed description of BED files and variants.

At present SUMO supports:



BEDPE file Structure

A BED like file containing basic information for both sequecces of Mate pairs form Paired-End sequecing.

A BED file typically looks like:
10          79359547    79359698    10          79359718    79359869    ST-E00210:236:HYJJ2CCXY:6:1101:3833:1555	44	+	-
KI270438.1	  110047	  110196	KI270438.1	  110098	  110249	ST-E00210:236:HYJJ2CCXY:6:1101:3853:1555	 1	+	-
.	                -1	      -1	.                 -1	      -1	ST-E00210:236:HYJJ2CCXY:6:1101:3975:1555	 0	.	.
14        	89566311	89566461	15	        89566427	89566578	ST-E00210:236:HYJJ2CCXY:6:1101:4239:1555	44	+	-
...












BED-tools - View

A basic tool to visualize content of BED,BEDPE,SAM,BAM file(s) on genomic scale as line graphs:

The graph illustrates two BED files (here in 3D projection:
Chromosomes boundaries are indicated by black dividers, centromers by gray dividers.

Depending on the pre-selected resolution ( here: 10000 BP) x-axes will be scaled.

You may customize the data view applying ProfileViewers options (e.g. Point-/Bar-/Line-Series, color withth, Zoom, ...).
You may use apply ProfileViewers DSP functions to transform the data (Smooth, BIN, NOrmalize, Log-transform, Segment, ...).

From SUMO's main menu select Main menu | Utilities | NGS | BED: View

A parameter dialog opens up:


Data files Supply a single or a semi-colon separated list of data files.
Click the "..." button to open a file system browser and select desired files.
More easyly: drag one or multiple files from Windows Explorer into the BED viewer dialog.
Newly dropped files are added to the list of already selected files.
To clear the list, click the "X" button.
To preview the (first) file from the selection list click the "binoculars" button.
At present SUMOsupports the file formats:
  • BED, file extension "*.bed"
  • GZip compressed BED, file extension "*.bed.gz"
  • BEDPE file extension "*.bedpe"
  • GZip compressed BEDPE, file extension "*.bedpe.gz"
  • SAM, file extension "*.sam"
  • GZip compressed SAM, file extension "*.sam.gz"
  • BAM, file extension "*.bam"
    (Only for files with SAM header << 65 KBytes (e.g Genomic alignments).
SUMO tries to parse the repsective filetypes depending on the file extension.
Thus, striclty stick to the expected extensions.

Value column ID    The data column within the BED file containing the value which shall be displayed on the Y-axes.
Define "0" to just count the coverage.

ResolutionSUMO averages neighbouring BED objects into bins with given size.
In the example bin size (=Resolution) of 10000 BP was selected, resultng in a line graph with 300,000 data points.
You may go down to 100 BP resolution (at least for single BED files) but response time for graphs with ~30 million will be low.

GenomeAt present, only Human genome HG38 is supported (overall size/chromosome dividers).

Reads to processOften one can reacognize te underlying datastructure when analyzing only a small part of the data (e.g. 10Milion reads).
Define "all" to process the complete data file.

ActionWaht to do with the result data:
Display
- Show only th graph.
Save - Save a "Coverage file".
     (A tab delimited text file: Start TAB Stop TAB Count)
Failed - A BEDPE file for mates not passing the defined filter (only for BEDPE)
Or any combination of the three.

Chromosome-filter Entries with "invalid" chromomes id (i.e. something else thean "1".."22","X","y",MT) are always skipped (and entered to the failed BEDPE).
Pair-filter BEDPE entries where Mates are located to different chromosomes are always skipped (and entered to the failed BEDPE).
Size filter Filter BEDPE entries an the target sequence region they span (i.e. Start-Mate1 - Stop-Mate2)
Define "Min,Max". BEDPE entries < Min or > Max will be skipped (and entered to the failed BEDPE).
Score filter Filter BEDPE entries an the score (Column8).
Define "Thresol. BEDPE entries < Thresjold will be skipped (and entered to the failed BEDPE).

Progrss is indicated in the Stauts bar.
You may interrupt the process at any time by pressing ESC-Key or clicking the Break button.











BED-tools - Features to Bumps

Sometimes it may be helpful to agglomerate neighbouring BED objects into larger "bumps".

In the example we are analyzing regions where DNA Double Strand Breaks have occured.
In a region of 1-2 MEGA-BP around the DSB, histon H2AX gets phosphorylated at Serin 139 (γH2AX).
To analyze localization / distribution of DSB sites on the genome, you might use CHIP (Chromatin Immune Precipitation) applying anti-γH2AX-antibodies to extract short genomic regions where γH2AX occurs.
But with e.g. ChipSeq you are identiying only the small binding area for the protein (~100 PB) compared to the huge acivation area.

Thus it may be better to search gemonic regons where many strong ChipSeq peaks (e.g. generated by MACS2) are colocalized and use these "bumps" for further analysis.


From SUMO's main menu select Main menu | Utilities | NGS | BED: Features to bumps

A parameter dialog opens up:



Data files Supply a single or a semi-colon separated list of BED files.
Click the "..." button to open a file system browser and select desired files.
More easyly: drag one or multiple files from Windows Explorer into the BED viewer dialog.
Newly dropped files are added to the list of already selected files.
To clear the list, click the "X" button.
To preview the (first) file from the selection list click the "binoculars" button.

Min bumps sizeMinimum length for a bump of neighbouring BED objects.

Max bumps sizeMaximum length for a single bump of neighbouring BED objects.
If you have huge enriched regions you may get multiple neighbouring / overlapping bumps.

Min feature count   A single bump must have at least this minimum number of features.


All selected files are analyzed independantly.

For each input file a new result BED file is created in the same location where the input file was found.

The resultung BED file contains for each bump:

An example result file may look like:
track name="Bumps" description="SUMO Bump finder"											
#CID	Start	Stop	Name	Peaks	P/MB	Mean-Coverage	SDev	Median	MAD	Length	
1	142964908	143464903	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	11	22.0002200022	22.6363639831543	10.7534351348877	18	5	499995	1
1	143164997	143485917	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	12	37.3924965723545	21.5	11.0412454605103	17	4.5	320920	1
4	9218283	9369432	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	12	79.3918583649247	21.9166660308838	7.55234241485596	19	5	151149	4
4	49131786	49571310	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	19	43.2285836495846	22.2105255126953	8.34910774230957	18	5	439524	4
4	49272653	49650909	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	16	42.2993951186498	23.5	8.10760974884033	22.5	6.5	378256	4
5	49406943	49682821	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	15	54.3718600250836	62.2000007629395	73.2863235473633	27	12	275878	5
7	51964259	52432007	NSCLC_1_cell_line-H2aX_broad_peaks.gappedPeak	10	21.3790331546046	16.5	4.71993398666382	15	2.5	467748	7
...

The image shows a zoom-in to a ~8 MB region:

Green indicates the original MACS2 peaks.
Red the resulting "bumps" (applying the above parameters).











BED-tools - 1/2 class test

Lets assume you performed an analysis resulting in a set of chromosmal regions saved in BED format.

Obvious questions would be:

The BED-tools class-test allows to perform this analysis using BED files as input.

From SUMO's main menu select Main menu | Utilities | NGS | BED: 1/2 class-test

A parameter dialog opens up:


Data files group1 Supply a single or a semi-colon separated list of BED files.
Data files group2 Supply a single or a semi-colon separated list of BED files - ir required.
ResolutionSize of bins in BP.
Resolution downto 100 BP should work.

Score columnThe column in all BED files from which the "intensity" information is extracted and used in the calss test.
To simply count the coverage (coverage of individual samples for a specific bin) define "0"

Log2 transformDefine "yes" to log transform the "intensity values".
But don't forget: for log2 transformation ALL VALUES HAVE TO BE >0
OffsetA contstant to add to the "intensity" values.
Useful to avoid negative number with log2 transformation
Min difference   You may filter bins by
- average "Intensity" (1-class test) or
- differential regulation (2-class test)
Only bins >Min-difference are passed.
Min difference is sign independent.
To bypass this filter supply "0"
p-value   You may filter bins by p-value.
Only bins with p≤p-value are passed.
To bypass this filter define p-value = "1"

a result file is created:

The result file has BED-4 format and additional columns containing:

A (unfiltered) result file may look like:

CID	Start	  Stop  	    Name	  Cnt1	Mean1	SDev1	Cnt2	Mean2	Sdev2	C1-C2	M1-M2	t	p	DF
1	2239000  	 2414000  Segment	0	0	0	1	14.12	9.98	-1	-14.12	-1.41	0.126	3
1	2414000  	 2588000  Segment	0	0	0	2	16.94	3.99	-2	-16.94	-6.00	0.004	3
1	2588000 	 2723000  Segment	1	23.71	16.76	2	17.09	3.78	-1	6.62	0.390	3
1	2723000 	 2901000  Segment	0	0	0	2	18.03	2.44	-2	-18.03	-10.42	0.001	3
1	2901000 	 3506000  Segment	0	0	0	1	15	10.60	-1	-15	-1.41	0.12	3
1	10132000	11563000  Segment	0	0	0	1	16.42	11.61	-1	-16.42	-1.41	0.126	3
1	16054000	17324000  Segment	0	0	0	1	16.29	11.52	-1	-16.29	-1.41	0.126	3
1	19411000	19813000  Segment	0	0	0	1	15.30	10.81	-1	-15.30	-1.41	0.126	3
...











BED-tools - Annotate

Certain NGS applications generate results as "genomic ranges", e.g. LAM-PCR applications, ChipSeq peak detection ...
as e.g. SAM or BED files, or variants of those.

Often it will be reqired to annotate individual objects to their genomic context - i.e. annotate these data files.


SUMO expects:
SUMO scans each data file against the annotation file and adds annotation informaton to each feature of the data file (if possibley) in a newly created file.
Depending of the operation mode additonal other files are generated.


Annotation modes


Several annotation models may be used:
Assume we have an annotation file with coordinates for all transcript of a specie.
ClosestSearch the genes with smallest distance to each individual data object.
Independant of orientation of a transcipt, Within. upstream or donwstream of the transcript
Start of end of the feature.
Closest-5'Only up-stream positon is checked (depending on the transcripts orentation
Closest-3'Only donw-stream positon is checked (depending on the transcripts orentation
Regions around StartFind the transcript where a feature falls into a region around the start of transcript (e.g. promotor region).
Region is defined by Range.
Search region will (be StartTranscript+Range1 .. StartTranscript+Range2.
Define one number (e.g. "2000").
Searh range will be Range1=-2000, Range 2=2000
Or 2 numbers (e.g. "-4000,-2000", thus range1=-4000, range2=-2000).
Regions around StopSame as above, but only searching for transcripts at the and.
WithinSimilar as above.
Search region will cover whole transcript (plus defined ranges, (e.g. Start-4000..Stop-4000)

For Start/Stop/Within search, additonal data files are generated:


Select SUMO | Main menu | Utilities | NGS | Annotate.

A parameter dialog open up:



Annotation file


Drag a file from Windows explorer into respective text field, click the "..." button or double click the edit field to open a file selection dialog.

In the demo we created an annotation file containing gene coordinates from data avialabla from
Encode project.
The file has a BED6 file format:
##description: evidence-based annotation of the human genome (GRCh38), version 27 (Ensembl 90)	Start	Stop	GeneSymbol	Score	Dir	NN1	NN2	NN2	NN2
chr1	11869	14409	DDX11L1		+	HAVANA	gene	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; 	; level 2; havana_gene "OTTHUMG00000000961.2";
chr1	14404	29570	WASH7P		-	HAVANA	gene	gene_id "ENSG00000227232.5"; gene_type "unprocessed_pseudogene"; 	; level 2; havana_gene "OTTHUMG00000000958.1";
chr1	17369	17436	MIR6859		-	ENSEMBL	gene	gene_id "ENSG00000278267.1"; gene_type "miRNA"; 	; level 3;
chr1	29554	31109	MIR1302		+	HAVANA	gene	gene_id "ENSG00000243485.5"; gene_type "lincRNA"; 	; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
...

Anno columns


A comma separated list of column-IDs containing the required genomic coordinates:
Chromosome, Start,End,Direction,Name,Score

In the example we would specify "1,2,3,4,6".

Feature file


A single (or muliple) feature files.
Drag files from Windows explorer, or click "..." button or double-click the respective edit field to selecect files.
All simultaneously selected file MUST have same structure.

Data files could be SAM files:
M01688:41:000000000-BGR6Y:1:1101:10374:6013	0	chr11	59327808	22	66S38M136S	*	0	0	GTGTGAC...	FGGGGGGG...	AS:i:69	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:13A24	YT:Z:UU	
M01688:41:000000000-BGR6Y:1:1101:12830:7500	0	chr14	77774120	22	63S30M147S	*	0	0	GTGTGAC...	FGGGGGGG...	AS:i:60	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:30	YT:Z:UU	
M01688:41:000000000-BGR6Y:1:1101:21160:9931	0	chr19	41748142	22	66S38M136S	*	0	0	GTGTGAC...0	FGGGGGFG...	AS:i:55	XN:i:0	XM:i:3	XO:i:0	XG:i:0	NM:i:3	MD:Z:5A5A22G3	YT:Z:UU	
...
(The example shows a few records from a bar-coded LAM-PCR sequencing project. Thus first ~60 bp of the sequences are identical (bar-code, LTR), as well as the quality scores.
For this example we would define the FT-columns as "3,4,4,5".
As the SAM file does not contain explicit end of the feature (implicitely available from CIGAR string), annotion might not be perfect.
(NB: Above SAM filter can create BED formated output)

BED like formats:
Chr	Start	Stop	TotCnt	#Samples	F1	F2	F3  F4...
1	1756351	1756500	31	1	0.000	0.000	0.000	31.001...
1	1781251	1781400	3	1	0.000	0.000	0.000	0.000...
1	4048801	4048950	6	1	0.000	0.000	6.001	0.000...																					   
Here we would correctly define the FT-columns as "1,2,3,4".


Distance range



Define the search range for Start/Stop/Within modes.

Define a single values: Search symmetrically around Eg. Start.
Features falling into the range: Start-Range .. Start+Range will be annotated with this annotation object (e.g. gene).

Define two values: Search in the specified range relativ to e.g. Star.
E.g. try to fnd gene's promoter regions with ChipSeq peaks, you would like to search 2kb regions upstream a gene' (transcript's) start.
Thus you might define distance range "-2000,0".


Search mode

See above.

Result files

For each input file a new file is generated. (Query_FileName_NEW).
Additional data columns are appended to the original file containing matched annnotations.

The eample shows the "annotated" matrix from a LAM_PCR expiment with ~30 samples, "Closest" mode:

Chr	Start	Stop	TotCnt	#Smpl	F1	F2	...	F29	F30    Name  CID  Start  Stop  Dir  Pos  Dist
1	5267701	5267850	281	1	0.000	0.000	...	0.000	0.000	AL139823.1	1  5301928	5301928	-	up	-34078
1	5354701	5354850	79	2	0.000	78.001	...	0.000	0.000	AL139823.1	1  5301928	5301928	-	up	47307
1	5887651	5887800	16	1	16.001	0.000	...	0.000	0.000	NPHP4	5862811	2  5862811	-	up	24840
...
Appended columns contain informaton about "closest" object (in this case gene) from the annotaion file:


Objects file

For each input query file a new feature file is generated (QueryFileName_objects):

CIDStartStopNameDirCntLocs
11751231780457NADK+2 1756351,1756500;1781251,1781400;
140129214019508AL805961.2-14048801,4048950;
153019285307394AL139823.1+25267701,5267850;5354701,5354850;
158628115992473NPHP4+15887651,5887800;6037351,6037500;
168343336834618AL590128.1-16834601,6834750;
...
  • The file contains a list of annotation "objects" (genes) wich were hit by the features from your data files.
    One object per line.
    For each object tab-delimited


    Coverage matrix

    For each input query file the coverage by the mapped features of targeted objects (genes) is generated (QueryFileName_covmat):

    The matrix - a tab-delimited text - file accumulates for each targeted annotation object (gene), the distribution of mapped features accross the search range.
    Obviously, it only makes sense to generate such a matrix for objects with same size (i.e. Start/Stop search modes.

    The matrix may be displayed like a gene expression heatmap:












    SAM tools

    A set of tools to work with SAM/BAM files



    SAM-QC

    Generate some diagnostic data / plots from SAM/BAM file.

  • Selet SUMO main menu | Utilities | NGS | SAM-tools | SAM/BAM-QC.

    A parameter dialog opens up:


    Source files Supply a list of files (separated by semicolon ";") to be processed.
    Click the "..."-button to open a file system browser.
    Select one or multiple files.
    Alternatively drag one or multiple files from Windows explorer into Source files text field.
    Presently you may use the file types:
    - SAM files (file extension ".sam") or gzipped sam files (".sam.gz")
    - BAM file ".bam"
    - FastQ (".fastq") or gzipped FastQ (".fastrq.gz")
    Click the binocular button to preview SAM/FastQ files (not for BAM files).
    I.e. just view the first view thousand lines from the file.


    Single/Paired

    Max reads to process      Often it is enough to analyses a part of the reads (e.g. a few million reads).
    This may already give indication for:
    - base compostion biases - not properly removed adapter sequneces
    - overrepresentation of chromosomes, MT, not assigned contigs - cloning / ampification artifacts
    - correctly mapped reads - genomc rearrangements, chimeric reads, low qulity mapping
    - ...
    Specify the number of reads to process from each file (e.g. "1000000").
    Select "all" or leave the field empty to process the whole file.
    For each of the selected files a report file (extension Aold_Fileanme_SSAM-QC.htm") is generated in the same folder where the source file is located.

    See a demo file.

    The analysis sections

    :


    Basics

    Basic statistics about you data file, e.g.:
    Filename="D:\Demo_R1.fastq.gz_Aligned.out.sam.bam"
    Total number of sequences: 7.600566 [MegaSeq]
    Total number of bases: 0.943 [GigaBP]
    Number of UNIQUE mapped sequences [Mega]:	3.304229	(43.47%)
    Number of UNIQUE qualified mapped sequences [Mega]:	3.304229	( 43.47%)
    Number of bases [Giga]:	0.410	( 43.47%)
    AT-Content (%): 47.96
    CG-Content (%): 52.03
    



    Sequence length distribution

    Depending on data preparation (e.g. adapter trimming) and mapping (softclipping of not aligning sequence parts) length of individual sequences may vary.

    A strong variation of sequences from the expected length after the sequencing run may indicate problems during cloning / amplification / sequencing / mapping.

    The graph shows the normalized density distribution as well as the cumulative distribution:



    The example was sequenced with 125 bp.
    The length variation mainly originates from soft clipping by BOWTIE2 mapper.



    Base composition

    Here we analyze the base composition per sequence position across all reads in the SAM/BAM file.

    The "X"-series contains all ambiguity characters (e.g. "N", ...).

    A priori, one would expect a more or less constant distribution of bases across all sequence positions.
    (At least with whole genome / transcriptome sequncing).

    Local inhomogenities may indicate contaminations with not properly removed adapters / bar-code sequences ....

    The sample shows:

    Removal of adapter sequences removes these biases:



    Additionally, colormaps visualize distribution of Di-/Tri-/Tetra-nulceotides across the sequence positions:

    Di-Nucleotides:


    Tri-Nucleotides:


    Tetra-Nucleotides:


    The colored Columns at left side of the map encode the base sequence: ACGT.

    The map colors indicate: ------- =over-represented,      ------- =under-represented,

    The gray ticks below at the bottom of the maps indicate base position, starting with one, 10 BP increments.



    Distribution of Mapping Quality values

    Most sequence mapper tools (e.g. BOWTIE, BWA) generate mapping quality values, indicating the significance of a mapped sequence.

    A priori, one would hope that the vast majority of the newly generated sequences map correctly (e.g. MapQ>30) to the reference.

    Only a smaller fraction of sequences should contain many discrepancies to the reference or should onyl partially match to the reference (e.g. chimeric sequences).
    Structural variants like SNPs should not drop the MapQ values dramatically, chromosomal aberrations or rearrangements should only affect a very small fraction of the new reads.

    The graph show density as well as cumulative distribution of Mapping Quality values:




    Target length

    In paired end reads, SAM/BAM files contain information about length of the region spanned by correctly mapped mate (read) pairs.

    The extracted target length distribution should match the sequencing library size distribution.

    The graph shows normalized density- as well as cumulative distribution of target lengths from all read pairs:


    All target sequneces larger 1000 BP are aggregated in te the 1000 BP channel of the histogram.

    For single-end sequencing this analysis is meaningless - and therefore not shown.



    Distribution of left clippped bases

    The number of soft clipped bases is read from the CIGAR string contained in the SAM / BAM files.

    A priori one would expect thit clipping of seqences should hardly occur in case all adapter sequneces have been completely removed from the sequences before mapping.

    The distribution of left, soft-clipped bases is shown in the graph, both as density as well as cumulative distribution:


    The demo graph shows soft clipping from a not correctly prepared sample.
    More then 10% of the sequences contain clipped sequences up to 80 PB in length. These conatmination origin from not correctly removed adapter sequences.



    Distribution of right clippped bases

    Similar to above, the distriubtion of sequneces clipped from the right end of th sequences is shown.

    Again, a mentionable proportion of long clipped sequence regiosn may indicate not properly prepared sequences (adapters, ...).

    Like above, the graph visualizes both, density as well as cumulative distribution of right soft-clipped bases.



    Reads to chromosomes

    The graph shows the distribution of reads mapped to chromosomes.

    Channels "0" contains all sequences not mapped to any chromosme.
    "1" to "22" represent the autosomes.
    "23","24" the X and Y chromosome.
    "25" the Mitochondrial genome.
    Alternative mappings are assigned to the "0" channel.


    A priori one would expect a rough correlation between chromosome size and fraction of mapped sequences (especially with genome- and exome-sequneces).

    The sample shows a high overrepresentation of chromosome 21 (~25% of allreads).
    A more detailed view uncovers the overrepresented region Chr21:8,200,000-8,400,000.
    This region contains ribosomal s28 gene, indicating that ribosomal RNA depletion during sample preparation failed.

    In case the reads where nt mapped against a (human) reference genome, this grpah is meanngless.



    Performance

    Data processing may take a while:
    (On an I7-3930, 3.2GHz, 1 thread, data on local spinning disk.)
    Typical file size is < 2 MByte.

    Each report file is a single html document containg all graphs as in-line jpg-images.