SeedScan - Diagnostic plots

Dignostic plots allow to estimate consistence of an analysis.
Already after a few million sequences (i.e. a few minutes of analysis) the plots allow to recognize unexpected count distribution between barcodes/pairs/identified seeds.

Such problems are in most cases due to:

not correctly adapted barcode/pairs/seed-sequences
not correct definition of Barocde position/length in construct/barcode file
not correct definition of Seed's position/length in construct/seed file

From SeedScan's Main menu | View select:

Overall counts

While TagScan is running and evaluation of a pair of NGS-sequnece files is performed, overall count numbers are graphically displayed and contiuously updated:

The bar chart visualizes:

Purple left bar: Total number of sequence pairs analyzed
Aqua: Number of identified Barcode-1 Sequences:
- bright: total BC1 matches
- dimmed: number of perfect matches
- dark: number of 1-error matches
Red - Failed BC1: Number of sequence pairs, where no Barcode1 sequence
(as defined in bar code list file) was identified in the First (forward) sequence FastQ file.
Teal: Number of identified Barcode-2 Sequences:
- bright: total BC2 matches
- dimmed: number of perfect matches
- dark: number of 1-error matches
Red - Failed BC2: Number of sequence pairs, where no Barcode2 sequence
(as defined in bar code list file) was identified in the Second (reverse) sequence FastQ file.
Number of sequence pairs, with valid BC1<=>BC2 pairing
Red - Failed-BC-Pairs: Number of sequence pairs, where the identified pairing
Barcode1<=>Barocde2 was not valid according to the definition in Barcode-Pairs file.
Green: Number of identified Seed Sequences:
- bright: total Seed matches
- dimmed: number of perfect matches
- dark: number of 1-error matches
Red - Failed-seed: Number of sequence pairs, where the extracted construct sequence from forward
read could not be matched to any of the seeds as defined in seeds sequnece list.

The number on top of the bars indicate:

Top line: Total count number per categorie
Middle line: Fraction of this categorie from all sequences pairs in %
Bottom line: Fraction of this categorie relative to the respective parent in %
E.g. Perfect match BC1: 97.74 of all BC1 are perfect macthes.

The example shows a preliminary counting snapshot after ~9 Mega sequences were analyzed (~2:30 min execution time.).
This snapshot indicates a particular bad analysis result, which is due to several problems/experimental settings:

~50% sequences - No Barcode1 recognized.
Thats a feature - not a bug.
To improve sequence calling on the Illumina HTS systems, especially when analyzing highly similiar sequences, it is recommended to mix non specific random DNA sequences with your wanted contructs. Typically at a rate of 20%-50% non-sense DNA.
~2% Failed-BC1/BC2: not brilliant but acceptable
~2% Failed-BC-pairs: accepatable
~25% of Failed-Seeds. This corresponds to failing of ~50% of the correctly identified barcodes/pairs.
Either a serious cloning problem, or much more probably an error in definition of seed libraries.
In fact, in this example the majority of the samples (barcode-pairs) where incorrectly assigned to the wrong seed library.
=> Stop analysis, correct barcode-pairs file, Restart the analysis and get ~95% matching seeds.
+~70% of Seeds-identified are perfect matches, the rest contain 1-error.
This may be due to an encreased sequencing error rate around read bases 35-55 vs. the barcodes (bases 1-20).
An alignmnet view of a few (~180) randomly picked sequneces may illustrate this finding:

The ~25 bases Linker sequence between Barcode and Seed show much higher sequece homogenity compared to the Vector sequence following the Seed.

Barcodes / pairs

From SeedScan's main menu select:
Main menu | View | BC_Count sums

The graph indicates the count numbers for each

Blue line: Barcodes as defined in Barcode-list file.
Orange lines: Barcode-pair as defined in barcode-pairs-file.

The graph indicates:

A total of 28 barcodes were defined in barcode-list file.
Six of them where not identified at all (BC-11/12/13/25/26/27)
Either: cloning/PCR amplification failed, samples were not added to multiplex sequencing.
Or - most probably: Barcode/pairs-files were not correctly adapted for this particular experiment.
The above "empty" barcodes/pairs werd simply not used.
Not nice, but no problem.
All abundant barcodes have comparable total counts: Concentration of the 10 multiplex samples are very similar - good.
All ten Barcode-pairs (orange line) have comparable counts for Barcode1/2 pair.
~5% lower counts for the pair compared to its respective Barcode1 - ok.
The pair with Barcode1=BC10-.... has only ~50% of BC10 - Strange, no obvious explanation !

Change the scaling to logarithmic for the y-axes (counts).
Click the Tool button to open the tool box and check the Y-log field:

Now yon can see, that the Barcodes BC11..BC14 and BC28,BC29 have small numbers of counts (few hundred).
Probably contaminations or matches to the random sequences.

Pairs / Seeds

Barcodes pairings

In a perfect run, you only would see barcode pairs as defined.
In the example pairs like BC1<=>BC15, BC2<=>BC16, ...
There should not be any other combinations.
SeedScan counts all detected barcode pairs for the first barcode from pairs with all others.

Each line represents the first barcode from a defined pair, showing the counts for all possible pairs.

Click the 3D to show a a 3-d projection of the data lines:

If you change to log-scale on y-axis (click Tool button, then check y-log, you may much better recognize low abandant "not allowed" pairs:

The pair BC5<=>BC4 must come from sequencing errors.
The very small count numbes for BC12-13 (10-100 counts) may come frome contaminations or sequencing errors.
We see a constant background (~1%) of all (not allowd) forward-reverse barcode combination.
This may come from artefacts generated by colony PCR ("bar code hopping").

Seed shift

This graph visualizes abundance of identified seeds shifted against their expected position within the sequenced constructs

In the example a seed shift +/- 3bp was enabled.

The graph indicates:

~1% of the seeds had a 1-base deletion in front of the seed sequence (either sequencing error or error in construct/PCR)
~0.5% of the seeds had a 1-base insertion in front of the seed sequence (either sequencing error or error in construct/PCR)
Seeds with larger shifts contribute less then ~0.1% of the identified seeds.
=> you might reduce seed shift to +/- 1-bp to save processing time.

Sample / Seed profiles

The graph shows the abundance of all seeds (x-axis) in the analyzed barcode-pairs (samples).

The example shows:

14 sampels transfected with Gecko-A library (~65000 seeds on x-axes)
Profiles were captured after ~10 Million sequences analyzed
Sample "*-14" is the transfection library.
=> all seeds show up at similar (low) abundance.
Samples 3,4,5 are transfected cells with no selection pressure.
Similar distribution as in transfection library.
Samples 1,5,7 are transfected cells under strong selection pressure.
Only very view highly abundant seeds show up.
See Y-scale up to 1500 in track 1 vs. ~50 in tracks 3,4,5 and 14.
Shutdown of these "genes" is highly beneficial for cell survival under the applied selection pressure.

Sample reads

For deeper problem analysis, SeedScan stores a small set of reads (first 1000) as plain text file.
These files are generated in the location as specified for the Result file.
The differerent file names are composed of the base file name - as defined by the user - and an extension specific for the data type:

Failed-BC1: A text file with the first ~1000 sequence pairs where no valid Barcode-1 was detected.
("basefilename_Failed_BC1.txt")
: Odd lines contain the sequence number in the fastg-files (here reads 6,7,8,11)
Even lines contain the forward and reverse read for the this sequeence pair.

>S-6
GNTAGTGATAATGTCTTGTGGAAAGGACGAAACACCGCCCTCGGGTAGGGATACACAGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTACGTT          GATTCGACTCACTTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTTTCTCAAGATAT
>S-7
TNAGAAACGGCAGAAGTGCCAGCCTGCAACGTACCTTCAAGAAGTCCTTTACCAGCTTTAGCCATAGCACCAGAAACAAAACTAGGGACGGCCTCATCAGG          TTACCGCTACTAAATGCCGCGGATTGGTTTCGCTGAATCAGGTTATTAAAGAGATTATTTGTCTCCAGCCACTTAAGTGAGGTGATTTATGTTTGGTGCTA
>S-8
TNCTGATGAGTTCTTGTGGAAAGGACGAAACACCGACTCAATCCGTGAGGATTGGTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTA          TACAGAGATCGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGATCTAG
>S-11
TNGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTAA          CAACGCCGCTAATCAGGTTGTTTCTGTTGGTGCTGATATTGCTTTTGATGCCGACCCTAAATTTTTTGCCTGTTTGGTTCGCTTTGAGTCTTCTTCGGTTC

Failed-BC2: A text file with the sequence pairs where a valid Barcode-1 was detected, but no listed Barcode-2
("basefilename_Failed_BC2.txt")
Odd lines contain the sequence number in the fastq-files (here reads 129,216,231,445)
and the identified Barcode ID and sequence. Even lines contain the forward and reverse read for the this sequence pair.

>S-129; BC1=GNTCGATATCTGTAGAC
GNTCGATATCTGTAGACTCTTGTGGAAAGGACGAAACACCGACCAGGCAAATATGTTGGAGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTC          AAACAAACACTCACACAAAAAAAATAATTACCCATCCAATTTACCCCCCAAACCCCCCTTTAATTTTAAAATTGTGAATAATAAATCCCATTTTAAAAAAA
>S-216; BC1=GNTCGATTCTGACGTCA
GNTCGATTCTGACGTCATCTTGTGGAAAGGACGAAACACCGAGAACATGAAGTGCGCCTCGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTC          ATCATACTGCTCTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGATCTA
>S-231; BC1=ANTATGCTAGTA
ANTATGCTAGTATCTTGTGGAAAGGACGAAACACCGAGCTATCGGAAAGTCAAGAGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTA          ATACGACTCAGTAATACTATTCTTTCCCATCCACTTTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGTATGAATAATGCAATTTTTATCAAAAAATA
>S-445; BC1=TNTACTGCACT
TNTACTGCACTTCTTGTGGAAAGGACGAAACACCGCTGCAGCAGTCGGTGACTCTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTAT          TACAATCATCTACTAATAATAATTACCCTTCAATTTAACCCCCAAACCCCCCATTTCATTTAAAAATTTGGAATAAAACATCCATTTGTCTCAAGAAAAAG

Failed-BCPairs: A text file with sequence pairs where Barcode-1 as well as Barcode-2 were within the defined list, but pairing is not "allowed", i.e.not in the Bacode-Pairs list file.
("basefilename_Failed_BCPairs.txt")
Odd lines contain the sequence number in the fastq-files (here reads 32,110,117,211)
and the identified Barcode 1/2 IDs. Even lines contain the forward and reverse read for the this sequence pair.

>S-32; BC1=19;BC2=49
CNATTAGTGTAGATTCTTGTGGAAAGGACGAAACACCGTGCAAAGAACTCATATGAGGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGT          ATCGATTACACGTGATTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAATA
>S-110; BC1=28;BC2=63
CNATACGCGAGTATTCTTGTGGAAAGGACGAAACACCGTTTGTTGCTAAACGGTATTGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCAT          TCGATTCGCACTAGTTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGAT
>S-117; BC1=17;BC2=40
ANTATGCTAGTATCTTGTGGAAAGGACGAAACACCGTCGGCTGGGCCAAAGGAACGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTA          ATCGATCGTCTATGTCTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTATTTTAAAATTGTGGATGAATACTGCCATTTATCTAAAGA
>S-211; BC1=30;BC2=56
ANCGATAGACGCACTCTCTTGTGGAAAGGACGAAACACCGAGAACATTAAGTGCGCCTCGGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCC          GAGCGATTCTATACTATTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAG

Matched seeds: A text file with sequence pairs where everything matched.
Barcode 1/2, valid pair, correct seed sequnece idntified.
For Clone tracer: Index primer found, seed sequences do not exist.
("basefilename_Matched_Reads.txt")
Odd lines contain the sequence number in the fastq-files (here reads 1,2,3,4)
and the identified Barcode 1/2 ID, Pair-ID, ID of matched seed, ID of LIbrary.
Even lines contain the forward and reverse read for the this sequence pair.

>S-1; BC1=15;BC2=49;SeedID=32571;LibID=1
ANCGATTACGTCATCATCTTGTGGAAAGGACGAAACACCGACTCCACCCAAACATCTGGTGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCC          ATCGATTACACGTGATTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGA
>S-2; BC1=14;BC2=48;SeedID=47980;LibID=1
TNGATACTGTACAGTTCTTGTGGAAAGGACGAAACACCGACCTAGCCAGTGATGGACCAGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCG          TCGATTCTAGCGACTTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGAT
>S-3; BC1=25;BC2=59;SeedID=71798;LibID=1
TNGAGTCAGTATCTTGTGGAAAGGACGAAACACCGGCTCGTCATATTGCATAAGAGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTCCGTTAT          TCGTGTCTCTATCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAGATCTAG
>S-4; BC1=7;BC2=41;SeedID=51296;LibID=1
GNTCGATATCTGTAGACTCTTGTGGAAAGGACGAAACACCGCTGCAGTGGAACTATTGGAAGTTTTAGAGCTAGAAATAGCAAGTTAAAATAAGGCTAGTC          GATCGATCACGCACAGATCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAATTGTGGATGAATACTGCCATTTGTCTCAAG