SUMO - BLAST wrapper

To search DNA sequences against DNA sequence data bases, BLAST is the method of choice.

NCBI offers a free WEB service: NCBI BLAST.

Sometimes response times from NCBI-BLAST web service are slow, due to high server load, slow network interconnectivity, ...

Thus, it may he convenient to perform a local BLAST search on your own computer, using the original NCBI BLAST program as well as NCBI databases.

SUMO contains a wrapper which may be used to conveniently launch BLAST searches and review the results.
A pre-requisite is, you have BLAST/databases installed/set-up on your system (see Preferences below).

From SUMO's main menu select:

Main Menu | Utilities | BLAST wrapper



Just paste your sequence into the input text field:

Or supply a (multiple) FASTA sequence file in the FASTA file text field.
Click the   ...   button to open a file selection diaolog or drag a FASTA file from Windows explorer and drop it into the Query FASTA file text box.

Set:
Click the   BLAST   button to execute the search.

Now SUMO

While BLAST is working, SUMO shows the elapsed time in the statusbar.

The input text fields color is set to red.
You may use the   !! KILL !!   " button to stop and terminate a long running BLAST search.


When the BLAST search has finalized, SUMO loads, parses and displays the result:


The table summarizes some key elements for all hits:
Subject name Name of the sequence, as extracted from the FASTA header line
Subject length Length of the search sequence
DB-Ref len Length of the matched sequence in the reference data base.
E-value Expect value, indicating how probable it is to get such an alignment.
E-values close or larger 1 indicate a non specific just by chance expected alignment.
Very small E-values indicate non random specific alignments.
The E-value depends obviously on length and quality of the alignment between search and data base sequence.
But databse size as well as search sequence length will influence the E-value.
Bit score
Overlap length Contiguous part of the search sequence which matches the respective bata base sequence.
Ideally, Overlap length should be equal to Subject length.
Identity(%) Fraction of the search sequence which matches the data base sequence.
Ideally, Identity should be 100, i.e. search and data base sequence are identical.
Coverage Fraction of the search sequence which matches the data base.
Ideally Coverage should be 100, i.e. the complete search sequence matches the data base.
OriOrientation of the search sequence in respect to the data base
Symbol Gene symbol, extracted from the data base sequence
DB-RefNameName of the matched sequence in the reference data base.

Use the Tool buttons on top of the Query sequence textfield to :

Get a few demo sequences from Main menu | Demo data.


A BLAST search may be time consuming.
A few timings: A 1600 BP cDNA search against human gennome 38 (~ 3 GB, I7-8700K, 4.3 GHz, 64GB RAM:
Thread number  Time(s)
165
234
419
615
813
1012
1211

Memory consumption of blastn.exe < 20 MB RAM.






Peferences

On the Preferences page, define location of BLAST executable and databases:



Click the   ...   buttons to open a file selection dialog for the respective file.
Alternatively, drag a file from Windows explorer into the respective text field.





Build BLAST-DB

It may be recommened to generate a new or update the BLAST reference database from time to time.

One way may be, to download reference sequence files (multiple FASTA format) e.g. from NCBI and build the data base on your system.

If you have a Multiple FASTA file containing your reference sequences, you may just select this mFASTA-file for the Build BLAST-DB text field.

Now click the   Build-BLAST-DB   button.
SUMO will lauch the "makeblastdb.exe" utility from the installed blast package to generate a BLAST-DB.
The newly generated BLAST-DB is saved into the folder where the selected mFASTA file is found.
Processing success-/ error-messages are shown in the text box on the Preferences-tabsheet.




If you want to regularly update your BLAST-DBs you might create a Windows script and run it (e.g. once per month) with the Windows scheduler.

Such a Windows script file might look like:
rem
rem get RefSEQ mRNA fasta files from NCBI ftp-site and convert into BlastDB
rem V1.00a, from 31.03.2017, c.schwager[at]dkfz.de
rem
rem get files from ncbi via ftp, using ftp command file ftp.txt
ftp -s:ftp.txt
rem
rem unpack all just downloaded archives
c:\programme\7-zip\7z.exe -y e *.gz
rem
rem append all just unpacked fasta file into one
copy *.fna Human_RefSeq_mRNA.fas
rem
rem build blast database
d:\programme\blast\bin\makeblastdb.exe -input_type fasta -dbtype nucl -in Human_RefSeq_mRNA.fas -out BlastDB\Human_RefSeq_mRNA
rem
rem thats it
pause

Obviously, you have to modify any file/program locations according to your sytem.
Also, references to locatons at NCBI may change over time.


The ftp script might look like:
open ftp.ncbi.nih.gov
anonymous
yourname@yourinstitue.org
cd refseq
cd H_sapiens
cd mRNA_Prot
bin
hash
prompt no
mget human*rna.fna.gz
close
bye