BALSA Database Query User Manual

1          E-mail

The Email address is required.  All results when using the database comparison software are returned via Email.

2          Sequences

2.1         Query Sequence

This is the sequence the user which to align against a chosen database, called R1.

2.2         Database File

The user may choose from one of four databases (PDB40DB, PDB90DB, SCOP_ASTRAL40, SCOP_ASTRAL90) or upload a database of their own.  PDB40DB and PDB90DB are the two databases originally used by Brenner (1998) to compare FASTA, BLAST, and SSEARCH and used by the authors of BALSA for evaluation of the algorithm (Webb et al., 2002).  PDB40DB and PDB90DB include only domain that have less than 40 and 90% identical to any of the others and contain 1323 and 2079 sequences, respectively.  Additionally, they have been updated to replace any missing amino acid information in the original databases.  SCOP_ASTRAL40 and SCOP_ASTRAL90 also include only domains that have less than 40 and 90% identical to one another, but are from the newest version, 1.57, of the ASTRAL database (Chandonia et al., 2000, Brenner et al., 2002). The database sequence is referred to as R2.

 

NOTE: When uploading your own database all sequences must be in FASTA format.

3          Uniform Prior Term

The Bayesian P-value is calculated using the probability ratio of a homolog versus not.  In a database of sequences of known structure, such as PDB40DB, this probability ratio is known.  If the database is of sequences of unknown structures, the number of homologs is not known, a uniform prior of 1 over the number of sequences in the database is used. 

 

NOTE: This field only needs to be filled in if the user is uploading their own database and the number of homologs in the database is known.  If the number of homologs in the database being uploaded is not known, then a general uniform prior will be used. 

 

NOTE: When counting homologs, count homologs in only one direction. (ex. 3sdha vs 1flp and 1flp vs 3sdha is considered the same homolog)

4          Level of Significance

Sequences in the database with a Bayesian p-value() less than this level of significance will be returned to the user. 

5          Scoring Matrices and Gap Penalties

The algorithm assumes that scoring matrices and gap penalties are viewed as pairs.  Additionally, the number of scoring matrices and gap penalty pairs is limited to 4 in the software as little gain in sensitivity is viewed beyond this number and this limitation reduces and the overall computation time.

 

6          3D –Alignment

The algorithm returns the posterior alignment distribution.  This 3-dimensional graph has the positions of the query and database sequences along the x-axis and y-axis.  The probability of residue i on the query sequence aligning with residue j on the database sequence is on the z-axis, P(Residue i , Residue j) in the query and database sequences, respectively.

 

7          Output

The user will receive via Email the Posterior Probability for each scoring matrix/gap penalty combination, Posterior Alignment Distribution, and the Bayesian P-Value for sequences in the chosen database that have a p-value < 0.01 (or the user specified p-value) with the query sequences. 

 

Example:

> seq 1

> seq 2

P(BLOSUM 45, Gap Opening Penalty = -12, Gap Extension Penalty = -1 | R1, R2) =

P(BLOSUM 50, Gap Opening Penalty = -12, Gap Extension Penalty = -2 | R1, R2) =

P(BLOSUM 62, Gap Opening Penalty = -10, Gap Extension Penalty = -1 | R1, R2) =

P(BLOSUM 62, Gap Opening Penalty = -12, Gap Extension Penalty = -1 | R1, R2) =

 

Bayesian p-value() =

 

NOTE: The user can go back and use the BALSA Pairwise Sequence Alignment to obtain the 3-diminsional histograms of the posterior alignment distributions for the sequences in the database of interest.

 

8          References

  1. Brenner S, Chothia C, and Hubbard TJP. (1998) Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA, 95, 6073-6078.

 

  1. Webb BM, Liu JS, and Lawrence CE. (2002) BALSA: Bayesian Algorithm for Local Sequence Alignment.  Nucleic Acids Research, 30:5, 1268-1277.

 

  1. Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, and Brenner SE. (2002) ASTRAL compendium enhancements. Nucleic Acids Research, 30:1, 260-263.

 

  1. Brenner SE, Koehl P, and Levitt M. (2000) The ASTRAL compendium for sequence and structure analysis. Nucleic Acids Research 28:1, 254-256.