Normally, dscan takes the aligned segments submitted by the user and creates a frequency model which is used to scan the database. When a palindromic model is suspected, use palindrome_dscan. palindrome_dscan creates a palindromic model from the submitted segments and uses it to scan the data. palindrome_dscan should only be used with nucleotide data and reverse complement searching should be turned off.
A full set of E. coli and R. palustris intergenic regions are available as databases for searching. Clicking the E. coli or R. palustris button will load the database file into the Sequences to be searched textbox. Depending on the load on the server, it may take several minutes to load the file. The E. coli intergenic file consists of 2417 sequences. The FASTA headers contain the gene name, the genomic coordinates of the gene and its upstream neighbor, the length of the intergenic region and its genomic coordinates. The E.coli intergenic regions were derived from the E.coli K12 genome entry in RefSeq, downloaded on Feb 28, 2003.
The R. palustris intergenic file consists of 2633 sequences. The FASTA headers ontain the gene name, the genomic coordinates of the gene and its upstream neighbor, the length of the intergenic region and its genomic coordinates. The R. palustris intergenic regions were derived from the Rhodopseudomonas palustris CGA009 genome entry in RefSeq (NC_005296), downloaded on Mar 8, 2004.
For scanning a DNA database, dscan allows a choice of either an Identity matrix or a PAM1 DNA scoring matrix. For proteins, a Blossum62 matrix is used for scoring. In either case a product multinomial model may be used instead.
Aligned segments are basically the alignment produced by Gibbs. A seperate list of the aligned segments can be produced using the Create Scan Ouput option on the Advanced Options page for Gibbs
The segments consist of a row of describing the fragmentation of the sites. A * for each conserved position and a . for each fragmented column. The first character in the mask must be an '*' and not a '.', ie, the 1st position specified must be an ON position and not an OFF position.
For example
**.*****.*..*.*****.**
TTTTTTGATCGTTTTCACAAAA
TTATTTGCACGGCGTCACACTT
AACTGTGAGCATGGTCATATTT
GTATGCAAAGGACGTCACATTA
AGGTGTTAAATTGATCACGTTT
TTATTTGAACCAGATCGCATTA
AATTGTGATGTGTATCGAAGTG
TTGTGTAAACGATTCCACTAAT
TTATCTGCAATTCAGTACAAAA
TAATGTGAGTTAGCTCACTCAT
TTCTGTAACAGAGATCACACAA
TTTCGTGATGTTGCTTGCAAAA
AATTGTGACACAGTGCAAATTC
ATGCCTGACGGAGTTCACACTT
GATTGTGATTCGATTCACATTT
TGTTGTGATGTGGTTAACCCAA
CGGTGTGAAATACCGCACAGAT
ATTTGTGAGTGGTCGCACATAT
dscan will create a frequency model from the conserved columns and use it to search the database sequences for similar sites.
A count matrix is similar to the frequency matrix output by Gibbs. It contains a list of the counts of each nucleotide or amino acid for each position in the matrix.
The matrix consists of a row of describing the fragmentation of the sites. A * for each conserved position and a . for each fragmented column. The first character in the mask must be an '*' and not a '.', ie, the 1st position specified must be an ON position and not an OFF position.
Rows following the model mask specify model positions. There must be one row for each position in the model. This includes OFF positions. For example, if the model mask is 18 characters long, with 14 ON positions and 4 OFF positions, 18 rows of data must be present in the frequency matrix that follow. Data in the OFF position of the matrix is ignored, but it must numeric data, ie, not alphabetic characters.
Columns specify counts of each possible alphabet letter.
For nucleotide data: column order is: A T C G. this order is compatible
with Gibbs.
- protein data: alphabetic order of single letter AA codes; ie,
A C D E F G H I K L M N P Q R S T V W Y
Values in a freq matrix may be integers or floats. This implies that both probability matrices (with positions summing to 1.0) and counts matrices are allowed.
Note: count matrices only work with dscan, not palindrom_dscan.
For example
**.*****.*..*.*****.**
7 7 1 2
6 9 0 2
0 0 0 0
0 14 3 0
0 3 1 13
0 16 1 0
3 1 0 13
16 0 1 0
0 0 0 0
4 3 6 4
0 0 0 0
0 0 0 0
1 4 4 8
0 0 0 0
0 12 2 3
0 1 16 0
13 1 0 3
2 1 14 0
14 2 0 1
0 0 0 0
7 10 0 0
5 10 1 1
dscan will create a frequency model from the conserved columns and use it to search the database sequences for similar sites.
The program normally samples data in both forward direction only. For nucleotide data, it is common to search both the forward and reverse complement strand. Checking this option will cause the program to scan both strands. Sites found in the reverse complement direction will be marked with an R. Note: when scanning for repeats, the sequences are scanned in the forward direction and then in the reverse direction for all repeats. Thus, dscan may miss cases where a site model appears in the same sequence in the forward and reverse direction.
The E-value is the number of sites with the same score or better that we would expect to find in a random database of the same size.
The p-value is the probability of finding a profile score of at least the value of the highest scoring segmentt in the sequence in a random sequence of the same length. A Bonferroni adjustment is made to adjust for the number of possible segments. If searching for multiple sites or multiple motifs a different Bonferroni adjsutment is made. In all cases a second Bonferroni adjustment is made for the size of the database, either by multiplying by the number of sequences or the effective size of the database. See Neuwald, Liu and Lawrence, Gibbs motif sampling, Protein Science (1995) 4:1618-1632 for details
By default, dscan prints all sites found with an E-value less than the cutoff and the top p-value with a -log10(p-value) greater than the cutoff or in the case of repeats, all p-values above the cutoff. It is possible, instead, to print the top N values regardless of the E-value cutoff.
Normally, the size of the database is adjusted for by multiplying the adjusted p-value for a sequence by the number of sequences in the database. When there is a large variation in the length of the sequences, this can underestimate some p-values and overestimate others. When searching for one model without repeats, an alternate adjustment is available which multiplies the p-value calculated for a single sequence by the effective length of the database
The output below is the result of searching a small database of E. coli intergenic sequences with the segments listed above.
/tmp/dscan21491/dscan21491 /tmp/dscan21491/dbfile.txt /tmp/dscan21491/snfile.txt -P -n -R
C: 740 (0.195767)
G: 740 (0.195767)
A: 1150 (0.304233)
T: 1150 (0.304233)
Total database length: 3780
Effective size for model 1: 3024
average length = 105.0
Distribution of -log10(E-values):
'=' is 1 count.
-4.00 : 5 |=====
-3.00 : 0 |
-2.00 : 3 |===
-1.00 : 2 |==
0.00 : 9 |=========
1.00 : 9 |=========
2.00 : 6 |======
3.00 : 2 |==
4.00 : 0 |
total : 36
mean = 0.55919
stdev = 1.99404
range = -3.48058 .. 3.45876
[3.46] ecomale
5.0 (1.150e-07): 14 TTACCGCCAA TTCTGTAACAGAGATCACACAA AGCGACGGTG 35
[3.12] cole1 R
4.7 (2.498e-07): 82 GGACTTCCAT TTTTGTGAAAACGATCAAAAAA ACAGTCTTTC 61
[2.93] cole1
4.5 (3.865e-07): 61 GAAAGACTGT TTTTTTGATCGTTTTCACAAAA ATGGAAGTCC 82
[2.66] (tdr)
4.2 (7.304e-07): 78 TTGAAAGTTA ATTTGTGAGTGGTCGCACATAT CCTGTT 99
[2.59] ecomale R
4.1 (8.482e-07): 35 CACCGTCGCT TTGTGTGATCTCTGTTACAGAA TTGGCGGTAA 14
[2.54] ecolac
4.1 (9.509e-07): 9 AACGCAAT TAATGTGAGTTAGCTCACTCAT TAGGCACCCC 30
[2.44] ecodaop
4.0 (1.206e-06): 7 AGTGAA TTATTTGAACCAGATCGCATTA CAGTGATGCA 28
[2.16] ecobgirl
3.7 (2.280e-06): 76 CAAAGTTAAT AACTGTGAGCATGGTCATATTT TTATCAAT 97
time: 0 seconds (0.00 minutes)
The first part of the output lists the nucleotide or amino acid content of the database along with the total database size. Next is a histogram of the distribution of -log10(E-values). The list of sites matching the model follows. The number in square brackets is the -log10(expectation value). The higher this number is, the less likely it would be to find a site with a score equal or greater than the one found in a random database of the same size. The E-value is followed by the FASTA header of the sequence. Sites found in reverse complement have an R after the header.
The next line lists the -log10(adjusted p-value) followed by the raw p-value in parenthesis. Immediately following is the starting position of the site found, some flanking sequence and the listing of the site. Following this is the flanking sequence and the site ending position.