phyloscan logo software for locating sequence motifs in intergenic regions

More information

For citing purposes or to find more information:

C. Steven Carmack, Lee Ann McCue, Lee A. Newberg, and Charles E. Lawrence (2007) PhyloScan: Identification of transcription factor binding sites using cross-species evidence. Algorithms for Molecular Biology, 2(1), article 1. PubMed: 17244358, doi: 10.1186/1748-7188-2-1

Detailed description of Phyloscan arguments

Phylogenetic Tree

The phylogenetic tree expresses the evolutionary relationships among the species whose sequences are to be scanned. It should be in Newick format. For example suppose human and chimp are believed to be more closely related to each other than to baboon. If the most recent common ancestor of human and chimp is a chump, and the most recent common ancestor of chump and baboon is chaboon, and if the evolutionary distances are:

  • human to chump : 0.005
  • chimp to chump : 0.006
  • chump to chaboon: 0.019
  • baboon to chaboon: 0.032

then you can enter

((human:0.005,chimp:0.006):0.019,baboon:0.032);

If there is only one species, your tree will look like this:

human;

Each distance value is the average number of mutation events expected to occur per neutral/junk sequence position.

Note that if sequences from two species are ever supplied to Phyloscan as aligned, then an attempt should be made to have the phylogenetic tree accurately provide the distance between the two species. (The distance in the phylogenetic tree is the sum of the edge lengths along the unique path through the phylogenetic tree that connects the two species.) Other edge lengths in the phylogenetic tree are immaterial, but must be supplied any way; please use a value of 10.0 for these edges.

Sequences to be scanned

Phyloscan will scan a collection of aligned promoter sequences, supplied in MAF format. For instance, for the region upstream of the human gene abc, if the sequence from human and the orthologous sequence in chimp are alignable, but no reliable alignment exists with the orthologous sequence in baboon then the baboon sequence is put in its own alignment block and the input file might appear as:

##maf version=1
a
s human.chrom4.abc     45 32 + 1049595 acgtacgtacgtacgtacgtacgtacgtacgt
s chimp.abc        485868 28 - 1234567 acgtacgt----ACGTACGTACGtacgtacgt

a
s baboon.chrom10.abc  23456 25 +  345678 tctctcttctctctctctgggaaaa

Note that for the first entry on any "s" line (e.g., "human.chrom4.abc"), the text before the first "." should match the name of a species in the phylogenetic tree (e.g., human, chimp, baboon), and the text after the last "." should match with those sequences that are orthologous to it (e.g., the sequence upstream of the human abc gene and its orthologous counterparts are all labeled as "abc"). The text between the first and last "." is ignored by Phyloscan.

All promoters from all genes to be scanned should be included in this input file. Just because human is aligned with chimp for one gene doesn't mean that the same must be true for another gene, and so on.

Motif model

Like the sequences to be scanned argument to Phyloscan, the motif model is supplied in MAF format. Currently, Phyloscan does not support aligned binding sites; each known binding site for the transcription factor should be supplied in its own alignment block.

For instance, three sites could be supplied with:

##maf version=1
a
s ECOL.crp.abc 0 22 + 22 TTGCGTGATCTGTCGCCCAAAT

a
s ECOL.crp.def 0 22 + 22 TTTGTTGCTGACCTTCAAAAAT

a
s ECOL.crp.ghi 0 22 + 22 TTTTGTGAATCAGATCAGAAAA

If the foreground model is palindromic then give each site in either, but not both, orientations. Otherwise, all the binding sites supplied for the motif model must have the same binding orientation.

Much as with the MAF file that supplies the sequences to be scanned, in this file, the first entry on any "s" line (e.g., "ECOL.crp.abc"), the text before the first "." should match the name of a species in the phylogenetic tree (e.g., ECOL, HINF, VCHO), and the text after the last "." should match with those sequences that are orthologous to it (e.g., the sequence upstream of the ECOL abc gene and its orthologous counterparts are all labeled as "abc"). The text between the first and last "." is ignored by Phyloscan.

Palindromic motif model

If the binding pattern is believed to be palindromic, i.e., a site is believed to be as good as its reverse complement, then check the palindromic motif model box. In this case, each binding site supplied as part of the motif model can be supplied in either orientation, but not both orientations.

When the motif model is not palindromic, all the binding sites supplied for the motif model must have the same orientation.

When the motif is palindromic, Phyloscan skips the reverse scan of the sequence data, yielding better p-values and q-values.

Fragmentation mask

The fragmentation mask indicates which positions of a binding site are significant with a "*" and the remaining positions with a ".". Only those positions with a "*" will be used in Phyloscan calculations. For instance if the middle 6 positions of a 22-nucleotide wide binding site are not significant for binding, the fragmentation mask would be

     ********......********

P-value cutoff for primary species

Phyloscan asks you to pick a primary species from among those in the phylogenetic tree. Phyloscan will report an intergenic region as likely to have one or more binding sites if and only if there is sufficient evidence of the binding sites in the primary species considered in isolation and in the primary species as considered in the context of the remaining orthologous sequences. The p-value cutoff field sets the cutoff threshold for the primary species considered in isolation; for instance a cutoff value of 0.05 will instruct Phyloscan to consider only those intergenic regions with a p-value of 0.05 or better in the primary species. With this cutoff, approximately one of twenty intergenic regions that do not have binding sites will be false positives at this stage and Phyloscan will proceed with the analysis of the intergenic region in the context of its orthologous sequences. (This "high" level of false positives is acceptable because of the further processing; see q-value cutoff below.)

Setting a low value for the p-value cutoff, e.g., 0.001, will cause Phyloscan to reject intergenic regions that do not appear quite good in the primary species, even if they might otherwise be "rescued" by the existence of high-quality binding sites in the orthologous sequences. Note that an intergenic region that passes such a strict cutoff is of high quality, and frequently this high quality will cause it to pass the subsequent q-value test as well, unless this second test is even more strict.

On the other hand, a high value for the p-value cutoff will instruct Phyloscan to not be too concerned with the quality of the binding sites in the primary species; Phyloscan will consider an intergenic region to be of high quality if consideration of the primary species and orthologous sequences together so indicates.

The default value of 0.05 is chosen so that Phyloscan will identify those intergenic regions that have one or more high quality binding sites in the primary species and those intergenic regions that have only low quality sites in the primary species but for which the conservation of those sites across the remaining species is significant evidence of the functionality of those sites.

Q-value cutoff for combined evidence

The q-value cutoff is the mechanism by which Phyloscan controls the trade off between the number and quality of the intergenic regions it identifies. Q-value (also termed False Discovery Rate) is the expected fraction of false positives in the output data set. For example, for a set of 40 intergenic regions reported as significant hits by Phyloscan, a q-value of 0.05 would indicate that, on average, two of those forty will be a false positive, under the assumption that the statistical models that are employed perfectly model the underlying biology. This cutoff defaults to a very strict value of 0.001 to account for the fact that the biology is more complicated than the statistical models we use to analyze it.

Note that q-value differs from p-value. That latter is the expected fraction of negative cases expected to test falsely positive.

Weights for best sites found in an intergenic

Much of the strength of Phyloscan arises from its combining of the evidence across multiple binding sites within an intergenic region. The default weight of 0.9 for the best site indicates to Phyloscan that approximately 90% of the time an intergenic region with one or more functional binding sites will have at least one strong binding site. The rank weight of 0.1 for the second best site indicates to Phyloscan that approximately 10% of the time, the best site will not be strong, but the second best site will be strong enough that, together, the best two sites make the intergenic functional for the transcription factor.

You must supply at least one rank weight. Each supplied rank weight must be non-negative and at least one of the rank weights must be positive. If the supplied rank weights do not sum to 1.0, they will be scaled proportionally.

Phyloscan model assumptions

The background nucleotide equilibrium distribution is built from the GC-content of the sequences to be scanned. Additionally 5 pseudocounts is added to the number of each nucleotide.

The foreground nucleotide equilibrium distribution for each column of the motif is built from the count of the relevant nucleotides in the motif sequence file. Additionally 0.28 pseudocounts is added to to the number of each nucleotide; thus modeling the expectation that a typical motif position will have approximately 1.0 bits of information.

The nucleotide substitution model is that of Hasegawa, Kishino, and Yano (1985) J Mol Evol 22(2):160-174. PubMed: 3934395, doi: 10.1007/BF02101694.

Description of Phyloscan outputs

Combined q-value

Combined q-value is a measure of an intergenic region and its orthologous sequences, whether aligned to it or not, when the evidence of all the sequences and all the potential binding sites is considered together. The combined q-value is the fraction of groups of orthologous intergenic regions in the Phyloscan output of this quality or better that is expected to be false positives. Because the statistical model only approximately models the underlying biology, we find a value of 0.001 or less to be significant in many circumstances.

Combined p-value

Similar to combined q-value, combined p-value is a measure of an intergenic region and its orthologous sequences, whether aligned to it or not, when the evidence of all the sequences and all the potential binding sites is considered together. The combined p-value is the probability that random data of this size would accidentally look this good.

Intergenic p-value

Intergenic p-value is a measure of an intergenic region for a single alignment block, when the evidence of all the sequences within it and all the potential binding sites within it is considered together. It is the probability that random data of this size would accidentally look this good.

Binding site E-value

The E-value for a binding site is the expected number of binding sites expected to look this good if a data set of random aligned sequences were searched.


Phyloscan 2.0 - Wadsworth Bioinformatics Center