DNA sequencing technology has outrun all other molecular biology methods leading to a vast data base of unexplored sequence data. The human genome project and other genome projects have greatly accelerated the growth of this data base. Since genes code for proteins, the action molecules of life, the bulk of scientific effort has focused on the genes and their products. Yet only about 3% of the human genome codes for genes. The remainder, sometimes called "junk DNA" has received far less attention. Recent findings suggest that this junk may be far more important than previously believed. In particular this "junk" has been shown to contain important regulatory signals.
Gene regulation is the fundamental process behind cellular alteration. It is used by single cell organisms to respond to changes in their environment and by multicellular organisms for cell differentiation. The most studied gene regulation involves the binding of regulator proteins to "regulatory elements" which are signal sequences that normally occur in an upstream fragment of the genome called the promoter. Far less studied are regulator elements that are more distant from the genes they regulate and regulation due to the binding of RNA gene regulators. While there are experimental methods to identify regulator elements, such as ChIP-chip and ChIP-PET they are time consuming and difficult. Thus there is a need for the development of computational and statistical methods to explore these databases to identify novel regulatory signals that may be hidden within the genomic "junk".
Gene regulation has the ability to cause short-term and long-term effects. Short-term regulation generally leads to the production of induction or repression enzymes that allow cells to respond to changes in their environment. Long-term regulation is vital in determining the ontogeny of an individual. If for some reason regulatory sequences differ from the norm, birth defects can result which may prove deleterious to the organism.
Oncogenes are genes that cause a healthy cell to become a cancer cell. Often oncogenes encode enzymes that phosphorylate many proteins, some of which are in the cell's membrane. Such modification can cause the cell to divide. Gene regulation normally would suppress the continuous division found in these neoplasia (cancer cells).
If the site of gene regulation can be determined, then the complex changes taking place in the expected motifs when things go wrong can begin to be analyzed. Hopefully, in the future, gene therapy can be explored to help eliminate or lessen the effects of certain birth defects and cancer
The goal is take a given set of amino acid or nucleotide sequences and determine common motif elements within them. It is possible that more than one characteristic motif type exists in which case multiple gene regulation sites are probable. One approach known as site sampling assumes that each sequence contains exactly one motif element for each motif type. The alternatives motif, recursive, and centroid sampling assume that each sequence can contain zero or more motif elements of each motif type.
The Gibbs Motif Sampler
The web version of the Gibbs Motif Sampler operates in
4 basic modes: site sampling, motif sampling, recursive sampling, and centroid sampling. In site
sampling (1), it is assumed that each sequence contains one site of each
motif type. In motif sampling (3,7), the user provides an estimate of the
the total number of sites of each motif type in all of the sequences. There
may be 0 or more sites in any particular sequence. With the recursive sampler
the user sets an upper limit on the number of sites in any sequence. The
recursive sampler is based on the propagation model described by Liu et al
(11) but unlike that model does not require that the order of the motif sites
remain constant throughout the sequences. The centroid sampler
(14,18) finds the
ensemble centroid motif estimator, i.e., the solution that is the set of sites having the
minimum total distance to the set of sites sampled from the posterior weighted ensemble of sites.
There are 3 different pages for entering
data to the Gibbs Sampler. The entry page allows the user to access the minimum
amount of information to get started. Clicking on the Show Advanced Options link will
bring up a page with more options.
One of 4 possible sampling modes is chosen from the series of radio buttons:
The user specifies at a minimum the data
file in FASTA format containing the sequences
and the motif width(s) (in number of residues). A width for each motif must
be entered. For multiple motifs, the widths are separated by commas; For example, entering 16,20.
indicates that there are 2 motif models. The first has a width of 16 bases and the second has width 20. The
motif width is the number of conserved position in the site. Sites may be
wider than the number of conserved bases due to fragmentation.
If the motif sampling strategy is
chosen, an additional parameter must be added indicating an initial "guesstimate"
of the total number of motif elements in all of the sequences for each motif
type. As with the widths, estimates for multiple motifs are separated by
commas. If this field is left blank, the site sampling mode will be used.
This mode provides a more advanced sampling algorithm and also allows use of a background composition model and a spacing model.
For this mode, the user supplies a
value for the number of different motifs and an upper limit on the number
of sites in each sequence in addition to the data file and motif widths.
Clicking the Eukaryotic
Defaults Button or the Prokrayotic Defaults Button
will chose the recursive sampler and take you to a new page with a seriers of default values for
a number of paramters.
The centroid sampler reports a centroid alignment, i.e., an alignment that has the minimum total distance to the set of Gibbs samples chosen from the a posteriori probability distribution of motif site alignments. By focusing on the region of solution space containing the most posterior probability rather than on the single solution that is most probable, this option significantly enhances the statistical specificity and positive predictive value of the algorithm and is now recommended for most situations.
The Centroid Sampler is descibed in (14,18).
The Centroid Sampler is the basis for the Phylogenetic Sampler.
The Centroid Sampler is the basis for the Phylogenetic Sampler.
This page allows the user to enter a number of parameters that influence the biological characteristics of the type of sites to searched for.
The centroid sampler calculates a mean or centroid solution. To accomplish this, it samples motif sites for a fixed number of iterations, updating the motif models at each iteration. This initial sampling period is called a burn-in. It allows the motif models to converge. The default burn-in period is 2000 iterations. The burn-in period value is ignored for sampling modes other than the centroid sampler
For the centroid sampler once the burn-in is completed, the sampler proceeds, again through a fixed number of iterations (the default is 8,000), tracking each sampled position. A centroid alignment solution is obtained from these samples through identification of the alignment that possesses the minimum total distance to the other alignments in the set. The sampling period value is ignored for sampling modes other than the centroid sampler.
The centroid sampler calculates a centroid solution. The solution reported by the centroid sampler is a collection of sites that constitute an alignment that possesses the minimum total distance to the other alignments in the sampled set. Because of fragmentation, the lengths of the centroid sites may vary, making it difficult to see which positions are more highly conserved. As an option, the collection of sites making up the centroid amy be aligned to produce a more tradtional alignment model. It is important to note that this new alignment does not represent either an optmal or centroid model.
Optional alternative models can be used with the nucleic acid alphabet. Palindromic models are entered as t,i,j.... where t is the motif number and i,j are the beginning and ending positions of the palindromic region. For example, 1,1,8 indicated that motif number is palindromic in positions 1 through 8 and the corresponding reverse complement positions at the opposite end. Note that for each position that is marked as palindromic, there is automatically a corresponding position the same distance away from the opposite end of the motif that is also palindromic. Multiple motif models are enterted in a similar manner. 1,1,8,2,1,6 indicates that positions 1 through 8 in motif model 1 and the corresponding positions at the opposite end and palindromic in model 1 and positions 1 through 6 and the corresponding positions at the end are palindromic for model 2.
Motif regions using a collapsed alphabet are entered in a similar manner. Concentrated positions are entered as t,i,j,p.... As before, t is the motif number and i,j are the beginning and ending positions of the concentrated region region. p allows the user to indicate whether the concentrated region should be palindromic as well. A value of 0 indicates no palindromes and 1 indicates palindromes. Concentrated positions indicate that a particular residue pair (A- T or C-G) is concentrated in a particular region. Physical models suggest this can occur due to the flexibility or rigidness that concentrated regions provide.
Fragmentation allows the sampler to look at the current alignment and the surrounding positions to find which columns should be used in the motif rather than just allowing for motifs that are continuous. The width of the field that can be checked is by default equal to 1.5 times the width of the current motif model. However, the field is generally greatly narrowed due to the limited flexibility of individual motif elements caused by overlapping the ends of sequences or other motifs. The fragmentation algorithm calculates the probability of each of the possible ways to fragment the model and samples a model. The process of fragmentation is examined every fifth iteration. The fragmentation process can be invoked more or less often by changing the Adjustment period field on the third entry page.
Fragmentation can shift the position of the aligned sites by removing the first column replacing it with the last column or vice-versa. It will also deal with modeling more realistic regulatory sites where binding may or may not occur at contiguous positions. Fragmentation is covered in greater detail in (3). A maximum fragmentation width can be specified by entering a value in the Max Frag Width field. Width are specified in the formm t,w where t is the motif number and w is the maximum fragmentation width. For example, 1,24 specifies that motif 1 will be allowed to fragment to a width of 24 bases. The default is 1.5 * the motif length.
Typically transcription factors (TFs) that bind as homodimers or homo-multimeric protein complexes have DNA binding motifs that are palindromes (2). Other regulatory TFs such as PhoB and certain zinc fingers bind in directly repeating multimers and therefore have directly repeating binding patterns (13). Often, the spacing between palindromic repeat units is unknown. In cases where this is suspected, fragmentation can be restricted to the center position by choosing the Fragment from center option.
If direct repeating units are suspected, the repeating positions can be entered in the Direct repeat pos field in the same manner as the Palindromic Pos field.
The program normally samples nucleotide data in both forward and reverse complement. It is possible to turn off reverse complement sampling by unchecking the Reverse complement checkbox. This is typically done when palindromic models are used.
For amino acid sequences, the data is normally XNU'd using a blossum matrix to remove low complexity regions. This option is selected by the Xnu checkbox.
In order to assess the significance of a particular alignment, the program can perform a Wilcoxon signed-rank test. Each sequence is shuffled and the collection of shuffled sequences is appended to the input sequences. Motif sampling is performed on the combined data set and a Wilcoxon signed rank test is performed treating the aligned motifs as coming from two paired samples. See (3) for details.
During sampling, a limited local search is periodically performed. Rather than having the sampler search for motifs of fixed width, it is possible to allow the sampler to adjust the width of the motifs. Width values are entered in the Motif Widths field as t,min,max where t is the motif number and min and max are the minimum and maximum values for the motif width. If a value is entered in this field, the fragmentation checkbox is ignored. If the fragmentation checkbox is deselected and no width values are entered, the motif columns are shifted right or left slightly during the local search to try and improve the current MAP value. Width sampling is deprecated. Fragmentation is the preferred method of adjusting site widths and positions.
By default, for the site, motif, and recursive sampler modes, the algorithm returns the single best solution that it has found as measured by the MAP value. However, no prediction method is perfect, and not all of the predictions of this or other algorithms are equally compelling. For examination of the creditability of each prediction, a Bayesian sampling solution is also available as an option. For this solution, after the MAP solution has been determined, the algorithm continues to sample, in order to explore variations in the models. The frequencies with which sites are sampled are recorded and reported to the user. Variations in models may alter estimates of site frequencies from the values obtained with the single MAP solution. Our experience shows that some sites are strong and are sampled nearly 100% of the time, while others are weaker and are sampled with a lower frequency. These sampling frequencies are an estimate of the probability that the cognate TF binds at each predicted sites. By default, only sites that have a frequency of at least 50% are reported. Clicking the Frequency Solution checkbox controls whether the frequency solution will be displayed in the output.
In addition, an optional set of prior information or a spacing model may be entered. The program may also be requested to calculate a background composition model. Background models are calculated by default.
Checking the Output for dscan: option will produce a series of segments from the alignment. These segments can be used as input to the dscan program for scanning a database.
Pseudo site weights can be updated by changing the Pseudo Site Weight field and the Pseudo count weights can be changed via the field Pseudo count weight. These fields are entered as percentages. Pseudo count weights affect the strength of the priors.
Pseudosites are used in calculating the posterior probability of an arbitrary position being a motif site. This probability is used is sampling. It multiplies the probability that a sites belongs to a particular motif type. See (3, 7) for details. The pseudosites are priors and play the same role as psuedocounts do in the motif models.
For MAP solutions, the Plateau Period is the number of iterations between successive local maxima. For each seed, the program samples until Plateau period iterations are performed without an increase in the MAP value. Every Adjustment Period iterations, a limited local search is performed to see if it is possible to improve the current alignment. This search is performed either by shifting all motif columns, fragmentation or by allowing the motif width to vary. The plateau period is ignored by the centroid sampler
The maximum number of iterations can be entered in Max Iterations. This parameter specifies the maximum number of allowed iterations for each seed as well as the total iterations for near optimal sampling. Max Iterations is ignored by the centroid sampler. Centroid sampler iterations are set by the burn-in and the samples settings.
In order to sample over a larger (or smaller) number of separate initial models, the value of Number of Seeds can be changed. This value is the number of time the sampler is restarted.
The Near Optimal Cutoff percent specifies the threshold value for selecting the "good" sites for near optimal sampling. This value is also the threshold for reporting the most frequently sampled sites during near optimal sampling.
If it is desired to reproduce results, an initial seed value can be entered into the Random Number Seed field. This value is used to seed the random number generator. This setting is typically only used for debugging
A set of default parameters are provided
for prokaryote and another for eukaryotes. The prokaryotic defaults are those
from McCue et. al (6). The eukaryotic defaults are a set that we have found
useful when studying transcription regulation in genes expressed in human and mouse tissue.
Note: The eukaryotic defaults specify 5 different motif types and up to 5 sites per sequence. On large data sets, this may take a long time to process on this server. If you plan to run large data sets frequently using these defaults, we encourage you to obtain a copy of Gibbs and run it locally.
See http://bayesweb.wadsworth.org/GIBBS-SAMPLER-ACADEMIC.htm and http://bayesweb.wadsworth.org/GIBBS-SAMPLER-COMMERCIAL.htm for details.
Amino acid sequences can have low complexity regions of residues closely related or occurring in tandem. Such regions can lead to alignments that are statistically correct but are obviously experimentally false. Unless the user turns off the Xnu option, the sequences are processed or XNU'd using a blossum matrix to remove such low complexity regions. It is also possible that the sequences can be pre-processed using BLAST scoring (15) or other software packages available such as DUST. BLAST scoring can also be used on nucleotide sequences to help remove from a data set sequences that are too similar to one another. Once again, this helps eliminate false motif alignments.
When using the Bayesian statistical formula, there needs to be a distinction between what is in the model and what is not. Those elements in the model will be referred to as motif, while those outside the model are background.
Initially there are no motif elements for any of the motif types. Observed counts for each of the residues are calculated for the background based on the number of each residue occurring in the sequences. Since there are no motif elements, observed motif counts are set to zero for each residue. The observed counts are represented by the variable q0,1, ..., qi,n where q0,kare the background counts, i = number of motif types, and n= alphabet length (4 for nucleotides, 20 for amino acids.)
The posterior Dirichlet distribution for site and background models requires pseudocount parameters. Default background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying a pseudocount weight. The weight is a percentage in the range between 0 and 100 with a default of 10 (10% of the observed counts.)
Motif pseudocounts are set in one of two ways. In the first method, there is no bias in the composition of the motif. Motif residue pseudocounts are set to the background pseudocounts. Such an approach implements "uninformed priors." When the composition of a motif is known to some degree, "informed priors" can be used. In this method, motif pseudocounts are based on the values entered by the user. Depending upon the confidence level of the occurrence of particular residues in specific locations, the pseudocounts are multiplied by a user supplied factor.
After the counts are calculated for the model without any motif elements, the "null model" is stored. An initial alignment of motif elements is randomly selected. The site sampler will choose one location in each sequence as a motif element for each motif type. Motif sampling randomly chooses locations within the sequences according to the initial number of motifs that the user specifies. Care is taken in both cases to ensure that motifs do not overlap the ends of sequences or one another. At this point an alignment, albeit random, is given.
Observed background and motif counts are now updated according to the alignment. This is accomplished by adding in the residues occurring in each motif element to the motif counts and deleting them from the background counts. The various sampling modes now sample motif elements according to their own strategies.
The site sampler begins with an alignment of motifs, one for each motif type per sequence. It then proceeds to follow the two steps of the Gibbs site sampler (1). The first step is the predictive update step in which one of the N sequences is chosen, beginning with the first sequence and proceeding to the last sequence. The motif element for each motif type in the current sequence is deleted from the motif model anbd added to the background. The counts are updated accordingly.
The second step of the Gibbs site sampler is the sampling step. Each possible motif starting position is assigned a probability of being a motif starting position. See equations 1 - 4. Once a probability is assigned to each possible starting position, the probabilities are normalized so they sum to 1. A single motif element is then sampled in for each motif type based on the probability weights. This is a random process, so the best fit may not be chosen, although the probabilities supply a weight to help ensure that good elements are nearly always sampled when present.
The process is repeated with each sequence. Once a correct pattern begins to appear in the first step, it will be improved upon by the weights associated with the sampling step until a local maximum alignment has not been obtained for a number of iterations through the site sampler (defined in the Plateau Period field). Such a lack of improvement suggests that the alignment is in an "energy well" and will not improve. It is possible that this well is indeed the overall and not a local maximum.
The equations used to calculate the probabilities are as follows:
where t is the current motif type;Jt is the motif width, ct,j,ris the number of residues of typerat position j for motif typet;bt,j,r is the number of pseudocounts of typerat position j for motif type t; btis the total number of motif residue pseudocounts for motif type t;qris the number of background pseudocounts of type r, and q is the total number of background pseudocounts. P(t)is the posterior probability that an arbitrary site belongs to motif typet and P(Bg)is the posterior probability that an arbitrary site belongs to the background.
The Motif sampler begins with an alignment of motifs randomly spread throughout the sequences. The algorithm starts at the very first position in the long, concatenated sequence consiting of all sequences and checks to see if the position is a possible motif start site, i.e. a motif placed there will not overlap two sequences or a previously set motif element of a different type. If it is determined to be a possible start site, checking is done to see if it overlaps a motif element of the current type or a motif type that has not yet been set. When this happens, the motif element it overlaps is taken out of the alignment and the counts are updated accordingly. Note that the element can get sampled back in later. The current position is then sampled in either as a motif element of the current type or as background based on the probabilities that are calculated via equations 1, 2, 3 and 4.
In the search for binding sites, the exact number of sites and the
number of sites corresponding to each motif within each sequence are not
usually known. The recursive sampler (16,17) uses
recursive sums over all possible
alignments from 0 to a maximum in a sequence, to obtain Bayesian inferences
on the number of sites for each motif and the total number of sites in each
sequence. The recursion examines the placements of sites for p sites,
as represented by p different motifs, in each sequence. Each sequence
may contain multiple sites for each of the motifs in any order, and some
sequences may contain no sites for some or all of the motifs. The algorithm
sums over all combinations of sites in each sequence, and infers the alignments
and ordering of sites for each sequence via a backward sampling step.
In this way, the algorithm infers for each sequence the total number of sites,
the number of each of the motifs, and the alignments of the sites in
Once initialized with a random alignment, the Gibbs sampling procedure proceeds through the following steps: (1) the probability of each possible number of sites for a sequence is calculated based on the current model; (2) the number of sites for the sequence is sampled; (3) the predicted sites of each of the current motif models are sampled; and (4) the motif models are updated from the sampled sites. In previous versions, this process repeated until the MAP failed to increase for a fixed number of iterations. To obtain a sampling solution, we allow the algorithm to sample through a burn-in period, typically 2,000-3,000 iterations. Next, the sampler proceeds, again through a fixed number of sampling iterations (typically 8,000-10,000), tracking each sampled position. A centroid alignment solution is obtained from these samples through identification of the alignment that possesses the minimum total distance to the other alignments in the set. Thus, the centroid is defined in terms of a distance measure between pairs of proposed alignments.
In the computation of the distance between two alignments, each site in each of the two alignments is considered separately, and the distance between the two alignments is the sum of the values, computed as follows. If the site exactly overlaps a site in the other alignment, then this is a "perfect overlap," and the contribution to the distance sum is zero. If the site overlaps a site in the other alignment by more than half of the larger of the widths of each of the two sites, then this is "sufficient overlap," and the contribution to the distance sum is an infinitesimal epsilon. Otherwise, there is "insufficient overlap," and the contribution to the distance sum is one. Epsilon is an infinitesimal in that it is effectively zero except when a non-zero value is needed to serve as a tie breaker between two candidate centroids. We construct the centroid solution via a dynamic programming algorithm (14,18).
For the site, nmotif, and recursive samplers, once an alignment of motif elements has been found for all of the motif types (i.e. the motif elements for the last sequence have been set in the site sampler or the last position of the long sequence has been sampled in or out in the motif sampler), the current alignment is tested to see how well it describes the data. The value that is calculated is the posterior probability of the alignment given the data. The maximum value of this probability referred to as the MAP(maximum a posteriori) value (3). The final alignment and model reported by the program is the alignment corresponding to the maximal value of this probability. Rather than calculate the probability as shown in equation 10 of (3), the natural log of the probability is used. There are four parts to this probability calculation: the motif portion, the background portion, the fragmentation portion and the distribution of the number of motif sites. The probability for the site sampler and motif sampler is calculated in the same fashion with the exception that the distribution of the number of sites is dropped when using the site sampler since it is not informative when the number of motifs is constant. The MAP value is measured relative to an empty or "null" alignment by taking the difference of the logs of the probability of the alignment and the probability an empty alignment. A value greater than 0 indicates that the alignment is more likely than unaligned background.
The MAP is not calculated by the centroid sampler.
It is possible that an alignment of motifs results in a map value that is not quite the maximal value. For instance, it could be possible, especially in the case of the motif sampler, that a good but not great alignment is found that is a shifted form of a better alignment.
Fragmentation can alleviate the problem of shifting by picking the first column as the worst and replacing it with the last column or vice-versa. It will also deal with modeling more realistic regulatory sites where binding may or may not occur at contiguous sites. Fragmentation is covered in greater detail in (3).
After the sampling steps have been iterated through a number of times (default is 20), each time running until a maximal map has not been found for the current run in the last plateau perioditerations or until the maximal number of iterations has been reached, the overall maximal map is stored. However, sampling of individual elements within the sequence is a random process, so it is possible to have some motif elements that have a low probability of being motif elements that are sampled in while others with a high probability of being motif elements are sampled out. Since such an alignment is suboptimal, a near optimal sampling routine is used to converge closer to the desired results.
The process proceeds is two stages. The first step is a preprocessing stage. all of the positions that are motif elements in the maximal alignment are marked as "good" positions. All of the other possible motif starting positions are analyzed to see if their probability of being a motif element (according to equations 1, 2, 3 and 4) is above a certain threshold. If so, then they are marked as "good" positions, otherwise they are marked as "bad".
The sampling stage of near optimal sampling works exactly the same as described above with one small difference. The exception is that only those positions which are marked "good" are possible starting positions. As a result, near optimal sampling is much faster than suboptimal sampling.
Near optimal sampling will run for the maximum number of iterations (default is 500). Sites which are sampled greater than a fixed percent of the total number of iterations are displayed in the frequency solution.
Once near optimal sampling has completed, two different results that are available: the near optimal alignment and the alignment of motif elements that occur over a threshold percentage of the time in near optimal sampling. If the Map Miximization field is selected, the alignment with the higer MAP will be refined.
The Maximal map calculation will first set the current alignment to the maximal alignment. Then, in a single pass, beginning with the first position of the first sequence and ending with the last position of the last sequence, individual motif elements will be added in or taken out depending upon whether or not their inclusion improves the map value. This may be compuationally expensive because the map must be calculated at each position in each sequence. The alignment with the maximum a probability (MAP) from this process is reported.
The results produced by the Bernoulli sampler were compared with known motif positions and results obtained using other samplers. The three main data sets used were a bacterial porins data set with known motif positions (7), dinucleotide- binding proteins whose motifs have been detected by a previous implementation of a Gibbs sampling algorithm (3) and a CRP binding data set with CRP binding sites that have been found using an implementation of the expectation maximization algorithm (2).
For more examples see the descriptions of prokayotic phylogenetic footprinting and the analysis of prokaryotic co-expression data.
The porin data consists of a set of 16 distantly related bacterial integral outer membrane proteins. The location of 16 known beta strands for 3 of these proteins is given in Table 1.
S, M BEG pre MOTIF post END PROB T DESCRIPTION STRAND ____________________________________________________________________________ 1, 1 149 ffglv DGLNFAVQYLG knerd 159 0.28 F OMPF_ECOLI OUTER B7 1, 2 170 tarrs NGDGVGGSISY eyegf 180 0.96 F OMPF_ECOLI OUTER B8 1, 3 212 ngkka EQWATGLKYDA nniyl 222 1.00 F OMPF_ECOLI OUTER B10 1, 4 255 fankt QDVLLVAQYQF dfglr 265 0.99 F OMPF_ECOLI OUTER B12 1, 5 293 dvdlv NYFEVGATYYF nknms 303 0.98 F OMPF_ECOLI OUTER B14 1, 6 328 klgvg SDDTVAVGIVY qf 338 0.98 F OMPF_ECOLI OUTER B16 2, 1 37 sgttd SGLEFGASFKA hesvg 47 0.86 F PORI_RHOCA PORIN. B3 2, 2 57 gaetg EDGTVFLSGAF gkiem 67 0.99 F PORI_RHOCA PORIN. B4 2, 3 99 lddrg GNDIPYLTGDE rltae 109 0.12 F PORI_RHOCA PORIN. -- 2, 4 159 aaytf GNYTVGLGYEK idspd 169 0.98 F PORI_RHOCA PORIN. B9 2, 5 182 lmadm EQLELAAIAKF gatnv 192 0.97 F PORI_RHOCA PORIN. B10 2, 6 261 tiddv TYYGLGASYDL gggas 271 0.18 F PORI_RHOCA PORIN. B14 3, 1 6 eisln GYGRFGLQYVE drgvg 16 0.31 F porin - R. blastica B1 3, 2 43 ttetd QGVTFGAKLRM qwddg 53 0.99 F porin - R. blastica B3 3, 3 88 gnvdt AFDSVALTYDS emgye 98 0.90 F porin - R. blastica -- 3, 4 171 iaadw SNDMISLAAAY ttdag 181 0.99 F porin - R. blastica B9 3, 5 222 lstag DQVTLYGNYAF gattv 232 1.00 F porin - R. blastica B12 3, 6 263 dyqfa EGVKVSGSVQS gfane 273 0.97 F porin - R. blastica B15 3, 7 277 qsgfa NETVADVGVRF df 287 0.84 F porin - R. blastica B16
The first column represents the sequence number (S) followed by the motif number (M). The second and sixth columns are the motif beginning (BEG) and ending (END) start sites. The third (pre) and fifth (post) columns are the residues that flank the motifs (MOTIF) in the fourth column. The seventh column represents the probability of the motif belonging to the alignment (PROB). The eighth column indicates whether it is a forward (F) or reverse complement (R) element. Column nine contains a description of the sequence and column 10 indicates which strand the element is contained within.
Of all the motifs found for the first three proteins, only two do not fit into one of the beta strands (see Table 1). While only 17 of 48 motif elements have been found, the results are impressive considering that the known motif alignment has elements of varying length. Thus, it would be best to consider multiple motifs and run this data set again to sample more motif elements.
The dinucleotide data is a set of 91 distantly related proteins that bind to one of 3 cofactors used by several different enzymes. The maximal alignment has been identified by an implementation of the Gibbs sampler using fragmentation (7) see Table 2).
S, M BEG pre MOTIF post END PROB T DESCRIPTION _____________________________________________________________________________ 1, 1 195 tqgst CAVFGLGGVGLSVIMGCKAAGAARII gvdin 220 1.00 F ADHE_HORSE 2, 1 23 rsynk ITVVGVGAVGMACAISILMKDLADEV alvdv 48 1.00 F LDHM_SQUAC 4, 1 6 agwsc LVTGGGGFLGQRIICLLVEEKDLQEI rvldk 31 1.00 F 3BHS_BOVIN 5, 1 6 malqq FGLIGLAVMGENLALNIERNGFSLTV ynrta 31 1.00 F 6PGD_SYNP7 6, 1 8 ankni IFVAGLGGIGFDTSREIVKSGPKNLV ildri 33 1.00 F ADH1_DRONA 8, 1 41 ektpq ICVVGSGPAGFYTAQHLLKHPQAHVD iyekq 66 1.00 F ADRO_HUMAN 8, 2 179 lscdt AVILGQGNVALDVARILLTPPEHLEA lllcq 204 1.00 F ADRO_HUMAN 11, 1 468 dqvie VFVIGVGGVGGALIEQIYRQQPWLKQ khidl 493 1.00 F AK1H_SERMA 14, 1 5 mqfd YIIIGAGSAGNVLATRLTEDPNTSVL lleag 30 1.00 F BETA_ECOLI 21, 1 348 lvqrl IGIEARGRIGRAVAVALHRASRALDV adhrq 373 1.00 F CRTI_APHSP 22, 1 11 egmgr AVVIGAGLGGLAAAMRLGAKGYKVTV vdrld 36 1.00 F CRTI_RHOCA 22, 2 222 kfgvh YAIGGVQAIADAMAKVITDQGGEMRL ntevd 247 1.00 F CRTI_RHOCA 24, 1 6 mtnir VAIVGYGNLGRSVEKLIAKQPDMDLV gifsr 31 1.00 F DDH_CORGL 27, 1 215 mkkak IAVQGIGNVGSYTVLNCEKLGGTVVA maewc 240 1.00 F DHE3_CLODI 28, 1 24 plplt VGVLGSGHAGTALAAWFASRHVPTAL wapad 49 1.00 F DHNO_AGRT7 29, 1 148 fkgkp ALIVGGGGTARTAIYVLRKWLGVSKI yivnr 173 0.99 F DHQA_ASPNI 41, 1 162 fglna VVIGASNIVGRPMSMELLLAGCTTTV thrft 187 1.00 F FOLD_ECOLI 41, 2 204 lenad LLIVAVGKPGFIPGDWIKEGAIVIDV ginrl 229 0.98 F FOLD_ECOLI 42, 1 7 tfqad LAIVGAGGAGLRAAIAAAQANPNAKI alisk 32 1.00 F FRDA_ECOLI 44, 1 6 msrak VGINGFGRIGRLVLRAAFLKNTVDVV svndp 31 1.00 F G3P_SCHMA 46, 1 6 mqpir LGLVGYGKIAQDQHVPAINANPAFTL vsvat 31 0.99 F GAL_PSEFL 49, 1 6 mnqva VVIGGGQTLGAFLCHGLAAEGYRVAV vdiqs 31 1.00 F GUTD_ECOLI 50, 1 260 svtvc VGHLGGLDIAERDIARLRGLGRTVSD siavr 285 1.00 F HDNO_ARTOX 51, 1 184 lssqk TVIIGAGKMACLLVKHLLAKGATDIT ivnrs 209 1.00 F HEM1_SYNY3 55, 1 297 dvqsd IVAQGFGSLGLMTSILVTPDGKTFES eaahg 322 0.99 F IDHP_YEAST 56, 1 5 mslr IGVIGTGAIGKEHINRITNKLSGAEI vavtd 30 1.00 F IDH_BACSU 57, 1 80 fkndt FALIGYGSQGYGQGLNLRDNGLNVII gvrkd 105 1.00 F ILV5_YEAST 63, 1 286 lsdht VLFQGAGEAALGIANLIVMAMEKEGV skeaa 311 1.00 F MAOX_ANAPL 70, 1 4 mtd IAFLGLGNMGGPMAANLLKAGHRVNV fdlqp 29 1.00 F MMSB_PSEAE 71, 1 3 mk ALHFGAGNIGRGFIGKLLADAGIQLT fadvn 28 1.00 F MTLD_ECOLI 76, 1 5 msnt IVVVGAGVIGLTSALLLSKNKGNKIT vvakh 30 1.00 F OXDA_FUSSO 77, 1 119 lydrt VGIVGVGNVGRRLQARLEALGIKTLL cdppr 144 1.00 F PDXB_ECOLI 78, 1 5 mktq VAIIGAGPSGLLLGQLLHKAGIDNVI lerqt 30 1.00 F PHHY_PSEFL 79, 1 167 vppak VMVIGAGVAGLAAIGAANSLGAIVRA fdtrp 192 1.00 F PNTA_ECOLI 79, 2 463 aisgi IVVGALLQIGQGGWVSFLSFIAVLIA sinif 488 0.96 F PNTA_ECOLI 82, 1 13 aesyt LGFIGAGKMAESIARGAVRSGVLPPS rirta 38 1.00 F PROC_SOYBN 85, 1 9 lkeyv IVSGASGFIGKHLLEALKKSGISVVA itrdv 34 1.00 F RFBJ_SALSP 86, 1 218 lagkv AVVAGYGDVGKGSAASLKAFGSRVIV teidp 243 1.00 F SAHH_CAEEL 89, 1 16 katkv IGIIGLGDMGLLYANKFTDAGWSVIC cdree 41 1.00 F TYR1_YEAST ******* ** * * * * Seed = 838822581 Difference of Logs of Maps = 100.53212
Fragmentation was used to obtain the maximal alignment. The columns that have an '*' beneath them are the columns chosen for the motif.
The CRP data contains a set of 18 proteins that have a CRP binding site. CRP is a dimeric DNA binding protein. Footprint sites have been reported by Stormo and Hartzell (2,10) The results using the site sampler with a width of 22, and no fragmentation are produced as follows:
S, M BEG pre MOTIF post END PROB T DESCRIPTION ___________________________________________________________________________ 1, 1 38 gcgct ATTCTCGCCCGATGCCACAAAA accag 17 0.31 R cole1 1, 2 61 actgt TTTTTTGATCGTTTTCACAAAA atgga 82 0.34 F cole1 2, 1 38 ttttc TGCCGTGATTATAGACACTTTT gttac 17 0.05 R ecoarabop 2, 2 55 attga TTATTTGCACGGCGTCACACTT tgcta 76 0.45 F ecoarabop 3, 1 76 ttaat AACTGTGAGCATGGTCATATTT ttatc 97 0.49 F ecobgirl 4, 1 63 tgcat GTATGCAAAGGACGTCACATTA ccgtg 84 0.46 F ecocrp 5, 1 71 gtcta AAACGTGATCAATTTAACACCT tgctg 50 0.49 R ecocya 6, 1 28 cactg TAATGCGATCTGGTTCAAATAA ttcac 7 0.26 R ecodaop 6, 2 81 caaca CACTTCGATACACATCACAATT aagga 60 0.25 R ecodaop *7, 1 24 aattc TTGTGTAAACGATTCCACTAAT ttatt 45 0.33 F ecogale +8, 1 15 gggtt TTTTGTTATCTGCAATTCAGTA caaaa 36 0.10 F ecoilvbpr 8, 2 39 gtaca AAACGTGATCAACCCCTCAATT ttccc 60 0.37 F ecoilvbpr 9, 1 30 gccta ATGAGTGAGCTAACTCACATTA attgc 9 0.36 R ecolac *9, 2 94 ttgtt ATCCGCTCACAATTCCACACAA catac 73 0.28 R ecolac 10, 1 35 tcgct TTGTGTGATCTCTGTTACAGAA ttggc 14 0.50 R ecomale 11, 1 29 acggc TTCTGTGAACTAAACCGAGGTC atgta 50 0.08 F ecomalk 11, 2 82 acgat TTTTGCAAGCAACATCACGAAA ttcct 61 0.47 R ecomalk 12, 1 62 tgtct GAATTTGCACTGTGTCACAATT ccaaa 41 0.35 R ecomalt 13, 1 48 ttcat ATGCCTGACGGAGTTCACACTT gtaag 69 0.49 F ecoompa 14, 1 71 cgaac GATTGTGATTCGATTCACATTT aaaca 92 0.48 F ecotnaa 15, 1 38 tctaa TTGGGTTAACCACATCACAACA atttc 17 0.49 R ecouxu1 16, 1 53 atatg CGGTGTGAAATACCGCACAGAT gcgta 74 0.48 F pbr322 17, 1 105 CGTGCCGATCAACGTCTCATTT tcgcc 84 0.99 R trn9cat *18, 1 97 aggat ATGTGCGACCACTCACAAATTA acttt 76 0.26 R (tdr) ********************** Seed = 838740876
Comparing these results to the footprint sites in (2,10), it can be seen that 20 of 24 footprint sites are found. Three of the sites (those marked with an *) are sites that have been previously found using an expectation maximization approach (2). The site marked by + is the only site that does not appear in the footprint sites or those sites previously identified.
The first portion of the results consists of the command line used to run the program, a list of the options used for the current run and a list of the Fasta headers of the input sequences. The results ection begins with a table with a column for each of the residues (4 for when using DNA nucleotide sequences and 20 when using amino acid sequences -- see Table 4 for the one letter abbreviation for the amino acids) and a row for each of the positions in the motif. The numbers in the table indicate the frequency of occurrence of each residue in each position within the motif throughout the alignment. The last column is an information parameter that indicates how much the column adds to the model. Using this table, it can be deduced which positions are most conserved and what the pattern of the motif is. Following this is a probability model calculated from the frequency table and the priors. Informative priors can be extracted from the information in this table.
The next portion of the results is the alignment of the motifs found. The first column indicates which sequence the motif element occurs in followed by the motif number for that sequence. The second column indicates where the motif element starts within the sequence (in number of residues) while the sixth column indicates the ending location of the motif element. The third and fifth columns indicate residues that occur on either side of the motif element located in the fourth column. The probability that this motif element belongs to the alignment is given in the seventh column. The eighth column indicates whether the element is found in the forward direction (F) or as a reverse complementary element (R). The ninth column provides a description of the sequence. Following the alignment, the random number seed used and map probability are given.
An example of the complete results is given below. It contains the results of searching a set of CRP intergenic sequences for a motif of width 16 using the recursive sampler allowing up to 2 sites in each sequence with an estimated total number of motifs of 20, fragmentation, a palindromic model, and a heterogeneous background model. In this example, the near optimal, frequency and maximal MAP solutions were requested. The default for the recursive, site and motif samplers is to only report the maximal MAP solution
/tmp/gibbs19212/Gibbs19212 -PBernoulli /tmp/gibbs19212/data.txt 16 20 -E 2 -K -m -n -r -W 0.8 -w 0.1 -p 50 -j 5 -i 500 -S 20 -C 0.5 -R 1,1,8 -B /tmp/gibbs19212/back.txt -Y -o /tmp/gibbs19212/outfile.txt Gibbs 2.04.001 Mar 24 2003 Data file: /tmp/gibbs19212/data.txt Output file: /tmp/gibbs19212/outfile.txt Background Composition Model file: /tmp/gibbs19212/back.txt The following options are set: Concentrated Region False Sequence type False Collapsed Alphabet False Pseudocount weight True Use Expectation/Maximization False Don't Xnu sequence False Help flag False Near optimal cutoff True Number of iterations True Don't fragment False Don't use map maximization True Repeat regions False Output file True Informed priors file False Plateau periods True palindromic sequence True Don't Reverse complement True Number of seeds True Seed Value False Pseudosite weight True Suboptimal sampler output False Overlap False Allow width to vary False Wilcoxon signed rank False Sample along length False Output Scan File False Output prior file False Modular Sampler True Ignore Spacing Model False Sample Background False Bkgnd Comp Model True Init from prior False Homologous Seq pairs False Parallel Tempering False Group Sampler False No progress info False Fragment from middle False Verify Mode False Alternate sample on k True No freq. soln. False Calc. def. pseudo wt. True site_samp = 0 nMotifLen = 16 nAlphaLen = 4 nNumMotifs = 20 dPseudoCntWt = 0.1 dPseudoSiteWt = 0.8 nMaxIterations = 500 lSeedVal = 1049467533 nPlateauPeriods = 50 nSeeds = 20 nNumMotifTypes = 1 dCutoff = 0.5 RevComplement = 0 glOverlapParam = 0 Max Sites/seq = 2 Min Sites/Seq = 0 Rcutoff factor = 0.001 Post Plateau Samples = 0 Frag/Shft Per. = 5 Frag width = 24 Sequences to be Searched: _________________________ #1 cole1 #2 ecoarabop #3 ecobgirl #4 ecocrp #5 ecocya #6 ecodaop #7 ecogale #8 ecoilvbpr #9 ecolac #10 ecomale #11 ecomalk #12 ecomalt #13 ecoompa #14 ecotnaa #15 ecouxu1 #16 pbr322 #17 trn9cat #18 (tdr) Processed Sequence Length: 1890 Total sequence length: 1890 ======================== NEAR OPTIMAL RESULTS ======================== ====================================================================== MAP = 119 maybe = 121 discard = 1769 ------------------------------------------------------------------------- MOTIF a Motif model (residue frequency x 100) ____________________________________________ Pos. # a t c g Info _____________________________ 1 | 29 52 5 11 0.2 2 | 29 52 . 17 0.3 3 | 35 35 11 17 0.0 4 | . 88 11 . 0.9 5 | . 23 11 64 0.7 6 | . 94 5 . 1.1 7 | 17 5 . 76 0.9 8 | 82 5 11 . 0.7 15 | . 76 . 23 0.7 16 | 5 11 82 . 1.2 17 | 88 . . 11 0.9 18 | 5 5 88 . 1.4 19 | 82 5 5 5 0.7 20 | 17 41 35 5 0.2 21 | 47 52 . . 0.5 22 | 41 52 5 . 0.4 nonsite 30 30 18 20 site 30 37 17 14 Motif probability model ____________________________________________ Pos. # a t c g ____________________________________________ 1 | 0.295 0.506 0.072 0.127 2 | 0.295 0.506 0.019 0.180 3 | 0.348 0.348 0.124 0.180 4 | 0.032 0.822 0.124 0.022 5 | 0.032 0.243 0.124 0.601 6 | 0.032 0.874 0.072 0.022 7 | 0.190 0.085 0.019 0.706 8 | 0.769 0.085 0.124 0.022 15 | 0.032 0.716 0.019 0.233 16 | 0.084 0.137 0.756 0.022 17 | 0.821 0.032 0.019 0.127 18 | 0.084 0.085 0.809 0.022 19 | 0.769 0.085 0.072 0.075 20 | 0.190 0.401 0.335 0.075 21 | 0.453 0.506 0.019 0.022 22 | 0.400 0.506 0.072 0.022 Background probability model 0.303 0.294 0.184 0.219 16 columns Num Motifs: 17 1, 1 61 actgt TTTTTTGATCGTTTTCACAAAA atgga 82 0.56 F cole1 2, 1 55 attga TTATTTGCACGGCGTCACACTT tgcta 76 1.00 F ecoarabop 3, 1 76 ttaat AACTGTGAGCATGGTCATATTT ttatc 97 0.96 F ecobgirl 4, 1 63 tgcat GTATGCAAAGGACGTCACATTA ccgtg 84 0.98 F ecocrp 5, 1 50 cagca AGGTGTTAAATTGATCACGTTT tagac 71 0.91 F ecocya 6, 1 7 gtgaa TTATTTGAACCAGATCGCATTA cagtg 28 0.99 F ecodaop 7, 1 42 tccac TAATTTATTCCATGTCACACTT ttcgc 63 0.58 F ecogale 8, 1 20 ttttg TTATCTGCAATTCAGTACAAAA cgtga 41 0.59 F ecoilvbpr 9, 1 9 gcaat TAATGTGAGTTAGCTCACTCAT taggc 30 0.99 F ecolac 10, 1 14 gccaa TTCTGTAACAGAGATCACACAA agcga 35 0.99 F ecomale 11, 1 61 aggaa TTTCGTGATGTTGCTTGCAAAA atcgt 82 0.96 F ecomalk 12, 1 41 tttgg AATTGTGACACAGTGCAAATTC agaca 62 0.90 F ecomalt 13, 1 48 ttcat ATGCCTGACGGAGTTCACACTT gtaag 69 0.98 F ecoompa 14, 1 71 cgaac GATTGTGATTCGATTCACATTT aaaca 92 1.00 F ecotnaa 15, 1 17 gaaat TGTTGTGATGTGGTTAACCCAA ttaga 38 0.54 F ecouxu1 16, 1 53 atatg CGGTGTGAAATACCGCACAGAT gcgta 74 0.93 F pbr322 18, 1 78 agtta ATTTGTGAGTGGTCGCACATAT cctgt 99 1.00 F (tdr) ******** ******** Column 1 : Sequence Number, Site Number Column 2 : Left End Location Column 4 : Motif Element Column 5 : Right End Location Column 6 : Probability of Element Column 7 : Forward Motif (F) or Reverse Complement (R) Column 8 : Sequence Description from Fast A input Log Motif portion of MAP for motif a = -249.09778 Log Fragmentation portion of MAP for motif a = -4.78749 Seed = 1049467533 Difference of Logs of Maps = 37.69802 ============================================================= ====== ELEMENTS OCCURRING GREATER THAN 50% OF THE TIME ===== ====== Motif a ===== ============================================================= Listing of those elements occurring greater than 50% of the time in near optimal sampling using 500 iterations Motif model (residue frequency x 100) ____________________________________________ Pos. # a t c g Info _____________________________ 1 | 29 52 5 11 0.2 2 | 29 52 . 17 0.3 3 | 35 35 11 17 0.0 4 | . 88 11 . 0.9 5 | . 23 11 64 0.7 6 | . 94 5 . 1.1 7 | 17 5 . 76 0.9 8 | 82 5 11 . 0.7 15 | . 76 . 23 0.7 16 | 5 11 82 . 1.2 17 | 88 . . 11 0.9 18 | 5 5 88 . 1.4 19 | 82 5 5 5 0.7 20 | 17 41 35 5 0.2 21 | 47 52 . . 0.5 22 | 41 52 5 . 0.4 nonsite 30 30 18 20 site 30 37 17 14 Motif probability model ____________________________________________ Pos. # a t c g ____________________________________________ 1 | 0.295 0.506 0.072 0.127 2 | 0.295 0.506 0.019 0.180 3 | 0.348 0.348 0.124 0.180 4 | 0.032 0.822 0.124 0.022 5 | 0.032 0.243 0.124 0.601 6 | 0.032 0.874 0.072 0.022 7 | 0.190 0.085 0.019 0.706 8 | 0.769 0.085 0.124 0.022 15 | 0.032 0.716 0.019 0.233 16 | 0.084 0.137 0.756 0.022 17 | 0.821 0.032 0.019 0.127 18 | 0.084 0.085 0.809 0.022 19 | 0.769 0.085 0.072 0.075 20 | 0.190 0.401 0.335 0.075 21 | 0.453 0.506 0.019 0.022 22 | 0.400 0.506 0.072 0.022 Background probability model 0.303 0.294 0.184 0.219 16 columns Num Motifs: 17 1, 1 61 actgt TTTTTTGATCGTTTTCACAAAA atgga 82 1.00 F cole1 2, 1 55 attga TTATTTGCACGGCGTCACACTT tgcta 76 1.00 F ecoarabop 3, 1 76 ttaat AACTGTGAGCATGGTCATATTT ttatc 97 1.00 F ecobgirl 4, 1 63 tgcat GTATGCAAAGGACGTCACATTA ccgtg 84 1.00 F ecocrp 5, 1 50 cagca AGGTGTTAAATTGATCACGTTT tagac 71 1.00 F ecocya 6, 1 7 gtgaa TTATTTGAACCAGATCGCATTA cagtg 28 1.00 F ecodaop 7, 1 42 tccac TAATTTATTCCATGTCACACTT ttcgc 63 1.00 F ecogale 8, 1 20 ttttg TTATCTGCAATTCAGTACAAAA cgtga 41 1.00 F ecoilvbpr 9, 1 9 gcaat TAATGTGAGTTAGCTCACTCAT taggc 30 1.00 F ecolac 10, 1 14 gccaa TTCTGTAACAGAGATCACACAA agcga 35 1.00 F ecomale 11, 1 61 aggaa TTTCGTGATGTTGCTTGCAAAA atcgt 82 1.00 F ecomalk 12, 1 41 tttgg AATTGTGACACAGTGCAAATTC agaca 62 1.00 F ecomalt 13, 1 48 ttcat ATGCCTGACGGAGTTCACACTT gtaag 69 1.00 F ecoompa 14, 1 71 cgaac GATTGTGATTCGATTCACATTT aaaca 92 1.00 F ecotnaa 15, 1 17 gaaat TGTTGTGATGTGGTTAACCCAA ttaga 38 0.99 F ecouxu1 16, 1 53 atatg CGGTGTGAAATACCGCACAGAT gcgta 74 0.99 F pbr322 18, 1 78 agtta ATTTGTGAGTGGTCGCACATAT cctgt 99 1.00 F (tdr) ******** ******** Difference of Logs of Maps = 37.69802 Log Background portion of Map = -2221.57441 Log Alignment portion of Map = -75.32389 Log Site/seq portion of Map = -7.36221 Log Null Map = -2595.84380 Non Palindromic Map = 20.07571 Max subopt MAP found on seed 2 Frequency Map = 37.698023 Nearopt Map = 37.698023 log MAP = sum of motif and fragmentation parts of MAP + background + alignment + sites/seq - Null Elapsed time: 62.100000 secs
An example of the output from the centroid sampler is given below. It contains the results of searching the set of CRP intergenic sequences described above for a motif of width 16 using the centroid sampler allowing up to 2 sites in each sequence with an estimated total number of motifs of 20, fragmentation, a palindromic model, and a heterogeneous background model. In this example, an aligned centroid model was chosen
/users/thompson/gibbs/bin/Gibbs.opteron /users/thompson/gibbs/data/crp.fa 16 20 -n -E 2 -r -R 1,1,8 -bayes -align_centroid -o /users/thompson/gibbs/output/crp.082707.out Gibbs 3.04.001 Aug 21 2007 Data file: /users/thompson/gibbs/data/crp.fa Output file: /users/thompson/gibbs/output/crp.082707.out Current directory: /users/thompson/gibbs/tmp The following options are set: Concentrated Region False Sequence type False Collapsed Alphabet False Pseudocount weight False Use Expectation/Maximization False Don't Xnu sequence False Help flag False Near optimal cutoff False Number of iterations False Don't fragment False Don't use map maximization True Repeat regions False Output file True Informed priors file False Plateau periods False palindromic sequence True Don't Reverse complement True Number of seeds False Seed Value False Pseudosite weight False Suboptimal sampler output False Overlap False Allow width to vary False Wilcoxon signed rank False Sample along length False Output Scan File False Output prior file False Modular Sampler True Ignore Spacing Model False Sample Background False Bkgnd Comp Model False Init from prior False Homologous Seq pairs False Parallel Tempering False Group Sampler False No progress info False Fragment from middle False Verify Mode False Alternate sample on k False No freq. soln. True Calc. def. pseudo wt. False Motif/Recur smpl False Phylogenetic Sampling False Supress Near Opt. True Nearopt display cutoff False Sample model True Hierarchical Model False Centroid model True Print Bayesian Counts False Align Centroid True Calculate Credibility Limits True site_samp = 0 nMotifLen = 16 nAlphaLen = 4 nNumMotifs = 20 dPseudoCntWt = 0.1 dPseudoSiteWt = 0.8 lSeedVal = 1188244180 nPlateauPeriods = 20 nSeeds = 10 nNumMotifTypes = 1 dCutoff = 0.01 dNearOptDispCutoff = 0.5 RevComplement = 0 glOverlapParam = 0 Max Sites/seq = 2 Min Sites/Seq = 0 Rcutoff factor = 0 Post Plateau Samples = 0 Frag/Shft Per. = 5 Frag width = 24 Burn-in = 2000 Sample Period = 8000 Sequences to be Searched: _________________________ #1 cole1 #2 ecoarabop #3 ecobgirl #4 ecocrp #5 ecocya #6 ecodaop #7 ecogale #8 ecoilvbpr #9 ecolac #10 ecomale #11 ecomalk #12 ecomalt #13 ecoompa #14 ecotnaa #15 ecouxu1 #16 pbr322 #17 trn9cat #18 (tdr) Processed Sequence Length: 1890 Total sequence length: 1890 ====================================================================== ========================== CENTROID RESULTS ========================== ====================================================================== 1, 1 12 ttgtg CTGGTTTTTGTGGCATCGGGC gagaa 32 0.66 cole1 1, 2 54 gtgaa AGACTGTTTTTTTGATCGTTTTC acaaa 76 0.98 cole1 2, 1 56 ttgat TATTTGCACGGCGTCACAC tttgc 74 0.98 ecoarabop 3, 1 79 ataac TGTGAGCATGGTCATATTTTTA tcaat 100 0.91 ecobgirl 4, 1 55 tgatg TACTGCATGTATGCAAAGGACGT cacat 77 0.84 ecocrp 5, 1 48 atcag CAAGGTGTTAAATTGATCACGTT ttaga 70 0.76 ecocya 6, 1 5 agtg AATTATTTGAACCAGATCGCATTA cagtg 28 0.97 ecodaop 6, 2 67 ttgtg ATGTGTATCGAAGTGTGTTGCGG agtag 89 0.86 ecodaop 7, 1 31 gtgta AACGATTCCACTAATTTATTCCA tgtca 53 0.90 ecogale 8, 1 29 ctgca ATTCAGTACAAAACGTGATCAAC ccctc 51 0.88 ecoilvbpr 9, 1 8 cgcaa TTAATGTGAGTTAGCTCACTC attag 28 0.97 ecolac 9, 2 79 tgtgt GGAATTGTGAGCGGATAACAATT tcac 101 0.66 ecolac 10, 1 8 attac CGCCAATTCTGTAACAGAGATCA cacaa 30 0.97 ecomale 11, 1 31 ggctt CTGTGAACTAAACCGAGGTCATG taagg 53 0.57 ecomalk 11, 2 56 atgta AGGAATTTCGTGATGTTGCTT gcaaa 76 0.83 ecomalk 12, 1 41 tttgg AATTGTGACACAGTGCAAATTCA gacac 63 0.94 ecomalt 13, 1 48 ttcat ATGCCTGACGGAGTTCACACTTG taagt 70 0.83 ecoompa 14, 1 78 ttgtg ATTCGATTCACATTTAAACAA tttca 98 0.92 ecotnaa 15, 1 15 gtgaa ATTGTTGTGATGTGGTTAACCCA attag 37 0.55 ecouxu1 16, 1 53 atatg CGGTGTGAAATACCGCACAGATG cgtaa 75 0.85 pbr322 18, 1 78 agtta ATTTGTGAGTGGTCGCACATAT cctgt 99 0.99 (tdr) Num Sites: 21 Column 1 : Sequence Number, Site Number Column 2 : Left End Location Column 4 : Motif Element Column 6 : Right End Location Column 7 : Probability of Element Column 8 : Sequence Description from FastA input Credibility limits % samples within specified number of mismatched sites from the sequence centroid. Avg. dist. Seq. 95% 90% 85% to samples ------------------------------- 1 1 1 1 0.39 cole1 2 1 1 1 0.47 ecoarabop 3 2 1 1 0.70 ecobgirl 4 1 1 1 0.39 ecocrp 5 2 1 1 0.62 ecocya 6 1 1 1 0.20 ecodaop 7 2 1 1 0.73 ecogale 8 1 1 1 0.59 ecoilvbpr 9 1 1 1 0.39 ecolac 10 1 1 0 0.10 ecomale 11 2 2 2 0.64 ecomalk 12 1 1 1 0.33 ecomalt 13 1 1 1 0.40 ecoompa 14 1 1 1 0.61 ecotnaa 15 2 1 1 0.73 ecouxu1 16 1 1 1 0.21 pbr322 17 1 0 0 0.07 trn9cat 18 1 1 1 0.50 (tdr) Avg. 1.28 1.00 0.94 0.45 ====================================================================== ======================== Aligned Centroid Sites ====================== ====================================================================== ------------------------------------------------------------------------- MOTIF a Motif model (residue frequency x 100) ____________________________________________ Pos. # a t c g Info _____________________________ 1 | 19 38 19 23 0.0 2 | . 90 9 . 1.0 3 | . 14 4 80 1.0 4 | . 90 9 . 1.0 5 | 19 4 . 76 0.9 6 | 85 . 4 9 0.9 8 | 28 19 28 23 0.1 9 | 4 42 19 33 0.2 10 | 47 23 4 23 0.2 11 | 9 19 14 57 0.4 13 | 4 61 9 23 0.3 14 | 9 4 85 . 1.4 15 | 71 . . 28 0.8 16 | 23 4 71 . 1.0 17 | 66 9 4 19 0.4 18 | 23 33 23 19 0.0 nonsite 28 32 16 22 site 25 28 19 26 Motif probability model ____________________________________________ Pos. # a t c g ____________________________________________ 1 | 0.199 0.376 0.188 0.237 2 | 0.026 0.852 0.102 0.020 3 | 0.026 0.159 0.059 0.756 4 | 0.026 0.852 0.102 0.020 5 | 0.199 0.073 0.015 0.713 6 | 0.805 0.029 0.059 0.107 8 | 0.285 0.203 0.275 0.237 9 | 0.069 0.419 0.188 0.323 10 | 0.459 0.246 0.059 0.237 11 | 0.112 0.203 0.145 0.540 13 | 0.069 0.592 0.102 0.237 14 | 0.112 0.073 0.795 0.020 15 | 0.675 0.029 0.015 0.280 16 | 0.242 0.073 0.665 0.020 17 | 0.632 0.116 0.059 0.194 18 | 0.242 0.333 0.232 0.194 Background probability model 0.307 0.362 0.143 0.187 16 columns Num Motifs: 21 1, 1 19 ggttt TTGTGGCATCGGGCGAGA atagc 36 1.00 F cole1 1, 2 63 tgttt TTTTGATCGTTTTCACAA aaatg 80 1.00 F cole1 2, 1 57 tgatt ATTTGCACGGCGTCACAC tttgc 74 1.00 F ecoarabop 3, 1 78 aataa CTGTGAGCATGGTCATAT tttta 95 1.00 F ecobgirl 4, 1 65 catgt ATGCAAAGGACGTCACAT taccg 82 1.00 F ecocrp 5, 1 52 gcaag GTGTTAAATTGATCACGT tttag 69 1.00 F ecocya 6, 1 9 gaatt ATTTGAACCAGATCGCAT tacag 26 1.00 F ecodaop 6, 2 62 cttaa TTGTGATGTGTATCGAAG tgtgt 79 1.00 F ecodaop 7, 1 26 ttctt GTGTAAACGATTCCACTA attta 43 1.00 F ecogale 8, 1 24 gttat CTGCAATTCAGTACAAAA cgtga 41 1.00 F ecoilvbpr 9, 1 11 aatta ATGTGAGTTAGCTCACTC attag 28 1.00 F ecolac 9, 2 75 atgtt GTGTGGAATTGTGAGCGG ataac 92 1.00 F ecolac 10, 1 16 caatt CTGTAACAGAGATCACAC aaagc 33 1.00 F ecomale 11, 1 31 ggctt CTGTGAACTAAACCGAGG tcatg 48 1.00 F ecomalk 11, 2 63 gaatt TCGTGATGTTGCTTGCAA aaatc 80 1.00 F ecomalk 12, 1 43 tggaa TTGTGACACAGTGCAAAT tcaga 60 1.00 F ecomalt 13, 1 50 catat GCCTGACGGAGTTCACAC ttgta 67 1.00 F ecoompa 14, 1 73 aacga TTGTGATTCGATTCACAT ttaaa 90 1.00 F ecotnaa 15, 1 19 aattg TTGTGATGTGGTTAACCC aatta 36 1.00 F ecouxu1 16, 1 55 atgcg GTGTGAAATACCGCACAG atgcg 72 1.00 F pbr322 18, 1 80 ttaat TTGTGAGTGGTCGCACAT atcct 97 1.00 F (tdr) ****** **** ****** Column 1 : Sequence Number, Site Number Column 2 : Left End Location Column 4 : Motif Element Column 6 : Right End Location Column 7 : Probability of Element Column 8 : Forward Motif (F) or Reverse Complement (R) Column 9 : Sequence Description from Fast A input Elapsed time: 698.280000 secs
Errors will be generated and execution terminated when any of the input parameters have values that are not within their range. If the input file, prior file or output file cannot be opened, then an error message will be generated and execution terminated.
The most common error is that the input file contains a letter which does not code for an amino acid or nucleotide, depending on which model is begin used. then an error message is displayed and execution halts. To fix this problem, the input file should be examined to make sure that it is in fasta format and all of the sequences are protein or amino acids codes.
When randomly selecting motif positions, an error can occur. This is possible if there are not enough positions in the sequence to set random motif positions according to the user specified motif width and expected number of motifs. The solution is to decrease either the motif width or the expected number of motifs.
|Adjust period must be greater than 0||A value less than 0 was entered for the number of iterations between fragmentation steps.|
|cannot allocate random motifs||shorten the number of motifs or motif length Either there are too many expected sites specified for the amount of sequence data or when using the -E option, Gibbs is unable to allocate the default number of expected sites. Try decreasing the number of expected sites and re-run.|
|Cannot open background composition file filename||The filename entered after -B could not be found or you do not have proper permission to read it. Contact the system administrator.|
|Cannot open description file for reading||A file system error has occurred. Contact the system administrator.|
|Cannot open output file filename||An invalid file name was entered. Gibbs was unable to open the specified file because the path did not exist or a file with the same name exists and is read-only. Contact the system administrator.|
|Cannot open prior file filename||Filename could not be found or you do not have proper permission to read it.|
|Cannot open spacing probability file filename||>SPACING command: The filename containing the position probability values could not be found or you do not have proper permission to read it. Contact the system administrator.|
|Cannot open Wilcoxon Control file filename||The filename entered after -l could not be found or you do not have proper permission to read it. Contact the system administrator.|
|Can't initialize sequence nnn with more than ddd sites||When using the -E option, Gibbs is unable to allocate the default number of expected sites. Try entering an expected number of sites and re-run. nnn is the sequence and ddd is the default number of sites for this sequence.|
|ccc (ascii ddd) is not a valid amino acid abbreviation in sequence nnn||An invalid character has been encountered in the data file. ccc is the character, ddd is the ASCII code for the character and nnn is the sequence containing the invalid character.|
|ccc (ascii ddd) is not a valid Nucleic Acid in sequence nnn||An invalid character has been encountered in the data file. ccc is the character, ddd is the ASCII code for the character and nnn is the sequence containing the invalid character.|
|Collapsed begin pos is greater than end pos||The specified collapsed range position is greater than the ending position.|
|Collapsed subscript overflow||The collapsed range beginning or ending position is greater than half the motif length.|
|Collapsed subscript underflow||The collapsed range beginning or ending position is less than 1.|
|Concentrated region begin pos is greater than end pos||The specified concentrated alphabet position is greater than the ending position.|
|Concentrated region subscript overflow||The concentrated alphabet beginning or ending position is greater than half the motif length.|
|Concentrated region subscript underflow||The concentrated alphabet beginning or ending position is less than 1.|
|Error in priors file: line||An invalid command was encountered in a prior file. The offending line is printed.|
|Error in Pseudo mmm - should be nnn lines||>PSEUDO command: The number of rows entered for pseudocounts for motif nnn does not match the motif width.|
|Error in Pseudo: line||>PSEUDO command: A line containing invalid data was encountered. The offending line is printed. Pseudocount lines should only contain numerical data.|
|get_strings: Sequence nnn has length = 0||An empty sequence has been encountered in the data file.|
|Invalid data - probably a missing FASTA header||An invalid data file was read. The most likely cause is a missing FASTA header for the first sequence.|
|Invalid Nucleotide||An invalid character was detected in a DNA sequence. DNA sequences may only contain A,T,C,G, or N.|
|Iteration count must be at least 1||A value of 0 was entered for the -i option.|
|-M: Fragmentation widths must be greater than or equal to motif lengths||The value entered after -M was less than the number of conserved positions for the motif.|
|-M: Mismatched fragmentation begin and end ranges||A motif number was entered but no maximum fragmentation width.|
|-M: Missing fragmentation width arguments||No value was entered for the maximum fragmentation width.|
|-M: Negative fragmentation width||The value entered after -M was either 0 or not a valid number.|
|Mismatched Palindrome begin and end ranges||The end range of the palindrome is missing.|
|Mismatched repeat begin and end ranges||The end range of the direct repeat is missing.|
|MISSING ARGUMENTS||This is a general error for an ill-formed command line. Probable causes are a missing data file name or a missing motif width.|
|Missing collapsed range arguments||An invalid collapsed range specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.|
|Missing concentrated region arguments||An invalid concentrated alphabet specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.|
|Missing palindromic range arguments||An invalid palindrome specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.|
|Missing repeat range arguments||An invalid direct repeat specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.|
|Missing sequence data - sequence nnn||A FASTA header was encountered, but the sequence was empty. nnn is the sequence number.|
|Must try at least one seed||A seed value of 0 was entered for the -S option.|
|Near Opt cutoff must be between 0 and 1||An invalid probability value was entered for -C option.|
|-o: Missing Argument||No name was entered for the output file.|
|-P: Missing Argument||No name was entered for the prior information file.|
|Palindrome begin pos is greater than end pos||The specified palindrome starting position is greater than the ending position.|
|Palindrome subscript overflow||The palindrome beginning or ending position is greater than half the motif length.|
|Palindrome subscript underflow||The palindrome beginning or ending position is less than 1.|
|Plateau periods must be >= 5||The minimum plateau period is 5 iterations.|
|Positions are defined as both palindromes and repeats||You have specified the same motif positions as both palindromes and direct repeat positions.|
|Pseudocount weight must be between 0 and 1||An invalid probability value was entered for -w option.|
|Pseudocounts sum to 0||>PSEUDO command: A pseudocount row either contains no numerical data or contains all zeros.|
|Pseudosite weight must be between 0 and 1||An invalid probability value was entered for -W option.|
|ReadBeginSite: Error in starting site probability||>BEGIN command: A line containing invalid data was encountered. The offending line is printed. Beginning probability lines should only contain numerical data.|
|ReadBeginSite:Sum of probabilities = 0||>BEGIN command: The row either contains no numerical data or contains all zeros.|
|ReadBkgndComposition: Invalid position in background file: seq: sss pos: ppp||The format of the background composition file is invalid. Check that the file has not been corrupted.|
|ReadBkgndComposition: Invalid sequence number in background file: nnn||The format of the background composition file is invalid. Check that the file has not been corrupted.|
|ReadBkgndComposition: Seq: sss Position: ppp probabilities do not sum to 1||The format of the background composition file is invalid. Check that the file has not been corrupted.|
|ReadBlockProb: Probability values must be >= 0||>BLOCKS command: A number less than 0 was entered for a site per sequence probability.|
|ReadEndSite: Error in end site probability||>END command: A line containing invalid data was encountered. The offending line is printed. Ending probability lines should only contain numerical data.|
|ReadEndSite:Sum of probabilities = 0||>END command: The row either contains no numerical data or contains all zeros.|
|ReadMask: Invalid mask length. Should be nnn||>MASK command: The number of entries in the fragmentation mask does not match the fragmentation width.|
|ReadMask: Mask codes must be >= 0 and <= 3||>MASK command: A value which is not 0,1, or 3 was read as a mask character.|
|ReadMask: Remove -F option if specifying a fragmentation mask||>MASK command: You have specified a fragmentation mask while suppressing fragmentation with the -F option.|
|ReadTransProbability: Error in transition probability table||>TRANS command: A line containing invalid data was encountered. The offending line is printed. Transition matrix lines should only contain numerical data.|
|ReadTransProbability: Trans weight must be between 0 and 1||>TRANS command: An invalid weight parameter was entered.|
|Remove -F option when using -J option||You have attempted to suppress fragmentation while requesting fragmentation from the center positions.|
|Repeat begin pos is greater than end pos||The specified direct repeat starting position is greater than the ending position.|
|Repeat subscript overflow||The direct repeat beginning or ending position is greater than half the motif length.|
|Repeat subscript underflow||The direct repeat beginning or ending position is less than 1.|
|-S: Missing Arguments||No value was found after the -S option.|
|strip_maxblocks: -E must have at least one motif per sequence||The value entered after -E was either 0 or not a valid number.|
|stripargs - Getting Motif Lengths: motif lengths must be > 0||An invalid number of conserved positions was entered for the motif.|
|The maximum temperature must be positive.||The value entered for the maximum temperature after the -X option was 0 or negative.|
|The minimum temperature must be positive.||The value entered for the minimum temperature after the -X option was 0 or negative.|
|The number of exchange iterations must be >= 1.||The value entered for the number of iterations between temperature exchanges must be greater than 0 for the -X option.|
|The Wilcoxon control file must contain the same number of sequences as the data file||The Wilcoxon file containing negative control sequences does not contain the same number of sequences as the input data file.|
|Unable to get temporary file name||A file system error has occurred. Contact the system administrator.|
|Wilcoxon control file error||A file system error has occurred. Contact the system administrator.|
|Wilcoxon Control file: Sequence nnn has a different length than the sequence in the data file||The negative control sequences must have the same length and be in the same order as the sequences as the input data file.|
Amino Acid Three Letter One Letter Abbreviation Abbreviation ___________________________________________ Alanine Ala A Arginine Arg R Asparagine Asn N Aspartic Acid Asp D Cysteine Cys C Glutamine Gln Q Glutamic Acid Glu E Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V
1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208-214. PubMed: 8211139. doi: 10.1126/science.8211139.
2. Lawrence CE, and Reilly AA. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7(1):41-51. PubMed: 2184437. doi: 10.1002/prot.340070105.
3. Liu JS, Neuwald AF, and Lawrence CE. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90(432):1156-1170. doi: 10.2307/2291508
4. Liu JS and Lawrence CE. (1999) Bayesian inference on biopolymer models. Bioinformatics 15(1):38-52. PubMed: 10068691. doi: 10.1093/bioinformatics/15.1.38.
5. Martin DI and Orkin SH. (1990) Transcriptional activation and DNA binding by the erythroid factor GF-1/NF-E1/Eryf 1. Genes Dev 4(11):1886-1898. PubMed: 2276623. doi: 10.1101/gad.4.11.1886.
6. McCue LA, Thompson W, Carmack CS, Ryan MP, Liu JS, Derbyshire V, and Lawrence CE. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29(3):774-782. PubMed: 11160901. doi: 10.1093/nar/29.3.774.
7. Neuwald AF, Liu JS, and Lawrence CE. (1995) Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci 4(8):1618-1632. PubMed: 8520488. (link)
8. Omichinski JG, Clore GM, Schaad O, Felsenfeld G, Trainor C, Appella E, Stahl SJ, and Gronenborn AM. (1993) NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. Science 261(5120):438-446. PubMed: 8332909. doi: 10.1126/science.8332909.
9. Orkin SH. (1995) Regulation of globin gene expression in erythroid cells. Eur J Biochem 231(2):271-281. PubMed: 7635138.
10. Stormo GD and Hartzell GW 3rd. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A 86(4):1183-1187. PubMed: 2919167. doi: 10.1073/pnas.86.4.1183.
11. Liu JS, Neuwald AF, and Lawrence CE. (1999) Markovian Structures in Biological Sequence Alignments. J Am Stat Assoc 94(445):1-15. (link)
12. Wasserman WW, Palumbo M, Thompson W, Fickett JW, and Lawrence CE. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26(2):225-228. PubMed: 11017083. doi: 10.1038/79965.
13. Wanner BL. (1996) Phosphorus assimilation and control of the phosphate regulon. In Neidhardt FC, Curtiss R 3rd, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, and Umbarger HE (ed.). Escherichia coli and Salmonella: Cellular and Molecular Biology 1357-1381. American Society for Microbiology, Washington, DC.
14. Newberg LA, Thompson WA, Conlan S, Smith TM, McCue LA, and Lawrence CE. (2007) A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics . PubMed: 17488758. doi: 10.1093/bioinformatics/btm241.
15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389-3402. PubMed: 9254694. doi: 10.1093/nar/25.17.3389.
16. Thompson W, Rouchka EC, and Lawrence CE. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 31(13):3580-5. PubMed: 12824370. doi: 10.1093/nar/gkg608.
17. Thompson W, Palumbo MJ, Wasserman WW, Liu JS, and Lawrence CE. (2004) Decoding human regulatory circuits. Genome Res 14(10A):1967-74. PubMed: 15466295. doi: 10.1101/gr.2589004.
18. Thompson WA, Newberg LA, Conlan S, McCue LA, and Lawrence CE. (2007) The Gibbs Centroid Sampler. Nucleic Acids Res . PubMed: 17483517. doi: 10.1093/nar/gkm265.