Gibbs Motif Sampling Algorithms


Gene Regulation

DNA sequencing technology has outrun all other molecular biology methods leading to a vast data base of unexplored sequence data. The human genome project and other genome projects have greatly accelerated the growth of this data base. Since genes code for proteins, the action molecules of life, the bulk of scientific effort has focused on the genes and their products. Yet only about 3% of the human genome codes for genes. The remainder, sometimes called "junk DNA" has received far less attention. Recent findings suggest that this junk may be far more important than previously believed. In particular this "junk" has been shown to contain important regulatory signals.

Gene regulation is the fundamental process behind cellular alteration. It is used by single cell organisms to respond to changes in their environment and by multicellular organisms for cell differentiation. The most studied gene regulation involves the binding of regulator proteins to "regulatory elements" which are signal sequences that normally occur in an upstream fragment of the genome called the promoter. Far less studied are regulator elements that are more distant from the genes they regulate and regulation due to the binding of RNA gene regulators. While there are experimental methods to identify regulator elements, such as ChIP-chip and ChIP-PET they are time consuming and difficult. Thus there is a need for the development of computational and statistical methods to explore these databases to identify novel regulatory signals that may be hidden within the genomic "junk".

Gene regulation has the ability to cause short-term and long-term effects. Short-term regulation generally leads to the production of induction or repression enzymes that allow cells to respond to changes in their environment. Long-term regulation is vital in determining the ontogeny of an individual. If for some reason regulatory sequences differ from the norm, birth defects can result which may prove deleterious to the organism.

Oncogenes are genes that cause a healthy cell to become a cancer cell. Often oncogenes encode enzymes that phosphorylate many proteins, some of which are in the cell's membrane. Such modification can cause the cell to divide. Gene regulation normally would suppress the continuous division found in these neoplasia (cancer cells).

If the site of gene regulation can be determined, then the complex changes taking place in the expected motifs when things go wrong can begin to be analyzed. Hopefully, in the future, gene therapy can be explored to help eliminate or lessen the effects of certain birth defects and cancer

The goal is take a given set of amino acid or nucleotide sequences and determine common motif elements within them. It is possible that more than one characteristic motif type exists in which case multiple gene regulation sites are probable. One approach known as site sampling assumes that each sequence contains exactly one motif element for each motif type. The alternatives  motif, recursive, and centroid sampling assume that each sequence can contain zero or more motif elements of each motif type.

The Gibbs Motif Sampler

The web version of the Gibbs Motif Sampler operates in 4 basic modes: site sampling, motif sampling, recursive sampling, and centroid sampling. In site sampling (1), it is assumed that each sequence contains one site of each motif type. In motif sampling (3,7), the user provides an estimate of the the total number of sites of each motif type in all of the sequences. There may be 0 or more sites in any particular sequence. With the recursive sampler the user sets an upper limit on the number of sites in any sequence. The recursive sampler is based on the propagation model described by Liu et al (11) but unlike that model does not require that the order of the motif sites remain constant throughout the sequences. The centroid sampler (14,18) finds the ensemble centroid motif estimator, i.e., the solution that is the set of sites having the minimum total distance to the set of sites sampled from the posterior weighted ensemble of sites.

Entering the Data

There are 3 different pages for entering data to the Gibbs Sampler. The entry page allows the user to access the minimum amount of information to get started. Clicking on the Show Advanced Options link will bring up a page with more options.

One of 4 possible sampling modes is chosen from the series of radio buttons:

Site Sampler

The user specifies at a minimum the data file in FASTA format containing the sequences and the motif width(s) (in number of residues). A width for each motif must be entered. For multiple motifs, the widths are separated by commas; For example, entering 16,20. indicates that there are 2 motif models. The first has a width of 16 bases and the second has width 20. The motif width is the number of conserved position in the site. Sites may be wider than the number of conserved bases due to fragmentation.

Motif Sampler

If the  motif sampling strategy is chosen, an additional parameter must be added indicating an initial "guesstimate" of the total number of motif elements in all of the sequences for each motif type. As with the widths, estimates for multiple motifs are separated by commas. If this field is left blank, the site sampling mode will be used.

Recursive Sampler

This mode provides a more advanced sampling algorithm and also allows use of a background composition model and a spacing model.

For this mode, the user supplies a value for the number of different motifs and an upper limit on the number of sites in each sequence in addition to the data file and motif widths.

Clicking the Eukaryotic Defaults Button or the Prokrayotic Defaults Button will chose the recursive sampler and take you to a new page with a seriers of default values for a number of paramters.

Centroid Sampler

The centroid sampler reports a centroid alignment, i.e., an alignment that has the minimum total distance to the set of Gibbs samples chosen from the a posteriori probability distribution of motif site alignments. By focusing on the region of solution space containing the most posterior probability rather than on the single solution that is most probable, this option significantly enhances the statistical specificity and positive predictive value of the algorithm and is now recommended for most situations.

The Centroid Sampler is descibed in (14,18).

The Centroid Sampler is the basis for the Phylogenetic Sampler.

Advanced Options

This page allows the user to enter a number of parameters that influence the biological characteristics of the type of sites to searched for. 

Burn-in Period

The centroid sampler calculates a mean or centroid solution. To accomplish this, it samples motif sites for a fixed number of iterations, updating the motif models at each iteration. This initial sampling period is called a burn-in. It allows the motif models to converge. The default burn-in period is 2000 iterations. The burn-in period value is ignored for sampling modes other than the centroid sampler

Sampling Period

For the centroid sampler once the burn-in is completed, the sampler proceeds, again through a fixed number of iterations (the default is 8,000), tracking each sampled position. A centroid alignment solution is obtained from these samples through identification of the alignment that possesses the minimum total distance to the other alignments in the set. The sampling period value is ignored for sampling modes other than the centroid sampler.

Aligning the Centroid Model

The centroid sampler calculates a centroid solution. The solution reported by the centroid sampler is a collection of sites that constitute an alignment that possesses the minimum total distance to the other alignments in the sampled set. Because of fragmentation, the lengths of the centroid sites may vary, making it difficult to see which positions are more highly conserved. As an option, the collection of sites making up the centroid amy be aligned to produce a more tradtional alignment model. It is important to note that this new alignment does not represent either an optmal or centroid model.

Palindromes

Optional alternative models can be used with the nucleic acid alphabet. Palindromic models are entered as t,i,j.... where t is the motif number and i,j are the beginning and ending positions of the palindromic region. For example, 1,1,8 indicated that motif number is palindromic in positions 1 through 8 and the corresponding reverse complement positions at the opposite end. Note that for each position that is marked as palindromic, there is automatically a corresponding position the same distance away from the opposite end of the motif that is also palindromic. Multiple motif models are enterted in a similar manner. 1,1,8,2,1,6 indicates that positions 1 through 8 in motif model 1 and the corresponding positions at the opposite end and palindromic in model 1 and positions 1 through 6 and the corresponding positions at the end are palindromic for model 2.

Reduced Alphabet

Motif regions using a collapsed alphabet are entered in a similar manner. Concentrated positions are entered as t,i,j,p.... As before, t is the motif number and i,j are the beginning and ending positions of the concentrated region region. p allows the user to indicate whether the concentrated region should be palindromic as well. A value of 0 indicates no palindromes and 1 indicates palindromes. Concentrated positions indicate that a particular residue pair (A- T or C-G) is concentrated in a particular region. Physical models suggest this can occur due to the flexibility or rigidness that concentrated regions provide.

Fragmentation

Fragmentation allows the sampler to look at the current alignment and the surrounding positions to find which columns should be used in the motif rather than just allowing for motifs that are continuous. The width of the field that can be checked is by default equal to 1.5 times the width of the current motif model. However, the field is generally greatly narrowed due to the limited flexibility of individual motif elements caused by overlapping the ends of sequences or other motifs. The fragmentation algorithm calculates the probability of each of the possible ways to fragment the model and samples a model. The process of fragmentation is examined every fifth iteration. The fragmentation process can be invoked more or less often by changing the Adjustment period field on the third entry page.

Fragmentation can shift the position of the aligned sites by removing the first column replacing it with the last column or vice-versa. It will also deal with modeling more realistic regulatory sites where binding may or may not occur at contiguous positions. Fragmentation is covered in greater detail in (3). A maximum fragmentation width can be specified by entering a value in the Max Frag Width field. Width are specified in the formm t,w where t is the motif number and w is the maximum fragmentation width. For example, 1,24 specifies that motif 1 will be allowed to fragment to a width of 24 bases. The default is 1.5 * the motif length.

Fragment from Center Positions

Typically transcription factors (TFs) that bind as homodimers or homo-multimeric protein complexes have DNA binding motifs that are palindromes (2).  Other regulatory TFs such as PhoB and certain zinc fingers bind in directly repeating multimers and therefore have directly repeating binding patterns (13).  Often, the spacing between palindromic repeat units is unknown.  In cases where this is suspected, fragmentation can be restricted to the center position by choosing the Fragment from center option.

Direct Repeats

If direct repeating units are suspected, the repeating positions can be entered in the Direct repeat pos field in the same manner as the Palindromic Pos field.

Reverse Complement

The program normally samples nucleotide data in both forward and reverse complement. It is possible to turn off reverse complement sampling by unchecking the Reverse complement checkbox. This is typically done when palindromic models are used.

Xnu

For amino acid sequences, the data is normally XNU'd using a blossum matrix to remove low complexity regions. This option is selected by the Xnu checkbox.

Wilcoxon Signed-Rank Test

In order to assess the significance of a particular alignment, the program can perform a Wilcoxon signed-rank test. Each sequence is shuffled and the collection of shuffled sequences is appended to the input sequences. Motif sampling is performed on the combined data set and a Wilcoxon signed rank test is performed treating the aligned motifs as coming from two paired samples. See (3) for details.

Width Sampling

During sampling, a limited local search is periodically performed. Rather than having the sampler search for motifs of fixed width, it is possible to allow the sampler to adjust the width of the motifs. Width values are entered in the Motif Widths field as t,min,max where t is the motif number and min and max are the minimum and maximum values for the motif width. If a value is entered in this field, the fragmentation checkbox is ignored. If the fragmentation checkbox is deselected and no width values are entered, the motif columns are shifted right or left slightly during the local search to try and improve the current MAP value. Width sampling is deprecated. Fragmentation is the preferred method of adjusting site widths and positions.

Frequency Solution

By default, for the site, motif, and recursive sampler modes, the algorithm returns the single best solution that it has found as measured by the MAP value.  However, no prediction method is perfect, and not all of the predictions of this or other algorithms are equally compelling.  For examination of the creditability of each prediction, a Bayesian sampling solution is also available as an option.  For this solution, after the MAP solution has been determined, the algorithm continues to sample, in order to explore variations in the models.  The frequencies with which sites are sampled are recorded and reported to the user.  Variations in models may alter estimates of site frequencies from the values obtained with the single MAP solution. Our experience shows that some sites are strong and are sampled nearly 100% of the time, while others are weaker and are sampled with a lower frequency.  These sampling frequencies are an estimate of the probability that the cognate TF binds at each predicted sites.  By default, only sites that have a frequency of at least 50% are reported.  Clicking the Frequency Solution checkbox controls whether the frequency solution will be displayed in the output.

Prior Information

In addition, an optional set of prior information or a spacing model may be entered. The program may also be requested to calculate a background composition model. Background models are calculated by default.

Dscan

Checking the Output for dscan: option will produce a series of segments from the alignment. These segments can be used as input to the dscan program for scanning a database.

More Advanced Options

Additional options on this page allow control over program runtime parameters.

Pseudocounts

Pseudo site weights can be updated by changing the Pseudo Site Weight field and the Pseudo count weights can be changed via the field Pseudo count weight. These fields are entered as percentages. Pseudo count weights affect the strength of the priors.

Pseudosites are used in calculating the posterior probability of an arbitrary position being a motif site. This probability is used is sampling. It multiplies the probability that a sites belongs to a particular motif type. See (3, 7) for details. The pseudosites are priors and play the same role as psuedocounts do in the motif models.

Plateau Period

For MAP solutions, the Plateau Period is the number of iterations between successive local maxima. For each seed, the program samples until Plateau period iterations are performed without an increase in the MAP value. Every Adjustment Period iterations, a limited local search is performed to see if it is possible to improve the current alignment. This search is performed either by shifting all motif columns, fragmentation or by allowing the motif width to vary. The plateau period is ignored by the centroid sampler

Iterations

The maximum number of iterations can be entered in Max Iterations. This parameter specifies the maximum number of allowed iterations for each seed as well as the total iterations for near optimal sampling. Max Iterations is ignored by the centroid sampler. Centroid sampler iterations are set by the burn-in and the samples settings.

Seeds

In order to sample over a larger (or smaller) number of separate initial models, the value of Number of Seeds can be changed. This value is the number of time the sampler is restarted.

Near Optimal Cutoff percent

The Near Optimal Cutoff percent specifies the threshold value for selecting the "good" sites for near optimal sampling. This value is also the threshold for reporting the most frequently sampled sites during near optimal sampling.

Random Number Seed

If it is desired to reproduce results, an initial seed value can be entered into the Random Number Seed field. This value is used to seed the random number generator. This setting is typically only used for debugging

Prokaryotic and Eukaryotic Defaults

A set of default parameters are provided for prokaryote and another for eukaryotes. The prokaryotic defaults are those from McCue et. al (6). The eukaryotic defaults are a set that we have found useful when studying transcription regulation in genes expressed in human and mouse tissue.
Note: The eukaryotic defaults specify 5 different motif types and up to 5 sites per sequence. On large data sets, this may take a long time to process on this server. If you plan to run large data sets frequently using these defaults, we encourage you to obtain a copy of Gibbs and run it locally.
See http://bayesweb.wadsworth.org/GIBBS-SAMPLER-ACADEMIC.htm and http://bayesweb.wadsworth.org/GIBBS-SAMPLER-COMMERCIAL.htm for details.

Processing Sequences

Amino acid sequences can have low complexity regions of residues closely related or occurring in tandem. Such regions can lead to alignments that are statistically correct but are obviously experimentally false. Unless the user turns off the Xnu option, the sequences are processed or XNU'd using a blossum matrix to remove such low complexity regions. It is also possible that the sequences can be pre-processed using BLAST scoring (15) or other software packages available such as DUST. BLAST scoring can also be used on nucleotide sequences to help remove from a data set sequences that are too similar to one another. Once again, this helps eliminate false motif alignments.

Calculating Counts

When using the Bayesian statistical formula, there needs to be a distinction between what is in the model and what is not. Those elements in the model will be referred to as motif, while those outside the model are background.

Initially there are no motif elements for any of the motif types. Observed counts for each of the residues are calculated for the background based on the number of each residue occurring in the sequences. Since there are no motif elements, observed motif counts are set to zero for each residue. The observed counts are represented by the variable q0,1, ..., qi,n where q0,kare the background counts, i = number of motif types, and n= alphabet length (4 for nucleotides, 20 for amino acids.)

The posterior Dirichlet distribution for site and background models requires pseudocount parameters. Default background pseudocounts are calculated as a percentage of the background observed counts. The user can determine how much weight the pseudocounts hold by specifying a pseudocount weight. The weight is a percentage in the range between 0 and 100 with a default of 10 (10% of the observed counts.)

Motif pseudocounts are set in one of two ways. In the first method, there is no bias in the composition of the motif. Motif residue pseudocounts are set to the background pseudocounts. Such an approach implements "uninformed priors." When the composition of a motif is known to some degree, "informed priors" can be used. In this method, motif pseudocounts are based on the values entered by the user. Depending upon the confidence level of the occurrence of particular residues in specific locations, the pseudocounts are multiplied by a user supplied factor.

After the counts are calculated for the model without any motif elements, the "null model" is stored. An initial alignment of motif elements is randomly selected. The site sampler will choose one location in each sequence as a motif element for each motif type. Motif sampling randomly chooses locations within the sequences according to the initial number of motifs that the user specifies. Care is taken in both cases to ensure that motifs do not overlap the ends of sequences or one another. At this point an alignment, albeit random, is given.

Observed background and motif counts are now updated according to the alignment. This is accomplished by adding in the residues occurring in each motif element to the motif counts and deleting them from the background counts. The various sampling modes now sample motif elements according to their own strategies.

Site Sampler

The site sampler begins with an alignment of motifs, one for each motif type per sequence. It then proceeds to follow the two steps of the Gibbs site sampler (1). The first step is the predictive update step in which one of the N sequences is chosen, beginning with the first sequence and proceeding to the last sequence. The motif element for each motif type in the current sequence is deleted from the motif model anbd added to the background. The counts are updated accordingly.

The second step of the Gibbs site sampler is the sampling step. Each possible motif starting position is assigned a probability of being a motif starting position. See equations 1 - 4. Once a probability is assigned to each possible starting position, the probabilities are normalized so they sum to 1. A single motif element is then sampled in for each motif type based on the probability weights. This is a random process, so the best fit may not be chosen, although the probabilities supply a weight to help ensure that good elements are nearly always sampled when present.

The process is repeated with each sequence. Once a correct pattern begins to appear in the first step, it will be improved upon by the weights associated with the sampling step until a local maximum alignment has not been obtained for a number of iterations through the site sampler (defined in the Plateau Period field). Such a lack of improvement suggests that the alignment is in an "energy well" and will not improve. It is possible that this well is indeed the overall and not a local maximum.

The equations used to calculate the probabilities are as follows:

Equation 1 Equation 1

Equation 2 Equation 2

Equation 3Equation 3

Equation 4Equation 4

where t is the current motif type;Jt is the motif width, ct,j,ris the number of residues of typerat position j for motif typet;bt,j,r is the number of pseudocounts of typerat position j for motif type t; btis the total number of motif residue pseudocounts for motif type t;qris the number of background pseudocounts of type r, and q is the total number of background pseudocounts. P(t)is the posterior probability that an arbitrary site belongs to motif typet and P(Bg)is the posterior probability that an arbitrary site belongs to the background.

Motif Sampler

The Motif sampler begins with an alignment of motifs randomly spread throughout the sequences. The algorithm starts at the very first position in the long, concatenated sequence consiting of all sequences and checks to see if the position is a possible motif start site, i.e. a motif placed there will not overlap two sequences or a previously set motif element of a different type. If it is determined to be a possible start site, checking is done to see if it overlaps a motif element of the current type or a motif type that has not yet been set. When this happens, the motif element it overlaps is taken out of the alignment and the counts are updated accordingly. Note that the element can get sampled back in later. The current position is then sampled in either as a motif element of the current type or as background based on the probabilities that are calculated via equations 1, 2, 3 and 4.

Recursive Sampler

In the search for binding sites, the exact number of sites and the number of sites corresponding to each motif within each sequence are not usually known.  The recursive sampler (16,17) uses recursive sums over all possible alignments from 0 to a maximum in a sequence, to obtain Bayesian inferences on the number of sites for each motif and the total number of sites in each sequence.  The recursion examines the placements of sites for p sites, as represented by p different motifs, in each sequence.  Each sequence may contain multiple sites for each of the motifs in any order, and some sequences may contain no sites for some or all of the motifs. The algorithm sums over all combinations of sites in each sequence, and infers the alignments and ordering of sites for each sequence via a backward sampling step.  In this way, the algorithm infers for each sequence the total number of sites, the number of each of the motifs, and the alignments  of the sites in the sequence

Centroid Sampler

Once initialized with a random alignment, the Gibbs sampling procedure proceeds through the following steps: (1) the probability of each possible number of sites for a sequence is calculated based on the current model; (2) the number of sites for the sequence is sampled; (3) the predicted sites of each of the current motif models are sampled; and (4) the motif models are updated from the sampled sites. In previous versions, this process repeated until the MAP failed to increase for a fixed number of iterations. To obtain a sampling solution, we allow the algorithm to sample through a burn-in period, typically 2,000-3,000 iterations. Next, the sampler proceeds, again through a fixed number of sampling iterations (typically 8,000-10,000), tracking each sampled position. A centroid alignment solution is obtained from these samples through identification of the alignment that possesses the minimum total distance to the other alignments in the set. Thus, the centroid is defined in terms of a distance measure between pairs of proposed alignments.

In the computation of the distance between two alignments, each site in each of the two alignments is considered separately, and the distance between the two alignments is the sum of the values, computed as follows. If the site exactly overlaps a site in the other alignment, then this is a "perfect overlap," and the contribution to the distance sum is zero. If the site overlaps a site in the other alignment by more than half of the larger of the widths of each of the two sites, then this is "sufficient overlap," and the contribution to the distance sum is an infinitesimal epsilon. Otherwise, there is "insufficient overlap," and the contribution to the distance sum is one. Epsilon is an infinitesimal in that it is effectively zero except when a non-zero value is needed to serve as a tie breaker between two candidate centroids. We construct the centroid solution via a dynamic programming algorithm (14,18).

Calculating the MAP

For the site, nmotif, and recursive samplers, once an alignment of motif elements has been found for all of the motif types (i.e. the motif elements for the last sequence have been set in the site sampler or the last position of the long sequence has been sampled in or out in the motif sampler), the current alignment is tested to see how well it describes the data. The value that is calculated is the posterior probability of the alignment given the data. The maximum value of this probability referred to as the MAP(maximum a posteriori) value (3). The final alignment and model reported by the program is the alignment corresponding to the maximal value of this probability. Rather than calculate the probability as shown in equation 10 of (3), the natural log of the probability is used. There are four parts to this probability calculation: the motif portion, the background portion, the fragmentation portion and the distribution of the number of motif sites. The probability for the site sampler and motif sampler is calculated in the same fashion with the exception that the distribution of the number of sites is dropped when using the site sampler since it is not informative when the number of motifs is constant. The MAP value is measured relative to an empty or "null" alignment by taking the difference of the logs of the probability of the alignment and the probability an empty alignment. A value greater than 0 indicates that the alignment is more likely than unaligned background.

The MAP is not calculated by the centroid sampler.

It is possible that an alignment of motifs results in a map value that is not quite the maximal value. For instance, it could be possible, especially in the case of the motif sampler, that a good but not great alignment is found that is a shifted form of a better alignment.

Fragmentation can alleviate the problem of shifting by picking the first column as the worst and replacing it with the last column or vice-versa. It will also deal with modeling more realistic regulatory sites where binding may or may not occur at contiguous sites. Fragmentation is covered in greater detail in (3).

Near optimal Sampling

After the sampling steps have been iterated through a number of times (default is 20), each time running until a maximal map has not been found for the current run in the last plateau perioditerations or until the maximal number of iterations has been reached, the overall maximal map is stored. However, sampling of individual elements within the sequence is a random process, so it is possible to have some motif elements that have a low probability of being motif elements that are sampled in while others with a high probability of being motif elements are sampled out. Since such an alignment is suboptimal, a near optimal sampling routine is used to converge closer to the desired results.

The process proceeds is two stages. The first step is a preprocessing stage. all of the positions that are motif elements in the maximal alignment are marked as "good" positions. All of the other possible motif starting positions are analyzed to see if their probability of being a motif element (according to equations 1, 2, 3 and 4) is above a certain threshold. If so, then they are marked as "good" positions, otherwise they are marked as "bad".

The sampling stage of near optimal sampling works exactly the same as described above with one small difference. The exception is that only those positions which are marked "good" are possible starting positions. As a result, near optimal sampling is much faster than suboptimal sampling.

Near optimal sampling will run for the maximum number of iterations (default is 500). Sites which are sampled greater than a fixed percent of the total number of iterations are displayed in the frequency solution.

Map Maximization

Once near optimal sampling has completed, two different results that are available: the near optimal alignment and the alignment of motif elements that occur over a threshold percentage of the time in near optimal sampling. If the Map Miximization field is selected, the alignment with the higer MAP will be refined.

The Maximal map calculation will first set the current alignment to the maximal alignment. Then, in a single pass, beginning with the first position of the first sequence and ending with the last position of the last sequence, individual motif elements will be added in or taken out depending upon whether or not their inclusion improves the map value. This may be compuationally expensive because the map must be calculated at each position in each sequence. The alignment with the maximum a probability (MAP) from this process is reported.

Sample Results

The results produced by the Bernoulli sampler were compared with known motif positions and results obtained using other samplers. The three main data sets used were a bacterial porins data set with known motif positions (7), dinucleotide- binding proteins whose motifs have been detected by a previous implementation of a Gibbs sampling algorithm (3) and a CRP binding data set with CRP binding sites that have been found using an implementation of the expectation maximization algorithm (2).

For more examples see the descriptions of prokayotic phylogenetic footprinting and the analysis of prokaryotic co-expression data.

Example 1: Porin Data

The porin data consists of a set of 16 distantly related bacterial integral outer membrane proteins. The location of 16 known beta strands for 3 of these proteins is given in Table 1.



S,  M  BEG  pre     MOTIF    post    END   PROB T DESCRIPTION         STRAND
____________________________________________________________________________
1,  1  149 ffglv DGLNFAVQYLG knerd   159   0.28 F OMPF_ECOLI OUTER     B7
1,  2  170 tarrs NGDGVGGSISY eyegf   180   0.96 F OMPF_ECOLI OUTER     B8
1,  3  212 ngkka EQWATGLKYDA nniyl   222   1.00 F OMPF_ECOLI OUTER     B10
1,  4  255 fankt QDVLLVAQYQF dfglr   265   0.99 F OMPF_ECOLI OUTER     B12
1,  5  293 dvdlv NYFEVGATYYF nknms   303   0.98 F OMPF_ECOLI OUTER     B14
1,  6  328 klgvg SDDTVAVGIVY qf      338   0.98 F OMPF_ECOLI OUTER     B16
2,  1   37 sgttd SGLEFGASFKA hesvg    47   0.86 F PORI_RHOCA PORIN.    B3
2,  2   57 gaetg EDGTVFLSGAF gkiem    67   0.99 F PORI_RHOCA PORIN.    B4
2,  3   99 lddrg GNDIPYLTGDE rltae   109   0.12 F PORI_RHOCA PORIN.    --
2,  4  159 aaytf GNYTVGLGYEK idspd   169   0.98 F PORI_RHOCA PORIN.    B9  
2,  5  182 lmadm EQLELAAIAKF gatnv   192   0.97 F PORI_RHOCA PORIN.    B10
2,  6  261 tiddv TYYGLGASYDL gggas   271   0.18 F PORI_RHOCA PORIN.      B14
3,  1    6 eisln GYGRFGLQYVE drgvg    16   0.31 F porin - R. blastica  B1 
3,  2   43 ttetd QGVTFGAKLRM qwddg    53   0.99 F porin - R. blastica  B3 
3,  3   88 gnvdt AFDSVALTYDS emgye    98   0.90 F porin - R. blastica  --
3,  4  171 iaadw SNDMISLAAAY ttdag   181   0.99 F porin - R. blastica  B9
3,  5  222 lstag DQVTLYGNYAF gattv   232   1.00 F porin - R. blastica  B12 
3,  6  263 dyqfa EGVKVSGSVQS gfane   273   0.97 F porin - R. blastica  B15
3,  7  277 qsgfa NETVADVGVRF df      287   0.84 F porin - R. blastica  B16

The first column represents the sequence number (S) followed by the motif number (M). The second and sixth columns are the motif beginning (BEG) and ending (END) start sites. The third (pre) and fifth (post) columns are the residues that flank the motifs (MOTIF) in the fourth column. The seventh column represents the probability of the motif belonging to the alignment (PROB). The eighth column indicates whether it is a forward (F) or reverse complement (R) element. Column nine contains a description of the sequence and column 10 indicates which strand the element is contained within.

Of all the motifs found for the first three proteins, only two do not fit into one of the beta strands (see Table 1). While only 17 of 48 motif elements have been found, the results are impressive considering that the known motif alignment has elements of varying length. Thus, it would be best to consider multiple motifs and run this data set again to sample more motif elements.

Example 2: Dinucleotide Data

The dinucleotide data is a set of 91 distantly related proteins that bind to one of 3 cofactors used by several different enzymes. The maximal alignment has been identified by an implementation of the Gibbs sampler using fragmentation (7) see Table 2).


 S,  M  BEG  pre           MOTIF             post    END   PROB T DESCRIPTION
_____________________________________________________________________________
 1,  1  195 tqgst CAVFGLGGVGLSVIMGCKAAGAARII gvdin   220   1.00 F ADHE_HORSE
 2,  1   23 rsynk ITVVGVGAVGMACAISILMKDLADEV alvdv    48   1.00 F LDHM_SQUAC
 4,  1    6 agwsc LVTGGGGFLGQRIICLLVEEKDLQEI rvldk    31   1.00 F 3BHS_BOVIN
 5,  1    6 malqq FGLIGLAVMGENLALNIERNGFSLTV ynrta    31   1.00 F 6PGD_SYNP7
 6,  1    8 ankni IFVAGLGGIGFDTSREIVKSGPKNLV ildri    33   1.00 F ADH1_DRONA
 8,  1   41 ektpq ICVVGSGPAGFYTAQHLLKHPQAHVD iyekq    66   1.00 F ADRO_HUMAN
 8,  2  179 lscdt AVILGQGNVALDVARILLTPPEHLEA lllcq   204   1.00 F ADRO_HUMAN
11,  1  468 dqvie VFVIGVGGVGGALIEQIYRQQPWLKQ khidl   493   1.00 F AK1H_SERMA
14,  1    5  mqfd YIIIGAGSAGNVLATRLTEDPNTSVL lleag    30   1.00 F BETA_ECOLI
21,  1  348 lvqrl IGIEARGRIGRAVAVALHRASRALDV adhrq   373   1.00 F CRTI_APHSP
22,  1   11 egmgr AVVIGAGLGGLAAAMRLGAKGYKVTV vdrld    36   1.00 F CRTI_RHOCA
22,  2  222 kfgvh YAIGGVQAIADAMAKVITDQGGEMRL ntevd   247   1.00 F CRTI_RHOCA
24,  1    6 mtnir VAIVGYGNLGRSVEKLIAKQPDMDLV gifsr    31   1.00 F DDH_CORGL 
27,  1  215 mkkak IAVQGIGNVGSYTVLNCEKLGGTVVA maewc   240   1.00 F DHE3_CLODI
28,  1   24 plplt VGVLGSGHAGTALAAWFASRHVPTAL wapad    49   1.00 F DHNO_AGRT7
29,  1  148 fkgkp ALIVGGGGTARTAIYVLRKWLGVSKI yivnr   173   0.99 F DHQA_ASPNI
41,  1  162 fglna VVIGASNIVGRPMSMELLLAGCTTTV thrft   187   1.00 F FOLD_ECOLI
41,  2  204 lenad LLIVAVGKPGFIPGDWIKEGAIVIDV ginrl   229   0.98 F FOLD_ECOLI
42,  1    7 tfqad LAIVGAGGAGLRAAIAAAQANPNAKI alisk    32   1.00 F FRDA_ECOLI
44,  1    6 msrak VGINGFGRIGRLVLRAAFLKNTVDVV svndp    31   1.00 F G3P_SCHMA 
46,  1    6 mqpir LGLVGYGKIAQDQHVPAINANPAFTL vsvat    31   0.99 F GAL_PSEFL 
49,  1    6 mnqva VVIGGGQTLGAFLCHGLAAEGYRVAV vdiqs    31   1.00 F GUTD_ECOLI
50,  1  260 svtvc VGHLGGLDIAERDIARLRGLGRTVSD siavr   285   1.00 F HDNO_ARTOX
51,  1  184 lssqk TVIIGAGKMACLLVKHLLAKGATDIT ivnrs   209   1.00 F HEM1_SYNY3
55,  1  297 dvqsd IVAQGFGSLGLMTSILVTPDGKTFES eaahg   322   0.99 F IDHP_YEAST
56,  1    5  mslr IGVIGTGAIGKEHINRITNKLSGAEI vavtd    30   1.00 F IDH_BACSU 
57,  1   80 fkndt FALIGYGSQGYGQGLNLRDNGLNVII gvrkd   105   1.00 F ILV5_YEAST
63,  1  286 lsdht VLFQGAGEAALGIANLIVMAMEKEGV skeaa   311   1.00 F MAOX_ANAPL
70,  1    4   mtd IAFLGLGNMGGPMAANLLKAGHRVNV fdlqp    29   1.00 F MMSB_PSEAE
71,  1    3    mk ALHFGAGNIGRGFIGKLLADAGIQLT fadvn    28   1.00 F MTLD_ECOLI
76,  1    5  msnt IVVVGAGVIGLTSALLLSKNKGNKIT vvakh    30   1.00 F OXDA_FUSSO
77,  1  119 lydrt VGIVGVGNVGRRLQARLEALGIKTLL cdppr   144   1.00 F PDXB_ECOLI
78,  1    5  mktq VAIIGAGPSGLLLGQLLHKAGIDNVI lerqt    30   1.00 F PHHY_PSEFL
79,  1  167 vppak VMVIGAGVAGLAAIGAANSLGAIVRA fdtrp   192   1.00 F PNTA_ECOLI
79,  2  463 aisgi IVVGALLQIGQGGWVSFLSFIAVLIA sinif   488   0.96 F PNTA_ECOLI
82,  1   13 aesyt LGFIGAGKMAESIARGAVRSGVLPPS rirta    38   1.00 F PROC_SOYBN
85,  1    9 lkeyv IVSGASGFIGKHLLEALKKSGISVVA itrdv    34   1.00 F RFBJ_SALSP
86,  1  218 lagkv AVVAGYGDVGKGSAASLKAFGSRVIV teidp   243   1.00 F SAHH_CAEEL
89,  1   16 katkv IGIIGLGDMGLLYANKFTDAGWSVIC cdree    41   1.00 F TYR1_YEAST
                  ******* **   *  *   *    *


Seed = 838822581 Difference of Logs of Maps = 100.53212

Fragmentation was used to obtain the maximal alignment. The columns that have an '*' beneath them are the columns chosen for the motif.

Example 3: CRP Data

The CRP data contains a set of 18 proteins that have a CRP binding site. CRP is a dimeric DNA binding protein. Footprint sites have been reported by Stormo and Hartzell (2,10) The results using the site sampler with a width of 22, and no fragmentation are produced as follows:


  S,  M   BEG  pre           MOTIF         post    END   PROB T DESCRIPTION
___________________________________________________________________________
  1,  1    38 gcgct ATTCTCGCCCGATGCCACAAAA accag    17   0.31 R cole1
  1,  2    61 actgt TTTTTTGATCGTTTTCACAAAA atgga    82   0.34 F cole1
  2,  1    38 ttttc TGCCGTGATTATAGACACTTTT gttac    17   0.05 R ecoarabop
  2,  2    55 attga TTATTTGCACGGCGTCACACTT tgcta    76   0.45 F ecoarabop
  3,  1    76 ttaat AACTGTGAGCATGGTCATATTT ttatc    97   0.49 F ecobgirl
  4,  1    63 tgcat GTATGCAAAGGACGTCACATTA ccgtg    84   0.46 F ecocrp
  5,  1    71 gtcta AAACGTGATCAATTTAACACCT tgctg    50   0.49 R ecocya
  6,  1    28 cactg TAATGCGATCTGGTTCAAATAA ttcac     7   0.26 R ecodaop
  6,  2    81 caaca CACTTCGATACACATCACAATT aagga    60   0.25 R ecodaop
 *7,  1    24 aattc TTGTGTAAACGATTCCACTAAT ttatt    45   0.33 F ecogale      
 +8,  1    15 gggtt TTTTGTTATCTGCAATTCAGTA caaaa    36   0.10 F ecoilvbpr
  8,  2    39 gtaca AAACGTGATCAACCCCTCAATT ttccc    60   0.37 F ecoilvbpr
  9,  1    30 gccta ATGAGTGAGCTAACTCACATTA attgc     9   0.36 R ecolac
 *9,  2    94 ttgtt ATCCGCTCACAATTCCACACAA catac    73   0.28 R ecolac
 10,  1    35 tcgct TTGTGTGATCTCTGTTACAGAA ttggc    14   0.50 R ecomale
 11,  1    29 acggc TTCTGTGAACTAAACCGAGGTC atgta    50   0.08 F ecomalk
 11,  2    82 acgat TTTTGCAAGCAACATCACGAAA ttcct    61   0.47 R ecomalk
 12,  1    62 tgtct GAATTTGCACTGTGTCACAATT ccaaa    41   0.35 R ecomalt
 13,  1    48 ttcat ATGCCTGACGGAGTTCACACTT gtaag    69   0.49 F ecoompa
 14,  1    71 cgaac GATTGTGATTCGATTCACATTT aaaca    92   0.48 F ecotnaa
 15,  1    38 tctaa TTGGGTTAACCACATCACAACA atttc    17   0.49 R ecouxu1
 16,  1    53 atatg CGGTGTGAAATACCGCACAGAT gcgta    74   0.48 F pbr322
 17,  1   105       CGTGCCGATCAACGTCTCATTT tcgcc    84   0.99 R trn9cat
*18,  1    97 aggat ATGTGCGACCACTCACAAATTA acttt    76   0.26 R (tdr)
                    **********************

Seed = 838740876

Comparing these results to the footprint sites in (2,10), it can be seen that 20 of 24 footprint sites are found. Three of the sites (those marked with an *) are sites that have been previously found using an expectation maximization approach (2). The site marked by + is the only site that does not appear in the footprint sites or those sites previously identified.

Full CRP Example

The first portion of the results consists of the command line used to run the program, a list of the options used for the current run and a list of the Fasta headers of the input sequences. The results ection begins with a table with a column for each of the residues (4 for when using DNA nucleotide sequences and 20 when using amino acid sequences -- see Table 4 for the one letter abbreviation for the amino acids) and a row for each of the positions in the motif. The numbers in the table indicate the frequency of occurrence of each residue in each position within the motif throughout the alignment. The last column is an information parameter that indicates how much the column adds to the model. Using this table, it can be deduced which positions are most conserved and what the pattern of the motif is. Following this is a probability model calculated from the frequency table and the priors. Informative priors can be extracted from the information in this table.

The next portion of the results is the alignment of the motifs found. The first column indicates which sequence the motif element occurs in followed by the motif number for that sequence. The second column indicates where the motif element starts within the sequence (in number of residues) while the sixth column indicates the ending location of the motif element. The third and fifth columns indicate residues that occur on either side of the motif element located in the fourth column. The probability that this motif element belongs to the alignment is given in the seventh column. The eighth column indicates whether the element is found in the forward direction (F) or as a reverse complementary element (R). The ninth column provides a description of the sequence. Following the alignment, the random number seed used and map probability are given.

An example of the complete results is given below. It contains the results of searching a set of CRP intergenic sequences for a motif of width 16 using the recursive sampler allowing up to 2 sites in each sequence with an estimated total number of motifs of 20, fragmentation, a palindromic model, and a heterogeneous background model. In this example, the near optimal, frequency and maximal MAP solutions were requested. The default for the recursive, site and motif samplers is to only report the maximal MAP solution


/tmp/gibbs19212/Gibbs19212 -PBernoulli /tmp/gibbs19212/data.txt 16 20 -E 2 -K -m -n -r -W 0.8 -w 0.1 -p 50 -j 5 -i 500 -S 20 -C 0.5 -R 1,1,8 -B /tmp/gibbs19212/back.txt -Y -o /tmp/gibbs19212/outfile.txt 


Gibbs 2.04.001  Mar 24 2003
Data file: /tmp/gibbs19212/data.txt
Output file: /tmp/gibbs19212/outfile.txt
Background Composition Model file: /tmp/gibbs19212/back.txt
The following options are set:
Concentrated Region          False    Sequence type        False
Collapsed Alphabet           False    Pseudocount weight   True 
Use Expectation/Maximization False    Don't Xnu sequence   False
Help flag                    False    Near optimal cutoff  True 
Number of iterations         True     Don't fragment       False
Don't use map maximization   True     Repeat regions       False
Output file                  True     Informed priors file False
Plateau periods              True     palindromic sequence True 
Don't Reverse complement     True     Number of seeds      True 
Seed Value                   False    Pseudosite weight    True 
Suboptimal sampler output    False    Overlap              False
Allow width to vary          False    Wilcoxon signed rank False
Sample along length          False    Output Scan File     False
Output prior file            False    Modular Sampler      True 
Ignore Spacing Model         False    Sample Background    False
Bkgnd Comp Model             True     Init from prior      False
Homologous Seq pairs         False    Parallel Tempering   False
Group Sampler                False    No progress info     False
Fragment from middle         False    Verify Mode          False
Alternate sample on k        True     No freq. soln.       False
Calc. def. pseudo wt.        True 

site_samp            =            0
nMotifLen            =           16
nAlphaLen            =            4
nNumMotifs           =           20
dPseudoCntWt         =          0.1
dPseudoSiteWt        =          0.8
nMaxIterations       =          500
lSeedVal             =   1049467533
nPlateauPeriods      =           50
nSeeds               =           20
nNumMotifTypes       =            1
dCutoff              =          0.5
RevComplement        =            0
glOverlapParam       =            0
Max Sites/seq        =            2
Min Sites/Seq        =            0
Rcutoff factor       =        0.001
Post Plateau Samples =            0
Frag/Shft Per.       =            5
Frag width           =           24


Sequences to be Searched:
_________________________
#1   cole1
#2   ecoarabop
#3   ecobgirl
#4   ecocrp
#5   ecocya
#6   ecodaop
#7   ecogale
#8   ecoilvbpr
#9   ecolac
#10  ecomale
#11  ecomalk
#12  ecomalt
#13  ecoompa
#14  ecotnaa
#15  ecouxu1
#16  pbr322
#17  trn9cat
#18  (tdr)
Processed Sequence Length: 1890 Total sequence length: 1890










======================== NEAR OPTIMAL RESULTS ========================
======================================================================
MAP = 119 maybe = 121 discard = 1769
-------------------------------------------------------------------------
                          MOTIF a





Motif model (residue frequency x 100)
____________________________________________
Pos. #     a   t   c   g  Info
_____________________________
   1 |    29  52   5  11  0.2
   2 |    29  52  .   17  0.3
   3 |    35  35  11  17  0.0
   4 |    .   88  11  .   0.9
   5 |    .   23  11  64  0.7
   6 |    .   94   5  .   1.1
   7 |    17   5  .   76  0.9
   8 |    82   5  11  .   0.7

  15 |    .   76  .   23  0.7
  16 |     5  11  82  .   1.2
  17 |    88  .   .   11  0.9
  18 |     5   5  88  .   1.4
  19 |    82   5   5   5  0.7
  20 |    17  41  35   5  0.2
  21 |    47  52  .   .   0.5
  22 |    41  52   5  .   0.4

nonsite   30  30  18  20
site      30  37  17  14

Motif probability model
____________________________________________
Pos. #    a     t     c     g   
____________________________________________
   1 |  0.295 0.506 0.072 0.127 
   2 |  0.295 0.506 0.019 0.180 
   3 |  0.348 0.348 0.124 0.180 
   4 |  0.032 0.822 0.124 0.022 
   5 |  0.032 0.243 0.124 0.601 
   6 |  0.032 0.874 0.072 0.022 
   7 |  0.190 0.085 0.019 0.706 
   8 |  0.769 0.085 0.124 0.022 

  15 |  0.032 0.716 0.019 0.233 
  16 |  0.084 0.137 0.756 0.022 
  17 |  0.821 0.032 0.019 0.127 
  18 |  0.084 0.085 0.809 0.022 
  19 |  0.769 0.085 0.072 0.075 
  20 |  0.190 0.401 0.335 0.075 
  21 |  0.453 0.506 0.019 0.022 
  22 |  0.400 0.506 0.072 0.022 



Background probability model
        0.303 0.294 0.184 0.219 



16 columns
Num Motifs: 17
   1,  1      61 actgt TTTTTTGATCGTTTTCACAAAA atgga      82   0.56 F cole1
   2,  1      55 attga TTATTTGCACGGCGTCACACTT tgcta      76   1.00 F ecoarabop
   3,  1      76 ttaat AACTGTGAGCATGGTCATATTT ttatc      97   0.96 F ecobgirl
   4,  1      63 tgcat GTATGCAAAGGACGTCACATTA ccgtg      84   0.98 F ecocrp
   5,  1      50 cagca AGGTGTTAAATTGATCACGTTT tagac      71   0.91 F ecocya
   6,  1       7 gtgaa TTATTTGAACCAGATCGCATTA cagtg      28   0.99 F ecodaop
   7,  1      42 tccac TAATTTATTCCATGTCACACTT ttcgc      63   0.58 F ecogale
   8,  1      20 ttttg TTATCTGCAATTCAGTACAAAA cgtga      41   0.59 F ecoilvbpr
   9,  1       9 gcaat TAATGTGAGTTAGCTCACTCAT taggc      30   0.99 F ecolac
  10,  1      14 gccaa TTCTGTAACAGAGATCACACAA agcga      35   0.99 F ecomale
  11,  1      61 aggaa TTTCGTGATGTTGCTTGCAAAA atcgt      82   0.96 F ecomalk
  12,  1      41 tttgg AATTGTGACACAGTGCAAATTC agaca      62   0.90 F ecomalt
  13,  1      48 ttcat ATGCCTGACGGAGTTCACACTT gtaag      69   0.98 F ecoompa
  14,  1      71 cgaac GATTGTGATTCGATTCACATTT aaaca      92   1.00 F ecotnaa
  15,  1      17 gaaat TGTTGTGATGTGGTTAACCCAA ttaga      38   0.54 F ecouxu1
  16,  1      53 atatg CGGTGTGAAATACCGCACAGAT gcgta      74   0.93 F pbr322
  18,  1      78 agtta ATTTGTGAGTGGTCGCACATAT cctgt      99   1.00 F (tdr)
                       ********      ********


Column 1 :  Sequence Number, Site Number
Column 2 :  Left End Location
Column 4 :  Motif Element
Column 5 :  Right End Location
Column 6 :  Probability of Element
Column 7 :  Forward Motif (F) or Reverse Complement (R) 
Column 8 :  Sequence Description from Fast A input

Log Motif portion of MAP for motif a = -249.09778
Log Fragmentation portion of MAP for motif a = -4.78749
Seed = 1049467533	Difference of Logs of Maps = 37.69802


=============================================================
====== ELEMENTS OCCURRING GREATER THAN  50% OF THE TIME =====
======                    Motif a                       =====
=============================================================


Listing of those elements occurring greater than 50% of the time
in near optimal sampling using 500 iterations



Motif model (residue frequency x 100)
____________________________________________
Pos. #     a   t   c   g  Info
_____________________________
   1 |    29  52   5  11  0.2
   2 |    29  52  .   17  0.3
   3 |    35  35  11  17  0.0
   4 |    .   88  11  .   0.9
   5 |    .   23  11  64  0.7
   6 |    .   94   5  .   1.1
   7 |    17   5  .   76  0.9
   8 |    82   5  11  .   0.7

  15 |    .   76  .   23  0.7
  16 |     5  11  82  .   1.2
  17 |    88  .   .   11  0.9
  18 |     5   5  88  .   1.4
  19 |    82   5   5   5  0.7
  20 |    17  41  35   5  0.2
  21 |    47  52  .   .   0.5
  22 |    41  52   5  .   0.4

nonsite   30  30  18  20
site      30  37  17  14

Motif probability model
____________________________________________
Pos. #    a     t     c     g   
____________________________________________
   1 |  0.295 0.506 0.072 0.127 
   2 |  0.295 0.506 0.019 0.180 
   3 |  0.348 0.348 0.124 0.180 
   4 |  0.032 0.822 0.124 0.022 
   5 |  0.032 0.243 0.124 0.601 
   6 |  0.032 0.874 0.072 0.022 
   7 |  0.190 0.085 0.019 0.706 
   8 |  0.769 0.085 0.124 0.022 

  15 |  0.032 0.716 0.019 0.233 
  16 |  0.084 0.137 0.756 0.022 
  17 |  0.821 0.032 0.019 0.127 
  18 |  0.084 0.085 0.809 0.022 
  19 |  0.769 0.085 0.072 0.075 
  20 |  0.190 0.401 0.335 0.075 
  21 |  0.453 0.506 0.019 0.022 
  22 |  0.400 0.506 0.072 0.022 



Background probability model
        0.303 0.294 0.184 0.219 



16 columns
Num Motifs: 17
   1,  1      61 actgt TTTTTTGATCGTTTTCACAAAA atgga      82   1.00 F cole1
   2,  1      55 attga TTATTTGCACGGCGTCACACTT tgcta      76   1.00 F ecoarabop
   3,  1      76 ttaat AACTGTGAGCATGGTCATATTT ttatc      97   1.00 F ecobgirl
   4,  1      63 tgcat GTATGCAAAGGACGTCACATTA ccgtg      84   1.00 F ecocrp
   5,  1      50 cagca AGGTGTTAAATTGATCACGTTT tagac      71   1.00 F ecocya
   6,  1       7 gtgaa TTATTTGAACCAGATCGCATTA cagtg      28   1.00 F ecodaop
   7,  1      42 tccac TAATTTATTCCATGTCACACTT ttcgc      63   1.00 F ecogale
   8,  1      20 ttttg TTATCTGCAATTCAGTACAAAA cgtga      41   1.00 F ecoilvbpr
   9,  1       9 gcaat TAATGTGAGTTAGCTCACTCAT taggc      30   1.00 F ecolac
  10,  1      14 gccaa TTCTGTAACAGAGATCACACAA agcga      35   1.00 F ecomale
  11,  1      61 aggaa TTTCGTGATGTTGCTTGCAAAA atcgt      82   1.00 F ecomalk
  12,  1      41 tttgg AATTGTGACACAGTGCAAATTC agaca      62   1.00 F ecomalt
  13,  1      48 ttcat ATGCCTGACGGAGTTCACACTT gtaag      69   1.00 F ecoompa
  14,  1      71 cgaac GATTGTGATTCGATTCACATTT aaaca      92   1.00 F ecotnaa
  15,  1      17 gaaat TGTTGTGATGTGGTTAACCCAA ttaga      38   0.99 F ecouxu1
  16,  1      53 atatg CGGTGTGAAATACCGCACAGAT gcgta      74   0.99 F pbr322
  18,  1      78 agtta ATTTGTGAGTGGTCGCACATAT cctgt      99   1.00 F (tdr)
                       ********      ********

Difference of Logs of Maps = 37.69802

Log Background portion of Map = -2221.57441
Log Alignment portion of Map = -75.32389
Log Site/seq portion of Map = -7.36221
Log Null Map = -2595.84380
Non Palindromic Map = 20.07571
Max subopt MAP found on seed 2

Frequency Map = 37.698023
Nearopt Map   = 37.698023



log MAP = sum of motif and fragmentation parts of MAP + background + alignment + sites/seq - Null

Elapsed time: 62.100000 secs

Output from the Centroid Sampler

An example of the output from the centroid sampler is given below. It contains the results of searching the set of CRP intergenic sequences described above for a motif of width 16 using the centroid sampler allowing up to 2 sites in each sequence with an estimated total number of motifs of 20, fragmentation, a palindromic model, and a heterogeneous background model. In this example, an aligned centroid model was chosen


/users/thompson/gibbs/bin/Gibbs.opteron /users/thompson/gibbs/data/crp.fa 16 20 -n -E 2 -r -R 1,1,8 -bayes -align_centroid -o /users/thompson/gibbs/output/crp.082707.out 


Gibbs 3.04.001  Aug 21 2007
Data file: /users/thompson/gibbs/data/crp.fa
Output file: /users/thompson/gibbs/output/crp.082707.out
Current directory: /users/thompson/gibbs/tmp
The following options are set:
Concentrated Region          False    Sequence type        False
Collapsed Alphabet           False    Pseudocount weight   False
Use Expectation/Maximization False    Don't Xnu sequence   False
Help flag                    False    Near optimal cutoff  False
Number of iterations         False    Don't fragment       False
Don't use map maximization   True     Repeat regions       False
Output file                  True     Informed priors file False
Plateau periods              False    palindromic sequence True 
Don't Reverse complement     True     Number of seeds      False
Seed Value                   False    Pseudosite weight    False
Suboptimal sampler output    False    Overlap              False
Allow width to vary          False    Wilcoxon signed rank False
Sample along length          False    Output Scan File     False
Output prior file            False    Modular Sampler      True 
Ignore Spacing Model         False    Sample Background    False
Bkgnd Comp Model             False    Init from prior      False
Homologous Seq pairs         False    Parallel Tempering   False
Group Sampler                False    No progress info     False
Fragment from middle         False    Verify Mode          False
Alternate sample on k        False    No freq. soln.       True 
Calc. def. pseudo wt.        False    Motif/Recur smpl     False
Phylogenetic Sampling        False    Supress Near Opt.    True 
Nearopt display cutoff       False    Sample model         True 
Hierarchical Model           False    Centroid model       True 
Print Bayesian Counts        False    Align Centroid       True 
Calculate Credibility Limits True 

site_samp            =            0
nMotifLen            =           16
nAlphaLen            =            4
nNumMotifs           =           20
dPseudoCntWt         =          0.1
dPseudoSiteWt        =          0.8
lSeedVal             =   1188244180
nPlateauPeriods      =           20
nSeeds               =           10
nNumMotifTypes       =            1
dCutoff              =         0.01
dNearOptDispCutoff   =          0.5
RevComplement        =            0
glOverlapParam       =            0
Max Sites/seq        =            2
Min Sites/Seq        =            0
Rcutoff factor       =            0
Post Plateau Samples =            0
Frag/Shft Per.       =            5
Frag width           =           24
Burn-in              =         2000
Sample Period        =         8000


Sequences to be Searched:
_________________________
#1   cole1
#2   ecoarabop
#3   ecobgirl
#4   ecocrp
#5   ecocya
#6   ecodaop
#7   ecogale
#8   ecoilvbpr
#9   ecolac
#10  ecomale
#11  ecomalk
#12  ecomalt
#13  ecoompa
#14  ecotnaa
#15  ecouxu1
#16  pbr322
#17  trn9cat
#18  (tdr)
Processed Sequence Length: 1890 Total sequence length: 1890










======================================================================
========================== CENTROID RESULTS ==========================
======================================================================

    1, 1      12 ttgtg CTGGTTTTTGTGGCATCGGGC    gagaa      32  0.66 cole1
    1, 2      54 gtgaa AGACTGTTTTTTTGATCGTTTTC  acaaa      76  0.98 cole1
    2, 1      56 ttgat TATTTGCACGGCGTCACAC      tttgc      74  0.98 ecoarabop
    3, 1      79 ataac TGTGAGCATGGTCATATTTTTA   tcaat     100  0.91 ecobgirl
    4, 1      55 tgatg TACTGCATGTATGCAAAGGACGT  cacat      77  0.84 ecocrp
    5, 1      48 atcag CAAGGTGTTAAATTGATCACGTT  ttaga      70  0.76 ecocya
    6, 1       5  agtg AATTATTTGAACCAGATCGCATTA cagtg      28  0.97 ecodaop
    6, 2      67 ttgtg ATGTGTATCGAAGTGTGTTGCGG  agtag      89  0.86 ecodaop
    7, 1      31 gtgta AACGATTCCACTAATTTATTCCA  tgtca      53  0.90 ecogale
    8, 1      29 ctgca ATTCAGTACAAAACGTGATCAAC  ccctc      51  0.88 ecoilvbpr
    9, 1       8 cgcaa TTAATGTGAGTTAGCTCACTC    attag      28  0.97 ecolac
    9, 2      79 tgtgt GGAATTGTGAGCGGATAACAATT   tcac     101  0.66 ecolac
   10, 1       8 attac CGCCAATTCTGTAACAGAGATCA  cacaa      30  0.97 ecomale
   11, 1      31 ggctt CTGTGAACTAAACCGAGGTCATG  taagg      53  0.57 ecomalk
   11, 2      56 atgta AGGAATTTCGTGATGTTGCTT    gcaaa      76  0.83 ecomalk
   12, 1      41 tttgg AATTGTGACACAGTGCAAATTCA  gacac      63  0.94 ecomalt
   13, 1      48 ttcat ATGCCTGACGGAGTTCACACTTG  taagt      70  0.83 ecoompa
   14, 1      78 ttgtg ATTCGATTCACATTTAAACAA    tttca      98  0.92 ecotnaa
   15, 1      15 gtgaa ATTGTTGTGATGTGGTTAACCCA  attag      37  0.55 ecouxu1
   16, 1      53 atatg CGGTGTGAAATACCGCACAGATG  cgtaa      75  0.85 pbr322
   18, 1      78 agtta ATTTGTGAGTGGTCGCACATAT   cctgt      99  0.99 (tdr)
Num Sites: 21

Column 1 :  Sequence Number, Site Number
Column 2 :  Left End Location
Column 4 :  Motif Element
Column 6 :  Right End Location
Column 7 :  Probability of Element
Column 8 :  Sequence Description from FastA input


Credibility limits
% samples within specified number of mismatched sites
from the sequence centroid.
                                 Avg. dist.
Seq.      95%     90%     85%    to samples
-------------------------------
1    	    1	    1	    1	 0.39	cole1
2    	    1	    1	    1	 0.47	ecoarabop
3    	    2	    1	    1	 0.70	ecobgirl
4    	    1	    1	    1	 0.39	ecocrp
5    	    2	    1	    1	 0.62	ecocya
6    	    1	    1	    1	 0.20	ecodaop
7    	    2	    1	    1	 0.73	ecogale
8    	    1	    1	    1	 0.59	ecoilvbpr
9    	    1	    1	    1	 0.39	ecolac
10   	    1	    1	    0	 0.10	ecomale
11   	    2	    2	    2	 0.64	ecomalk
12   	    1	    1	    1	 0.33	ecomalt
13   	    1	    1	    1	 0.40	ecoompa
14   	    1	    1	    1	 0.61	ecotnaa
15   	    2	    1	    1	 0.73	ecouxu1
16   	    1	    1	    1	 0.21	pbr322
17   	    1	    0	    0	 0.07	trn9cat
18   	    1	    1	    1	 0.50	(tdr)
Avg.	 1.28	 1.00	 0.94	 0.45






======================================================================
======================== Aligned Centroid Sites ======================
======================================================================

-------------------------------------------------------------------------
                          MOTIF a





Motif model (residue frequency x 100)
____________________________________________
Pos. #     a   t   c   g  Info
_____________________________
   1 |    19  38  19  23  0.0
   2 |    .   90   9  .   1.0
   3 |    .   14   4  80  1.0
   4 |    .   90   9  .   1.0
   5 |    19   4  .   76  0.9
   6 |    85  .    4   9  0.9

   8 |    28  19  28  23  0.1
   9 |     4  42  19  33  0.2
  10 |    47  23   4  23  0.2
  11 |     9  19  14  57  0.4

  13 |     4  61   9  23  0.3
  14 |     9   4  85  .   1.4
  15 |    71  .   .   28  0.8
  16 |    23   4  71  .   1.0
  17 |    66   9   4  19  0.4
  18 |    23  33  23  19  0.0

nonsite   28  32  16  22
site      25  28  19  26

Motif probability model
____________________________________________
Pos. #    a     t     c     g   
____________________________________________
   1 |  0.199 0.376 0.188 0.237 
   2 |  0.026 0.852 0.102 0.020 
   3 |  0.026 0.159 0.059 0.756 
   4 |  0.026 0.852 0.102 0.020 
   5 |  0.199 0.073 0.015 0.713 
   6 |  0.805 0.029 0.059 0.107 

   8 |  0.285 0.203 0.275 0.237 
   9 |  0.069 0.419 0.188 0.323 
  10 |  0.459 0.246 0.059 0.237 
  11 |  0.112 0.203 0.145 0.540 

  13 |  0.069 0.592 0.102 0.237 
  14 |  0.112 0.073 0.795 0.020 
  15 |  0.675 0.029 0.015 0.280 
  16 |  0.242 0.073 0.665 0.020 
  17 |  0.632 0.116 0.059 0.194 
  18 |  0.242 0.333 0.232 0.194 



Background probability model
        0.307 0.362 0.143 0.187 



16 columns
Num Motifs: 21
   1,  1      19 ggttt TTGTGGCATCGGGCGAGA atagc     36 1.00 F cole1
   1,  2      63 tgttt TTTTGATCGTTTTCACAA aaatg     80 1.00 F cole1
   2,  1      57 tgatt ATTTGCACGGCGTCACAC tttgc     74 1.00 F ecoarabop
   3,  1      78 aataa CTGTGAGCATGGTCATAT tttta     95 1.00 F ecobgirl
   4,  1      65 catgt ATGCAAAGGACGTCACAT taccg     82 1.00 F ecocrp
   5,  1      52 gcaag GTGTTAAATTGATCACGT tttag     69 1.00 F ecocya
   6,  1       9 gaatt ATTTGAACCAGATCGCAT tacag     26 1.00 F ecodaop
   6,  2      62 cttaa TTGTGATGTGTATCGAAG tgtgt     79 1.00 F ecodaop
   7,  1      26 ttctt GTGTAAACGATTCCACTA attta     43 1.00 F ecogale
   8,  1      24 gttat CTGCAATTCAGTACAAAA cgtga     41 1.00 F ecoilvbpr
   9,  1      11 aatta ATGTGAGTTAGCTCACTC attag     28 1.00 F ecolac
   9,  2      75 atgtt GTGTGGAATTGTGAGCGG ataac     92 1.00 F ecolac
  10,  1      16 caatt CTGTAACAGAGATCACAC aaagc     33 1.00 F ecomale
  11,  1      31 ggctt CTGTGAACTAAACCGAGG tcatg     48 1.00 F ecomalk
  11,  2      63 gaatt TCGTGATGTTGCTTGCAA aaatc     80 1.00 F ecomalk
  12,  1      43 tggaa TTGTGACACAGTGCAAAT tcaga     60 1.00 F ecomalt
  13,  1      50 catat GCCTGACGGAGTTCACAC ttgta     67 1.00 F ecoompa
  14,  1      73 aacga TTGTGATTCGATTCACAT ttaaa     90 1.00 F ecotnaa
  15,  1      19 aattg TTGTGATGTGGTTAACCC aatta     36 1.00 F ecouxu1
  16,  1      55 atgcg GTGTGAAATACCGCACAG atgcg     72 1.00 F pbr322
  18,  1      80 ttaat TTGTGAGTGGTCGCACAT atcct     97 1.00 F (tdr)
                       ****** **** ******


Column 1 :  Sequence Number, Site Number
Column 2 :  Left End Location
Column 4 :  Motif Element
Column 6 :  Right End Location
Column 7 :  Probability of Element
Column 8 :  Forward Motif (F) or Reverse Complement (R) 
Column 9 :  Sequence Description from Fast A input

Elapsed time: 698.280000 secs


Appendix A

Possible Errors

Errors will be generated and execution terminated when any of the input parameters have values that are not within their range. If the input file, prior file or output file cannot be opened, then an error message will be generated and execution terminated. 

The most common error is that the input file contains a letter which does not code for an amino acid or nucleotide, depending on which model is begin used. then an error message is displayed and execution halts. To fix this problem, the input file should be examined to make sure that it is in fasta format and all of the sequences are protein or amino acids codes.

When randomly selecting motif positions, an error can occur. This is possible if there are not enough positions in the sequence to set random motif positions according to the user specified motif width and expected number of motifs. The solution is to decrease either the motif width or the expected number of motifs. 

Possible Error Messages
MessageDescription
Adjust period must be greater than 0A value less than 0 was entered for the number of iterations between fragmentation steps.
cannot allocate random motifsshorten the number of motifs or motif length Either there are too many expected sites specified for the amount of sequence data or when using the -E option, Gibbs is unable to allocate the default number of expected sites. Try decreasing the number of expected sites and re-run.
Cannot open background composition file filenameThe filename entered after -B could not be found or you do not have proper permission to read it. Contact the system administrator.
Cannot open description file for readingA file system error has occurred. Contact the system administrator.
Cannot open output file filenameAn invalid file name was entered. Gibbs was unable to open the specified file because the path did not exist or a file with the same name exists and is read-only. Contact the system administrator.
Cannot open prior file filenameFilename could not be found or you do not have proper permission to read it.
Cannot open spacing probability file filename>SPACING command: The filename containing the position probability values could not be found or you do not have proper permission to read it. Contact the system administrator.
Cannot open Wilcoxon Control file filenameThe filename entered after -l could not be found or you do not have proper permission to read it. Contact the system administrator.
Can't initialize sequence nnn with more than ddd sitesWhen using the -E option, Gibbs is unable to allocate the default number of expected sites. Try entering an expected number of sites and re-run. nnn is the sequence and ddd is the default number of sites for this sequence.
ccc (ascii ddd) is not a valid amino acid abbreviation in sequence nnnAn invalid character has been encountered in the data file. ccc is the character, ddd is the ASCII code for the character and nnn is the sequence containing the invalid character.
ccc (ascii ddd) is not a valid Nucleic Acid in sequence nnnAn invalid character has been encountered in the data file. ccc is the character, ddd is the ASCII code for the character and nnn is the sequence containing the invalid character.
Collapsed begin pos is greater than end posThe specified collapsed range position is greater than the ending position.
Collapsed subscript overflowThe collapsed range beginning or ending position is greater than half the motif length.
Collapsed subscript underflowThe collapsed range beginning or ending position is less than 1.
Concentrated region begin pos is greater than end posThe specified concentrated alphabet position is greater than the ending position.
Concentrated region subscript overflowThe concentrated alphabet beginning or ending position is greater than half the motif length.
Concentrated region subscript underflowThe concentrated alphabet beginning or ending position is less than 1.
Error in priors file: lineAn invalid command was encountered in a prior file. The offending line is printed.
Error in Pseudo mmm - should be nnn lines>PSEUDO command: The number of rows entered for pseudocounts for motif nnn does not match the motif width.
Error in Pseudo: line>PSEUDO command: A line containing invalid data was encountered. The offending line is printed. Pseudocount lines should only contain numerical data.
get_strings: Sequence nnn has length = 0An empty sequence has been encountered in the data file.
Invalid data - probably a missing FASTA headerAn invalid data file was read. The most likely cause is a missing FASTA header for the first sequence.
Invalid NucleotideAn invalid character was detected in a DNA sequence. DNA sequences may only contain A,T,C,G, or N.
Iteration count must be at least 1A value of 0 was entered for the -i option.
-M: Fragmentation widths must be greater than or equal to motif lengthsThe value entered after -M was less than the number of conserved positions for the motif.
-M: Mismatched fragmentation begin and end rangesA motif number was entered but no maximum fragmentation width.
-M: Missing fragmentation width argumentsNo value was entered for the maximum fragmentation width.
-M: Negative fragmentation widthThe value entered after -M was either 0 or not a valid number.
Mismatched Palindrome begin and end rangesThe end range of the palindrome is missing.
Mismatched repeat begin and end rangesThe end range of the direct repeat is missing.
MISSING ARGUMENTSThis is a general error for an ill-formed command line. Probable causes are a missing data file name or a missing motif width.
Missing collapsed range argumentsAn invalid collapsed range specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.
Missing concentrated region argumentsAn invalid concentrated alphabet specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.
Missing palindromic range argumentsAn invalid palindrome specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.
Missing repeat range argumentsAn invalid direct repeat specification was encountered on the command line. Be sure that the specified range has a beginning and ending position.
Missing sequence data - sequence nnnA FASTA header was encountered, but the sequence was empty. nnn is the sequence number.
Must try at least one seedA seed value of 0 was entered for the -S option.
Near Opt cutoff must be between 0 and 1An invalid probability value was entered for -C option.
-o: Missing ArgumentNo name was entered for the output file.
-P: Missing ArgumentNo name was entered for the prior information file.
Palindrome begin pos is greater than end posThe specified palindrome starting position is greater than the ending position.
Palindrome subscript overflowThe palindrome beginning or ending position is greater than half the motif length.
Palindrome subscript underflowThe palindrome beginning or ending position is less than 1.
Plateau periods must be >= 5The minimum plateau period is 5 iterations.
Positions are defined as both palindromes and repeatsYou have specified the same motif positions as both palindromes and direct repeat positions.
Pseudocount weight must be between 0 and 1An invalid probability value was entered for -w option.
Pseudocounts sum to 0>PSEUDO command: A pseudocount row either contains no numerical data or contains all zeros.
Pseudosite weight must be between 0 and 1An invalid probability value was entered for -W option.
ReadBeginSite: Error in starting site probability>BEGIN command: A line containing invalid data was encountered. The offending line is printed. Beginning probability lines should only contain numerical data.
ReadBeginSite:Sum of probabilities = 0>BEGIN command: The row either contains no numerical data or contains all zeros.
ReadBkgndComposition: Invalid position in background file: seq: sss pos: pppThe format of the background composition file is invalid. Check that the file has not been corrupted.
ReadBkgndComposition: Invalid sequence number in background file: nnnThe format of the background composition file is invalid. Check that the file has not been corrupted.
ReadBkgndComposition: Seq: sss Position: ppp probabilities do not sum to 1The format of the background composition file is invalid. Check that the file has not been corrupted.
ReadBlockProb: Probability values must be >= 0>BLOCKS command: A number less than 0 was entered for a site per sequence probability.
ReadEndSite: Error in end site probability>END command: A line containing invalid data was encountered. The offending line is printed. Ending probability lines should only contain numerical data.
ReadEndSite:Sum of probabilities = 0>END command: The row either contains no numerical data or contains all zeros.
ReadMask: Invalid mask length. Should be nnn>MASK command: The number of entries in the fragmentation mask does not match the fragmentation width.
ReadMask: Mask codes must be >= 0 and <= 3>MASK command: A value which is not 0,1, or 3 was read as a mask character.
ReadMask: Remove -F option if specifying a fragmentation mask>MASK command: You have specified a fragmentation mask while suppressing fragmentation with the -F option.
ReadTransProbability: Error in transition probability table>TRANS command: A line containing invalid data was encountered. The offending line is printed. Transition matrix lines should only contain numerical data.
ReadTransProbability: Trans weight must be between 0 and 1>TRANS command: An invalid weight parameter was entered.
Remove -F option when using -J optionYou have attempted to suppress fragmentation while requesting fragmentation from the center positions.
Repeat begin pos is greater than end posThe specified direct repeat starting position is greater than the ending position.
Repeat subscript overflowThe direct repeat beginning or ending position is greater than half the motif length.
Repeat subscript underflowThe direct repeat beginning or ending position is less than 1.
-S: Missing ArgumentsNo value was found after the -S option.
strip_maxblocks: -E must have at least one motif per sequenceThe value entered after -E was either 0 or not a valid number.
stripargs - Getting Motif Lengths: motif lengths must be > 0An invalid number of conserved positions was entered for the motif.
The maximum temperature must be positive.The value entered for the maximum temperature after the -X option was 0 or negative.
The minimum temperature must be positive.The value entered for the minimum temperature after the -X option was 0 or negative.
The number of exchange iterations must be >= 1.The value entered for the number of iterations between temperature exchanges must be greater than 0 for the -X option.
The Wilcoxon control file must contain the same number of sequences as the data fileThe Wilcoxon file containing negative control sequences does not contain the same number of sequences as the input data file.
Unable to get temporary file nameA file system error has occurred. Contact the system administrator.
Wilcoxon control file errorA file system error has occurred. Contact the system administrator.
Wilcoxon Control file: Sequence nnn has a different length than the sequence in the data fileThe negative control sequences must have the same length and be in the same order as the sequences as the input data file.

Names and Abbreviations of the Common Amino Acids

         Amino Acid	  Three Letter 	  One Letter
                          Abbreviation   Abbreviation
          ___________________________________________
          Alanine	      Ala	      A
          Arginine            Arg	      R
          Asparagine	      Asn	      N
          Aspartic Acid	      Asp	      D
          Cysteine	      Cys	      C
          Glutamine	      Gln	      Q
          Glutamic Acid	      Glu	      E
          Glycine	      Gly	      G
          Histidine	      His	      H
          Isoleucine	      Ile	      I
          Leucine	      Leu	      L
          Lysine	      Lys	      K
          Methionine	      Met	      M
          Phenylalanine	      Phe	      F
          Proline	      Pro	      P
          Serine	      Ser	      S
          Threonine	      Thr	      T
          Tryptophan	      Trp	      W
          Tyrosine	      Tyr	      Y
          Valine	      Val	      V


References

1. Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, and Wootton JC. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262(5131):208-214. PubMed: 8211139. doi: 10.1126/science.8211139.

2. Lawrence CE, and Reilly AA. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7(1):41-51. PubMed: 2184437. doi: 10.1002/prot.340070105.

3. Liu JS, Neuwald AF, and Lawrence CE. (1995) Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J Am Stat Assoc 90(432):1156-1170. doi: 10.2307/2291508

4. Liu JS and Lawrence CE. (1999) Bayesian inference on biopolymer models. Bioinformatics 15(1):38-52. PubMed: 10068691. doi: 10.1093/bioinformatics/15.1.38.

5. Martin DI and Orkin SH. (1990) Transcriptional activation and DNA binding by the erythroid factor GF-1/NF-E1/Eryf 1. Genes Dev 4(11):1886-1898. PubMed: 2276623. doi: 10.1101/gad.4.11.1886.

6. McCue LA, Thompson W, Carmack CS, Ryan MP, Liu JS, Derbyshire V, and Lawrence CE. (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29(3):774-782. PubMed: 11160901. doi: 10.1093/nar/29.3.774.

7. Neuwald AF, Liu JS, and Lawrence CE. (1995) Gibbs motif sampling: detection of bacterial outer membrane protein repeats. Protein Sci 4(8):1618-1632. PubMed: 8520488. (link)

8. Omichinski JG, Clore GM, Schaad O, Felsenfeld G, Trainor C, Appella E, Stahl SJ, and Gronenborn AM. (1993) NMR structure of a specific DNA complex of Zn-containing DNA binding domain of GATA-1. Science 261(5120):438-446. PubMed: 8332909. doi: 10.1126/science.8332909.

9. Orkin SH. (1995) Regulation of globin gene expression in erythroid cells. Eur J Biochem 231(2):271-281. PubMed: 7635138.

10. Stormo GD and Hartzell GW 3rd. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A 86(4):1183-1187. PubMed: 2919167. doi: 10.1073/pnas.86.4.1183.

11. Liu JS, Neuwald AF, and Lawrence CE. (1999) Markovian Structures in Biological Sequence Alignments. J Am Stat Assoc 94(445):1-15. (link)

12. Wasserman WW, Palumbo M, Thompson W, Fickett JW, and Lawrence CE. (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26(2):225-228. PubMed: 11017083. doi: 10.1038/79965.

13. Wanner BL. (1996) Phosphorus assimilation and control of the phosphate regulon. In Neidhardt FC, Curtiss R 3rd, Ingraham JL, Lin ECC, Low KB, Magasanik B, Reznikoff WS, Riley M, Schaechter M, and Umbarger HE (ed.). Escherichia coli and Salmonella: Cellular and Molecular Biology 1357-1381. American Society for Microbiology, Washington, DC.

14. Newberg LA, Thompson WA, Conlan S, Smith TM, McCue LA, and Lawrence CE. (2007) A phylogenetic Gibbs sampler that yields centroid solutions for cis regulatory site prediction. Bioinformatics . PubMed: 17488758. doi: 10.1093/bioinformatics/btm241.

15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389-3402. PubMed: 9254694. doi: 10.1093/nar/25.17.3389.

16. Thompson W, Rouchka EC, and Lawrence CE. (2003) Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res 31(13):3580-5. PubMed: 12824370. doi: 10.1093/nar/gkg608.

17. Thompson W, Palumbo MJ, Wasserman WW, Liu JS, and Lawrence CE. (2004) Decoding human regulatory circuits. Genome Res 14(10A):1967-74. PubMed: 15466295. doi: 10.1101/gr.2589004.

18. Thompson WA, Newberg LA, Conlan S, McCue LA, and Lawrence CE. (2007) The Gibbs Centroid Sampler. Nucleic Acids Res . PubMed: 17483517. doi: 10.1093/nar/gkm265.


back Gibbs home pageemail