The paper is available from Genome Research (abstract).
Data Files:
All sequence files are in FASTA format. .gz files have been compressed with gzip.
2710.aligned.fa.gz - the set of human-mouse 2710 10kb sequence pairs used for sequence mining aligned with BLASTZ and masked with RepeatMasker
24.aligned.pos.train.fa - original 24 pairs of human-rodent sequences used as positive training. The file, reported.sites, contains the positions of the reported sites in this file.
100.aligned.neg.train.fa - 100 pairs of human and mouse sequences used as negative training
13.aligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation
100.aligned.neg.valid.fa - 100 pairs of human-mouse sequences used as negative validation
13.nonaligned.pos.valid.fa - 13 pairs of human-rodent sequences used as positive validation.
2910.nonaligned.human.fa.gz - the full set of unaligned human sequences. These sequenceshave been processed with RepeatMasker. The unaligned sequences from 2710.aligned.fa.gz, 100.aligned.neg.train.fa, and 100.aligned.neg.valid.fa are contained in this file.
2910.nonaligned.mouse.fa.gz - the full set of unaligned mouse sequences.
24.nonaligned.pos.train.fa.tar.gz - the 24 pairs of positive training files before alignment with BLASTZ
20.aligned.liver.fa - 20 aligned pairs of liver specific sequences
crp.fa - a sample fasta file for testing Gibbs. It contains 18 e. coli sequences containing know CRP TFBS.
crp.sites.dat - a sample collection of sites for testing dscan
blastz.to.fasta.pl - a perl program for converting the output from BLASTZ to aligned fasta sequences.
Files labeled as .comp.gz are gzipped background composition files for use with Gibbs.
Reported Sites
=============
reported.sites - contains a list of reported sites and positions for 24.aligned.pos.train.fa
Bayes factor ratios
===================
Each file contains columns for the human RefSeq Id, mouse RefSeq Id, Bayes ratio and number of modules found in 10 sampling steps.
ratio.2710.061804 - Bayes ratio for 2710 human-mouse sequence pairs
predicted.sites.2710 - contains all sites predicted during data mining in the 2710 human-mouse sequence pairs. It lists the site type, the predicted site and its position in the aligned sequences.
pos.validation.13 - ratios for the 13 positive validation pairs
neg.validation.100 - ratios for 100 negative validation pairs
pos.xvalid.21 - ratio from cross-validation for 21 positive training sequence pairs with predicted modules
neg.xvalid.24 - ratio from cross-validation for 24 negative training sequence pairs with predicted modules
hum.mouse.pr - an annotated prior file for use with the modular sampler. This file contains the parameters we used with Gibbs to analyze 24.aligned.pos.train.fa.
ortho.110206.pr -prior file for crp data
stb5.tar.gz - simulated yeast sequences
phylo.101706.1.pr - prior file for yeast sequences
studyset.tar.gz - prokaryotic data set
regulon.tar.gz - regulon data set
Information on obtaining the Gibbs sampler may be found at http://bayesweb.wadsworth.org/gibbs/gibbs.html.
If you have comments or questions about these files or Gibbs, please contact Bill Thompson