COLLECT/ANALYZE.IDENTITY.pl README revised 1/8/2006 ###################### # Contents of package: ###################### README - this file LICENSE.txt - the academic licensing agreement, please read collect.identity.pl - automates pairwise alignment and make identity.txt table analyze.identity.pl - analyzes identity.txt shuffle.fasta.pl - randomly permutes sequences boxplot.R - a source file for making a boxplot from identity.txt in R toyalign.cpp - C++ source code for a simple aligner (testing only) random/ - A set of 1000 randomized sequence pairs for testing ###################### # TODO ###################### In a future version of the code: - Improve model used to convert % identity to phylogenetic distance ###################### # System requirements: ###################### These scripts were developed on a Solaris workstation but should be directly portable to any UNIX-like operating system running Perl. The collect.identity.pl script does call the "grep" command so if you are running this on a Windows system using a Windows version of Perl, you will need grep or something like grep. An external aligner is called and must be installed. It is assumed that the aligner is in the current user path. If this is not the case, you need to add a hard path to the command that calls the aligner in the collect.identity script (see below). ########### # Aligners: ########### Three commonly used pairwise aligners are supported: 1) Gap, which is part of the Wisconsin GCG Package and available from Accelrys Software Inc. It is implemented with the -gap option. or 2) Needle and 3) Water which are pairwise aligners distributed as part of the EMBOSS Molecular Biology Package. They are implemented with the -needle and -water options. also 4) A fourth option, provided solely for testing, is the toyalign.cpp code distributed with the package. It is a C++ implementation of the Basic Global Alignment algorithm described in "Introduction to Computational Molecular Biology", Setubal and Meidanis, PWS Publishing Company, Boston, Chapter 3. It will produce a pairwise alignment and can be used to test the scripts before downloading an aligner. It is NOT to be used for doing actual alignments as there are much better algorithms out there and this implementation has not been extensively tested. It is implemented with the -user option. Additional aligners can be added. See the code for the user specified aligner in the collect.identity.pl script. You can modify this code to suit your needs or copy it to make another option. If you copy the code (i.e., make a 5th aligner option) you will also need to edit the section at the top of the code where the flag is read off the command line and the $aligner variable is set. It is important to note that the script assumes that most aligners will be in the user's path. The exception is the toy aligner which is assumed to be in the working directory. If you wish to hardcode a path to an aligner you would modify the code as follows (using water as an example): ORIGINAL: `water $file1 $file2 -gapopen=10.0 -gapextend=0.5 -outfile=water.aln 2>/dev/null`; NEW (for instance, water is in /emboss/): `/emboss/water $file1 $file2 -gapopen=10.0 -gapextend=0.5 -outfile=water.aln 2>/dev/null`; As you can see, this is also where you might want to change alignment parameters. NOTE on Multiple alignment algorithms While we focus on pairwise aligners, support for ClustalW is also built in via the -clustalw alignment option. This has not been tested as thoroughly as the pairwise implementations but appears to work. You can also invoke ClustalW as a pairwise aligner using pwclustalw. ############## # Data Format: ############## The collect.identity script expects the input file to be in Fasta format. While it is sometimes frowned upon to impose constraints on the information in the Fasta header (some would argue it is free text only), the collect.identity script requires that each bacterial species have a unique four letter code associated with it. This is how the script sets up the comparisons and passes the individual sequences to the aligners. Here is an example Fasta file: >ECOL_1 E_score=1e-139 :yshA: range=4062319-4062388 type=c:Len=70 TAAATGTCAATAGCCGCTATTTCCATCCTGGTGGGTGGCGGCCTCCCTACGTTTAAAAAA TGGACTTATT >O157_1 E_score=1e-130 :yshA: range=4860095-4860163 type=c:Len=69 TAAATGTCAATAGCCGCTATTTCCATCCTGGTGGGTGGCGGCCTCCCTACGTTTAAAAAT GGACTTATT >SENT_1 E_score=1e-123 :yshA: range=3719549-3719624 type=n:Len=76 AATAATTAAACGACGCTTGCCGCTTTATTCATCCCGATGGATGGCGGCTTCCCTATATTT GAAAAATGGATTTATT >SFLE_1 E_score=3e-57 :yshA: range=4076952-4077021 type=c:Len=70 TAAATGTCAATAGCCGCTATTTCCATCCTGGTGGGTGGCGGCCTCCCTACGTTTAAAAAA TGGACTTATT In this example ECOL is Escherichia coli K12, O157 is Escherichia coli O157:H7 and so on. For most people, setting up the fasta files will be the most time consuming part of the whole process. In the end you will have a set of files, one per gene or orthologous sequence, containing sequences from each species. The script is smart enough to handle cases of missing sequences and will output an NA in place of any missing comparisons. There should, however, be at least two sequences in each file. ################## # Making Datasets: ################## Our intergenic sequence data was compiled using a BLAST based method which has been described previously (McCue et al., NAR 29(3):774-782,2001). The scripts used to implement that method are designed to interface with our in-house formats and methods and are not appropriate for general distribution. For the purposes of ortholog identification, I suggest looking at the reciprocal BLAST algorithm INPARANOID (http://inparanoid.cgb.ki.se/). In addition, web databases like the Microbial Genome Database for Comparative Analysis (http://mbgd.genome.ad.jp/) or NCBI's COGS (http://www.ncbi.nlm.nih.gov/COG/) house precompiled lists of orthologous genes for many sequenced genomes. Sequences can be extracted for: genes, upstream regions and a variety of other sequence types, using the genome viewer Artemis (http://www.sanger.ac.uk/). For instance, the 200 bases upstream of every CDS can be extracted by selecting all CDS in the Select menu and then selecting upstream bases in the Write menu. It is important to realize that simply extracting the regions upstream of a CDS can result in overlap with upstream genes or RNAs. In addition, sequences between divergently transcribed genes will be included twice in some cases. ################# # An example run: ################# A subdirectory of random sequences is included with the package. The following commands are a guideline for running the scripts on this dataset. Commands you type follow a > in the text below: To extract the package, make a subdirectory and decompress the files: >mkdir identity >cd identity Put the compressed package in the identity subdirectory and inflate. The filename may differ slightly in version number, replace 1x with your version: identity > gunzip idpkg1x.tar.gz identity > tar -xvof idpkg1x.tar This operation may differ depending on what decompression software you have. If you do not have one of the three aligners, you can use the provided toy aligner. This will compile it on a system with g++: identity > g++ -x c++ toyalign.cpp -o toyalign The random data is in the random directory, go there and list the files: identity > cd random identity/random > ls You should see 1000 files with names like (random67.fa). If you plan to use the toy aligner compiled above, copy it to the current directory: identity/random > cp ../toyalign ./ Now you can run the collect identity script, just to see if it works: identity/random > perl -w ../collect.identity.pl usage project2.pl [-gap|-needle|-water|-user] -names=ONE,TWO... files.fa That is the usage statement. To actually process the files type: identity/random > perl -w ../collect.identity.pl -user -names=RAN0,RAN1 *.fa "-user" specifies the toyalign aligner but can be changed to the aligner of your choice within the code "-names" is a comma separated list of the 4 letter names for each species The file list should be last The resulting identity.txt file can be analyzed with the command: identity/random > perl -w ../analyze.identity.pl identity.txt or imported into a spreadsheet program or R (www.r-project.org). The lengths.txt file contains the corresponding sequence lengths for each file and can be used to check for length dependencies. ################# # analyze.identity.pl The analyze.identity.pl script will die if fewer than three lines are in the identity.txt file. This is because the statistics calculated don't make sense for fewer than three values. Even with three files, if any are missing sequences the calculations will not make sense. A warning is displayed if fewer than ten lines (i.e., files) are detected in the identity.txt file. This is to warn you that the calculations and resulting distribution information are based on very few observations. It is recommended that at least 100 datasets be used to build the identity.txt file. I typically use intergenic regions from the whole genome, and therefore have thousands of files. analyze.identity.pl takes the following optional arguments: -s turn the summary data OFF (default on) -h[binwitdth] turn histogram output ON (default off) binwidth specifies the width of a bin (default 5) -m turns on the mean weighbor matrix output (default off) -d turns on the median weighbor matrix output (default off) NOTE: The -m and -d options produce matrices suitable for use with the Weighbor Neighbor Joining program for building trees. At this time, however, care should be taken using these options. The Jukes-Cantor model used to convert the percent identities to phylogenetic distances is simple and makes a number of assumptions including: equal base frequencies and equal mutation rates. Also, sequence lengths are not accounted for in the percent identity distributions and this will effect the accuracy of these measures. --- EOF