First off, I was up late trying to finish off one of my many projects, so I didn't get a lot of sleep. Thus, if my writing is more incoherent than usual, that's probably why. And now on with the show.
I'm jumping gears a bit, today. I haven't finished my discussion about Aligners, of which I still want to talk about Exonerate in detail, and then discuss some of the other aligners in overview. (For instance, the one that I found out about today, called
GMAP, a Genomic Mapping and Alignment Program for mRNA and EST sequences.) Anyhow, the point is that part of the purpose of using an aligner is to align to something in particular, such as a genome or a contig, but selecting what you align your sequences back to is a big issues.
When you're re-sequencing a genome, you map back to the genome template. Yes, you're probably sequencing a different individual, so you'll find occasional sections that don't match, but most humans are ~99% identical, and you can look into SNP (single nucleotide polymorphism) databases to find out if the differences you're seeing are already commonly known changes. Of course, if you're re-sequencing
Craig Venter, you don't need to worry about SNPs as much. Fortunately, most of us are sequencing more exciting genomes and so forth.
When you're sequencing a genome you've never sequenced before, you can't do mapping at all. There are various assemblers (i.e.,
Velvet (written by Daniel Zerbino, who is a lot of fun to hang out with at conferences, I might add... ),
SSake (written by Rene Warren, who incidentally also works at the GSC, although several floors above me.), and
Euler (which I'd never heard of till I googled the web page for velvet...). The key point: you don't need to worry about what you're mapping back to when you do
de novo assembly, since you're creating your own map. I'll digress further for one brief comment: assembly from Solexa/Illumina sequences is a bad idea, because they're so short!
Moving right along, we come to the third thing people are sequencing these days: Transcriptomes. (Yes, I'm ignoring cloned libraries... they're so 90's!) Transriptomes are essentially a snapshot of the mRNA in a set of cells at a given point in time. Of course, mRNA is rather unstable, so protocols have been developed to convert mRNA to cDNA (complementary DNA), which is essentially a copy of the mRNA in DNA form. (Yes, I'm ignoring some complexity here, because it makes for a better story.) But I'm getting ahead of myself. Lets talk about the mRNA, but be aware that the sequencing is actually done on cDNA.
mRNA is an interesting beast. Unlike Genomic DNA, it's a more refined creature. For Eukaryotes, the mRNA is processed by the cell, removing some segments that are non-coding. Once the cell removes these segments (labeled introns), and leaves other segments (labeled exons), we have a sequence of bases that no longer matches the genomic DNA sequence from which it came. Sure, there are short segments that map back very well (i.e. the exons), but if you take a small random snippet from the mRNA, there's a small chance that it might overlap the boundaries between two exons, which means the bases you have won't map back nicely to the genomic point of origin. That can be a serious problem.
Sure, you say, we can do a gapped alignment, and try to find two points where this sequence originated, with one big gap. If you're sophisticated, you'll even know that introns often have signals that indicate their presence most of the time. And yes, you'd be right, we can do that. Unfortunately, for most solexa runs, you get 20,000,000+ sequences. At 10 seconds a sequence (which doesn't seem like much, really), how long would it take to do that alignment?
Too long.
So most of the time, we don't do gapped alignments. Instead, we have two choices:
- Align against the genome, and throw away reads that we can't align (i.e. those that over lap intron/exon boundaries.)
- Align against a collection of known coding DNA sequences
Number two isn't a bad option: it already has all the introns spliced out, so you don't need to worry about missing those alignments. Unfortunately, there are several issues with this approach:
- Do we really know all of the coding DNA sequences? For most species, probably not, but this is a great idea for animals like Drosophila. (I tried this yesterday on a fruit fly Illumina run and it worked VERY well.
- Many transcripts are very similar. This isn't a problem with the data, but with your alignment program. If your aligner doesn't handle multi-matches (like Eland), this will never work.
- Many transcripts are very similar. Did I say that already? Actually, it causes more problems. How do you know which transcript was really the source of the sequence? I have several ways to get around it, but nothing foolproof yet.
- Aligning to a transcript is hard to visualize. This is going to be one of my next projects... but with all the fantastic genomes out there, I'm still not aware of a good transcriptome visualization tool.
And that brings us to the conclusion. Aligning a transcriptome run against a genome or against a transcriptome both have serious problems, and there really are no good solutions for this yet.
For now, all I can do is run both: they tell you very different things, but both have fantastic potential. I haven't released my code for either one, yet, but they both work, and if you contact my supervisor, he'd probably be interested in collaborating, or possibly sharing the code.
Labels: Aligners, Bioinformatics, DNA, Grad School, Sequencing, Solexa/Illumina