Indel Calling
William left an interesting comment yesterday, which I figured was worthy of it's own post, albeit a short one. (Are any of my posts really short, tho?) His point was that everyone in the genomics field is currently paying a lot of attention to SNPs, and very little attention to INDELs (Insertions and Deletions). And, I have to admit - this is true. I'm not aware of anyone really trying to do indels with Solexa reads, yet. But there are very good reasons for this.
The first is a lack of tools - most aligners aren't doing gapped alignments. Well, there's Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn't been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you've noticed they key words: "short read", you're on to the real reason why no one is currently working with indels.
Yep - the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you're looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I'd venture to say it's 60% or so. With the 36 base pair read, you're looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don't quote me on those numbers - they're just estimates.) And that's best case.
If your gap is closer to the end, you'll get something more like a 32bp read and a 10bp read.... and I wouldn't trust a 10bp seed to give the correct match against the genome no matter what aligner you've got - especially if it comes from the "poor" end of an Illumina sequence.
So that leaves you with two options: use a paired end tag, or use a longer read.
Paired end tags (PET) have been around for a couple months, now. We're still trying to figure out the best way of using the technology, but it's coming. People are mostly interested in using them for other applications - gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it's definitely a tractable problem.)
Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we'll get fast Indel capable aligners - and I'm sure it's coming. There are long reads being attempted this week at the GSC. I don't know anything about them, or the quality, but if they work, I'd expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.
So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.
P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren't artifacts. As 454 ramps up its throughput, we'll see both the Solexa and 454 platforms converge towards better data for indel studies.
The first is a lack of tools - most aligners aren't doing gapped alignments. Well, there's Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn't been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you've noticed they key words: "short read", you're on to the real reason why no one is currently working with indels.
Yep - the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you're looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I'd venture to say it's 60% or so. With the 36 base pair read, you're looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don't quote me on those numbers - they're just estimates.) And that's best case.
If your gap is closer to the end, you'll get something more like a 32bp read and a 10bp read.... and I wouldn't trust a 10bp seed to give the correct match against the genome no matter what aligner you've got - especially if it comes from the "poor" end of an Illumina sequence.
So that leaves you with two options: use a paired end tag, or use a longer read.
Paired end tags (PET) have been around for a couple months, now. We're still trying to figure out the best way of using the technology, but it's coming. People are mostly interested in using them for other applications - gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it's definitely a tractable problem.)
Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we'll get fast Indel capable aligners - and I'm sure it's coming. There are long reads being attempted this week at the GSC. I don't know anything about them, or the quality, but if they work, I'd expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.
So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.
P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren't artifacts. As 454 ramps up its throughput, we'll see both the Solexa and 454 platforms converge towards better data for indel studies.
Labels: Aligners, indels, short-reads
9 Comments:
SNPs occur ~1000x more commonly than INDELs, so they are simply lower hanging fruit. Retain your sequence data and when we've gotten the SNPs pretty well covered, we'll do a second pass for INDELS. My guess is methylation and microRNA analyses would follow shortly.
Since NCBI assign INDELs such as rs34815109 and rs28360071 dbSNP identifiers, some of the distinctions between common usage of the SNP and INDEL seem to be fading. Personally I've come to think of it as a 'small nucleotide polymorphism' instead of a 'single'.
The process you described seems to suggest that you will be checking only a single read at a time. At a bare minimum you would need two to recognize the population variation, but I think you really want to be working with all N available reads. Establish scaffolds and contigs, preferably using a 454 for longjump read length, then use a short read system for deep coverage. Depth of coverage is the surest way to know you're finding real features, and not sequencing artifacts.
Anthony, have you tried soap?
Hi cariaso,
I agree that the line between SNPs and indels are slowly blurring, but I don't think that was the intent of my post. I wanted to draw attention to the difficulty of identifying indels within sequencing reads.
(Yes, we can all identify large deletions where no reads are found, but we always lose the reads which border them because of the reasons I cited in my post.)
In order to deal with all of the reads at once, you must still individually align each read to the genome. Analyzing the reads after aligning them is mainly a trivial business, but doing the alignments with the gaps/insertions is difficult.
Hi Ola,
I haven't actually tried SOAP yet, though I hear good things about it. Does it do gapped alignments well?
Thanks for bring this issue to the light. Indel is not that uncommon. Just the dbSNP does not contain enough indels. Various studies imply that for human 1 in 5~10 short variants is an indel. And the real data tell me this is probably true.
I agree with anthony that finding indel with single ended reads is very tricky. I know a group tried to do so with SOAP, but ended up with using paired end reads only in detecting short indels. Futhermore, even with paired end reads, we will miss nearly half of the short indels, simply because they are undetectable with short reads (in a long homopolymer run, for example). But fortunately, indels not in homopolyer runs might be of more biological interest?
To find indels in humans, I would suggest using paired end data whenever possible. MAQ seems a good candidate (PS: doesn't it have indelpe to 'assemble' indels?). If we only have single ended reads, novoalign may be a good choice. It is faster and more light-weighted and much more stable than SOAP. Finding indels on small genomes would be relatively easy. There are good tools and the alignment would not be a big problem.
At last, maybe we do not need to care too much about detecting indels with short reads. When proper tools are out there in future, the read length would probably go as long as 100bp!
Ask and we shall receive? pubmed 18697769
This post is excellent, and I agree with a lot of the major points, but I want to disagree with the final conclusion that the technology isn't there yet. I think (hope) that with the proper statistical treatment we could call indels in sequencing data that has somehow been enriched to query particular regions, particularly if the location of these regions are known a priori. I am thinking ChIP or other methods here.
Here's why: First of all you significantly reduce the size of the genome that you are looking at. Admittedly you might have stray reads coming from non-enriched areas, but these should be drowned out by sufficient sequencing with reads from enriched locations. Secondly the amount of sequencing required to observe the redundancy you would want to actually call an indel would be less. Finally: any indels called would be excellent candidates for biological relevance (given their location in a ChIP or other method's enriched region), and there might be some straightforward assays to assess the biological effects of the indel.
Lastly I think you might be too statistically stringent with your methods for assigning an indel. True a 42 bp read will only have, at most, 21bp on either side (not at all enough for unique location assignment), but let's think about this with a Bayesian eye. Suppose we observe a set of reads (I'm assuming more than one read per indel) showing 21 bp or so of sequence on either side of what appears to be a 1-3 bp deletion, and no reads showing sequence across the same location without the deletion, and given a set of prior about the frequency of such indels vs. the frequency of larger translocations and deletions, what is the posterior probability of an indel at that location?
I would suppose it's pretty high, given sufficient read depth. Thoughts? I'm back in medical school so I have no time to do these things but I would love to see work on this move forward. I have some scripts that use SOAP that sort of take a first stab at this that I would be happy to make available. Much statistical interpretation is still required.
With Illumina moving on up to 50 and then 75 bp reads very soon, I think my above comment may already be irrelevant.
Just a few thoughts on indel calling that might be worth considering.
One is that looking at Craig Venters genome we can see that short indels are fairly frequent. See Table 6. http://biology.plosjournals.org/perlserv/?request=slideshow&type=table&doi=10.1371/journal.pbio.0050254&id=12379
Second is that from an information content point of view a single base indel isn't much harder to align than a single base mismatch. Consider a 32bp read with one mismatch. The mismatch can be at any of 32 position in the read and take any of the other 3 bases so there are 3*32 = 96 (6.6 bits of information consumed) possible sequences that match with one mismatch. Now consider an insert of one base. It could be any of 32 positions and take any of 4 bases so there are 4*32 = 128 (7 bits of information consumed) possible sequences that match with a one base insert. Not much difference.
With short reads on human size genome you should be able to detect indels and snps at least in high complexity sequence (and easily on smaller genomes) Obviously it won't work in repeats but the alignment quality in maq and novo... should cover that.
And a little plug for novoalign and novopaired. Heng Li's maq can now do a novo2maq conversion making it possible to do all the snp/indel calling with the excellent maq tools. You can even use maq indelpe with single end reads aligned by novoalign.
Post a Comment
<< Home