With that title, you're probably expecting a discussion on how MAQ calls snps, but you're not going to get it. Instead, I'm going to rant a bit, but bear with me.
Rather than just use the MAQ snp caller, I decided to write my own. Why, you might ask? Because I already had all of the code for it, my snp caller has several added functionalities that I wanted to use, and *of course*, I thought it would be easy. Was it, you might also ask? No - but not for the reasons you might expect.
I spent the last 4 days doing nothing but working on this. I thought it would be simple to just tie the elements together: I have a working .map file parser (don't get me started on platform dependent binary files!), I have a working snp caller, I even have all the code to link them together. What I was missing was all of the little tricks, particularly the ones for intron-spanning reads in transcriptome data sets, and the code that links together the "kludges" with the method I didn't know about when I started. After hacking away at it, bit by bit things began to work. Somewhere north of 150 code commits later, it all came together.
If you're wondering why it took so long, it's three fold:
1. I started off depending on someone else's method, since they came up with it. As is often the case, that person was working quickly to get results, and I don't think they had the goal of writing production quality code. Since I didn't have their code (though, honestly, I didn't ask for it either since it was in perl, which is another rant for another day) it took a long time to settle all of the 1-off, 2-off and otherwise unexpected bugs. They had given me all of the clues, but there's a world of difference between being pointed in the general direction towards your goal and having a GPS to navigate you there.
2. I was trying to write code that would be re-usable. That's something I'm very proud of, as
most of my code is modular and easy to re-purpose in my next project. Half way through this, I gave up: the code for this snp calling is not going to be re-usable. Though, truth be told, I think I'll have to redo the whole experiment from the start at some point because I'm not fully satisfied with the method, and we won't be doing it exactly this way in the future. I just hope the change doesn't happen in the next 3 weeks.
3. Name space errors. For some reason, every single person has a different way of addressing the 24-ish chromosomes in the human genome. (Should we include the mitochondrial genome in our own?) I find myself building functions that strip and rename chromosomes all the time, using similar rules. Is the Mitochondrial genome a "MT" or just "M"? What case do we use for "X" and "Y" (or is it "x" and "y"?) in our files? Should we pre-pend "chr" to our chromsome names? And what on earth is "chr5_random" doing as a chromosome? This is even worse when you need to compare two active indexes, plus the strings in each read... bleh.
Anyhow, I fully admit that SNP calling isn't hard to do. Once you've read all of your sequences in, determined which bases are worth keeping (prb scores), determined the minimum level of coverage, minimum number of bases that are needed to call a snp, there's not much left to do. I check it all against the Ensembl database to determine which ones are non-synonymous, and then: tada, you have all your snps.
However, once you're done all of this, you realize that the big issue is that there are now too many snp callers, and everyone and their pet dog is working on one. There are several now in use at the GSC: Mine, at least one custom one that I'm aware of, one built into an aligner (Bad Idea(tm)) under development here and the one tacked on to the swiss army knife of aligners and tools: MAQ. Do they all give different results, or is one better than another? who knows. I look forward to finding someone who has the time to compare, but I really doubt there's much difference beyond the alignment quality.
Unfortunately, because the aligner field is still immature, there is no single file output format that's common to all aligners, so the comparison is a pain to do - which means it's probably a long way off. That, in itself, might be a good topic for an article, one day.
Labels: Aligners, MAQ, rant, SNP calling, Software