SNP/SNV callers minimum useful information
Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.
Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.
So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
Ok, now I've got that off my chest! Phew.
Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.
So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
- chromosome: obviously the coordinate to find the location
- position: the base position on the chromo
- genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
- canonical: what you expect to see at that position. (Invaluable for error checking!)
- observed: what you did see at that position
- coverage: the depth at that position (filtered or otherwise)
- canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
- variation_obs: how many times you saw the variation
- quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Ok, now I've got that off my chest! Phew.
Labels: application development, Bioinformatics, biology, SNPS, Vancouver Short Read Analysis Package
9 Comments:
I'd argue that it should probably also allow multiples of most of those fields, e.g., a list of variation observations and a quality for each.
And using a Phred score for the quality might not be what the system was meant for originally, but given how widely it's used now might as well use it as a confidence indicator for each SNP call?
Genome/chromosome/position: the SQ field of the SAM format? Might be we're just calling SNPs in pathogen genes, for example.
Finally.. isn't that what the Variant Call Format is supposed to address?
Hi Setar,
Your first point is correct - but that can be accomplished by multiple records pointing to the same location. You don't need to cram two SNPs on the same base into one record.
The second point, using phred scores, is somewhat misleading. A phred score tells you what is happening in one read at a given location, rather than summing up the properties of multiple reads at that location, so it's not quite suitable for next-gen seq. Maq comes close to doing this, however, giving you the highest score observed at that SNP location.
I don't understand your third point, however. Can you clarify?
Finally, I'm not familiar with Variant Call Format. I'll look into it - thanks!
Anthony
Good point if you allow multiples; figured it might be easier to keep track of coverage that way but could be confusing.
As for genome-chromosome-position.. just wondering what to do with reference sequences that do not fit this tuple, say 'HXB2', 'AJ249239' (pol protein), 200 (relative to the gene's sequence start).
I think dealing with genome-chr-position isn't actually all that bad. But, you're right - it may not be ideal for things like VDJ regions. (There are intresting "extra chromosomes" in the human genome chromosomes to deal with these issues already, however, so that shouldn't be a problem.
In any case, the examples you've picked are interesting, however. If I'm not mistaken, I think those are viral proteins - so shouldn't they just be marked on the virus genome, rather than the host genome? Or have I missed something?
Anthony,
yes, they are viral. Guess you could always set the chromosome to n/a or 0, but I can think of many situations where it would be useful to state the position with regards to the reference sequence rather than the full genome.
That complicates things, though. Maybe it's best to simply stick to a single reference genome.
Hi Setar,
Actually, the database I'm working on was going to be restricted to a single species - but the point of putting Genome-Chromosome-Position was to allow for this exact case, in addition to changing versions of a single genome.
Essentially, all you'd need to do is mark genome as "HIV-verX", and you'd be set.
Anthony
Just a quick addition - the Variant Call Format does look like a good solution - unfortunately, there's no freely available information on the format. Hopefully the information will be released soon, so other groups can benefit from it, or contribute to it's improvement.
Anthony
I'm very curious to learn more about VCF, but my netbrain has failed me. Please provide more info.
Identifying sequence features based on a coordinate is only useful if we use the same coordinates, or have a way to translate between them. I support the idea of defining standard datasets and comparing the tools against them. That greatly simplifies doing comparisons. But for real world samples, there will be some indels which throw off the absolute coordinates of all downstream features.
dbSNP identifiers such as rs1234 and ss1234 are defined not by position, but instead by flanking sequence which eliminates the problem with indels ruining your coordinates. Unfortunately it requires that you wait for an official name, which doesn't help as you're finding the very first occurrence.
Even if you find a good looking SNP, its possible that it is actually part of a larger distinct feature. An example might be 2 nearly neighboring variations which could each be considered a SNP, but on further review might be realized as an single inversion encompassing both. Neither is interpretation is necessarily correct, but SNP callers should be sensitive to this possibility.
What I'd like to see is snp identification treated as a parsing problem, with a goal of creating something like a SAX parser, which announces the recognition of a particular feature. The goal should not be to report a SNP, but merely to report a diff from the consensus. A later second step should consider these diffs in the context of their neighbors, and assign a type of variation.
Lastly I am interested in the position insensitive dbSNP ids. Tools such as traitomatic may eventually solve this last step.
http://wiki.github.com/xwu/trait-o-matic/download-installation
Hi Cariaso,
Unfortunately, I don't have much to add on the VCF beyond what was shown to me privately, and I don't believe I have permission to disclose that information. We'll all have to wait patiently for it to be released.
As for SNP co-ordinates, I don't think there's ever going to be a "position-free" system that is efficient, however, the coordinate system I use is explicit in storing Genome (or reference scaffold name) in addition to the chromosome and position. This more or less takes care of the problem, although still requires conversion between reference genomes (which I'll argue shouldn't be done blindly anyhow.)
in the meantime, thanks for the link to trait-o-matic. I'll take a look.
Post a Comment
<< Home