SNP callers.
I thought I'd switch gears a bit this morning. I keep hearing people say that the next project their company/institute/lab is going to tackle is a SNP calling application, which strikes me as odd. I've written at least 3 over the last several months, and they're all trivial. They seem to perform as well as any one else's SNP calls, and, if they take up more memory, I didn't think that was too big of a problem. We have machines with lots of RAM these days, and it's relatively cheap, these days.
What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.
I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.
Anyhow, just some food for thought, while I write tools that manage SNPs this morning.
Cheers.
What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.
I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.
Anyhow, just some food for thought, while I write tools that manage SNPs this morning.
Cheers.
Labels: application development, Bioinformatics, future, SNPS
7 Comments:
I'm curious to hear more about your needs for a SNP database. I'm partially to blame for SNPedia but that is only focused on the subset of SNPs with associated literature. For a recently started plant SNP project I'll be using existing tools, but still need to design my own storage/query infrastructure which tries to partially mirror dbSNP. I'm considering a traditional relational db for the huge numbers of unannotated plant snps, but now is a good time to consider alternatives.
Hi Cariaso,
I'd never really looked at SNPedia before - that's a pretty cool resource. Thanks for bringing it to my attention!
Your question is pretty open-ended, so I'm not really sure where to start.
My goal is to filter out all of the known snps from new samples, in order to identify novel snps, so my database focuses on individual observed SNPs located across many different assembled genomes. For instance, I have an expanded table which lists all SNPs from James Watson's DNA based on chromosome, position and observed base. (I also have dbsnp in there, as well as several other resources.) These can be crossed with my sample database tables in order to remove any positions where the SNP observed is already recorded elsewhere. (and, of course, many other analysis, which I also do, but filtering is a good use case that's easily explained.)
Filtering out the known snps from several cell lines simultaneously, while identifying overlapping SNPs often leaves me writing page-long SQL queries. That's the part that's really unmanageable in the long run. I have no problem doing this work, but there's no way anyone else can jump in and take over.
The real need, therefore, is just an interface to all the work I've already done - help with the imports, help with the table crosses, help with the filtering, etc. Even a simple Java GUI to this would be good. I just don't have the time to even think about implementing it.
Anyhow, as you put it, the back end will probably never be more than a traditional relational db for the huge numbers of SNPs of any organism, possibly with an interface for annotating them, as information becomes available. I'm sure there are better alternatives out there, but I haven't seen any yet.
Just a thought - how feasible would it be to dispense with standard rdb altogether and instead use the tools of the semantic web? Instead of records, fields and cells, you would have have RDF nodes, RDF propertyTypes and values. Such data could reside in XML flatfiles (OK, big XML flatfiles), which would make exchange easy. Visualization/querying could be done by semantic web tools. To make it really universal it would be nice to have a standard snp ontology (but as snp data is fairly simple this should not be difficult).
If you're looking at Watson but new to SNPedia, you should definitely check out
http://www.snpedia.com/files/promethease/outputs/promethease-watson.html which is a report about his snps.
You've trigged one of my pet issues when you write "all SNPs from James Watson's DNA based on chromosome, position and observed base. ".
Chrom and position end up being horrible landmarks, as we continually improve the resolution, duplications, and ordering of the reads. Identifying via named features (such as snps), which are defined by a fixed upstream and downstream (in my experience) is a far more sustainable approach.
If the above is unclear, consider the specific case of an extra copy of a CNV being placed into the next version of your reference genome. All the snps (and other features) downstream of this now have broken coordinates. This happens far too frequently as we get into the universe of 1000s of genomes instead of a single reference.
In two parts:
Part 1:
Hi cariaso,
The URL you've provided doesn't seem to give much information on Watson's files - though I may be interpreting the page incorrectly. If you could explain what I'm seeing there, it might help.
I also agree that chromosome and position are horrible landmarks, but until the alignment tools allow us to do much more "fancy" analysis, it's all we have to go on. One of the few unfortunate times that I have to refer the problem upstream. I don't have a better way of using reads without assembling or aligning them - one doesn't work well for short reads, and the other doesn't allow the type of analysis you're suggesting.
As for your example, as long as the scaffolding was identical, the coordinates won't be changed between the two samples, so I don't think that's the problem. The issue would only arise if they'd built Watson's genome up on a new scaffold - in which case, a whole new coordinate system would need to be used. That's clearly something that we should put some thought into, though it won't start affecting us till much better assemblies can be done with pyrosequencing... Pac Bio, anyone?
Part 2:
Anonymous, thanks for your ideas. At the level that SNPs are currently used, I don't see that this would gain anything - and would probably be much slower for actual data access when doing the analyses I'm currently working on. (It's hard to beat an rdb when you're crossing 10 tables of 10Million+ reads. That said, semantic web tool are a neat idea - I'll spend some time researching them when I'm back from Vacation.
Thanks!
Yes, chromosome and position are fine as an intermediate, and usually quite necessary. But since our maps continue to evolve they can perform poorly over the longer term.
Watson: Clicking on the various '...more...' links drills down into his data. The text in white is specific to his observed genotypes, while blue text is about that location generally.
Obviously the report is still not quite friendly enough, but every word of it is populated from the SNPedia wiki content, so it continues to improve.
Anonymous: I'm a big believer in semantic web (SNPedia is full of RDFs), but it seems were still in a day where the ratio of compute power to data volume necessitates something more compact during upstream processing. Downstream results which want to be shared widely should be as semantic web friendly as possible.
Just a quick note on a similar field but one that I believe is getting considerably less attention is INDEL calling. With the proliferation of short read sequencing interrogation of the genome in various assays, what kinds of data can be gained in small (maybe less than 3 base) indels? I would argue that these are far more likely to have important biological consequences on their own, but, compared to SNPs, are relatively unexplored.
Post a Comment
<< Home