Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, February 5, 2010

GFF3 undocumented feature...

Earlier today, I tweeted:
Does anyone know how to decypher a diBase GFF3 file? They don't identify the "most abundant" nucleotide uniquely. seems useless to me.
Apparently, there is a solution, albeit undocumented:

The attribute "genotype" contains an IUB code that is limited to using either a single base or a double base annotation (eg, it should not contain, H, B, V, D or N - but may contain R, Y, W, S, M or K ), which then allows you to subtract the "reference" attribute (that must be canonical) from the "genotype" attribute IUB code to obtain the new SNP - but only when the "genotype" attribute is not a canonical base.

If only that were documented somewhere...

UPDATE: Actually, this turns out not to be the case at all -- there are still positions for which the "genotype" attribute is an IUB code, and the reference is not one of the called bases. DOH!

Labels: , , ,

Wednesday, October 14, 2009

Useful error messages.... and another format rant.

I'll start with the error message, since it had me laughing, while everything else seems to have the opposite reaction.

I sent a query to Biomart the other day, as I often do. Most of the time, I get back my results quickly, and have no problems whatsoever. It's one of my "go-to" sites for useful genomic data. Unfortunately, every time I tried to download the results of my query, I'd get 2-3Mb into the file before the download would die. (It was a LONG list of snps, and the file size was supposed to be in the 10Mb ballpark.)

Anyhow, in frustration, I tried the "email results to you" option, whereupon I got the following email message:


Your results file FAILED.
Here is the reason why:
Error during query execution: Server shutdown in progress


That has to be the first time I've ever had a server shutdown cause a result failure. Ok, it's not that funny, but I am left wondering if that was the cause of the other 10 or so aborted downloads. Anyone know if Biomart runs on Microsoft products? (-;

The other thing on my mind this afternoon is that I am still looking to see my first Variant Call Format file for SNPs. A while back, I was optimistic about seeing the VCF files in the real world. Not that I can complain, but I thought adoption would be a little faster. A uniform SNP format would make my life much more enjoyable - I now have 7 different SNP format iterators to maintain, and would love to drop most of them.

What surprised me, upon further investigation, is that I'm also unable to find a utility that actually creates VCF files from .map, SAM/BAM, eland, bowtie or even pileup files. I know of only one SNP caller that creates VCF compatible files, and unfortunately, it's not freely available, which is somewhat un-helpful. (I don't know when or if it will be available, although I've heard rumours about it being put into our pipeline...)

That's kind of a sad state of affairs - although I really shouldn't complain. I have more than enough work on my plate, and I'm sure the same can be said for those who are actively maintaining SNP callers.

In the meantime, I'll just have to sit here and be patient... and maybe write an 8th snp format iterator.

Labels: , , , , , ,

Friday, September 4, 2009

Risk Factors #2 and thanks to Dr. Steve.

Gotta love drive-by comments from Dr. Steve:

I don't have time to go into all of the many errors in this post, but for a start, the odds ratio associated with the celiac snp is about 5-10X *per allele* (about 50X for a homozygote). This HLA allele accounts for about 90% of celiac disease and its mode of action is well understood.

I understand this is just a blog and you are not supposed to be an expert, but you should do some basic reading on genetics before posting misinformation. Or better yet, leave this stuff to Daniel MacArthur.


While even the 6th grade bullies I knew could give Dr Steve a lesson in making friends, I may as well at least clarify my point.

To start with, Dr. Steve made one good point. I didn't know know what the risk factor for a single given Celiac SNP is - and thanks to Dr. Steve's incredibly educational message - I still don't. I simply made up a risk factor, which was probably the source of Dr. Steve's confusion. (I did say I made it up in the post, but apparently hypothetical situations aren't within Dr Steve's repertoire.)

But lets revisit how Risk factors work, as I understand it. If someone would like to tell me how I'm wrong, I'll accept that criticism. Telling me I'm wrong without saying why is useless, so don't bother.

A risk factor is a multiplicative factor that indicates your risk of expressing a given phenotype versus the general population. If you have a risk factor of 5, and the general population has a 10% phenotype penetration, that means you have a 50% chance of expressing the phenotype. (.10 x 5 = 0.5, which is 50%).

In more complex cases, say with two independent SNPs, each with an independent risk factor, you multiply the set of risk factors by the probability of the phenotype appearing in the general population. (You don't need me to do the math for you, do you?)

My rather long-winded point was that discussing risk factors without discussing the phenotypic background disease rate in the population is pointless, unless you know that the risk ratio leads you to a diagnostic test and predicts in a demonstrated statistically significant manner that the information is actionable.

Where I went out on a limb was in discussing other unknowns: error rates in the test, and possible other factors. Perhaps Dr. Steve knows he has a 0% error rate in his DTC SNP calls, or assumes as much - I am just not ready to make that assumption. Another place Dr. Steve may have objected to was my point about extraneous environmental factors, which may be included in the risk factor, although I just passed over it in my previous post without much discussion.

(I would love to hear how a SNP risk factor for something like Parkinson's disease would be modulated by Vitamin D levels depending on your latitude. It can't possibly be built into a DTC report. Oh, wait, this is hypothetical again - Sorry Dr. Steve!)

My main point from the previous post was that I have a difficult time accepting is that genomics consultants consider a "risk factor" as a useful piece of genomic information in the absence of an accompanying "expected background phenotypic risk." A risk factor is simply a modulator of the risk, and if you talk about a risk factor you absolutely need to know what the background risk is.

Ok, I'm done rehashing my point from the previous point, and that takes me to my point for today:

Dr. Steve, telling people who have an interest in DTC genomics to stay out of the conversation in favor of the experts is shooting yourself in the foot. Whether it's me or someone else, we're going to ask the questions, and telling us to shut up isn't going to get the questions answered. If I'm asking these questions, and contrary to your condescending comment I do have a genomics background, people without a genomics background will be asking them as well.

So I'd like to conclude with a piece of advice for you: Maybe you should leave the discussion to Daniel McArthur too - he's doing a much better job of spreading information than you are, and he does it without gratuitously insulting people.

And I thought Doctors were taught to have a good bedside manner.

Labels: ,

Variant Call Format

After earlier discussions, there is now some information available on the Variant Call Format available to the public.

If you're intrested in working with snps, this may be required reading:

http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

Labels: , , ,

Thursday, September 3, 2009

DTC Snps... no more risk factors!

I've been reading Daniel's blog again. Whenever I end up commenting on things I don't understand well, that's usually why. Still, it's always food for thought.

First of all, has anyone quantified the actual error rate on these tests? We know they have all sorts of mistakes going on. (This one was recently in the news, and yes, unlike Wikipedia, Daniel is a valid reference source for anything genomics related.) I'll come back to this point in a minute.

As I understand it, the risk factor is an adjustment made to the likelihood of the general population in characterizing the risk of an individual suffering from a particular disease.

So, as I interpret it, you take whatever your likelihood of having that disease was multiplied by the risk factor. For instance with a disease like Jervell and Lange-Nielsen Syndrome, 6 of every 1 Million people suffer from it's effects (although this is a bad example since you would have discovered it in childhood, but ignoring that for the moment we can assume another rare disease with a similar rate.) If our DTC test shows we have a 1.17 risk factor because we have a SNP, we would multiply that by 1.17.

6/1,000,000 x 1.17 = 7/1,000,000

if I've understood it all correctly, that means you've gone from knowing you have a 0.000,6% chance to being certain you have a 0.000,7% chance of suffering from your selected disease. (What a great way to spend your money!)

But lets not stop there. Lets ask about the the error rate on actually calling that snp is. From my own experience in SNP validation, I'd make a guess that the validation rate is close to 80-90%. Lets even be generous and take the high end. Thus:

You've gone from 100% knowing you've got a 0.000,6% chance of having a disease to being 90% sure you have a 0.000,7% chance of having a disease and a 10% sure you've still got a 0.000,6% of having the disease.

Wow, I'm feeling enlightened.

Lets do the same for something like Celiacs disease, which is estimated to strike 1/250 people, but is only diagnosed 1/4,700 people in the U.S.A. - and lets be generous and assume that the SNP in your DTC test has a 1.1 risk factor. (Celiacs is far from a rare disease, I might add.)

As a member of the average U.S. population, you had a 0.4% chance of having the disease, but a 0.02% chance of being diagnosed with it. That's a pretty big disparity, so maybe there's a good reason to have this test done. As a Canadian it's somewhat different odds, but lets carry on with the calculations anyhow.

lets say you do the test and find out you have a 1.1 times risk factor of having the disease. omg scary!

Wait, lets not freak out yet. That sounds bad, but we haven't finished the calculations.

Your test has the SNP.... 1.1 x 1/250 = 0.44% likelihood you have the disease. Because Celiacs disease requires a biopsy to definitively diagnose it (and treatment does not start till you've done the diagnosis), would you run out and submit yourself to a biopsy on a 0.44% chance you have a disease? Probably not unless you have some other knowledge that you're likely to have this disease already.

Then, we factor in the 90% likelyhood of getting the SNP call correct: You have a 90% likelihood of having a 0.44% chance of having the disease, and a 10% likelihood of having a 0.4% chance of having the disease.

Ok, I'd be done panic-ing about now. And we've only considered two simple things here. Lets add one more just for fun.

lets pretend that an unknown environmental stressor is actually involved in triggering the condition, which would explain why the odds are somewhat different in Canada. Since we know nothing about that environmental trigger, we can't even project odds of coming in contact with it. Who knows what effect that plays with the SNP you know about.

By now, I can't help thinking that all of this is just a wild goose chase.

So, when people start talking about how you have to take your DTC results to a Genetic Counsellor or to your MD I really have to wonder. I can't help but to think that unless you have a very good reason to suspect a disease or if you have some form of a priori knowledge, this whole thing is generally a waste. Your Genetic Counsellor will probably just laugh at you, and your MD will order a lot of unnecessary tests - which of those sounds productive?

Let me make a proposal (and I'm happy to hear dissent): Risk factors are great - but are absolutlely useless when it comes to discussing how genetic factors affect you. Lets leave the risk factors to the people writing the studies and ask the DTC companies to make a statement: what are your odds of being affected by a given condition? And, if you can't make a helpful prediction (aka, a diagnostic test), maybe you shouldn't be selling it as a test.

Labels: , , ,

Friday, August 21, 2009

SNP Database v0.2

My SNP database is now up and running, with the first imports of data working well. That's a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I've even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)

So, what's most interesting to me? Some of the early stats:

11,545,499 snps in total, made from:
  • 870549 snp calls from the 1000 genome project
  • 11361676 snps from dbsnp
So, some quick math:
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.

That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.

I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.

Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....

This is starting to become fun.

Labels: , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Tuesday, August 11, 2009

SNP/SNV callers minimum useful information

Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Really, most of the callers out there give you most, if not all of it - but I have yet to see the final "quality" being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That's fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I've got that off my chest! Phew.

Labels: , , , ,

Thursday, August 6, 2009

New Project Time... variation database

I don't know if anyone out there is interested in joining in - I'm starting to work on a database that will allow me to store all of the snps/variations that arise in any data set collected at the institution. (Or the subset to which I have the right to harvest snps, anyhow.) This will be part of the Vancouver Short Read Analysis Package, and, of course, will be available to anyone allowed to look at GPL code.

I'm currently on my first pass - consider it version 0.1 - but already have some basic functionality assembled. Currently, it uses a built in snp caller to identify locations with variations and to directly send them into a postgresql database, but I will shortly be building tools to allow SNPs from any snp caller to be migrated into the db.

Anyhow, just putting it out there - this could be a useful resource for people who are interested in meta analysis, and particularly those who might be interested in collaborating to build a better mousetrap. (=

Labels: , , , , , ,

Monday, June 1, 2009

Science Cartoons - 5 (RNA-Seq)

This is the last science cartoon I did for my poster. I was pretty happy with the pictures, although if I were to do it over again, I've learned a few more tricks that I'd have used instead.

Anyhow, my favorite effect on this picture is the "text to path", where you can make any string follow any line - who knew graphic design could be so much fun. It definitely makes for some interesting graphics. I'd definitely use this effect in an RNA folding paper, if I ever got the chance to do another one. (-;

Labels: , ,

Friday, May 15, 2009

UBC Seminar - Dr. Dawn Bowdish, McMaster University

[This talk was given by a good friend, Dr. Dawn Bowdish - and is WAY outside of the topics that I normally cover. However, there is some interesting work with SNPs which is worth mentioning, if you can get past the non-genomic part at the start - which I suggest. As always, mistakes are my misunderstanding of the topic - not the speakers!]

Talk title: The class A scavenger receptors are associated with host defense towards Mycobacteium tuberculosis.

Lab URL: http://www.bowdish.ca/lab

Post-doc was done at oxford, where most of the work that will be presented today was done.
1.The role of the scavenger receptors in Mycobacterium tuberculosis in infection
2.Polymorphisms in scavenger receptors and susceptibility to M. Tuberculosis infection
3.The role of the cytoplasmic tail in scavenger receptor signalling
4.Evolution of scavenger receptor domains.

Macrophages are beautiful cells. They don't have a single form – you know it when you see it. Paraphrased: 'phooi to t-cells.'

[at this point, the projector died. Dawn offers to tell macrophage stories... Someone wants to know all about Oxford. “It was very Harry Potter.” AV people mess around in the back....]

Macrophages are central to everything she studies. They are an integral part of mammalian biology:
  • Embryonic development, organ structure
  • chronic disease
  • genetic diseases
  • infectious disease
  • autoimmunity
  • cancer
Macrophages receptors are indicators of phenotype, function and biomarkers for disease phenotype

Scavenger receptors: several classes of them exist. The only conserved feature is that they bind modified lipids (acLDL) with varying efficiency.

Class A scavengers: includes 2 that Dawn studies specifically: MARCO and SRA (I and II). Found in all organisms from plants to humans, yeast.. etc. They are involved in cell-cell interactions, and have been adapted to many other cell-interactions.

Marco (Macrophage receptor with collagenous structure) and SRA (scavenger receptor class A)have similar ligands, which is very broad. “Molecular fly paper.” In general, restricted to expression in macrophages)

They only bind some bacterial well, but not all.

SRA plays a role in homeostasis and infectious disease, septic shock.
Marco plays a role in infectious disease. (Redundancy in vitro – requires double knock out.)

The binding domains, however are very different. In Marco, binding is at the end of the receptor. In SRA, it's the 2nd last.

MARCO is not expressed in any cell line, and is not in bone marrow macrophage. Thus, it's often overlooked.

Three types of activation: Classical, alternative, innate (hypothesized). Marco seems to be innate activation, and the properties/phenotype are not well understood. Possibly a phenotype for immuno-modulation, but when it's used is not known. Fills a niche, which doesn't quite fit with the known models in mouse.

So, how does it work in TB? (Not something Dr. Bowdish intended to study, but ended up falling into it in oxford.)

There are many types of uptake – and many new ones have been discovered. There's still room for more receptors, however, and it's possible that the scavenger receptors might be involved in TB.

SRA is up-regulated in response to IFN-gamma and BCG, knockouts are susceptible to BG induced shock. But MARCO? No clear connection. There is still no human anti-MARCO antibody, so these experiments can't be repeated for human cells.

Collaboration with Dr. Russell and Sakamoto from Cornell, and ended up getting involved. They had a ligand (Trehalose dimycholate) that no one had ever found a receptor for – and that turned out to be MARCO. Using TDM coated beads, you could see if it was picked up.

Use a cell line with MARCO receptor – and the beads. MARCO showed that it picked up the beads, SRA did not pick up beads. Could knock it down with a specific inhibitor for MARCO. (shown with fluorescence microscopy.)

Previous work had shown that TDM induces cytokine production in a MyD88 dependent fashion. There was a TLR2 &4 response – so did a knock out, and showed that it could use either of them.

Minimum signal complex required is Marco + TLR (2 or 4). This recreates the pro-inflammatory response. Could never recreate this with SLA.

Is MARCO the missing factor in TDM signalling? Yes. So, it's not that they've lost the pathway or ability – just lacking the particular co-receptor to interact with TDM.

How MARCO works in cytoplasm, however, is another story – it has a very small cytoplasmic tail... which includes a predicted myristolation site. Made constructs with different part of the tail – which didn't change the signalling much. The model proposed, however, is that MARCO is a tethering receptor, which binds or transports the TDM beads to TLRs via CD14. (Similar to the LPS signalling complex.) This was tested with a NF-kb reporter system.

More experiments were done using the knockouts without MARCO or DKO, and were able to continue along to find that MARCO appears to be involved in response to M. Tuberculosis.

Up till now, this was in vitro and mouse. A switch was made to human models.

Started looking for groups looking at SNPs in humans. Did a study interested in whether these SNPs are related to human disease. (Adrian Hill?)

It works well because TB has been around for a long time – 40,000 years.

The Hill group has samples from Gambia, to study TB. Screened 3,500 individuals (HIV free), do controls for the usual (age, sex, etc), and then screened 25SNPs in MARCO and 22 in MSR1.

[Presents a fancy map, showing coverage.]

Much to surprise: there were no SNPs what so ever in SRA – found 4 in MARCO with association to susceptibility and resistance. However, they were all in introns. They were found in introns, and discovered that it was in a putative splice site. (There were no splice variants known in mice, at the time – and there are still none known.) Using assays, Dr. Bowdish found there were indeed splice variants, caused by the SNP.

Oddly enough, this splice variant seems to knock out the binding domain of MARCO. (And the SNP seems to be predominant in african populations - and is very uncommon in caucasians.)

Tentative model: TDM induces MARCO expression. MARCO is regulated at transcriptional and post-translational modification levels. Thus, splice variants may induce differences in response to TB bacteria.

Goals for the future:
  • Understand role of macrophage receptors in infectious disease
  • Attribute functional significance of genetic variability in macrophage genes
  • Characterize phenotype of innate activation & determine if this can be manipulated by immunomodulation
  • Collaborating with people studying other receptors.
Open day on October 26th, 2009 : Institute of infectious disease research opening day.

Labels: ,

Wednesday, March 25, 2009

Searching for SNPs... a disaster waiting to happen.

Well, I'm postponing my planned article, because I just don't feel in the mood to work on that tonight. Instead, I figured I'd touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they're not SNPs, they're variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They're only polymorphisms if they're common to a portion of the population.

In this case, they're from cancer cell lines, so after I filter out all the real SNPs, what's left are SNVs... and they're bloody annoying. This is the second major project I've done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs - a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data - nearly a year old. It wasn't particularly reliable, and I've now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked... and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious... do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question "why?" Why do we get unexpected results?

Fortunately, I'm a scientist, so that question is really something I like. I don't begrudge the last year's worth of work - which apparently is now more or less down the toilet - but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I'm hooked, otherwise I'd have tossed in the towel long ago.

Labels: , , , , , ,

Tuesday, January 6, 2009

My Geneticist dot com

A while back, I received an email from a company called mygeneticist.com that is doing genetic testing to help patients identify adverse drug reactions. I'm not sure what the relationship is, but they seem to be a part of something called DiscoverMe technologies. I bring mygeneticist up, because I had an "interview" with one of their partners, to determine if I am a good subject for their genetic testing program. It seems I'm too healthy to be included, unless they later decide to include me as a control. Nuts-it! (I'm still trying to figure out how to get my genome sequenced here at the GSC too, but I don't think anyone wants to fund that...)

At any rate, I spoke with the representative of their clinical side of operations this morning and had an interesting conversation about my background. In typical fashion, I also took the time to ask a few specific questions about their operations. I'm pretty sure they didn't tell me much more than was available on their various web pages, but I think there was some interesting information that came out of it.

When I originally read their email, I had assumed that they were going to be doing WTSS on each of their patients. At about $8000 per patient, it's expensive, but a relatively cheap form of discovery - if you can get around some of the challenges involved in tissue selection, etc. Instead, it seems that they're doing specific gene interrogation, although I wasn't able to get the type of platform their using. This leads me to believe that they're probably doing some form of literature check for genes related to the drugs of interest, followed by a PCR or Array based validation across their patient group. Considering the challenges of associating drug reactions with SNPs and genomic variation, I would be very curious to see what they have planned for "value-added" resources. Any drug company can find out (and probably does already know) what's in the literature, and any genetic testing done without approval from the FDA will probaby be sued/litigated/regulated out of existance... which doesn't leave a lot of wiggle room for them.

And that lead me to thinking about a lot of other questions, which went un-asked. (I'll probably email the Genomics expert there to ask some questions, though I'm mostly interested in the business side of it, which they probably won't answer.) What makes them think that people will pay for their services? How can they charge a low-enough fee to make the service attractive while getting making a profit? And, from the scientific side, assuming they're not just a diagnostic application company, I'm not sure how they'll get a large enough cohort to make sense of the data they receive through their recruitment strategy.

Anyhow, I'll be keeping my eyes on this company - if they're still around in a year or two, I'd be very interested in talking to them again about their plans in the next-generation sequencing field.

Labels: , , ,

Tuesday, August 12, 2008

SNP callers.

I thought I'd switch gears a bit this morning. I keep hearing people say that the next project their company/institute/lab is going to tackle is a SNP calling application, which strikes me as odd. I've written at least 3 over the last several months, and they're all trivial. They seem to perform as well as any one else's SNP calls, and, if they take up more memory, I didn't think that was too big of a problem. We have machines with lots of RAM these days, and it's relatively cheap, these days.

What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.

I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.

Anyhow, just some food for thought, while I write tools that manage SNPs this morning.

Cheers.

Labels: , , ,