SNP Database v0.2
My SNP database is now up and running, with the first imports of data working well. That's a huge improvement over the v0.1, where the data had to be entered under pretty tightly controlled circumstances. The API now uses locks, better indexes, and I've even tuned the database a little. (I also cheated a little and boosted the P4 running it to 1Gb RAM.)
So, what's most interesting to me? Some of the early stats:
11,545,499 snps in total, made from:
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.
That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.
I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.
Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....
This is starting to become fun.
So, what's most interesting to me? Some of the early stats:
11,545,499 snps in total, made from:
- 870549 snp calls from the 1000 genome project
- 11361676 snps from dbsnp
11,361,676 + 870,549 - 11,545,499 = 686,726 Snps that overlapped between the 1000 genome project (34 data sets) and the dbSNP calls.
That is a whopping 1.6% of the SNPs in my database were not previously annotated in dbSNP.
I suppose that's not a bad thing, since those samples were all "normals", and it's good to get some sense as to how big dbSNP really is.
Anyhow, now the fun with the database begins. A bit of documentation, a few scripts to start extracting data, and then time to put in all of the cancer datasets....
This is starting to become fun.
Labels: SNP calling, SNP Database, SNPS
6 Comments:
Re: your tweet about "non-lethal, non-detremental" ie. neutral mutations saturate in the genome.
In a Wright-Fisher ideal population (no selection, random mating, no migration, non-overlapping generations) the average number of generations it takes for a new mutation to become fixed is 4 times the effective population size.
Different human populations have different effective population sizes, due to population bottlenecks in the past. One estimate is ~3,100 for CEU,and ~7,500 for YRI.
So a good guess would be about 20,000 generations (500,000 years).
That's cool - That's not at all what I meant, but a very neat calculation. (=
I was thinking in terms of "how many genomes do I need to sequence before I stop finding new neutral Snps."
Although, I rather guess the two are related - if I knew when the Snp was first introduced, i should be able to guess how far it's been spread - or vice versa, I suppose.
Does that work? If I know the percent of the carrying population, can I figure out how when it happened?
You can make some hand-waving guesses, but the allele frequency alone will not tell you how old the allele is.
Again, if you're looking at a neutral change (not selected for or against) then I think that the number of people who have the new allele cannot greatly exceed the number of generations since its origin (I'm badly paraphrasing from a theory of pop. genetics text I read recently but don't have nearby).
The process is sort of a random walk, so most new mutations will just die out by chance, and a few will become fixed much later than 500,000 years. Also in real populations there is migration, wars, population bottlenecks, non-random mating, socio-economic factors... etc.
But the genome is a dynamic thing, and new mutations are being made every minute. Even when we have 1000 or 1M genomes, it is still a snapshot of this point in human history, and only for the populations that have been sequenced.
Hi Dan,
Thanks - that makes sense, and more or less reflects what I recall learning in high school population genetics.
Your point about it being a snapshot is important, however. I think a lot people working on genomics tend to forget that it's something will change over time (myself included.) Although, I suppose it'll be a while before we need a new reference genome.
I've been wondering about the same thing for a while. Thanks for make my life easier by having done the analysis ~
"That is a whopping 0.15% of the SNPs in my database were not previously annotated in dbSNP."
Just did some calculation based on the numbers provided in the post.
>>> (1-11361676/11545499.)*100
1.5921615860864935
>>> (870549-686726)/11545499.*100
1.5921615860864915
Should it be 1.6% ? I have missing information...
Ooops... well, as I said, quick calculations. Thanks for catching that.
Post a Comment
<< Home