Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, March 11, 2010

Wolfram Alpha recreates ensembl?

Ok, this might not be my most coherent post - I'm finally getting better from being sick for a whole week, which has left my brain felling somewhat... spongy. Several of us the AGBT-ers have come down with something after getting back, and I have a theory that it was something in the food we were given.... Maybe someone slipped something into the food to slow down research at the GSC??? (-; [insert conspiracy theory here.]

Anyhow, I just received a link [aka spam] from Wolfram Alpha, via a posting on linked in, letting me know all about their great new product: Wolfram Alpha now has genome information!

Somehow, looking at their quick demo, I'm somewhat less than impressed. Here's the link, if you'd like to check it out yourself: Wolfram Alpha Blog Post (Genome)

I'm unimpressed for two reasons: the first is that there are TONS of other resources that do this - and apparently do it better, from the little I've seen on the blog. For the moment, they have 11 genomes in there, which they hope to expand in the future. I'm going to have to look more closely, if I find the motivation, as I might be missing something, but I really don't see much that I can't do in the UCSC genome browser or the Ensembl web page. The second thing is that I'm still unimpressed by Wolfram Alpha's insistence that it's more than just a search engine, and that if you use it to answer a question, you need to cite it.

I'm all in favour of using really cool algorithms and searches are no exception. [I don't think I've mentioned this to anyone yet, but if you get a chance check out Unlimited Detail's use of search engine optimization to do unbelievable 3D graphics in real time.] However, if you're going to send links boasting about what you can do with your technology, do something other people can't do - and be clear what it is. From what I can tell, this is just a mash-up meta analysis of a few small publicly available resources. It's not like we don't have other engines that do the same thing, so I'm wondering what it is that they think they do that makes it worth going there for... anyone?

Worst of all, I'm not sure where they get their information from... where do they get their SNP calls from? How can you trust that, when you can't even trust dbSNP?

Anyhow, for the moment, I'll keep using resources that I can cite specifically, instead of just citing Wolfram Alpha... I don't know how reviewers would take it if I cured cancer... and cited Wolfram as my source.

Happy searching, people!

Labels: , , ,

Thursday, September 3, 2009

DTC Snps... no more risk factors!

I've been reading Daniel's blog again. Whenever I end up commenting on things I don't understand well, that's usually why. Still, it's always food for thought.

First of all, has anyone quantified the actual error rate on these tests? We know they have all sorts of mistakes going on. (This one was recently in the news, and yes, unlike Wikipedia, Daniel is a valid reference source for anything genomics related.) I'll come back to this point in a minute.

As I understand it, the risk factor is an adjustment made to the likelihood of the general population in characterizing the risk of an individual suffering from a particular disease.

So, as I interpret it, you take whatever your likelihood of having that disease was multiplied by the risk factor. For instance with a disease like Jervell and Lange-Nielsen Syndrome, 6 of every 1 Million people suffer from it's effects (although this is a bad example since you would have discovered it in childhood, but ignoring that for the moment we can assume another rare disease with a similar rate.) If our DTC test shows we have a 1.17 risk factor because we have a SNP, we would multiply that by 1.17.

6/1,000,000 x 1.17 = 7/1,000,000

if I've understood it all correctly, that means you've gone from knowing you have a 0.000,6% chance to being certain you have a 0.000,7% chance of suffering from your selected disease. (What a great way to spend your money!)

But lets not stop there. Lets ask about the the error rate on actually calling that snp is. From my own experience in SNP validation, I'd make a guess that the validation rate is close to 80-90%. Lets even be generous and take the high end. Thus:

You've gone from 100% knowing you've got a 0.000,6% chance of having a disease to being 90% sure you have a 0.000,7% chance of having a disease and a 10% sure you've still got a 0.000,6% of having the disease.

Wow, I'm feeling enlightened.

Lets do the same for something like Celiacs disease, which is estimated to strike 1/250 people, but is only diagnosed 1/4,700 people in the U.S.A. - and lets be generous and assume that the SNP in your DTC test has a 1.1 risk factor. (Celiacs is far from a rare disease, I might add.)

As a member of the average U.S. population, you had a 0.4% chance of having the disease, but a 0.02% chance of being diagnosed with it. That's a pretty big disparity, so maybe there's a good reason to have this test done. As a Canadian it's somewhat different odds, but lets carry on with the calculations anyhow.

lets say you do the test and find out you have a 1.1 times risk factor of having the disease. omg scary!

Wait, lets not freak out yet. That sounds bad, but we haven't finished the calculations.

Your test has the SNP.... 1.1 x 1/250 = 0.44% likelihood you have the disease. Because Celiacs disease requires a biopsy to definitively diagnose it (and treatment does not start till you've done the diagnosis), would you run out and submit yourself to a biopsy on a 0.44% chance you have a disease? Probably not unless you have some other knowledge that you're likely to have this disease already.

Then, we factor in the 90% likelyhood of getting the SNP call correct: You have a 90% likelihood of having a 0.44% chance of having the disease, and a 10% likelihood of having a 0.4% chance of having the disease.

Ok, I'd be done panic-ing about now. And we've only considered two simple things here. Lets add one more just for fun.

lets pretend that an unknown environmental stressor is actually involved in triggering the condition, which would explain why the odds are somewhat different in Canada. Since we know nothing about that environmental trigger, we can't even project odds of coming in contact with it. Who knows what effect that plays with the SNP you know about.

By now, I can't help thinking that all of this is just a wild goose chase.

So, when people start talking about how you have to take your DTC results to a Genetic Counsellor or to your MD I really have to wonder. I can't help but to think that unless you have a very good reason to suspect a disease or if you have some form of a priori knowledge, this whole thing is generally a waste. Your Genetic Counsellor will probably just laugh at you, and your MD will order a lot of unnecessary tests - which of those sounds productive?

Let me make a proposal (and I'm happy to hear dissent): Risk factors are great - but are absolutlely useless when it comes to discussing how genetic factors affect you. Lets leave the risk factors to the people writing the studies and ask the DTC companies to make a statement: what are your odds of being affected by a given condition? And, if you can't make a helpful prediction (aka, a diagnostic test), maybe you shouldn't be selling it as a test.

Labels: , , ,

Thursday, August 13, 2009

Ridiculous Bioinformatics

I think I've finally figured out why bioinformatics is so ridiculous. It took me a while to figure this one out, and I'm still not sure if I believe it, but let me explain to you and see what you think.

The major problem is that bioinformatics isn't a single field, rather, it's the combination of (on a good day) biology and computer science. Each field on it's own is a complete subject that can take years to master. You have to respect the biologist who can rattle off the biochemicals pathway chart and then extrapolate that to the annotations of a genome to find interesting features of a new organism. Likewise, theres some serious respect due to the programmer who can optimize code down at the assembly level to give you incredible speed while still using half the amount of memory you initially expected to use. It's pretty rare to find someone capable of both, although I know a few who can pull it off.

Of course, each field on it's own has some "fudge factors" working against you in your quest for simplicity.

Biologists don't actually know the mechanisms and chemistry of all the enzymes they deal with - they are usually putting forward their best guesses, which lead them to new discoveries. Biology can effectively be summed us as "reverse engineering the living part of the universe", and we're far from having all the details worked out.

Computer Science, on the other hand, has an astounding amount of complexity layered over every task, with a plethora of languages and system, each with their own "gotchas" (are your arrays zero based or 1 based? how does your operating system handle wild cards at the command line? what does your text editor do to gene names like "Sep9") leading to absolute confusion for the novice programmer.

In a similar manner, we can also think about probabilities of encountering these pitfalls. If you have two independent events, and each of which has a distinct probability attached, you can multiply the probabilities to determine the likelihood of both events occurring simultaneously.

So, after all that, I'd like to propose "Fejes' law of interdisciplinary research"

The likelihood of achieving flawless work in an interdisciplinary research project is the product of the likelihood of achieving flawless work in each independent area.


That is to say, that if your biology experiments (on average) are free of mistakes 85% of the time, and your programming is free of bugs 90% of the time. (eg, you get the right answers), your likely hood of getting the right answer in a bioinformatics project is:
Fp = Flawless work in Programming
Fb = Flawless work in Biology
Fbp = Flawless work in Bioinformatics

Thus, according to Fejes' law:
Fb x Fp = Fbp

and the example given:
0.90 x 0.85 = 0.765

Thus, even an outstanding programmer and bioinformatician will struggle to get an extremely high rate of flawless results.

Fortunately, there's one saving grace to all of this: The magnitude of the errors is not taken into account. If the bug in the code is tiny, and has no impact on the conclusion, then that's hardly earth shattering, or if the biology measurements have just a small margin of error, it's not going to change the interpretation.

So there you have it, bioinformticians. if i haven't just scared you off of ever publishing anything again, you now know what you need to do...

Unit tests, anyone?

Labels: , , , ,

Tuesday, August 11, 2009

SNP/SNV callers minimum useful information

Ok, I sent a tweet about it, but it didn't solve the frustration I feel on the subject of SNP/SNV callers. There are so many of them out there that you'd think they grow on trees. (Actually, they grow on arrays...) I've written one, myself, and I know there are at least 3 others written at the GSC.

Anyhow, At first sight, what pisses me off is that there's no standard format. Frankly, that's not even the big problem, however. What's really underlying that problem is that there's no standard "minimum information" content being produced by the SNP/SNV callers. Many of them give a bare minimum information, but lack the details needed to really evaluate the information.

So, here's what I propose. If you're going to write a SNP or SNV caller, make sure your called variations contain the following fields:
  • chromosome: obviously the coordinate to find the location
  • position: the base position on the chromo
  • genome: the version of the genome against which the snp was called (eg. hg18 vs. hg19)
  • canonical: what you expect to see at that position. (Invaluable for error checking!)
  • observed: what you did see at that position
  • coverage: the depth at that position (filtered or otherwise)
  • canonical_obs: how many times you saw the canonical base (key to evaluating what's at that position
  • variation_obs: how many times you saw the variation
  • quality: give me something to work with here - a confidence value between 0 and 1 would be ideal... but lets pick something we compare across data sets. Giving me 9 values and asking me to figure something out is cheating. Sheesh!
Really, most of the callers out there give you most, if not all of it - but I have yet to see the final "quality" being given. The MAQ SNP caller (which is pretty good) asks you to look at several different fields and make up your own mind. That's fine for a first generation, but maybe I can convince people that we can do better in the second gen snp callers.

Ok, now I've got that off my chest! Phew.

Labels: , , , ,

Friday, August 7, 2009

DNA replication video

This happens to be one of the coolest videos i've ever seen of a molecular simulation. I knew the theory of how DNA replication happens, but I'd never actually seen how all of the pieces fit together. If you have a minute, check out the video.

Labels:

Tuesday, August 4, 2009

10 minutes in a room with microsoft

As the title suggests, I spent 10 minutes in a room with reps from Microsoft. It counts as probably the 2nd least productive time span in my life - second only to the hour I spent at lunch while the Microsoft reps told us why they were visiting.

So, you'd think this would be educational, but in reality, it was rather insulting.

Wisdom presented by Microsoft during the first hour included the fact that Silverlight is cross platform, Microsoft is a major supporter of interoperability and that bioinformaticians need a better platform to replace bio{java|perl|python|etc} in .net.

My brain was actively leaking out of my ear.

My supervisor told me to be nice and courteous - and I was, but sometimes it can be hard.

The 30 minute meeting was supposed to be an opportunity for Microsoft to learn what my code does, and to help them plan out their future bioinformatics tool kit. Instead, they showed up with 8 minutes remaining in the half hour, during which myself and another grad student were expected to explain our theses, and still allow for 4 minutes of questions. (Have you ever tried to explain two thesis projects in 4 minutes?)

The Microsoft reps were all kind and listened to our spiel, and then engaged in a round-table question and discussion. What I learned during the process was interesting:
  • Microsoft people aren't even allowed to look at GPL software - legally, they're forbidden.
  • Microsoft developers also have no qualms about telling other developers "we'll just read your paper and re-implement the whole thing."
And finally,
  • Microsoft reps just don't get biology development: the questions they asked all skirted around the idea that they already knew what was best for developers doing bioinformatics work.
Either they know something I don't know, or they assumed they did. I can live with that part, tho - They probably know lots of things I don't know. Particularly, I'm sure they know lots about doing coding for biology applications that require no new code development work.

So, in conclusion, all I have to say is that I'm very glad I only published a bioinformatics note instead of a full description of my algorithms (They're available for EVERYONE - except Microsoft - to read in the source code anyhow) and that I produce my work under the GPL. While I never expected to have to defend my code from Microsoft, today's meeting really made me feel good about the path I've charted for developing code towards my PhD.

Microsoft, if you're listening, any one of us here at the GSC could tell you why the biology application development you're doing is ridiculous. It's not that I think you should stop working on it - but you should really get to know the users (not customers) and developers out there doing the real work. And yes, the ones that are doing the innovative and ground breaking code are are mainly working with the GPL. You can't keep your head in the sand forever.

Labels: , ,

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: , , , , , ,

Friday, May 15, 2009

On the necessity of controls

I guess I've had this rant building up for a while, and it's finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it's effects on a particular closed system. The scientific method actually demands that we do it - hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn't bother with controls for ChIP-Seq for a long time. I can't say I've even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I've put in to the latest version of FindPeaks, I'm finally getting a good grasp of the importance of using controls well. With the other software I've seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets... and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that's what prompted me to write this. As I look over the results from the new FindPeaks (3.3.3.1), both for ChIP-Seq and WTSS, I'm amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in - but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren't in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I've been pushing for a decent control for my WTSS lanes - and there is matched normal for one cell line - but it's still two months away from having the reads land on my desk... and I'm getting impatient.

Now that I'm able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it's so much more of a challenge without an appropriate control. Besides, who'd believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I'm going to make a suggestion: if you're still doing experiments without a control, and you want to get them published, it's going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

Labels: , , , , , , , ,

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Wednesday, March 25, 2009

Searching for SNPs... a disaster waiting to happen.

Well, I'm postponing my planned article, because I just don't feel in the mood to work on that tonight. Instead, I figured I'd touch on something a little more important to me this evening: WTSS SNP calls. Well, as my committee members would say, they're not SNPs, they're variations or putative mutations. Technically, that makes them Single Nucleotide Variations, or SNVs. (They're only polymorphisms if they're common to a portion of the population.

In this case, they're from cancer cell lines, so after I filter out all the real SNPs, what's left are SNVs... and they're bloody annoying. This is the second major project I've done where SNP calling has played a central role. The first was based on very early 454 data, where homopolymers were frequent, and thus finding SNVs was pretty easy: they were all over the place! After much work, it turned out that pretty much all of them were fake (false positives), and I learned to check for homopolymer runs - a simple trick, easily accomplished by visualizing the data.

We moved onto Illumina, after that. Actually, it was still Solexa at the time. Yes, this is older data - nearly a year old. It wasn't particularly reliable, and I've now used several different aligners, references and otherwise, each time (I thought) improving the data. We came down to a couple very intriguing variations, and decided to sequence them. After several rounds of primer design, we finally got one that worked... and lo and behold. 0/2. Neither of them are real. So, now comes the post-mortem: Why did we get the false positives this time? Is it bias from the platform? Bad alignments? Or something even more suspicious... do we have evidence of edited RNA? Who knows. The game begins all over again, in the quest for answering the question "why?" Why do we get unexpected results?

Fortunately, I'm a scientist, so that question is really something I like. I don't begrudge the last year's worth of work - which apparently is now more or less down the toilet - but I hope that the why leads to something more interesting this time. (Thank goodness I have other projects on the go, as well!)

Ah, science. Good thing I'm hooked, otherwise I'd have tossed in the towel long ago.

Labels: , , , , , ,

Tuesday, January 6, 2009

My Geneticist dot com

A while back, I received an email from a company called mygeneticist.com that is doing genetic testing to help patients identify adverse drug reactions. I'm not sure what the relationship is, but they seem to be a part of something called DiscoverMe technologies. I bring mygeneticist up, because I had an "interview" with one of their partners, to determine if I am a good subject for their genetic testing program. It seems I'm too healthy to be included, unless they later decide to include me as a control. Nuts-it! (I'm still trying to figure out how to get my genome sequenced here at the GSC too, but I don't think anyone wants to fund that...)

At any rate, I spoke with the representative of their clinical side of operations this morning and had an interesting conversation about my background. In typical fashion, I also took the time to ask a few specific questions about their operations. I'm pretty sure they didn't tell me much more than was available on their various web pages, but I think there was some interesting information that came out of it.

When I originally read their email, I had assumed that they were going to be doing WTSS on each of their patients. At about $8000 per patient, it's expensive, but a relatively cheap form of discovery - if you can get around some of the challenges involved in tissue selection, etc. Instead, it seems that they're doing specific gene interrogation, although I wasn't able to get the type of platform their using. This leads me to believe that they're probably doing some form of literature check for genes related to the drugs of interest, followed by a PCR or Array based validation across their patient group. Considering the challenges of associating drug reactions with SNPs and genomic variation, I would be very curious to see what they have planned for "value-added" resources. Any drug company can find out (and probably does already know) what's in the literature, and any genetic testing done without approval from the FDA will probaby be sued/litigated/regulated out of existance... which doesn't leave a lot of wiggle room for them.

And that lead me to thinking about a lot of other questions, which went un-asked. (I'll probably email the Genomics expert there to ask some questions, though I'm mostly interested in the business side of it, which they probably won't answer.) What makes them think that people will pay for their services? How can they charge a low-enough fee to make the service attractive while getting making a profit? And, from the scientific side, assuming they're not just a diagnostic application company, I'm not sure how they'll get a large enough cohort to make sense of the data they receive through their recruitment strategy.

Anyhow, I'll be keeping my eyes on this company - if they're still around in a year or two, I'd be very interested in talking to them again about their plans in the next-generation sequencing field.

Labels: , , ,

Saturday, December 6, 2008

Nothing like reading to stimulate ideas

Well, this week has been exciting. The house sale competed last night, with only a few hiccups. Both us and the seller of the house we were buying got low-ball offers during the week, which provided the real estate agents lots to talk about, but never really made an impact. We had a few sleepless nights waiting to find out of the seller would drop our offer and take the competing one that came in, but in the end it all worked out.

On the more science-related side, despite the fact I'm not doing any real work, I've learned a lot, and had the chance to talk about a lot of ideas.

There's been a huge ongoing discussion about the qcal values, or calibrated base call scores that are appearing in Illumina runs these days. It's my understanding that in some cases, these scores are calibrated by looking at the number of perfect alignments, 1-off alignments, and so on, and using the SNP rate to identify some sort of metric which can be applied to identify an expected rate of mismatched base calls. Now, that's fine if you're sequencing an organism that has a genome identical to, or nearly identical to the reference genome. When you're working on cancer genomes, however, that approach may seriously bias your results for very obvious reasons. I've had this debate with three people this week, and I'm sure the conversation will continue on for a few more weeks.

In terms of studying for my comprehensive exam, I'm now done the first 12 chapters of the Weinberg "Biology of Genomes" textbook, and I seem to be retaining it fairly well. My girlfriend quizzed me on a few things last night, and I did reasonably well answering the questions. 6 more days, 4 more chapters to go.

The most interesting part of the studying was Thursday's seminar day. In preparation for the Genome Sciences Centre's bi-annual retreat, there was an all-day seminar series, in which many of the PIs spoke about their research. Incidentally, 3 of my committee members were speaking, so I figured it would be a good investment of my time to attend. (Co-incidentally, the 4th committee member was also speaking that day, but on campus, so I missed his talk.)

Indeed - having read so many chapters of the textbook on cancer biology, I was FAR better equipped to understand what I was hearing - and many of the research topics presented picked up exactly where the textbook left off. I also have a pretty good idea what questions they will be asking now: I can see where the questions during my committee meetings have come from; it's never far from the research they're most interested in. Finally, the big picture is coming together!

Anyhow, two specific things this week have stood out enough that I wanted to mention them here.

The first was the keynote speaker's talk on Thursday. Dr. Morag Park spoke about the environment of tumours, and how it has a major impact on the prognosis of the cancer patient. One thing that wasn't settled was why the environment is responding to the tumour at all. Is the reaction of the environment dictated by the tumour, making this just another element of the cancer biology, or does the environment have it's own mechanism to detect growths, which is different in each person. This is definitely an area I hadn't put much thought into until seeing Dr. Park speak. (She was a very good speaker, I might add.)

The second item was something that came out of the textbook. They have a single paragraph at the end of chapter 12, which was bothering me. After discussing cancer stem cells, DNA damage and repair, and the whole works (500 pages of cancer research into the book...), they mention progeria. In progeria, children age dramatically quickly, such that a 12-14 year old has roughly the appearance of an 80-90 year old. It's a devastating disease. However, the textbook mentions it in the context of DNA damage, suggesting that the progression of this disease may be caused by general DNA damage sustained by the majority of cells in the body over the short course of the life of a progeria patient. This leaves me of two minds: 1), the DNA damage to the somatic cells of a patient would cause them to lose tissues more rapidly, which would have to be regenerated more quickly, causing more rapid degradation of tissues - shortening telomeres would take care of that. This could be cause a more rapid aging process. However, 2) the textbook just finished describing how stem cells and rapidly reproducing progenitor cells are dramatically more sensitive to DNA damage, which are the precursors involved in tissue repair. Wouldn't it be more likely then that people suffering with this disease are actually drawing down their supply of stem cells more quickly than people without DNA repair defects? All of their tissues may also suffer more rapid degradation than normal, but it's the stem cells which are clearly required for long term tissue maintenance. An interesting experiment could be done on these patients requiring no more than a few milliliters of blood - has their CD34+ ratio of cells dropped compared to non-sufferers of the disease? Alas, that's well outside of what I can do in the next couple of years, so I hope someone else gives this a whirl.

Anyhow, just some random thoughs. 6 days left till the exam!

Labels: , , , , , ,

Saturday, July 26, 2008

How many biologists does it take to fix a radio?

I love google analytics. You can get all sorts of information about traffic to your web page, including the google searches people use to get there. Admittedly, I really enjoy seeing when people link to my web page, but the google searches are a close second.

This morning, though, I looked through the search tearms, and discovered that someone had found my page by googling for "How many biologists does it take to fix a radio?" And that had me hooked. I've been toying with the idea all morning, and figured I had to try to blog an answer to that. (I've already touched on the subject once, with less humour, but it's worth revisiting.)

Now, bear in mind that I'm actually a biochemist and possibly a bioinformatician - and by some stretch of imagination, a microbiologist - so I enjoy poking fun at biologists, but it's all in good humour. Biology is infinitely more complicated than radios, but it makes for a fun analogy.

Ahem.

This is how I see it going.

  • A nobel prize winner makes a keynote speech, expounding on the subject that biologists have completely ignored the topic of radios. They deserve to be studied and are a long neglected topic that is key to understanding the universe. The Nobel prize winner further suggests his own type of broken radio that he's been tinkering with in his/her garage for several months as the model organism.

  • After the speech, several prominent biologists go to the bar, drink a lot, and then decide that the general consensus is that they should look at fixing broken radios.

  • Several opinion papers and notes appear on the subject, and a couple grad student written reviews pop up in the literature.

  • A legion of taxonomists appear, naming broken radios according to some principle that makes perfect sense to them. (eg. Monoantenna smithii, Nullamperage robertsoniaii). High school students are forced to learn the correct spellings of several prominent strains.

  • A Nature paper appears, describing the glossy casing of the Radio, the interaction of the broken radio with an electrical socket and the failed attempt to sequence the genome. Researchers around the world have been scooped by this first publication, and all subsequent attempts to publish descriptions of broken radios are not sufficiently novel to warrant publication in a big name journal.

  • Biologists begin to specialize in radio parts. Journal articles appear on components such as "purple red purple gold (PrpG), which is shown to differ dramatically from a similar appearing component, "blue green purple gold" (BgpG), and both are promptly given new names by ex-drosophila researchers: "Brothers for the Preservation of Tequila Based Drinks 12" and "Trombone."

  • Someone tries to patent a capacitor, just in case it's ever useful. Spawns three biotech companies, two of which spend $120 million dollars in less than 3 years and fold.

  • Someone does a knock out on a working radio and promptly discovers and names the component "Signal Silencing Subcomponent 1" or "Sss1". 25 more are discovered in a high-throughput screen.

  • X-ray studies are done on Sss22, resulting in a widely acclaimed paper which will later result in a Nobel prize nomination. No one has the faintest idea how Sss22 works or what it does.

  • Science fiction writers publish several fantastic novels that one day we might be able to fix radios by replacing individual parts.

  • The religious right declares biologists are playing god, and that fixing radios is beyond the capacity of humans. The moral dilemmas are too complex. Ethicists get involved. The US president tries to cut funding for biologists doing research on broken radios.

  • A researcher invents a method of doing in-situ component complementation, which allows a single element to be bypassed and replaced with a new one. All new components are attached with green flags attached to them to make studying them easier.

  • Someone else invents a method of replacing a frayed power cord, producing a working phenotype from a broken radio. The resulting media storm declares the discovery of the cure for broken radios.

  • The technique for fixing power cords begins the long process of getting FDA approval. 10 years later (and with a $1bn investment showing that technique also works on lamps and doesn't cause side effects in electric toothbrushes) the fix is allowed to go to market.

  • Marketing is conducted, telling people (with working and broken radios alike) that maybe they should try the cure, just in case they might have a frayed power cord some day too. They should talk to their doctor about if it's right for them.

  • Advertisements appear on tv showing silent smiling people holding on to power cords.

  • Long term studies after the fact show that the new part wasn't as good as it could have been. Sucking on it may cause liver damage.

  • Religious right takes recall as sign that science has failed again. Holistic fixes for frayed power cords appear, as well as organic electricity and antenna adjustment therapies, which work for some people. Products appear on the shopping channel.

  • Technology moves on, the radio becomes obsolete. Several biotech companies acquire each other in blockbuster mergers and begin working on new target components for computer sound cards.

Have a good weekend, everyone. (=

Labels: ,