Yesterday I got to dish out some criticism, so it's only fair that I take some myself, today. It came in the form of an article called "Modeling ChIP Sequencing In Silico with Applications", by
Zhengdong D. Zhang et al., PLoS Computational Biology, August 2008: 4(8).
This article is actually very cool. They've settled several points that have been hotly debated here at the Genome Sciences Centre, and made the case for some of the stuff I've been working on - and then show me a few places where I was dead wrong.
The article takes direct aim at the work done in
Robertson et al., using the STAT1 transcription factor data produced in that study. Their key point is that the "FDR" used in that study was far from ideal, and that it could be significantly improved. (Something I strongly believe as well.)
For those that aren't aware, Robertson et al. is sort of the ancestral origin of the FindPeaks software, so this particular paper is more or less aiming at the FindPeaks thresholding method. (Though I should mention that they're comparing their results to the peaks in the publication, which used the unreleased FindPeaks 1.0 software - not the FindPeaks 2+ versions, of which I'm the author.) Despite the comparison to the not-quite current version of the software, their points are still valid, and need to be taken seriously.
Mainly, I think there are two points that stand out:
1. The null model isn't really appropriate
2. The even distribution isn't really appropriate.
The first, the null model, is relatively obvious - everyone has been pretty clear from the start that the null model doesn't really work well. This model, pretty much consistent across ChIP-Seq platforms can be paraphrased as "if my reads were all noise, what would the data look like?" This assumption is destined to fail every time - the reads we obtain aren't all noise, and thus assuming they are as a control is really a "bad thing"(tm).
The second, the even distribution model, is equally disastrous. This can be paraphrased as "if all of my noise were evenly distributed across some portion of the chromosome, what would the data look like?" Alas, noise doen't distribute evenly for these experiments, so it should be fairly obvious why this is also a "bad thing"(tm).
The solution presented in the paper is fairly obvious; create a full simulation for your ChIP-Seq data. Their version requires a much more rigorous process, however. They simulate a genome-space, remove areas that would be gaps or repeats in the real chromosome, then begin tweaking the genome simulation to replicate their experiment using weighted statistics collected in the ChIP-Seq experiment.
On the one hand, I really like this method, as it should give a good version of a control, whereas on the other hand, I don't like that you need to know a lot about the genome of interest before you can analyze your ChIP-Seq data. (ie, mappability, repeat-masking, etc.) Of course, if you're going to simulate your genome, simulate it well - I agree with that.
I don't want to belabor the point, but this paper provides a very nice method for simulating ChIP-Seq noise in the absence of a control, as in Robertson et al. However, I think there are two things that have changed since this paper was submitted (January 2008) that should be mentioned:
1. FDR calculations haven't stood still. Even at the GSC, we've been working on two separate FDR models that no longer use the null model, however, both still make even distribution assumptions, which, is also not ideal.
2. I believe everyone has now acknowledged that there are several biases that can't be accounted for in any simulation technique, and that controls are the way forward. (They're used very successfully in
QuEST, which I discussed
yesterday.)
Anyhow, to summarize this paper: Zhang et al. provide a fantastic critique of the thresholding and FDR used in early ChIP-Seq papers (which is still in use today, in one form or another), and demonstrate a viable and clearly superior method for refining Chip-Seq results with out a matched control. This paper should be read by anyone working on FDRs for next-gen sequencing and ChIP-Seq software.
(Post-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)
Labels: article review, Chip-Seq, false discovery rate