QuEST
(Pre-script: In preparation for my comprehensive exam, I'm trying to prepare critical evaluations of papers in the area of my research. I'll provide comments, analysis and references (where appropriate), and try to make the posts somewhat interesting. However, these posts are simply comments and - coming from a graduate student - shouldn't be taken too seriously. If you disagree with my points, please feel free to comment on the article and start a discussion. Nothing I say should be taken as personal or professional criticism - I'm simply trying to evaluate the science in the context of the field as it stands today.)
(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)
I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I'm going to talk about that particular piece of Quest software for long, but bear with me a moment - it makes a nice lead in.
The software I wrote wasn't started before the University of Waterloo's version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That's when I became involved - in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over... but I heard it was used for more than two years. (There are many good stories there, but I'll save them for another day.)
Back to ChIP-Seq's QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software - but this time I don't think it will replace my software.
The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn't. What's most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don't see anything novel in this paper, though maybe someone else has seen something I've missed.
The one thing that surprises me about this paper, however, is their use of a "kernel density bandwidth", which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a "window" over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they're not losing much, so it's not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don't see that discussed as an advantage.
The second thing they've done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn't really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this - and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn't expected to provide data in the application note.)
Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they'd done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn't find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.
The second thing I really appreciated was the motif finding work, which isn't strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I'm aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There's good science behind this paper - and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.
So, here's my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.
(UPDATE: A response to this article was kindly provided by Anton Valouev, and can be read here.)
I once wrote a piece of software called WINQ, which was the predecessor of a piece of software called Quest. Not that I'm going to talk about that particular piece of Quest software for long, but bear with me a moment - it makes a nice lead in.
The software I wrote wasn't started before the University of Waterloo's version of Quest, but it was released first. Waterloo was implementing a multi-million dollar set of software for managing student records built on oracle databases, PeopleSoft software, and tons of custom extensions to web interfaces and reporting. Unfortunately, The project was months behind, and the Quest system was no where near being deployed. (Vendor problems and the like.) That's when I became involved - in two months of long days, I used Cognos tools (several of them, involving 5 separate scripting and markup languages) to build the WINQ system, which provided the faculty with a way to access query the oracle database through a secure web frontend and get all of the information they needed. It was supposed to be in use for about 4-6 months, until Quest took over... but I heard it was used for more than two years. (There are many good stories there, but I'll save them for another day.)
Back to ChIP-Seq's QuEST, this application was the subject of a recently published article. In a parallel timeline to the Waterloo story, QuEST was probably started before I got involved in ChIP-Seq, and was definitely released after I released my software - but this time I don't think it will replace my software.
The paper in question (Valouev et al, Nature Methods, Advanced Online Publication) is called "Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. I suspect it was published with the intent of being the first article on ChIP-Seq software, which, unfortunately, it wasn't. What's most strange to me is that it seems to be simply a reiteration of the methods used by Johnson et al. in their earlier ChIP-Seq paper. I don't see anything novel in this paper, though maybe someone else has seen something I've missed.
The one thing that surprises me about this paper, however, is their use of a "kernel density bandwidth", which appears to be a sliding window of pre-set length. This flies in the face of the major advantage of ChIP-Seq, which is the ability to get very strong signals at high resolution. By forcing a "window" over their data, they are likely losing a lot of the resolution they could have found by investigating the reads directly. (Admittedly, with a window of 21bp, as used in the article, they're not losing much, so it's not a very heavy criticism.) I suppose it could be used to provide a quick way of doing subpeaks (finding individual peaks in areas of contiguous read coverage) at a cost of losing some resolving power, but I don't see that discussed as an advantage.
The second thing they've done is provide a directional component to peak finding. Admittedly, I tried to do the same thing, but found it didn't really add much value. Both the QuEST publication and my application note on FindPeaks 3.1 mention the ability to do this - and then fail to show any data that demonstrates the value of using this mechanism versus identifying peak maxima. (In my case, I wasn't expected to provide data in the application note.)
Anyhow, that was the down side. There are two very good aspects to this paper. The first is that they do use controls. Even now, the Genome Sciences Centre is struggling with ChIP-Seq controls, while it seems everyone else is using them to great effect. I really enjoyed this aspect of it. In fact, I was rather curious how they'd done it, so I took a look through the source code of the application. I found the code somewhat difficult to wade through, as the coding style was very different from my own, but well organized. Unfortunately, I couldn't find any code for dealing with controls, which leads me to think this is an unreleased feature, and was handled by post-processing the results of their application. Too bad.
The second thing I really appreciated was the motif finding work, which isn't strictly ChIP-Seq, but is one of the uses to which the data can be applied. Unfortunately, this is also not new, as I'm aware of many earlier experiments (published and unpublished) that did this as well, but it does make a nice little story. There's good science behind this paper - and the data collected on the chosen transcription factors will undoubtedly be exploited by other researchers in the future.
So, here's my summary of this paper: As a presentation of a new algorithm, they failed to produce anything novel, and with respect to the value of those algorithms versus any other algorithm, no experiments were provided. On the other hand, as a paper on growth-associated binding protein, and serum response factor proteins (GABP and SRF respectively), it presents a nice compact story.
Labels: article review, Chip-Seq
4 Comments:
Hi Anthony,
nice to read your comments about QuEST. I was also surprised to see that they managed to get this into NM whit virtually no comparison to other methods or comments on how it works with other peoples datasets etc. What is remarkable though is the number of reads they get in the peaks - often several thousands. Then of course you can look at the distributions and peakshifts in individual peaks but I doubt the method is very useful if you have only 10 or so reads...
What was it about the motiffinding you found interresting? Perhaps the "remarkable" finding that the average distance was close to 0 from peak centers... :)
Would be nice to hear your comments about SISSRs also.
Hi Chipper,
Thanks for the comment - Yes, I did enjoy the distance for motifs to peak calls being 0.1bp to 2.55bp. In terms of molecular distances, that's about 0.26 to 6.7 Angstroms. I'd never considered converting to an actual measurement, but it's really quite mind blowing to think the technology has come that far.
The motif finding isn't much different from what everyone else's, but I enjoyed the way it was integrated into the paper - I thought it was well done.
Anyhow, Yes, the SISSR paper is next on my list. (=
Anthony
Hi,
perhaps I was not clear - my point was that the 0.1 to 2.55 bp only tells you that your peaks are on averaged centerd to the motifs, that is you could have one group at +100bp and one at -100 bp so that value does not say much about the resolution. Perhaps I misunderstood, but it did not look _that_ good from the histograms.
Hi Chipper,
I did understand what you were talking about, though the paper does give standard deviations for those distributions as 13.4bp - 21.8 bp, which addresses your point - I think. (If I'm still off, let me know.)
On the other hand, I thought it was fascinating that people can talk about a fraction of a base pair. In terms of the distribution, I understand what the article was discussing - but when you start to think about what those numbers mean as a physical distance, I'm somewhat mystified. The median distance of the motif to the peak call is a fraction of the length of a hydrogen bond!
Post a Comment
<< Home