Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, March 11, 2010

Wolfram Alpha recreates ensembl?

Ok, this might not be my most coherent post - I'm finally getting better from being sick for a whole week, which has left my brain felling somewhat... spongy. Several of us the AGBT-ers have come down with something after getting back, and I have a theory that it was something in the food we were given.... Maybe someone slipped something into the food to slow down research at the GSC??? (-; [insert conspiracy theory here.]

Anyhow, I just received a link [aka spam] from Wolfram Alpha, via a posting on linked in, letting me know all about their great new product: Wolfram Alpha now has genome information!

Somehow, looking at their quick demo, I'm somewhat less than impressed. Here's the link, if you'd like to check it out yourself: Wolfram Alpha Blog Post (Genome)

I'm unimpressed for two reasons: the first is that there are TONS of other resources that do this - and apparently do it better, from the little I've seen on the blog. For the moment, they have 11 genomes in there, which they hope to expand in the future. I'm going to have to look more closely, if I find the motivation, as I might be missing something, but I really don't see much that I can't do in the UCSC genome browser or the Ensembl web page. The second thing is that I'm still unimpressed by Wolfram Alpha's insistence that it's more than just a search engine, and that if you use it to answer a question, you need to cite it.

I'm all in favour of using really cool algorithms and searches are no exception. [I don't think I've mentioned this to anyone yet, but if you get a chance check out Unlimited Detail's use of search engine optimization to do unbelievable 3D graphics in real time.] However, if you're going to send links boasting about what you can do with your technology, do something other people can't do - and be clear what it is. From what I can tell, this is just a mash-up meta analysis of a few small publicly available resources. It's not like we don't have other engines that do the same thing, so I'm wondering what it is that they think they do that makes it worth going there for... anyone?

Worst of all, I'm not sure where they get their information from... where do they get their SNP calls from? How can you trust that, when you can't even trust dbSNP?

Anyhow, for the moment, I'll keep using resources that I can cite specifically, instead of just citing Wolfram Alpha... I don't know how reviewers would take it if I cured cancer... and cited Wolfram as my source.

Happy searching, people!

Labels: , , ,

Wednesday, November 4, 2009

New ChIP-seq control

Ok, so I've finally implemented and debugged a second type of control in FindPeaks... It's different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won't promise it's flawless, but hey, it's open source, and I'm willing to bug fix anything people find.)

If you do, let me know, and I'll tell you how to activate it. Let the testing begin!

Labels: , , , , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Monday, September 28, 2009

Recursive MC solution to a simple problem...

I'm trying to find balance between writing and experiments/coding. You can't do both at the same time without going nuts, in my humble opinion, so I've come up with the plan of alternating days. One day of FindPeaks work, one day on my project. At that rate, I may not give the fastest responses (yes, I have a few emails waiting), but it should keep me sane and help me graduate in a reasonable amount of time. (For those of you waiting, tomorrow is FindPeaks day.)

That left today to work on the paper I'm putting together. Unfortunately, working on the paper doesn't mean I don't have any coding to do. I had a nice simulation that I needed to run: given the data sets I have, what are the likely overlaps I would expect?

Of course, I hate solving a problem once - I'd rather solve the general case and then plug in the particulars.

Today's problem can be summed up as: "Given n data sets, each with i_n genes, what is the expected number of genes common to each possible overlap of 2 or more datasets?"

My solution, after thinking about the problem for a while, was to use a recursive solution. Not surprisingly, I haven't written recursive code in years, so I was a little hesitant to give it a shot. In contrast, I whipped up the code, and gave it a shot - and it worked the first time. (That's sometimes a rarity with my code - I'm a really good debugger, but can often be sloppy when writing code quickly the first time.) Best of all, the code is extensible - If I have more data sets later, I can just add them in and re-run. No code modification needed beyond changing the data. (Yes, I was sloppy and hard coded it, though it would be trivial to read it from a data file, if someone wants to re-use this code.)

Anyhow, it turned out to be an elegant solution to a rather complex problem - and I was happy to see that the results I have for the real experiment stick out like a sore thumb: it's far greater than random chance.

If anyone is interested in seeing the code, it was uploaded into the Vancouver Short Read Analysis Package svn repository: here. (I'm doubting the number of page views that'll get, but what the heck, it's open source anyhow.)

I love it when code works properly - and I love it even more when it works properly the first time.

All in all, I'd say it's been a good day, not even counting the 2 hours I spent at the fencing club. En gard! (-;

Labels: , ,

Tuesday, August 18, 2009

new repository of second generation software

I finally have a good resource for locating second gen (next gen) sequencing analysis software. For a long time, people have just been collecting it on a single thread in the bioinformatics section of the SeqAnswers.com forum, however, the brilliant people at SeqAnswers have spawned off a wiki for it, with an easy to use form. I highly recommend you check it out, and possibly even add your own package.

http://seqanswers.com/wiki/SEQanswers

Labels: , , , , , , , , , , , ,

Monday, August 17, 2009

SNP Datatabase v0.1

Good news, my snp database seems to be in good form, and is ready for importing SNPs. For people who are interested, you can download the Vancouver Short Read Package from SVN, and find the relevant information in
/trunk/src/transcript_analysis/SNP_Database/

There's a schema for setting up the tables and indexes, as well as applications for running imports from maq SNP calls and running a SNP caller on any form of alignment supported by FindPeaks (maq, eland, etc...).

At this point, there are no documents on how to use the software, since that's the plan for this afternoon, and I'm assuming everyone who uses this already has access to a postgresql database (aka, a simple ubuntu + psql setup.)

But, I'm ready to start getting feature requests, requests for new SNP formats and schema changes.

Anyone who's interested in joining onto this project, I'm only a few hours away from having some neat toys to play with!

Labels: , , , , , , , , , ,

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Monday, May 25, 2009

Can't we use ChIP-chip controls on *-Seq?

Thanks to Nicholas, who left this comment on my web page this morning, in reference to my post on controls in Second-Gen Seqencing:
Hi Anthony,

Don't you think that controls used for microarray (expression
and ChIP-chip) are well established and that we could use
these controls with NGS?

Cheers!

I think this is a valid question, and one that should be addressed. My committee asked me the same thing during my comprehensive exam, so I've had a chance to think about it. Unfortunately, I'm not a statistics expert, or a ChIP-chip expert, so I would really value other people's opinion on the matter.

Anyhow, I think the answer has to be put in perspective: Yes, we can learn from ChIP-chip and Arrays for the statistics that are being used, but no, they're not directly applicable.

Both ChIP-chip and array experiments are based on hybridization to a probe - which makes them cheap and reasonably reliable. Unfortunately, it also leads to a much lower dynamic range, since they saturate out at the high end, and can be undetectable at the low end of the spectrum. This alone should be a key difference. What signal would be detected from a single hybridization event on a micro-array?

Additionally, the resolution of a chip-chip probe is vastly different from that of a sequencing reaction. In ChIP-Seq or RNA-Seq, we can get unique signals for sequences with a differing start location only one base apart, which should then be interpreted differently. With ChIP-chip, the resolution is closer to 400bp windows, and thus the statistics take that into account.

Another reason why I think the statistics are vastly different is because of the way we handle the data itself, when setting up an experiment. With arrays, you repeat the same experiment several times, and then use that data as several repeats of the same experiment, in order to quantify the variability (deviation and error) between the repeats. With second-generation sequencing, we pool the results from several different lanes, meaning we always have N=1 in our statistical analysis.

So, yes, I think we can learn from other methods of statistical analysis, but we can't blindly apply the statistics from ChIP-chip and assume they'll correctly interpret our results. The more statistics I learn, the more I realize how many assumptions go into each method - and how much more work it is to get the statistics right for each type of experiment.

At any rate, these are the three most compelling reasons that I have, but certainly aren't the only ones. If anyone would like to add more reasons, or tell me why I'm wrong, please feel free to add a comment!

Labels: , ,

Thursday, January 8, 2009

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Sunday, October 5, 2008

Field Programmable Gate Arrays

Yes, I'm procrastinating again. I have two papers, two big chunks of code and a thesis proposal to write, a paper to review (it's been done but I have yet to type out my comments..), several major experiments to do and at least one poster looming on the horizon - not to mention squeezing in a couple of manuals for the Vancouver Package Software. And yet, I keep finding other stuff to work on, because it's the weekend.

So, I figured this would be a good time to touch on a topic of Field Programmable Gate Arrays or FPGAs. I've done very little research on this topic, since it's so far removed from my own core expertise, but it's a hot topic in bioinformatics, so I'd be doing a big disservice by not touching on this subject at all. However, I hope people will correct me if they spot errors.

So what is an FPGA? I'd suggest you read the wikipedia article linked above, but I'd sum it up as a chip that can be added to a computer, which has the ability to optimize the way in which information is processed, so as to accellerate a given algorithm. It's a pretty cool concept - move a particular part of an algorithm into the hardware itself to speed it up. Of course, there are disadvantages as well. Reprogramming is (was? - this may have changed) a few orders of magnitude slower than processing information, so you can't change the programming on the fly while processing data and still hope to get a speed up. Some chips can change programming of unused sub-sections, while other algorithms are running... but now we're getting really technical.

(For a very good technical discussion, I suggest this book, of which I've read a few useful paragraphs.)

Rather than discuss FPGAs, which are a cool subject on their own, I'd rather discuss their applications in Bioinformatics. As far as I know, they're not widely used for most applications at the moment. The most processor intensive bioinformatics applications, Molecular Modeling and drug docking, are mainly vector-based calculationd, so vector chips (eg Graphics Processing Units - GPUs) are more applicable for them. As for the rest, CPUs have traditionally been "good enough". However, recently the following two things seem to have accelerated this potential mariage of technology:
  1. The makers of FPGAs have been looking for applications for their products for years and have targeted bioinformatics because of it's intense computer use. Heavy computer use is always considered to be a sign that more efficient processing speed is an industry need - and FPGAs appear to meet that need - on the surface.
  2. Bioinformatics was doing well with the available computers, but suddenly found itself behind the processing curve with the advent of Second Generation Sequencing (SGS). Suddenly, the amount of information being processed spiked by an order of magnitude (or more), causing the bioinformaticians to scream for more processing power and resources.
So, it was inevitable that FPGA producers would hear about the demand for more power in the field, and believe that it's the ideal market into which they should pluge. To the casual observer, Bioinformatics needs more efficiency and power, and FPGA producers are looking for a martet where efficiency and power are needed! Is this a match made in heaven or what?

Actually, I contend that FPGAs are the wrong solution for several reasons.

While Second Generation Sequencing produces tons more data, the algorithms being employed haven't yet settled down. Every 4 months we pick a different aligner. Every 3 months we add a new data base. Every month we produce a more efficient version of our algorithms for interpreting the data. Due to the overhead in producing an algorithm translation into hardware necessary to use the FPGA (which seems large to me, but may not be to people more fluent in HDL) would mean that you'd spend a disproportionate amount of time trying to get the chips set up to process your data - which you're only going to use for a short period of time before moving on. And the gain of efficiency would probably be wiped out by the amount of effort introduced.

Furthermore, even when we do know the algorithms being used are going to stay around, a lot of our processing isn't necessarily CPU bound - but rather is I/O or memory bound. When you're trawling through 16Gb or memory, it's not necessarily obvious that adding more speed to the CPU will help. Pre-fetching and pre-caching are probably doing more to help you out than anything else bound to your CPU.

In the age of multi-CPUs, using multi-threaded programs already reduces many of the pains that plague bioinformaticians. Most of my java code is thrilled to pull 2, 3, or more processors in to work faster - without a lotof explicit multi-treadding. (My record so far is 1496% cpu usage - nearly 15 processors.) I would expect that buying 16-way processors is probably more cost-efficient than buying 16 FPGAs in terms of processing data for many of the current algorithms in use.

Buying more conventional resources will probably alleviate the sudden bottle-neck in compute power, rather than innovating around new solutions to solve the need. It's likely that many groups getting into the second generation genomics technologies failed to understand the processing demands of the data, and thus didn't plan adequately for the resources. This means that much of the demand for data processing is just temporary, and may even be aleviated with more efficient algorithms in the future.

So where does the FPGA fit in?

I'd contend that there are very few products out there that would benefit from FPGAs in Bioinformatics... but there are a few. Clearly, all bioinformaticians know that aligning short reads is one of those areas. Considering that a full Maq run for a flow cell from an Illumina GAII takes 14+ hours on a small cluster, that would be one area in which they'd clearly benefit.

Of course, no bioinformatician wants to have to reprogram an FPGA on the fly to utilize their work. Were I to pick a model, it would probably be to team up with an aligner group, to produce a stand alone, multi-FPGA/CPU hybrid box with 32Gb of RAM, and a 3-4 year upgrade path. Every 4 months you produce a new aligner algorithm and HDL template, and users pick up the aligner and HDL upgrade, and "flash" their computer to use the new software/hardware. This would follow the Google Appliance model: an automated box that does one task, and does it well, with the exception that hardware "upgrades" come along with the software patches. That would certainly turn a few heads.

At any rate, only time will tell. If the algorithms settle down, FPGAs may become more useful. If the FPGAs become easier to program for bioinformaticians, they may find a willing audience. If the FPGAs begin to understand the constraints of the bioinformatics groups, they may find niche applications that will truly benefit from this technology. I look forward to seeing where this goes.

Ok... now that I've gone WAY out on a limb, I think it's time to tackle a few of those tasks on my list.

Labels: , , ,

Friday, September 12, 2008

SISSR

One more day, one more piece of ChIP-Seq software to cover. I've not talked about FindPeaks, much, which is the software descended from Robertson et al, for obvious reasons. The paper was just an application note - and well, I'm really familiar with how it works, so I'm not going to review it. I have talked about Quest, however, which was presumably descended from Johnson et al.. And, for those of you who have been following ChIP-Seq papers since the early days will realize that there's still something missing: The aligner descended from Barski et al, which is the subject of today's blog: SISSR. Those were the first three published ChIP-Seq papers, and so it's no surprise that each of them followed up with a paper (or application note!) on their software.

So, today, I'll take a look at SISSR, to complete the series.

From the start, the Barski paper was discussing both histone modifications and transcription factors. Thus, the context of the peak finder is a little different. Where FindPeaks (and possibly QuEST as well) was originally conceived for identifying single peaks, and expanded to do multiple peaks, I would imagine that SISSR was conceived with the idea of working on "complex" areas of overlapping peaks. Although, that's only relevant in terms of their analysis, but I'll come back to that.

The most striking thing you'll notice about this paper is that the datasets look familiar. They are, in fact the sets from Robertson, Barski and Johnson: STAT1, CTCF and NRSF, respectively. This is the first of the Chip-Seq application papers that actually performs a comparison between the available peak finders, and of course, claim that theirs is the best. Again, I'll come back to that.

The method used by SISSR is almost identical to the method used by FindPeaks, with the use of directional information built into the base algorithm, whereas FindPeaks provides it as an optional module (-directional flag, which uses a slightly different method). They provide an excellent visual image on the 4th page of the article, demonstrating their concept, which will explain the method better than I can, but I'll try anyhow.

In ChIP-Seq, a binding site is expected to have many real tags pointing at it, as tags upstream should be on the sense strand, and tags on downstream should be on the anti-sense strand. Thus, a real binding site should exist at transition points, where the majority of tags switch from the sense to the anti-sense tag. By identifying these transition points, they will be able to identify locations of real binding sites. More or less, that describes the algorithm employed, with the following modifications: A window is used, (20bp default) instead of doing it on a base-by-base basis, and parameter estimation is employed to guess the length of the fragments.

In my review of QuEST, I complained that windows are a bad idea(tm) for ChIP-Seq, only to be corrected that QuEST wasn't using a window. This time, the window is explicitly described - and again, I'm puzzled. FindPeaks uses an identical operation without windows, and it runs blazingly fast. Why throw away resolution when you don't need to?

On the subject of length estimation, I'm again less than impressed. I realize this is probably an early attempt at it - and FindPeaks has gone through it's fair share of bad length estimators, so it's not a major criticism, but it is a weakness. To quote a couple lines from the paper: "For every tag i in the sense strand, the nearest tag k in the anti-sense strand is identified. Let J be the tag in the sense strand immediately upstream of k." Then follows a formula based upon the distances between (i,j) and (j,k). I completely fail to understand how this provides an accurate assessment of the real fragment length. I'm sure I'm missing something. As a function that describes the width of peaks, that may be a good method, which is really what the experiment is aiming for, anyhow - so it's possible that this may just be poorly named.

In fairness, they also provide options for a manual length estimation (or XSET, as it was referred to at the time), which overrides the length estimation. I didn't see a comparison in the paper about which one provided the better answers, but having lots of options is always a good thing.

Moving along, my real complaint about this article is the analysis of their results compared to past results, which comes in two parts. (I told you I'd come back to it.)

The first complaint is what they were comparing against. The article was submitted for publication in May 2008, but they compared results to those published in the June 2007 Robertson article for STAT1. By August, our count of peaks had changed. By January 2008, several upgraded versions of FindPeaks were available, and many bugs had been ironed out. It's hardly fair to compare the June 2007 FindPeaks results to the May 2008 version of SISSR, and then declare SISSR the clear winner. Still, that's not a great problem - albeit somewhat misleading.

More vexing is their quality metric. In the Motif analysis, they clearly state that because of the large amount of computing power, only the top X% of reads were used in their analysis. For comparison with FindPeaks, the top 5% of peaks were used - and were able to observe the same motifs. Meanwhile, their claim to find 74% more peaks than FindPeaks, is not really discussed in terms of the quality of the additional sites. (FindPeaks was also modified to identify sub-peaks after the original data set was published, so this is really comparing apples to oranges, a fact glossed over in the discussion.)

Anyhow, complaints aside, it's good to see a paper finally compare the various peak finders out there. They provide some excellent graphics, and a nice overview on how their ChIP-Seq application works, while contrasting it to published data available. Again, I enjoyed the motif work, particularly that of figure 5, which correlates four motif variants to tag density - which I feel is a fantastic bit of information, buried deeper in the paper than it should be.

So, in summary, this paper presents a rather unfair competition by using metrics guaranteed to make SISSR stand out, but still provides a good read with background on ChIP-Seq, excellent illustrations and the occasional moment of deep insight.

Labels: , ,