Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, November 6, 2009

ChIP-Seq normalization.

I've spent a lot of time working on ChIP-Seq controls recently, and wanted to raise an interesting point that I haven't seen addressed much: How to normalize well. (I don't claim to have read ALL of the chip-seq literature, and someone may have already beaten me to the punch... but I'm not aware of anything published on this yet.)

The question of normalization occurs as soon as you raise the issue of controls or comparing any two samples. You have to take it in to account when doing any type of comparision, really, so it's somewhat important as the backbone to any good second-gen work.

The most common thing I've heard to date is to simply normalize by the number of tags in each data set. As far as I'm concerned, that really will only work when your data sets come from the same library, or two highly correlated samples - when nearly all of your tags come from the same locations.

However, this method fails as soon as you move into doing a null control.

Imagine you have two samples, one is your null control, with the "background" sequences in it. When you seqeunce, you get ~6M tags, all of which represent noise. The other is ChIP-Seq, so some background plus an enriched signal. When you sequence, hopefully you sequence 90% of your signal, and 10% of the background to get ~8M tags - of which ~.8M are noise. When you do a compare, the number of tags isn't quite doing justice to the relationship between the two samples.

So what's the real answer? Actually, I'm not sure - but I've come up with two different methods of doing controls in FindPeaks: One where you normalize by identifying a (symmetrical) linear regression through points that are found in both samples, the other by identifying the points that appear in both samples and summing up their peak heights. Oddly enough, they both work well, but in different scenarios. And clearly, both appear (so far) to work better than just assuming the number of tags is a good normalization ratio.

More interesting, yet, is that the normalization seems to change dramatically between chromosomes (as does the number of mapping reads), which leads you to ask why that might be. Unfortunately, I'm really not sure why it is. Why should one chromosome be over-represented in an "input dna" control?

Either way, I don't think any of us are getting to the bottom of the rabbit hole of doing comparisons or good controls yet. On the bright side, however, we've come a LONG way from just assuming peak heights should fall into a nice Poisson distribution!

Labels: , , , ,

Wednesday, November 4, 2009

New ChIP-seq control

Ok, so I've finally implemented and debugged a second type of control in FindPeaks... It's different, and it seems to be more sensitive, requiring less assumptions to be made about the data set itself.

What it needs, now, is some testing. Is anyone out there willing to try a novel form of control on a dataset that they have? (I won't promise it's flawless, but hey, it's open source, and I'm willing to bug fix anything people find.)

If you do, let me know, and I'll tell you how to activate it. Let the testing begin!

Labels: , , , , ,

Monday, October 5, 2009

Why peak calling is painful.

In discussing my work, I'm often asked how hard it is to write a peak calling algorithm. The answer usually surprises people: It's trivial. Peak calling itself isn't hard. However, there are plenty of pitfalls that can surprise the unwary. (I've found myself in a few holes along the way, which have been somewhat challenging to get out of.)

The pitfalls, when they do show up, can be very painful - masking the triviality of the situation.

In reality, the three most frustrating things that occur in peak calling:
  1. Maintaining the software

  2. Peak calling without unlimited resources eg, 64Gb RAM

  3. Keeping on the cutting edge

On the whole, each of these things is a separate software design issue worthy of a couple of seconds of discussion.

When it comes to building software, it's really easy to fire up a "one-off" script. Anyone can write something that can be tossed aside when they're done with it - but code re-use and recycling is a skill. (And an important one.) Writing your peak finder to be modular is a lot of work, and a huge amount of time investment is required to keep the modules in good shape as the code grows. A good example of why this is important can be illustrated with file formats. Since the first version of FindPeaks, we've transitioned through two versions of Eland output, Maq's .map format and now on to SAM and BAM (but not excluding BED, GFF, and several other more or less obscure formats). In each case, we've been able to simply write a new iterator and plug it into the existing modular infrastructure. In fact, SAM support was added in quite rapidly by Tim with only a few hours of investment. That wouldn't have been possible without the massive upfront investment in good modularity.

The second pitfall is memory consumption - and this is somewhat more technical. When dealing with sequencing reads, you're faced with a couple of choices: you either sort the reads and then move along the reads one at a time, determining where they land - OR - you can pre-load all the reads, then move along the chromosome. The first model takes very little memory, but requires a significant amount of pre-processing, which I'll come back to in a moment. The second requires much less cpu time - but is intensely memory thirsty.

If you want to visualize this, the first method is to organize all of your reads by position, then to walk down the length of the chromosome with a moving window, only caring about the reads that fall into the window at any given point in time. This is how FindPeaks works now. The second is to build a model of the chromosome, much like a "pileup" file, which then can be processed however you like. (This is how I do SNP calling.) In theory, it shouldn't matter which one you do, as long as all your reads can be sorted correctly. The first can usually be run with a limited amount of memory, depending on the memory strucutures you use, whereas the second pretty much is determined by the size of the chromosomes you're using (multiplied by a constant that also depends on the structures you use.)

Unfortunately, using the first method isn't always as easy as you might expect. For instance, when doing alignments with transcriptomes (or indels), you often have gapped reads. An early solution to this in FindPeaks was to break each portion of the read into separate aligned reads, and process them individually - which works well when correctly sorted. Unfortunately, new formats no longer allow that - using a "pre-sorted" bam/sam file, you can now find multi-part reads, but there's no real option of pre-fragmenting those reads and re-sorting. Thus, FindPeaks now has an additional layer that must read ahead and buffer sam reads in order to make sure that the next one returned is the correct order. (You can get odd bugs, otherwise, and yes, there are many other potential solutions.)

Moving along to the last pitfall, the one thing that people want out of a peak finder is that it is able to do the latest and greatest methods - and do it ahead of everyone else. That on it's own is a near impossible task. To keep a peak finder relevant, you not only need to implement what everyone else is doing, but also do things that they're not. For a group of 30 people, that's probably not too hard, but for academic peak callers, that can be a challenge - particularly since every use wants something subtly different than the next.

So, when people ask how hard it is to write their own peak caller, that's the answer I give: It's trivial - but a lot of hard work. It's rewarding, educational and cool, but it's a lot of work.

Ok, so is everyone ready to write their own peak caller now? (-;

Labels: , , , , , , ,

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: , , , , , ,

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Tuesday, June 23, 2009

FindPeaks 4.0

Well, I've finally gotten to it: the tag for FindPeaks 4.0. At this point, I'm more or less satisfied with what made it in to this release: Saturation, Controls, Compares and a whole lot of changes to the underlying machinery. The documentation is still going through some changes, (I have another two flags to add in) and a lot more clarification to do on what some of the parameters actually accomplish, but it's now in a reasonable state.

Despite the milestone, this project is really a constant evolution. I'm already thinking about what should be in the next version (4.1?): Support for SAM/BAM, "peakless peak calling" for regions instead of peaks, a vastly upgraded FindFeatures code and a host of small changes that I had thought weren't worth the effort for this particular release. I'm even considering a GUI, if I can squeeze it in. (If anyone would like to help out on that project, I'd be thrilled to add them to the project!)

At this point, I'm happy to say I'm not aware of any outstanding coding bugs - although I do take it seriously that there is an open bug remarking that the documentation is insufficient. I've been worknig on improving it, and reorganizing the manual, which should be done in the next couple days. Once that's done, I'll jump back into using my code to do some analysis of my own. There are a few really neat things, based on work on my poster, that I'd like to play with. I guess that's what they say about coders: when you write software for yourself, you never lack the motivation to add in one more feature. (=

Labels:

Monday, June 8, 2009

Poster - reprise

In an earlier post, I said I'd eventually get around to putting up a thumbnail of the poster that I presented at the Canadian Institutes of Health Research National Research Poster Competition. (Yes, the word "research" appears twice in that sentence.) After a couple days of being busy with other stuff, I've finally gotten around to it.

I'm also happy to say that the poster was well received, despite the unconventional appearance. It was awarded an Award of Excellence (Silver category) from the judges.

thumbnail of poster

Labels: , ,

Friday, May 29, 2009

Science Cartoons - 3

I wasn't going to do more than one comic a day, but since I just published it into the FindPeaks 4.0 manual today, I may as well put it here too, and kill two birds with one stone.

Just to clarify, under copyright laws, you can certainly re-use my images for teaching purposes or your own private use (that's generally called "fair use" in the US, and copyright laws in most countries have similar exceptions), but you can't publish it, take credit for it, or profit from it without discussing it with me first. However, since people browse through my page all the time, I figure I should mention that I do hold copyright on the pictures, so don't steal them, ok?

Anyhow, Comic #3 is a brief description of how the compare in FindPeaks 4.0 works. Enjoy!

Labels: , , , , , , , , ,

Thursday, May 28, 2009

Science Cartoons - 1

I've been busy putting together a poster for a student conference I'm attending next week in Winnipeg, which has distracted me from just about everything else I had planned to accomplish. Not to worry, though, I'll be back on it in a day or two.

Anyhow, part of the fun this time around is that the poster competition includes criteria for judging that involve how pretty the poster is - so I decided to go all out and learn to draw with inkscape. Instead of the traditional background section on a poster, I went with a comic theme. It's a great way to use text and figures to get across complex topics very quickly. So, for the next couple days, I thought I'd post some of them, one at a time.

Here's the first, called "Second Generation Sequencing". It's my least favorite of the bunch, since the images feel somewhat sloppy, but it's not a bad try for a first pass. Yes, I do own the copyrights - and you do need to ask permission to re-use them.

Labels: , ,

Thursday, May 21, 2009

Rumours..

Apparently there's a rumour going around that I think FindPeaks isn't a good software program, and that I've said so on my blog. I don't know who started this, or where it came from, but I would like to set the record straight.

I think FindPeaks is one of the best ChIP-Seq packages currently available. Yes, I would like to have better documentation, yes there are occasionally bugs in it, and yes, the fact that I'm the only full-time developer on the project are detriments.... But those are only complaints, and they don't diminish the fact that I think FindPeaks can out peak-call, out control, out compare.... and will outlast, outwit and outplay all of it's opponents. (=

FindPeaks is actually a VERY well written piece of code, and has the flexibility to do SO much more than just ChIP-Seq. If I had my way, I'd have another developer or two working with the code to expand many of the really cool features that it already has.

If people are considering using it, I'm always happy to help get them started, and I'm thrilled to add more people to the project if they would like to contribute.

So, just to make things clear - I think FindPeaks (and the rest of the Vancouver Short Read Analysis Package) are great pieces of software, and are truly worth using. If you don't believe me, send me an email, and I'll help make you a believer too. (=

Labels: , ,

Monday, May 18, 2009

Questions about controls.

After my post on controls, the other day, I've discovered I'd left a few details out of my blog entry. It was especially easy to come to that conclusion, since it was pointed out by Aaron in a comment below.... Anyhow, since his questions were right on, I figured that they deserved a post of their own, not just a quick reply.

There were two parts to the questions, so I'll break up the reply into two parts. First:

One question - what exactly are you referring to as 'control' data? For chip-seq, sequencing the "input" ie non-enriched chromatin of your IPs makes sense, but does doubling the amount of sequencing you perform (therefore cost & analysis time) give enough of a return? eg in my first analysis we are using a normal and cancer cell line, and looking for differences between the two.
Control data, in the context I was using it, should be experiment specific. Non-enriched chromatin is probably going to be the most common for experiments with transcription factors, or possibly even histone modifications. Since you're most interested in the enrichment, a non-enriched control makes sense. At the GSC, we've had many conversations about what the appropriate controls are for any given experiment, and our answers have ranged from genomic DNA or non-enriched-DNA all the way to stimulated/unstimulated sample pairs. The underlying premise is that the control is always the one that makes sense for the given experiment.

On that note, I suspect it's probably worth rehashing the Robertson 2007 paper on stimulated vs. unstimulated IFN-gamma STAT1 transcription markers... hrm..

Anyhow, The second half of that question boils down to "is it really worth it to spend the money on controls?"

YES! Unequivocally YES!

Without a control, you'll never be able to get a good handle on the actual statistics behind your experiment, you'll never be able to remove bias in your samples (from the sequencing itself, and yes, it is very much present). In the tests I've been running on the latest FP code (3.3.3.1), I've been seeing what I think are incredibly promising results. If it cost 5 times as much to do controls, I'd still be pushing for them. (Of course, it's actually not my money, so take that for what it's worth.)

In any case, there's a good light at the end of the cost tunnel on this. Although yes, it does appear that the better the control, the better the results will be, we've also been able to "re-use" controls. So, in fact, having one good set of controls per species already seems to make a big contribution. Thus, controls should be amortizable over the life of the sequencing data, so the cost should not be anywhere near double for large sequencing centres. (I don't have hard data, to back this up, yet, but it is generally the experience I've had so far.)

To get to the second half of the question:

Also, what controls can you use for WTSS? Or do you just mean comparing two groups eg cancer vs normal?
That question is similar to the one above, and so is my answer: it depends on the question you're asking. If you want to use it to compare against another sample (eg, using FP 3.3.3.1's compare function), then you'll want to compare with another sample, which will be your "control". I've been working with this pipeline for a few weeks now and have been incredibly impressed with how well this works over my own (VERY old 32bp) data sets.

On the other hand, if you're actually interested in discovering new exons and new genes using this approach, you'd want to use a whole genome shotgun as your control.

As with all science, there's no single right answer for which protocol you need to use until you've decided what question you want to ask.

And finally, there's one last important part of this equation - have you sequenced either your control or your sample to enough depth? Even if you have controls working well, it's absolutely key to make sure your data can answer your question well. So many questions, so little time!

Labels: , ,

Friday, May 15, 2009

On the necessity of controls

I guess I've had this rant building up for a while, and it's finally time to write it up.

One of the fundamental pillars of science is the ability to isolate a specific action or event, and determine it's effects on a particular closed system. The scientific method actually demands that we do it - hypothesize, isolate, test and report in an unbiased manner.

Unfortunately, for some reason, the field of genomics has kind of dropped that idea entirely. At the GSC, we just didn't bother with controls for ChIP-Seq for a long time. I can't say I've even seen too many matched WTSS (RNA-SEQ) experiments for cancer/normals. And that scares me, to some extent.

With all the statistics work I've put in to the latest version of FindPeaks, I'm finally getting a good grasp of the importance of using controls well. With the other software I've seen, they do a scaled comparison to calculate a P-value. That is really only half of the story. It also comes down to normalization, to comparing peaks that are present in both sets... and to determining which peaks are truly valid. Without that, you may as well not be using a control.

Anyhow, that's what prompted me to write this. As I look over the results from the new FindPeaks (3.3.3.1), both for ChIP-Seq and WTSS, I'm amazed at how much clearer my answers are, and how much better they validate compared to the non-control based runs. Of course, the tests are still not all in - but what a huge difference it makes. Real control handling (not just normalization or whatever everyone else is doing) vs. Monte Carlo show results that aren't in the same league. The cutoffs are different, the false peak estimates are different, and the filtering is incredibly more accurate.

So, this week, as I look for insight in old transcription factor runs and old WTSS runs, I keep having to curse the lack of controls that exist for my own data sets. I've been pushing for a decent control for my WTSS lanes - and there is matched normal for one cell line - but it's still two months away from having the reads land on my desk... and I'm getting impatient.

Now that I'm able to find all of the interesting differences with statistical significance between two samples, I want to get on with it and find them, but it's so much more of a challenge without an appropriate control. Besides, who'd believe it when I write it up with all of the results relative to each other?

Anyhow, just to wrap this up, I'm going to make a suggestion: if you're still doing experiments without a control, and you want to get them published, it's going to get a LOT harder in the near future. After all, the scientific method has been pretty well accepted for a few hundred years, and genomics (despite some protests to the contrary) should never have felt exempt from it.

Labels: , , , , , , , ,

Monday, May 11, 2009

FindPeaks 4.0

After much delay, and more statistics than I'd ever hoped to see in my entire life, I'm just about ready to tag FindPeaks 4.0.

There are some really cool things going on - better handling of control data, better compares, and even saturation analysis made it in. Overall, I'm really impressed with how well things are working on that front.

The one thing that I'm disappointed to have not been able to include was support for SAM/BAM files. No one is really pushing for it yet, here at the GSC, but it's only a matter of time. Unfortunately, the integration is not trivial, and I made the executive decision to work on other things first.

Still, I can't say I'm unhappy at all - there's room for a fresh publication, if we decide to go down that route. If not, I have at least 2 possible publications in the novel applications of the code for the biology. That can't be a bad thing.

Anyhow, I'm off to put a few final updates on the code before tagging the final 3.3.x release.

Labels: , ,

Friday, May 1, 2009

2 weeks of neglect on my blog = great thesis progress.

I wonder if my blogging output is inversely proportional to my progress on my thesis. I stopped writing two weeks ago for a little break, and ended up making big steps forward. The vast amount of my work went into FindPeaks, which included the following:

  • A complete threaded Saturation analysis for next-gen libraries.

  • A method of comparing next-gen libraries to identify peaks that are statistically significant outliers. (It's also symmetic, unlike a linear regression based methods.)

  • A better control method

  • A whole new way of analysing WTSS data, which gives statistically valid expression differences


And, of course many many other changes. Not everything is bug-free, yet, but it's getting there. All that's left on my task list are debugging a couple of things in the compare mode, relating to peaks present in only one of the two librarires, and an upgrade to my FDR cutoff prediction methods. Once those are done, I think I'll be ready to push out FindPeaks 4.0. YAY!

Actually, what was surprising to me was the sheer amount of work that I've done on this since January. I compiled the change list since my last "quarterly report" for a project that used FindPeaks (but doesn't support it, ironically.... why am I doing reports for them again?) and came up with 24 pages of commit messages - over 575 commits. Considering the amount of work I've done on my actual projects, away from FindPeaks, I think I've been pretty darn productive.

Yes, I'm going to take this opportunity to pat myself on the back in public for a whole 2 seconds... ok, done.

So, overall, blogging may not have been distracting me from my work, as even at the height of my blogging (around AGBT), I was still getting lots done, but the past two weeks have really been a help. I'll be back to blogging all the good stuff on monday. And I'm looking forward to doing some writing now, on some of the cool things in FP4.0 that haven't made it into the manual... yet.

Anyone want some fresh ChIP-Seq results? (-;

Labels: , , , ,

Thursday, April 16, 2009

Multi-match reads in ChIP-Seq

I had an interesting comment left on my blog today, which is worth taking a few minutes to write a response to:
"Hi Anthony, I just discovered your blog and it looks very interesting to me!
Since this article on Eland is now more than one year old, I was wondering
if the description at point 3 about multi matching locations is still
applicable to the Eland program in the Illumina pipeline 1.3. More in general,
would you trust the multi matching locations extracted from the multi_eland
output files to perform a repeat enrichment analysis over an experiment of
ChIP-seq? If no, why? Thank you in advance for your attention."

The first question asks about multi-matching locations - and if the point in question (point 3) applies to the Illumina Pipeline 1.3. Since point 3 was just that the older pipeline didn't provide the locations of the multi-matche reads, I suppose this no longer really applies: I understand the new version of Eland does provide multi-match alignment information, as do other aligners such as Bowtie. However, I should also mention that since I adopted Maq as my preferred aligner, I haven't used Eland much - so it's hard for me to give an informed opinion on the quality of the matches. I simply don't know if they're any good, and I won't belabour that point. I have used Bowtie specifically because it was able to do mutli-matches, but we didn't use it for ChIP-Seq, and the multi-matches had other uses in that experiment.

So, the more interesting question is whether I'd use multi-match reads in a ChIP-Seq analysis. And, off hand, my answer has to be no. But let me explain my reasoning, and the conditions in which I would change that answer.

First, lets assume that we have Single End Tags, so the multi-match information is not resolvable. That means anytime we have a read that maps to more than one location, we have the possibility that we can either map it to it's source - or we're mapping it incorrectly. A 50% change of "getting it right." The greater the number of multi-match locations, the smaller the chance we're actually finding it's correct origin. So, at best we've got a 50-50 chance that we're not adversely affecting the outcome of the experiment. That's not great.

In contrast, there are things we could do to make them usable. The most widely used method from FindPeaks is the weighted fragment distribution type. Thus, we could expand the principle to weight the fragments according to the number of sites. That would be... bearable. But would it significantly add to the quality of the alignment?

I'm still going to say no. Fragments we see in ChIP-Seq experiments tend to fall within 200-300bp of the regions in which the transcription factor (or other sites) bind. Thus, even if we were concerned that a particular transcription factor binds primarily to the similar motif regions at two sites, there should be more than enough (unique) sequence around that site (which is usually <30-40bp in length) to which you'll still see fragments aligning. That should compensate for the loss of the multi-match fragments.

Even more importantly, as read lengths increase, the amount of non-unique sequence decreases rapidly, making the shrinking number of multi-match reads less important.

The same argument can be extended for paired end tags: Just as read lengths improve and reduce the number of multi-match sites, more of the multi-match reads will be resolved by pairing them with a second read, which is unlikely to be within the same repeat region, thus reducing the number of reads that become unresolvable multi-matches. Proportionally, one would then expect that leaving out these reads become a smaller and smaller segment of the population, and would have to worry less and less about their contribution.

So, then, when would I want them?

Well, on the odd chance you're working with very short reads, you can pull off the weighting properly, and you have single end tags - and the multi-match reads make up a significant proportion of the reads, then it's worth exploring.

You'd need to start asking the tough questions: did the aligner simply find that a small k-mer of the read aligned to multiple locations (and was then unable to resolve the tie by extension the way some Eland aligners work)? Does the aligner use quality scores to identify mis-alignments? How reliable are the alignments (what's their error rate)? What was your sample, and how divergent is it from reference ? (e.g., cancer samples have a high variation rate, and so encourage many false alignments, making the alignments less reliable.)

Overall, I really don't see too many cases where you're going to gain a lot by digging in the multi-match files. That's not too say that you won't find anything good in there - you probably would, if you knew where to look, but the noise to signal ratio is going to be pretty poor - just by definition of the fact that they're mutli-match reads alone. You'll just have to ask if it's worth your time.

For the moment, I don't think my time (even at grad student wages) is worth it. It's just not low hanging fruit, when it comes to ChIP-Seq.

Labels: , , , , , , ,

Friday, March 20, 2009

Universal format converter for aligned reads

Last night, I was working on FindPeaks when I realized what an interesting treasure trove of libraries I was really sitting on. I have readers and writers for many of the most common aligned read formats, and I have several programs that do useful functions. So, that raise the distinctly interesting point that all of them should be applied together in one shot... and so I did exactly that.

I now have an interesting set of utilities that can be used to convert from one file format to another: bed, gff, eland, extended eland, MAQ .map (read only), mapview, bowtie.... and several other more obscure formats.

For the moment, the "conversion utility" forces the output to bed file format (since that's the file type with the least information, and I don't have to worry about unexpected file information loss), which can then be viewed with the UCSC browser, or interpreted by FindPeaks to generate wig files. (BED files are really the lowest common denominator of aligned information.) But why stop there?

Why not add a very simple functionality that lets one format be converted to the other? Actually, there's no good reason not to, but it does involve some heavy caveats. Conversion from one format type to another is relatively trivial until you hit the quality strings. since these aren't being scaled or altered, you could end up with some rather bizzare conversions unless they're handled cleanly. Unfortunately, doing this scaling is such a moving target that it's just not possible to keep up with that and do all the other devlopment work I have on my plate. (I think I'll be asking for a co-op student for the summer to help out.)

Anyhow, I'll be including this nifty utility in my new tags. Hopefully people will find the upgraded conversion utility to be helpful to them. (=

Labels: , , , , , , , , , , ,

Thursday, March 19, 2009

Findpeaks 3.3... continued

Patch, compile, read bug, search code, compile, remember to patch, compile, test, find bug, realized it's the wrong bug, test, compile, test....

Although I really enjoy working on my apps, sometimes a whole day goes by where tons of changes are made, and really I don't feel like I've gotten much done. I suppose it's more of the scale of things left to do, rather than the number of tasks. I've managed to solve a few mysteries and make an impact for some people using the software, but haven't got around to testing the big changes I've been working on for a few days on using different compare mechanisms for FindPeaks.

(One might then ask why I'm blogging instead of doing that testing... and that would be a very good question.)

Some quick ChIP-Seq things on my mind:
  • Samtools: there is a very complete Java/Samtools/Bamtools API that I could be integrating, but after staring at it for a while, I've realized that the complete lack of documentation on how to integrate it is really slowing the effort down. I will proably return to it next week.
  • Compare and Control: It seems people are switching to this paradigm on several other projects - I just need to get the new compare mechanism in, and then integrate it in with the control at the same time. That will provide a really nice method for doing both at once, which is really key for moving forward.
  • Eland "extended" format: I ended up reworking all of the Eland Export file functions today. All of the original files I worked with were pre-sorted and pre-formatted. Unfortunately, that's not how they exist in the real world. I now have updated the sort and separate chromosome functions for eland ext. I haven't done much testing on them, unfortunately, but that's coming up too.
  • Documentation: I'm so far behind - writing one small piece of manual a day seems like a good target - I'll try to hold myself to it. I might catch up by the end of the month, at that pace.
Anyhow, lots of really fun things coming up in this version of FindPeaks... I just have to keep plugging away.

Labels: , ,

Thursday, January 29, 2009

FindPeaks 3.3.0 and AGBT proposal

Well, I finally have a version of FindPeaks 3.3.0 that runs without known bugs. Tracking down that last bug was tricky, and took me 3 days to find and squash it. It's hard to find bugs that only happen when they're near to a fragment that is duplicated. (-;

Anyhow, now that that's working better, it's time to add in the new functionality. The most pressing parts are the controls (in two parts - one of which is a top secret collaboration, while the other is just too boring to really talk about), and the other is implementing SAM/BAMtools interface. Whenever the "new MAQ" is ready, I'd like to be prepared for it.

Incidentally, I think controls will be the easier of the two, and I think I'll be able to finish the boring parts off this week. At the rate things are going, it might be another 2 days of debugging after that, but that's what makes software writing fun. (Just imagine a tall thin guy hunched over a computer keyboard cackling insanely while staring deep into the monitor displaying green matrix-style characters drifting downward...)

At any rate, I'm also working towards my poster for AGBT, which reminds me of what else I wanted to suggest. If anyone who reads my blog is going to be at AGBT and is happy to meet up to talk some ChIP-Seq or SNP finding (or anything remotely related), let me know. I'm thinking it would be neat to gather people together who are working on the same topic and talk for a bit. (I'm even willing to miss formal talks for it, as long as they're not directly related to my work.)

So, to that effect, I'll point to this page on SeqAnswers, and suggest if anyone is interested they let me know. (= It would definitely be an efficient way to network.

Oh, and (still) for those of you who've already registered for AGBT, check out the nifty package Illumina is sending out to people. I'm HIGHLY impressed with the creative idea and timeliness. (If you don't know what it is, the suspence is killing you and you care enough to ask, I'll put the answer in the comments.) (-:

Labels: , ,

Thursday, January 8, 2009

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Monday, November 3, 2008

Updates

It has been a VERY busy week since I last wrote. Mainly, that was due to my committee meeting on Friday, where I had to present my thesis proposal. I admit, there were a few things left hanging going into the presentation, but none of them will be hard to correct. As far as topics go for my comprehensive exams, it sounds like the majority of the work I need to do is to shore up my understanding of cancer. With a field that big, though, I have a lot of work to do.

Still, it was encouraging. There's a very good chance I could be wrapping up my PhD in 18-24 months. (=

Things have also been busy at home - we're still working on selling a condo in Vancouver, and had two showings and two open houses over the weekend, and considering the open houses were well attended,that is an encouraging sign.

FindPeaks has also had a busy weekend, even though I wasn't doing any coding, myself. A system upgrade took FindPeaks 3.1.9.2 off the web for a while and required a manual intervention to bring that back up. (Yay for the Systems Dept!) A bug was also found in all versions of 3.1 and 3.2, which could be fairly significant -and I'm still investigating. At this point, I've confirmed the bug, but haven't had a chance to identify if it's just for this one file, or for all files...

Several other findpeaks projects are also going to be coming to the forefront this week. Controls and automated file builds. Despite the bug I mentioned, FindPeaks would do well with an automated trunk build. More users would help me get more feedback, which would help me figure out what things people are using, so I can focus more on them. At least that's the idea. It might also recruit more developers, which would be a good thing.

And, of course, several new things that have appeared that I would like to get going on: Bowtie is the first one. If it does multiple alignments (as it claims to), I'll be giving it a whirl as the new basis of some of my work on transcriptomes. At a rough glance, the predicted 35x speedup compared to Maq is a nifty enticement for me. Then there's the opportunity to do some clean up code on the whole Vancouver package for command line parameter processing. A little work there could unify and clean up several thousand lines of code, and make new development Much easier.

First things first, though, I need to figure out the source and the effects of that bug in findpeaks!

Labels: , , ,

Wednesday, September 24, 2008

The best laid schemes...

"The best laid schemes o’ mice an’ men / Gang aft a-gley.” - Robert Burns

Well, ok, I don't speak 18-th century Scots, but I fully understand the sentiment. I have several major things in the works, and yet, I'm suffering from a bad cold and sore throat. So, the FindPeaks manual is going to be delayed a few days. I guess I'm not the only one working on it, but I defintely have plans for it.

Anyhow, between the FindPeaks manual, some new FindPeaks features, a poster abstract and two papers I'm working on, I have enough on my plate - but just don't have the energy to get it all done. Damn colds! They always get you just as the stress level starts to rise.

Anyhow, Thanks to tcezard (Tim), who's taking on an ever increasing role on the FindPeaks work, FindPeaks 3.2.1 was tagged today. Development is so much faster when people work together!

I thought I'd also go one further today, and add a plug in for a very cool technology that hasn't (to my knowledge) gotten much coverage - ODF@WWW. As a scientist who uses wikis and blogs, this has to be one of the best applications I've seen of extending the wiki concept. This has the ability to change the way scientists communicate data, and allow wikis to be incorporated into environments where they previously were "too complex to be useful." Of course, more people have to adopt ODF before this will really take off, but with applications like this, it's only a matter of time. (Yay open formats for making this possible!)

Labels: , ,

Monday, September 22, 2008

Two milestones reached

Two of my three projects have met significant milestones this weekend, which I thought were worth sharing.

The first one is that FindPeaks had it's first official patch submitted to me from developers outside the Genome Science Centre, on sunday night. Thank you very much to the submitters! I'm very glad to see someone using (and playing with!) the code. While the patch was minor, it tells me that there is definitely use in having released it to the wild.

The second milestone is for my blog, and while it's not officially a "project" in the academic sense of the word, has consumed enough of my time that it starts to feel like one. The milestone? There are now as many people coming here directly as there are finding it through google. It's not a huge number of hits, but it's nice to know there's an audience who's enjoying my random rants. (-:

Labels: ,

Thursday, September 18, 2008

FindPeaks 3.2.0

Obviously, I didn't write an article review this morning. Instead, I finished work on an "unstable" Findpeaks 3.2.0 release. It's now been tagged, for anyone who'd like to download the source code and build it for themselves. The documentation has some catching up to do (which I plan to work on for the end of the week.), but the majority of the features are in place and appear to be functional.

I was debating posting the changelog to SEQanswers, but I think I'll hold off until I have a more stable version. In the meantime, I'll just post a summary from the tag commit here. Welcome to the FindPeaks 3.2 series!

(Update: I should also mention two important details. This release includes code contributions and fixes from Timothee Cezard, another developer at the GSC, as well as a code donation (Lander-Waterman FDR algorithm) from Mikhail Bilenky, also at the GSC.)



Changes since FindPeaks 3.1.9.2
  • New FindPeaks Core Module
  • Significant reorganization/modularity/OOP of code
  • Significant cleanup of code
  • ability to read MAQ .map files
  • ability to read bed files (both reduced and extended)
  • Log_Buffer to provide buffered and logged output to file
  • Several bug fixes to reported issues
  • better error handling
  • better error messages and verbose output
  • New coverage based FDR (Thresholding methods) incorporated
  • Native mode (mode 3) for MAQ SET/PET or Bed files, or generating wigs
  • Conforms to most Enerjy best practice recommendations. (over 2000 code changes.)
  • More controlled exits to handle system crashes gracefully.
  • better handling of constants
  • smoothing can be turned on or off for adaptive fragment size estimation.
  • utilities for working with PETs
  • FindPeaks can now be used as a simple wig generator
  • hundreds more fixes and updates!

Labels:

Tuesday, September 16, 2008

Processing Paired End Reads

Ok, I'm taking a break from journals. I didn't like the overly negative tone I was taking in those reviews, so I'm rethinking how I write about articles. Admittedly, I think criticism is in order for papers , but I can definitely focus on papers that are worth reviewing. Unfortunately, I'm rarely a fan of papers discussing bioinformatics applications, as I always feel like there's a hidden agenda behind them. Whether it's simply proving their application is the best, or just getting it out first, computer application papers are rarely impartial.

Anyhow, I have three issues to cover today:
  • FindPeaks is now on sourceforge
  • Put the "number of reads under a peak" to rest. permanently, I hope.
  • Bed files for different data sources.

The first one is pretty obvious. FindPeaks is now available under the GPL on sourceforge, and I hope people will participate in using and improving the software. We're aiming for our first tagged release on friday, with frequent tags thereafter. Since I'm no longer the only developer on this project, it should continue moving forward quickly, even while I'm busy studying for my comps.

The second point is this silly notion that keeps coming up. "How many reads were found under each peak?" I'm quite sick of that question, because it really makes no sense. Unfortunately, this was a metric produced in Robertson et al.'s STAT1 data set, and I think other people have included it or copied it. Unfortunately it's rubbish.

The reason it worked in STAT1 was because they used a fixed length (or XSET) value on their data set. This allowed them to determine the exact length of each read, which allowed them to figure out how many reads were "contiguously linked in each peak." Readers who are paying attention will also realize what the second problem is... They didn't use subpeaks either. Once you start identifying subpeaks, you can no longer assign to which peak a read spanning peaks belongs. Beyond that, what do you do with reads in a long tail? Are they part of the peak or not?

Anyhow, the best measure for a peak, at the moment at least, is the height of the peak. This can also include weighted reads, so that reads which are unlikely to contribute to a peak actually contribute less, bringing in a scaled value. After all, unless you have paired end tags, you really don't know how long the original DNA fragment was, which means you don't know where it ended.

That also makes a nice segue to my last point. There are several ways of processing paired end tags. When it comes to peak calling it's pretty simple: you use the default method - you know where the two ends are, and they span the full read. For other applications, however, there are complexities.

If the data source is a transcriptome, your read didn't cover the intron, so you need to process the transcript to include breaks, when mapping it back to the genome. That's really a pain, but it is clearly the best way to visualize transcriptome PETs.

If the data source is unclear, or you don't know where the introns are (which is a distinct possibility), then you have to be more conservative, and not assume the extension of each tag. Thus, you end up with a "tag and bridge" format. I've included an illustration to make it clear.




So why bring it up? Because I've had several requests for the tag-and-bridge format, and my code only works on the default mode. Time to make a few adjustments.

Labels: , , ,

Thursday, July 3, 2008

FindPeaks News

Some quick news:

First of all, my FindPeaks Application note has finally been accepted into Bioinformatics, in their Advanced Access Section. It's an open access article, so anyone can read it.

Here's the link.

Second, several big decisions have been made, lately. One of them, concerning FindPeaks, is that the software will be going open source (GPL) pretty soon. I think there's still room for at least one more article out of it, so it's not up there yet... but I have a spot on SourceForge for it.

Third, FindPeaks 4.0 is finally building again. After 27,000+ line changes, it's now running and being tested, and is working about as well as I expected (Minor programming errors introduced, but things are going pretty smoothly.) This includes several new file formats, a touch of threading, and a lot more "clean" java code.

And now, back to work.

Labels:

Thursday, June 26, 2008

Chip-Seq revisited

In the world of ChIP-Seq, things don't seem to slow down. A collaborator of mine pointed out the new application called MACS, which is yet another peak finder, written in python as an open source project. That makes 2 open source peak finders that I'm aware of: Useq and now MACS.

The interesting thing, to me, is the maturity of the code (in terms of features implemented). In neither cases is it all that great, as it's mostly lacking features I consider to be relatively basic, and relatively naive in terms of algorithms used for peak detection. Though, I suppose I've been working with FindPeaks long enough that nearly everything else will seem relatively basic in comparison.

However, I'll come back to a few more FP related things in a moment. I wanted to jump to another ChIP-Seq related item that I'd noticed this week. The Wold lab merged their Peak Finder software into a larger development package for Genomic and Transcriptome work, which I think they're calling ERANGE. I've long argued that the Peak Finding tools are really just a subset of the whole Illumina tool-set required, and it's nice to see other people doing this.

This is the development model I've been using, though I don't know if the wold lab does exactly the same thing. The high-level organization uses a core library set, core object set, and then FindPeaks and other projects just sit on top, using those shared layers. It's a reasonably efficient model. And, in a blog a while ago, I mentioned that I'd made a huge number of changes to my code after coming across the tool called "Enerjy". I sat down to figure out how many lines were changed in the last two weeks: 26,000+ lines of code, comments and javadoc. That's a startling figure, since my entire code base ( grep -r " " * | wc -l) is only 22,884 lines, of which 15,022 contain semi-colons.

Anyhow, I have several plans for the next couple of days:
  1. try to get my SVN repository to somewhere other people can work on it as well, and not just restricted to GSC developers.
  2. Improve the threading I've got going
  3. Clean up the documentation, where possible
  4. and work on the Adaptive mode code.

Hopefully, that'll clean things up a bit.

Back to FindPeaks itself, the latest news is that my Application note in Bioinformatics has been accepted. Actually, it was accepted about a week ago, but I'm still waiting to see it in the advanced access section - hopefully it won't be much longer. I also have a textbook chapter on ChIP-Seq coming out relatively soon, (I'm absolutely honoured to have been given that opportunity!) assuming I can get my changes done by Monday.

I don't think that'll be a problem.

Labels: , ,

Thursday, June 19, 2008

Downloads and aligners

At the Genome Science Centre, we have a platform Scientific Advisory Board meeting coming up in the near future, so a lot of people are taking the time to put together some information on what's being going on since the last meeting. As part of that, I was asked to provide some information on the FindPeaks application, since it's been relatively successful. One of those things was how many times it's been downloaded.

Surprisingly, it's not an insigificant number: FindPeaks 2.x has been downloaded 280 times, while FindPeaks 3.1.x has been downloaded 325 times. I was amazed. Admittedly, I filtered out google and anything calling itself "spider", and didn't filter on unique IPs, but it's still more than I expected.

The other surprise to me was the institutions from which it was downloaded, which are scattered across the world. Very cool to see it's not just Vancouver people who took an interest in FindPeaks, though I somewhat suspected as much, already.

Thinking of FindPeaks, I put in one last marathon cleaning session for all my software. I'm sure I'm somewhere north of 15,000 lines of code modified in the past week. Even when you consider that some of those changes were massive search-and-replace jobs, (very few, really), or refactoring and renaming variables (many many more), it's still an impressive number for two weeks. With that done, I'm looking to see some significant gains in development speed, and in developer productivity. (I learned a LOT doing this.)

The last 122 warnings will just have to get cleaned up when I get around to it - although they're really only 4 warnings types repeated so many times. (The majority of them are just that the branched logic goes deeper than 5 levels, or that my objects have public variables. (You can only write so many accessor functions in a week.)

Anyhow, I'm looking forward to testing out FindPeaks 4.0 alphas starting tomorrow, and to put some documents together on it. (And catch up on the flood of emails I received in the last 2 days.

Finally, I'm working on a MAQ file interpreter for another project I'm working on. If I ever manage to figure out how to interpret the (nearly undocumented) files, I'll post that here. If anyone's done this already (though I didn't see anything publicly available on the web), I'd love to hear from them.

Cheers!

Labels: , , ,

Monday, June 16, 2008

Random Update on FP/Coding/etc.

I had promised to update my blog more often, but then failed miserably to follow through last week. I guess I have to chalk it up to unforeseen circumstances. On the bright side, it gave me the opportunity to come up with several things to discuss here.

1. Enerjy: I learned about this tool on Slashdot, last week while doing some of my usual lunch hour "open source news" perusal. I can unequivocally say that installing the Enerjy tool in Eclipse has improved my coding by unimaginable leaps and bounds. I tested it out on my Java codebase that has my FindPeaks application and the Transcriptome/Genome analysis tools, and was appalled by the number of suggestions it gave. Admittedly, I'm self taught in Java, but I thought I had grasped the "Zen" of Java by now, though the 2000+ warnings it gave disagreed. I've since been cleaning up the code like there's no tomorrow, and have brought it down to 533 warnings. The best part is that it pointed out several places where bugs were likely to have occurred, which have now all been cleaned up.

2. Threading has also come up this past week. Although I didn't "need" it, there was no way to get around it - learning threads was the appropriate solution to one problem that came up, so my development version is now beginning to include some thread management, which is likely to spread into the the core algorithms. Who knew??

3. Random politics: If you're a grad student in a mixed academic/commercial environment, I have a word of warning for you: Not everyone there is looking out for your best interests. In fact, some people are probably looking out for their own interests, and they're definitely not the same as yours.

4. I read Michael Smith's biography this week. I was given a free copy by the Michael Smith Foundation for Health Research, who were kind enough to provide my funding for the next few years. It's fantastic to understand a lot of the history behind the British Columbia Biotechnology scene. I wish I'd read that before having worked at Zymeworks. That would have provided me with a lot more insight into the organizations and people I met along the way. Hindsight is 20/20.

5. FindPeaks 4.0: Yes, I'm skipping plans for a FindPeaks 3.3. I've changed well over 12000+ lines of code, according to the automated scripts that report such things, which have included a major refactoring and the start I made at threading. If that doesn't warrant an major number version change, I don't know what does.

Well, on that note, back to coding... I'm going to be competing with people here, in the near future, so I had best be productive!

Labels: , , , , ,

Friday, March 28, 2008

Chip-Seq on a laptop.

Every once in a while, I get an interesting email or comment from someone, which is worthy of a blog post. I woke up this morning to a comment left on my last post:

What kind of computing infrastructure do you need to be able to handle chipSEQ datasets? I'm guessing my standard IBM T60 laptop is not going to do the trick - but what does?


So yes, I do check my email when I get up... but that's not the point. Moving along...

The simple answer is, your laptop probably can't run a chip-Seq data set with the currently available programs... but it's probably not far off. My laptop has 2Gb of RAM (after tossing two new 1Gb sticks in last week), and a dual core AMD, running at 2GHz. Ironically, it's not far from what we do use to run FindPeaks: A quadcore 2.2GHz opteron with 8Gb of RAM.

Of course, FindPeaks doesn't take up the full resources of the machine. (Though I've seen it take up to 12CPUs on one of our fancier servers.) Generally, I only allocate 4Gb of RAM to the process, which should be more than enough for several millions or tens of millions of reads. The reason it takes so much ram? Because I don't drop any of the information being held that's contained in the Eland file, and because I need to retain all of the reads in memory to do a sort.

What does it need? Generally, FindPeaks only needs the start, end and direction of each read to do the basics, however, I don't throw away any of the other information collected, in case we want to use it later. If I did that, or if I pre-sorted the reads, I could probably drop the memory requirements down by an order of magnitude or more. (No one's asked me to do this, yet, however.) Is there anyone out there who needs a laptop-runnable version of FindPeaks?

This is somewhat timely, though. For the last few days I've been playing with a companion piece of software for FindPeaks which generates UCSC-compatible BED tracks from Eland data where I've adopted a minimalist approach. It takes about the same amount of time, and runs in under 300m of RAM. That, clearly, should run on anyone's laptop.

Labels: , , ,