Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, April 28, 2008

I've spent the last week madly putting together a poster for the "Reasons for Hope 2008" conference this past weekend, which focuses on breast cancer science, treatment and quality of life research. So, you'll notice (shortly), a new poster in my poster section. It was a educational experience, and I must admit I learned a lot. Not so much in the areas that I need to learn for my own research, but about physiology, psychology and general health research. And that's even considering how few talks I went to!

Still, I highly recommend dropping into talks that aren't in your field, on occasion. I try to make a habit of it, which included a pathology lecture just before xmas, last year, and this time, I learned a lot about mammography, and new techniques for mammography that are up and coming. Neither are really practical skills for a bioinformatician, but it gives me a good idea of where the samples I'll be dealing with come from. Nifty.

Anyhow, I had a few minutes to revisit my ChIP-Seq code, FindPeaks, and do a few things I'd been hoping to do for a while. I got around to reducing the memory requirement - going from about 4Gb of RAM for a 12M+ read run down to under 1Gb. (I'd discussed this before in another posting.) The other thing I did was to re-write the core peak-finding algorithm. It was something I'd known was not-optimal for a while, but re-implementing a core routine isn't something you do without a lot of thought. The good news, it runs about 2x as fast, scales better on multiple cores and guarantees not to produce any of the type of bugs that have been relatively common in early versions of FindPeaks.

Having invested the 2 hours to do it, I'm very glad to see it provide some return. Since my next project is to clean up the Transcripter code (for whole transcriptome shotgun sequencing), this was a nice lesson in coding: if you find a problem, don't patch the problem: solve it. I think I have a lot of "solving" to do. (-;

For those of you who are interested, the next version of FindPeaks will be released once I can include support for the SRF files - hopefully the end of the week.

Labels: , , ,

Monday, April 21, 2008

Eland file Format

I haven't written much over the past couple of days. I have a few things piled up that all need doing urgently... and it never fails, that's when you get sick. I spent today in bed, fighting off a cold, sore throat and fever. Wonderful combination.

Anyhow, since I'm starting to feel better, I thought I'd write a few lines before going to bed, and wanted to mention that I've finally seen a file produced by the new Eland. It's a little different, and the documentation provided (ahem, that I was able to obtain from a colleague who uses the pipeline) was pretty scarce.

In fact, much of what you see in the file is pretty obvious, with the same general concept as the previous Eland files, except this has a few caveats:

1) the library name and 4-coordinate position of the sequence are all separated by tabs in one of the files I saw, but concatenated with a separating ":" in another. I'm not sure which is the real format, but there are at least 2 formats for line identification.

2) There's a string that seems to encode the base quality scores from the prb files, but it's in a format for which I can't find any information.

3) there's a new format for mismatches within the alignment. Instead of telling you the location of the mismatch, you now get a summary of the alignment itself. If Eland could do insertions, it would work well for those too. From the document, it tells you the number of aligned bases, with letters interspersed to show the mismatch. (e.g. if you had a 32 base alignment, with a mismatch A at character 10, you'd get the string "9A22".) I also understand that upper and lower case mismatches mean something different, though I haven't probed the format too much.

So, in the discussion of formats, I understand there's some community effort around using a so-called "short read format" or SRF format. It's been adopted by Helicos, GEO, as well as several other groups.

Maybe it's time I start converting Eland formats to this as well. Wouldn't it be nice if we only had to work with one format? (If only Microsoft understood that too! - ps, don't let the community name fool you, it's well known Microsoft sponsored that site.)

Labels: , ,

Sunday, April 13, 2008

Genomics Forum 2008

You can probably guess what this post is about from the title - which means I still haven't gotten around to writing an entry on thresholding for ChIP-Seq. Actually, it's probably a good thing I haven't, as we've been learning a lot about thresholding in the past week. It seems many things we took for granted aren't really the case. Anyhow, I'm not going to say too much about that, as I plan to collect my thoughts and discuss it in a later entry.

Instead, I'd like to discuss the 2008 Genomics Forum, sponsored by Genome BC, which took place on Friday - though, in particular, I'm going to focus on one talk, near to my own research. Dr. Barbara Wold from Caltech gave the first of the science talks, and focussed heavily on ChIP-Seq and Whole Transcriptome Shotgun Sequencing (WTSS). But before I get to that, I wanted to mention a few other things.

The first is that Genome BC took a few minutes to announce a really neat funding competition, which really impressed me, the Genome BC Science Opportunities Fund. (There's nothing up on the web page yet, but if you google for it, you'll come across the agenda for Friday's forum in which it's mentioned - I'm sure more will appear soon.) Its whole premise revolves around the question: "Are there experiments that we need to be doing, that are of strategic importance to the BC life science community?" I take that to mean, are there projects that we can't afford not to undertake, that we wouldn't have the funding to do otherwise? I find that to be very flexible, and very non-academic in nature - but quite neat. I hope the funding competition goes well, and I'm looking forward to seeing what they think falls into the "must do" category.

The second was the surprising demand for Bioinformaticians. I'm aware of several jobs for bioinformaticians with experience in next-gen sequencing, but the surprise to me was the number of times (5) I heard people mention that they were actively recruiting. If anyone with next-gen experience is out there looking for a job (post-doc, full time or grad student), drop me a note, and I can probably point you in the right direction.

The third was one of the afternoon talks, on journalism in science, from the perspective of traditional news paper/tv journalists. It seems so foreign to me, yet the talk touched on several interesting points, including the fact that journalists are struggling to come to terms with "new media." (... which doesn't seem particularly new to those of us who have been using the net since the 90's, but I digress.) That gave me several ideas about things I can do with my blog, to bring it out of the simple text format I use now. I guess even those of us who live/breath/sleep internet don't do a great job of harnessing it's power for communicating effectively. Food for though.

Ok... so on to the main topic of tonight's blog: Dr. Wold's talk.

Dr. Wold spoke at length on two topics, ChIP-Seq and Whole Transcriptome Shotgun Sequencing. Since these are the two subject I'm actively working on, I was obviously very interested in hearing what she has to say, though I'll comment more on the ChIP-Seq side of things.

One of the great open questions at the Genome Sciences Centre has been how to do an effective control for a ChIP-Seq experiment. It's not something we've done much of, in the past, but the Wold lab demonstrated why they're necessary, and how to do them well. It seems that ChIP-Seq experiments tend to yield fragments in several genomic regions that have nothing to do with the antibody or experiment itself. The educated guess is that these are caused by hypersensitive sites in the genome that tend to fragment in repeatable patterns, giving rise to peaks that appear in all samples. Indeed, I spend a good portion of this past week talking about observations of peaks exactly like that, and how to "filter" them out of the ChIP-Seq results. I wasn't able to get a good idea of how the Wold lab does this, other than by eye, (which isn't very high throughput), but knowing what needs to be done now, it shouldn't be particularly difficult to incorporate into our next release of the FindPeaks code.

Another smart thing that the Wold lab has done is to separate the interactions of ChIP-Seq into two different types: Type 1 and Type 2, where Type 1 refers to single molecule-DNA binding events, which give rise to sharp peaks, and very clean profiles. These tend be transcription factors like NRSF, or STAT1, upon which the first generation of ChIP-Seq papers were published. Type 2 interactomes tend to be less clear, as they are transcription factors that recruit other elements, or form complexes that bind to the DNA at specific sites, and require other proteins to bind to encourage transcription. My own interpretation is that the number of identifiable binding sites should indicate the type, and thus, if there were three identifiable transcription factor consensus sites lined up, it should be considered a Type 3 interactome, though, that may be simplifying the case tremendously, as there are, undoubtedly, many other proteins that must be recruited before any transcription will take place.

In terms of applications, the members of the wold lab have been using their identified peaks to locate novel binding site motifs. I think this is the first thing everyone thinks of when they hear of ChIP-Seq for the first time, but it's pretty cool to see it in action. (We also do it at the GSC too, I might add.) The neatest thing, however, was that they were able to identify a rather strange binding site, with two halves of a motif, split by a variable distance. I haven't quite figured out how that works, in terms of DNA/Protein structure, but it's conceptually quite neat. They were able to show that the distance between the two halves of the structure vary by 10-20 bases, making it a challenge to identify, for most traditional motif scanners. Nifty.

Another neat thing, which I think everyone knows, but was cool to hear that it's been shown is that the binding sites often line up on areas of high conservation across species. I use that as a test for my own work, but it was good to have it confirmed.

Finally, one of the things Dr. Wold mentioned was that they were interested in using the information in the directionality of reads in their analysis. Oddly enough, this was one of the first problems I worked on in ChIP-Seq, months ago, and discovered several ways to handle it. I enjoyed knowing that there's at least one thing my own ChIP-Seq code does that is unique, and possibly better than the competition. (-;

As for transcriptome work, there were only a couple things that are worth mentioning. The Wold lab seems to be using MAQ and a list of splice junctions assembled from annotated exons to map the transcriptome sequences. I've heard that before, actually, from someone at the GSC who is doing exactly the same thing. It's a small world. I'm not really a fan of the technique, however. Yes, you'll get a lot of the exon junction reads, but you'll only find the ones you're looking for, which is exactly the criticism all the next-gen people throw at the use of micro-arrays. There has got to be a better solution... but I don't yet know what it is. (We thought it was Exonerate, but we can't seem to get it to work well, due to several bugs in the software. It's clearly a work in progress.)

Anyhow, I think I'm going to stop here. I'll just sum it all up by saying it was a pretty good talk, and it's given me lots of things to think about. I'm looking forward to getting back to coding tomorrow.

Labels: , , , ,

Friday, April 4, 2008

Dr. Henk Stunnenberg's lecture

I saw an interesting seminar today, which I thought I'd like to comment on. Unfortunately, I didn't bring my notes home with me, so I can only report on the details I recall - and my apologies in advance if I make any errors - as always, any mistakes are obviously with my recall, and not the fault of the presenter.

Ironically, I almost skipped the talk - it was billed as discussing Epigenetics using "ChIP-on-Chip", which I wrote off several months ago as being a "poor man's ChIP-Seq." I try not to say that too loud, usually, since there are still people out there who put a lot of faith in it, and I have no evidence to say it's bad. Or, at least, I didn't until today.

The presenter was Dr. Stunnenberg, from Nijmegen Center for Molecular Sciences, who's web page doesn't do him justice in any respect. To begin with, Dr. Stunnenberg gave a big apology for the change in date of his talk - I gather the originally scheduled talk had to be postponed because someone had stolen his bags while he was on the way to the airport. That has got to suck, but I digress...

Right away, we were told that the talk would focus not on "ChIP-on-Chip", but on ChIP-Seq, instead, which cheered me up tremendously. We were also told that the poor graduate student (Mark?) who had spent a full year generating the first data set based on the ChIP-on-Chip method had had to throw away all of his data and start over again once the ChIP-Seq data had become available. Yes, it's THAT much better. To paraphrase Dr. Stunnenberg, it wasn't worth anyone's time to work with the ChIP-on-Chip data set when compared to the accuracy, speed and precision of the ChIP-Seq technology. Ah, music to my ears.

I'm not going to go over what data was presented, as it would mostly be of interest only to cancer researchers, other than to mention it was based on estrogen receptor mediated binding. However, I do want to raise two interesting points that Dr. Stunnenberg touched upon: the minimum height threshold they applied to their data, and the use of Polymerase occupancy.

With respect to their experiment, they performed several lanes of sequencing on their ChIP-Seq sample, and used the standard peak finding to identify areas of enrichment. This yielded a large number of sites, which I seem to recall was in the range of 60-100k peaks, with a "statistically derived" cutoff around 8-10. No surprise, this is a typical result for a complex interaction with a relatively promiscuous transcription factor; a lot of peaks! The surprise to me was that they decided that this was too many peaks, and so applied an arbitrary threshold of a minimum peak height of 30, which reduced the number of peaks down to 6,400-ish peaks. Unfortunately, I can't come up with a single justification for this threshold at 30. In fact, I don't know that anyone could, including Dr. Stunnenberg, who admitted it was rather arbitrary, because they thought the first number, in the 10's of thousands of peaks was too many.

I'll be puzzling over this for a while, but it seems like a lot of good data was rejected for no particularly good reason. yes, it made the data set more tractable, but considering the number of peaks we work on regularly at the GSC, I'm not really sure this is a defensible reason. I'm personally convinced that there is a lot of biological relevance for the peaks with low peak heights, even if we aren't aware of what that is yet, and arbitrarily raising the minimum height threshold 3-fold over the statistically justifiable cut off is a difficult pill to swallow.

Moving along, the part that did impress me a lot (one of many impressive parts, really) was the use of Polymerase occupancy ChIP-Seq tracks. Whereas the GSC tends to do a lot of transcriptome work to identify the expression of genes, Dr. Stunnenberg demonstrated that polymerase ChIP can be used to gain the same information, but with much less sequencing. (I believe he said 2-3 lanes of Solexa data were all that were needed, whereas our transcriptomes have been done up to a full 8 lanes.) Admittedly, I'd rather have both transcriptome and polymerase occupancy, since it's not clear where each one has weaknesses, but I can see obvious advantages to both methods, particularly the benefits of having direct DNA evidence, rather than mapping cDNA back to genomic locations for the same information. I think this is something I'll definitely be following up on.

In summary, this was clearly a well thought through talk, delivered by a very animated and entertaining speaker. (I don't think Greg even thought about napping through this one.) There's clearly some good work being done at the Nijmegen Center for Molecular Sciences, and I'll start following their papers more closely. In the meantime, I'm kicking myself for not going to the lunch to talk with Dr. Stunnenberg afterwards, but alas, the chip-on-chip poster sent out in advance had me fooled, and I had booked myself into a conflicting meeting earlier this week. Hopefully I'll have another opportunity in the future.

By the way, Dr. Stunnenberg made a point of mentioning they're hiring bioinformaticians, so interested parties may want to check out his web page.

Labels: ,

Wednesday, April 2, 2008

New ChIP-Seq tool from Illumina

Ok, I had to blog this. Someone on the SeqAnswers forum brought it to my attention that Illumina has a new tool for ChIP-Seq experiments. That in itself doesn't bother me - the more people in this space, the faster we learn about what makes us tick.

What surprises me, though, is the tool itself (beadstudio data analysis software - chip sequencing module). It's implemented only for Windows, for one. (Don't most self-respecting scientists use Macs or Linux these days? Or at least use and develop tools that can be used cross-platform?) Second, the feature set appears to be a re-implementation of the UCSC Genome Browser. Given the choice between the two, I don't see any reason to buy the Illumina version. (Yes, you have to pay for it, whereas UCSC is free and flexible.) I can't tell if it loads bed files or wig files, but the screen shots show a rather unflexible tool that looks like a graphical version of Gap4 or Consed. I'm not particularly impressed.

Worse still, I can't see this being implemented in a pipeline. If you're processing 100's of ChIP-Seq experiments in a year, or 1000's once this technique really starts to hit it's stride, why would you want to force it all through a GUI? I just don't get it.

Well, what do I know? Maybe there's a big market for people out there who don't want free cross-platform tools, and would rather pay for a brand name science application than use something that works. Come to think of it, I'm willing to bet there are a few pharma companies out there who do think like that, and Illumina is likely to conquer that market with their tool. Happy clicking, Vista users.

Labels: , ,