Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, January 30, 2009

A Change of Pace...

This post was inspired out of frustration - one of the biggest problems with bioinformatics is how quickly things change. It is also a huge strength, but it can be a major problem for people in the field.

The idea came out of a simple annoyance: someone renamed all of our reference genome fasta files last night, and clearly forgot to let people know. At least, those people I spoke to didn't know anything about it, so it wasn't just me missing a meeting. I can see the advantages of doing it, and I fully support it - but it should have been done a year ago, and when they finally got around to it, they should have sent a major email. Instead, I queued up a bunch of jobs and fired them off only to watch as they all started crashing.

Wonderful use of resources.

At any rate, that got me thinking about the change of pace of next-generation sequencing. I've seen several threads asking questions about getting set up to use the MAQ aligner, obviously written by people who are just getting started in with aligners. These threads, unfortuantely, are being written after the author of the software has already abandoned that project and moved on to a new aligner. So much goes on without people realizing they're working on the last wave of the technology.

That's far from an isolated case - I found a program called NestedMICA for doing motif scanning, which would be a cool side project. I won't link to it, though, because it's clearly also been abandoned. Only two years ago it was a very promising application with a decent publication. Now, it won't compile and the author isn't responding to emails. I've spoken to motif people and they all change motif scanners and tools about as often as they change their socks. (well, no, the socks get changed a little more often.)

Keeping up with the latest and greatest tools is a huge burden for people in this field, and it's practically impossible to do if your interests are at all diversified out of one of the major subjects.

I suppose that bioinformatics is far from the only field in which these things happen, but I just can't think of another example where the ante to get in the game is so low (being able to program), the subject is so accessible (internet access gets you access to the data), and the questions are so fundamental (how does the cell work?)

All this churn and people jumping head first into the field leads to a plethora of unmaintainable perl programs, abandoned code and half baked packages without documentation. (At the worst case, of course!) In industry, any field with this kind of bandwagon would be ripe for consolidation, but in academica, it's just a Darwinian process where many many failures seem to be required for each success. And somehow, I have no idea how to pick the winner.

All of this leaves me wondering where the field will be in 6 months or a year or two. I guess that's why scientists go to conferences: to see if we can get a glipse into the crystal ball.

Compare this with biology. Can you imagine if every 4 months, there would be a completely different way of doing cloning or that the pcr technique you used would become obsolete?

How about chemists? Need a new way to determine your compounds melting point every 4 months?

Or Physicists... your model of gravity changes every 4 months?

I dunno. the pace is exhilarating... but sometimes exhausting.

Excuse me while I go make a few more changes to change the way I process chip-seq samples.... again.

Labels:

Thursday, January 29, 2009

Canada's Government is pulling the carpet from Genomics.

Canada's conservative party, which currently controls the parliament, has decided that it wont fund genomics research any more. They seem to have decided that the $140 Million that it takes to power Canada's genome centres is too much, although they don't have a problem pouring in over a billion and a half into building upgrades. Wow. Just... wow.

I have never been as impressed with the shortsightedness of the conservatives. Yes, it's great to have a nice shiny genomics building, but when there's no money to operate the machinery, that's just sad. Considering the work that's being done in Canada with the new genomics technology, that's like deciding that all electronics after the invention of the lightbulb is superfluous. Great job guys. Genome Canada, through Genome BC has played a large part in the work I do on breast cancer, on ChIP-Seq work... etc.

Anyhow, this sudden disbelief in science from the government has me wondering what the future has in store for science research in Canada. In the next two years I'll be leaving the happy confines of the Genome Sciences Centre with a nifty doctorate and my hope was that I'll be able to stay in Vancouver (ideally), or at least in Canada to do a post-doc or something genomics related. Unfortunately, the biggest agency helping to get this type of research off the ground was Genome Canada and it's affiliates. Now that they've had a $140 Million/year budget pulled out from under them, I'm guessing it's pretty darn unlikely.

This means we'll be dropping funding for age related diseases, cancers, promising new pharmaceutical technologies.... ok, I'm not going to list it out. I sure hope that the Government knows what it's doing, because all I see in the future is another move... and this time it'll probably be south of the border.

Well, either that or I go back to school for another two years and learn carpentry to help build the empty buildings on campus.

Labels: ,

FindPeaks 3.3.0 and AGBT proposal

Well, I finally have a version of FindPeaks 3.3.0 that runs without known bugs. Tracking down that last bug was tricky, and took me 3 days to find and squash it. It's hard to find bugs that only happen when they're near to a fragment that is duplicated. (-;

Anyhow, now that that's working better, it's time to add in the new functionality. The most pressing parts are the controls (in two parts - one of which is a top secret collaboration, while the other is just too boring to really talk about), and the other is implementing SAM/BAMtools interface. Whenever the "new MAQ" is ready, I'd like to be prepared for it.

Incidentally, I think controls will be the easier of the two, and I think I'll be able to finish the boring parts off this week. At the rate things are going, it might be another 2 days of debugging after that, but that's what makes software writing fun. (Just imagine a tall thin guy hunched over a computer keyboard cackling insanely while staring deep into the monitor displaying green matrix-style characters drifting downward...)

At any rate, I'm also working towards my poster for AGBT, which reminds me of what else I wanted to suggest. If anyone who reads my blog is going to be at AGBT and is happy to meet up to talk some ChIP-Seq or SNP finding (or anything remotely related), let me know. I'm thinking it would be neat to gather people together who are working on the same topic and talk for a bit. (I'm even willing to miss formal talks for it, as long as they're not directly related to my work.)

So, to that effect, I'll point to this page on SeqAnswers, and suggest if anyone is interested they let me know. (= It would definitely be an efficient way to network.

Oh, and (still) for those of you who've already registered for AGBT, check out the nifty package Illumina is sending out to people. I'm HIGHLY impressed with the creative idea and timeliness. (If you don't know what it is, the suspence is killing you and you care enough to ask, I'll put the answer in the comments.) (-:

Labels: , ,

Wednesday, January 28, 2009

Pre-AGBT stuff: teaching at a high school

In addition to getting ready for AGBT 2009, which starts next week (and which I'm really not ready for yet), I've spent time this week volunteering with the Morgen project's community outreach effort. My involvement was confined to helping teach three grade 11 classes to kids in the I.B. program at a local high school.

Last year, I helped out with a single class, where the lesson was planned and all I ended up doing was helping the kids with their computer problems and walking them through some of the database issues they faced while poking around the ensembl.org web site. I think the studens got something out of it, but it didn't really engage them and challenge them.

This year, in contrast, I was involved in the planning, the preparation and the teaching, so I was able to make a much greater contribution. Unlike last year, instead of a large group of researchers descending on a 60 minute class, this year it was two grad students (a colleague and myself) and a single program coordinator who went over for an 80 minute class, so we had a lot of freedom to try out new things and change the focus of the lesson a little - and expand on the cool stuff.

The first thing we modified was the overall lesson plan. Last year featured a half hour walk through and lesson on Ensembl - which would have been totally unusable a year later, considering all the changes that have happened on the ensembl.org web page over the past year - followed by a half hour for the students to work on a work sheet. Instead, we broke the lesson into four parts: a 15-minute "participatory" vocabulary exercise, a 15-minute walk through of the exercise sheet using different examples, a half hour for the students to work on their exercise sheets (with our help as required), and a 15 minute Q&A session.

Overall, the response was phenominal. The students were asked to fill in evaluation sheets, and the vast majority of the students (>80%) said that it had really helped them get a better grasp of the topic. (We received only one negative response out of 84 kids who filled in the sheet!) The questions we got during the Q&A session were neat - and some of them were off the wall, but really helped make the science more relevant to the students. I think the kids all walked away with a better appreciation of what scientists do, and why we do it.

At any rate, I've been debating whether I should try to write out a summary of the events and activities for other people involved in this type of program. It might make a nice resource - but it would be nice to know if anyone has any interest in it before I start working on it. Any takers?

Just to summarize, it was a tremendously rewarding activity, and I'm hoping that the program continues on next year. It might even make me want to include teaching in my future activities somewhere. (-;

Labels:

Friday, January 9, 2009

No More Maq?

Another grad student at the GSC forwarded an email to our mailing list the other day, which was in turn from the maq-help mailing list. Unfortunately, the link on the maq-help mailing list takes you to another page, which incidentally (and erroneously) complains that FindPeaks doesn't work with Maq .map files - which it does. Instead, I suggest checking out this post on SeqAnswers from Li Heng, the creator of Maq, which has a very similar message.

The main gist of it is that the .map file format will be deprecated, and there will be no new versions of the Maq software package in the future. Instead, they will be working on two other projects (from the forwarded email):
  1. Samtools: replaces maq's (reference-based) "assembly"
  2. bwa: replaces maq's "mapping" for whole human genome alignment.
I suppose it means that eventually FindPeaks should support the Samtools formats, which I'll have to look into at some point. For those of you who are still using Maq, you may need to start following those projects as well, simply because it raises the question of long-term Maq support. As with many early generation Bioinformatics tools, we'll just have to be patient and watch how the software landscape evolves.

It probably also means that I'll have to start watching the Samtools development more carefully for use with my thesis project - many of the tools they are planning seem to replace the ones I've already developed in the Vancouver Short Read Alignment Package. Eventually, I'll have to evaluate both sets against each other. (That could also be an interesting project.)

While this was news to me, it's probably no more than the expected churn of a young technology field. I'm sure it's not going to be long until even the 2nd generation sequencing machines themselves evolve into something else.

Labels: , , ,

Thursday, January 8, 2009

Elisa for Obesity Proteins.

Ok, I can't resist. I occasionally get emails from random biotech companies promoting products that are invariably useless to me. This one amused me enough that I thought I should share it.

The title of the email is "ELISA Strip for Profiling 8 Obesity Proteins." While I'm sure there are people who have a good use for that, I have no clue why I'd want it. I'm not sure I'd want to go to a doctor who needs to use it to tell if their patients are overweight either.

What ever happened to looking at yourself in the mirror or standing on the bathroom scale and saying, "Oh man, I need to lose some weight!?" Now you're supposed to kit yourself out and do an Elisa to tell if you've got to diet?

Oh well, if you do find you have a use for it, Signosis will be more than happy to sell you one.

Labels: ,

The Future of FindPeaks

At the end of my committee meeting, last month, my advisors suggested I spend less time on engineering questions, and more time on the biology of the research I'm working on. Since that means spending more time on the cancer biology project, and less on FindPeaks, I've been spending some time thinking about how I want to proceed forward - and I think the answer is to work smarter on FindPeaks. (No, I'm not dropping FindPeaks development. It's just too much fun.)

For me, the amusing part of it is that FindPeaks is already on it's 4th major structural iteration. Matthew Bainbridge wrote the first, I duplicated it by re-writing it's code for the second version, then came the first round of major upgrades in version 3.1, and then I did the massive cleanup that resulted in the 3.2 branch. After all that, why would I want to write another version?

Somewhere along the line, I've realized that there are several major engineering things that could be done that would make FindPeaks faster, more versatile and able to provide more insight into the biology of ChIP-Seq and similar experiments. Most of the changes are a reflection of the fact that the underlying aligners that are being used have changed. When I first got involved we were using Eland 0.3 (?), which was simple compared to the tools we now have available. It just aligned each fragment individually and spit out the results, which left the filtering and sorting up to FindPeaks. Thus, early versions of FindPeaks were centred on those basic operations. As we moved to sorted formats like .map and _sorted.txt files, those issues have mostly dissapeared, allowing more emphasis to be placed on the statistics and functionality.

At this point, I think we're coming to the next generation of biology problems - integrating FindPeaks into the wider toolset - and generating real knowledge about what's going on in the genome, and I think it's time for FindPeaks to evolve to fill that role, growing out to better use the information available in the sorted aligner results.

Ever since the end of my exam, I haven't been able to stop thinking of neat applications for FindPeaks and the rest of my tool kit - so, even if I end up focussing on the cancer biology that I've got in front of me, I'm still going to find the time to work on FindPeaks, to better take advantage of the information that FindPeaks isn't currently using.

I guess that desire to do things well, and to get at the answers that are hidden in the data is what drives us all to do science. And probably what drives grad students to work late into the night on their projects.... I think I see a few more late nights in the near future. (-;

Labels: , , , , , ,

Wednesday, January 7, 2009

Howto on applying for jobs.

This post is way off topic for my usual blogging subjects, but it's something I've wanted to do for a long time. Ever since my time at the start-up company, where I read several thousand resumes, I've had the urge to write out a Howto document for job applicants. It comes out of the frustration of reading hundreds of terribly done resumes, and tens of badly written cover letters. After a while, you figure out what you don't want to see, which eventually turns into what you do want to see.

Of course, it's only in first draft mode - I haven't done much editing on it, but I figure it can only get better from here. My warning is that it's already 24 pages long without illustrations. Hopefully those will show up in later editions. Comnents and criticisms accepted. (-;

JobHowTo.pdf

Labels: ,

Tuesday, January 6, 2009

My Geneticist dot com

A while back, I received an email from a company called mygeneticist.com that is doing genetic testing to help patients identify adverse drug reactions. I'm not sure what the relationship is, but they seem to be a part of something called DiscoverMe technologies. I bring mygeneticist up, because I had an "interview" with one of their partners, to determine if I am a good subject for their genetic testing program. It seems I'm too healthy to be included, unless they later decide to include me as a control. Nuts-it! (I'm still trying to figure out how to get my genome sequenced here at the GSC too, but I don't think anyone wants to fund that...)

At any rate, I spoke with the representative of their clinical side of operations this morning and had an interesting conversation about my background. In typical fashion, I also took the time to ask a few specific questions about their operations. I'm pretty sure they didn't tell me much more than was available on their various web pages, but I think there was some interesting information that came out of it.

When I originally read their email, I had assumed that they were going to be doing WTSS on each of their patients. At about $8000 per patient, it's expensive, but a relatively cheap form of discovery - if you can get around some of the challenges involved in tissue selection, etc. Instead, it seems that they're doing specific gene interrogation, although I wasn't able to get the type of platform their using. This leads me to believe that they're probably doing some form of literature check for genes related to the drugs of interest, followed by a PCR or Array based validation across their patient group. Considering the challenges of associating drug reactions with SNPs and genomic variation, I would be very curious to see what they have planned for "value-added" resources. Any drug company can find out (and probably does already know) what's in the literature, and any genetic testing done without approval from the FDA will probaby be sued/litigated/regulated out of existance... which doesn't leave a lot of wiggle room for them.

And that lead me to thinking about a lot of other questions, which went un-asked. (I'll probably email the Genomics expert there to ask some questions, though I'm mostly interested in the business side of it, which they probably won't answer.) What makes them think that people will pay for their services? How can they charge a low-enough fee to make the service attractive while getting making a profit? And, from the scientific side, assuming they're not just a diagnostic application company, I'm not sure how they'll get a large enough cohort to make sense of the data they receive through their recruitment strategy.

Anyhow, I'll be keeping my eyes on this company - if they're still around in a year or two, I'd be very interested in talking to them again about their plans in the next-generation sequencing field.

Labels: , , ,