Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Wednesday, October 22, 2008

Ubuntu Intrepid Beta

I just upgraded my laptop from Ubuntu 8.04 to Ubuntu 8.10, and I have to say it was - by far - the easiest OS upgrade I have ever done in my life. I had been waiting for the new fglrx driver for my video card, as I use compiz heavily on this notebook for application switching... but now that it's here, I just gave the command to upgrade "update-manager -d" and off it went.

Going from 7.04 to 7.10 and then to 8.04 gave me all sorts of weird, but entertaining problems - wireless died, or the monitor changed resolutions or whatever. This time - Nothing! I didn't even see the usual set of weird errors (such as incompatible fonts and locale settings) in the upgrade terminal.

It asked a few relevant questions - kde vs gnome (gnome), if I want to install the proprietary fwcutter for my wireless card (yes) and a password for mysql-5, which it later uninstalled (strange...). And off it went.

It took about an hour and a half, of which most of it was entirely unsupervised. (A couple config file changes were offered for my approval, where I had changed system settings.) Otherwise, it took care of all those things I would normally do, such as uninstalling old kernel versions and clearing out unused packages. All in all, a very pleasant experience. (I even had my IM programs running throughout the process, and continued talking with people throughout.)

On reboot, I instantly noticed a speedup to the boot - there were fewer pauses, and the process seemed somewhat quicker because of it.

Once back into Gnome, you'll notice not much has changed - nothing broke during the install! That's a nice change.

When things do change unexpectedly, menus pop up to warn you. (eg. converting my exit Ubuntu icon to a user switcher/msn status combined icon.) Information was offered right away with the lightbulb icon, and the option to retain the old icon was provided as well. Very considerate of a new OS to give you a choice!

Finally, one of the things I though would break would be my previous hack to get the screen brightness keys working, which I deliberately overwrite during the upgrade. Instead, the screen brightness buttons now work even better than before. Where my previous solution had given uneven increments of brightness, which were hard to adjust, it now actually goes from light to dark in pretty even steps. Very nice!

All this in the first 5 minutes after the install... Anyhow, now it's time to get back to work and actually enjoy Ubuntu 8.10!

Labels:

Friday, October 17, 2008

I went to see a talk by Dr. Irmtraud Meyer yesterday afternoon, over on the UBC campus. I haven't been down that way for a long time - and it was just as gloomy in the rain as it was when I did my masters there. (The bright red trees do make a nice contrast in the fall, however.)

The title of the talk was "Investigating the Transcriptome of Higher Eukaryotes", which had me fooled into thinking it would be directly relevant to the work I'm doing and transcriptomes of human beings. Alas, I was wrong. However, it was related to some work in which I was involved during my masters degree. Oddly enough, that was a course project that just turned out well, and provided the group with a publication on inverse RNA folding, and sent one of the brighter grad students down the path of RNA work.

As I said, Dr. Meyer's work was quite interesting, and - in a strange way - turned out to be relevant after all. As a person working on transcriptomes, I tend to have the view that RNA are linear bits of sequence, which cells produce as part of the pathway of producing proteins.
central paradigm of molecular biology - transcription and translation-

That is the classical view of mRNA - and we tend not to stop and re-think it. However, that's exactly what I ended up doing yesterday.

Two interesting bits of information came up that I knew in general, but hadn't really processed:
  • 80% of transcribed sequence corresponds to unannotated regions (Science, 2005, 308:1149-1154)
  • 40-65% of known mammalian genes are alternately spliced (Science, 2005, 309:1559-1563)
And then, there's ample evidence that RNA folding is involved in alternate splicing... well. Suddenly it's hard to think of those little RNA sequences as linear strings - it's hard even to think of them outside of the normal context of transcription and translation. Yet, we have tRNA, mRNA and even miRNA! Clearly transcriptomes aren't the simple model that we perceive them to be in genomics.

While it's nice to have a clearer picture on what's going on at the molecular level, I don't really know how to apply this information. I can't use it to analyze the transcriptomes I work with, and I can't use it to deal with alternative splicing that I see. I can't even figure out what all those splice sites are, yet, but eventually, this information will have to be integrated into our annotations. miRNA and "junk" RNA all probably have meaning, which we just don't understand at that level, yet.

Just a few more things to work on in the future, I suppose.

Labels: ,

Thursday, October 16, 2008

Manuscript review.

Several weeks ago, I was really flattered when I was asked to review a manuscript, even if it was for a journal that I've never heard of. As a grad student, it's cool that people have even heard of me to put me down as a potential reviewer. I'm very flattered. (I've also had other requests in the meantime, though I don't think I'll have the spare hours to tackle more reviews - though I'm still flattered!)

Unfortunately, I'm disappointed in the quality of the very first manuscript I've ever reviewed. I won't go into details, however.

On a completely unrelated topic, and for future reference, I think I'll provide a few links that may be useful to people who want to submit manuscripts in the future:
  • Plagiarism (Wikipedia). Covers what plagiarism is, and the possible consequences of it as a student.
  • How to avoid Plagiarism (Purdue). General tips on when and how to give credit to the originators of the idea and source of the material.
  • A good discussion on different forms of plagiarism (Andrew Roberts at Middlesex university.) I highly suggest this, if you're unsure where the grey line between copying and paraphrasing begins.
  • If things are still unclear this reference (Irving Hexham at the University of Calgary) provides examples and a demonstrates the correct forms of how to quote and reference other people's work.

Labels:

Wednesday, October 15, 2008

Not first on the subject...

One of the challenges of writing a blog is to try and provide new content. I've always thought that there was no point in covering information that someone else has already covered - but it's hard to be the first to break every story. So, I figured I'd do something else, which is useful: provide a few links to things that someone else found first.

Today's breaking news was something I found at SeqAnswers.com, on the new relase of information on Pacific Biosystems' new SMRT technology. The open access article is here. Thanks to ECO for updating with that link! (I was trying to get it all morning, to no avail.)

For humour, I wanted to discuss the IgNobel award winning paper: You Bastard: A Narrative Exploration of the Experience of Indignation within Organizations. However, another blogger on my reading list beat me to it.

I was also going to comment on lawsuit launched by the makers of Endnote against Zotero, but the same blogger also beat me to that. (Though, there's certainly a lot to be said about that case in terms of open standards, and people with antiquated business models fighting technology advances by any means possible...) Since I'm still considering using Zotero for my own research, it's probably something I'll save for another day, when there's more information available about the progress of the case.

Otherwise, there's still the new Apple MacBook, milled out of a single piece of aluminum, and the new Dell Mini 9, which I've been debating buying. Since it now comes with Ubuntu pre-installed, maybe I can forgive Dell for not refunding me for the XP licence that they made me buy at xmas last year with my current laptop.

Oh well. Now you know what I've been looking at while I procrastinate on my thesis proposal, and the mound of code changes I need to make ASAP... and now, back to work!

Labels: , , ,

Tuesday, October 14, 2008

Election Day in Canada

Another non-sciency post. I'll try to get back on topic later today with a few more Second Generation Sequencing posts later today. But for now, I figure a bit of politics is in order.

Us Canadians are always a step ahead of our American neighbors. We just had our thanksgiving long weekend, and today is our federal election, while Americans have to wait till November for thanksgiving and their election day. Of course, just like the American election, the Canadian election is a complete gong show. All mudslinging all the time!

Actually, for once, one party didn't bother with the attack adds, so they get my vote. They stayed on the issues, and even if I don't agree with everything they say, they seem to have a clear enough vision, and share most of my ideals. How much better can it get?

Unfortunately, unlike out American neighbours with their 2-party system, Canada has a lot of viable parties, and no clear cut winner, which means we're in for another minority government. (No one party controls the government, so power sharing becomes important.) Because the pary system in Canada is somewhat confusing and getting a lot less press time in the states than the US election, I figured I'd do my part to share a bit of Canadian culture with the world, and particularly for those Americans who are considering coming to canada if McCain wins...

Thus, I present the fejes.ca guide to Canadian political parties:
  • The Conservatives: roughly analogous to the American Republicans. Basically, whatever George Bush says is good for the Statesm, they believe is also good for Canada. More oil drilling, less arts, more tax breaks for big companies, less transparency for government.
  • The New Democratic Party (NDP): imagine if unions ran for office. Yep - that says it all. They have a very personable leader who makes it sound like a good idea to shift the entire tax burden of the country onto big businesses. (Won't they all leave?)
  • The Liberals: roughly analogous to the Democrats, but without a charismatic leader. They've tired to out-bully the Conservatives (but failed), they tried to out-environmental the greens (but failed), they've tried to out baby kiss the NDP (but failed), and never once even tried to portray themselves as the most centrist party in Canada. They've come down far in the world since they last held power
  • The Greens: a much more intelligent version of the US branch. They claim they can help reposition the ecconomy of the country to support ecologicaly sound businesses. Otherwise, their platform is relatively centrist (Education, health care, etc). On the down side, their party has a tradition of running a lot of nutjobs as candidates. Oops.
  • The Block Quebecois: imagine if Texas wanted to separate, and elected a bunch of loudmouth cowboys to congress to sit around and heckle the other congressmen. Except, of course, they claim that only people who speak French should be allowed into their "country." (They don't seem to care if you're black, red, green or white, just that you speak french - they're generally just language snobs who believe that the French won against the British in North America.) Fortunately, you can only vote for them in Quebec.
Faced with that choice, Canada will have another government by tomorrow - and a long 4 years ahead of us.

Labels:

Thursday, October 9, 2008

Maq .map to .bed format converter

I've finally "finished" the manual for one section of the Vancouver Short Read Analysis Package - though it's not findpeaks. (I'm still working on that - but it's a big application.) It could still use pictures and graphs, and stuff... but it's now functional.

One down... about 7 more manuals to write. Documentation is never the fun stuff.

What slowed this down, ironically, was my inabilty to read the Maq documentation. I completely missed the fact that unmapped reads are now included in PET aligned .map files, but with a different paired flag status. Previously, unmapped ends were thrown away, and I had to handle the unpaired ends. With the new version, those unmapped reads are now included, but given a status of 192, so they can be paired again - albeit there's not much information in the pairing. Infact, I can even handle the other ends as I find them, because they're given a status of 64. (Do these numbers seem arbitrary to anyone other than me?)

Anyhow, Finally, the .map to .bed converter works - and there's a manual to go with it.

Cheers.

Labels: ,

Wednesday, October 8, 2008

My ChIP-Seq chapter is now available...

One more post for today... I'm on a roll.

The textbook chapter I worked on last year in November is finally available. It's called "Next Generation Genome Sequencing: Towards Personalized Medicine." I was a little worried it would be out of date by the time it got to press, but it seems to have held up pretty well.

Anyhow, it's pretty cool that you can buy it at Barns and Noble.

next generation genome sequencing: towards personalized medicine

Labels:

CERN photo gallery.

Ok.. going for a record for most posts in a 20 minute record on my blog.

If you have a chance, go check out this gallery of photographs of CERN. This is basically the setup we'll be needing for genomics in the near future.

CERN computer space


Credit for the image goes to CERN, of course, and it was found in an article on the cnet.com news site.

Well, you can't claim CERN wasn't thinking big when they did this design. Look at all that empty rack space for future expansion!

Maq Bug

I came across an interesting bug today, when trying to work with reads aligned by Smith-Waterman (flag = 130), in PET alignments. Indeed, I even filed a Bug Report on it.

The low down on the "bug" or "feature" (I'm not sure which it is yet), is that sequences containing deletions don't actually show the deletions - they show the straight genomic sequence at that location. The reason that I figure this is a bug instead of by design is because sequences with insertions show the insertion. So why the discrepancy?

Anyhow, the upshot of it is that I'm only able to use 1/2 of the Smith-Waterman alignments maq produces when doing SNP calls in my software. (I can't trust that the per-base quality scores align to the correct bases with this bug) Since other people are doing SNP calls using MAQ alignments... what are they doing?

Labels: , , ,

Tuesday, October 7, 2008

How not to negotiate the price on a house.

Among all the tons of things going on in my life - as if it weren't busy enough - my girlfriend and I made an offer on a new house yesterday. With the market the way it is, houses just aren't going anywhere - and this one's been on and off the market for at least 6 months, now. The only offers they've had in that period were from us - so clearly this isn't a hot property. Even for us, it isn't a perfect place - but it fits what we're looking for. After a stunning lack of interest on the house, the developer took it off the market for August and September, and just recently re-listed it about a week ago.

Indeed, given the housing market's poor state, the developer has also been forced to drop the asking price by about 10% from their original listed price. Of course, in the way of real estate, it subsequently turned out that the developer was claiming their house was 1,500sq. ft., when it was really closer to 1366sq. ft., so I don't know that they had much choice about dropping the price - but the listing was made with the larger square footage, so I don't even think that was taken into account.

With all that going on, we made an offer about 5% below asking price - which isn't unreasonable for the price per (real) square foot in that neighborhood. (It's actually slightly higher, so it wasn't a low-ball offer by any means.)

In the game of negotiations, you'd expect the developer to counter with a price closer to what he originally had asked for. That way, both parties can converge on something agreeable to everyone. Apparently, that's not what the developer had in mind.

He counter offered at $15,000 above his listed asking price!

Needless to say, after a few rounds of discussions, this deal isn't happening. Fortunately, house prices will only continue to drop, so when we offer again in a month or two, it'll be even lower. (=

I can only shake my head and wonder what's running through the developers mind.

Labels:

Sunday, October 5, 2008

Field Programmable Gate Arrays

Yes, I'm procrastinating again. I have two papers, two big chunks of code and a thesis proposal to write, a paper to review (it's been done but I have yet to type out my comments..), several major experiments to do and at least one poster looming on the horizon - not to mention squeezing in a couple of manuals for the Vancouver Package Software. And yet, I keep finding other stuff to work on, because it's the weekend.

So, I figured this would be a good time to touch on a topic of Field Programmable Gate Arrays or FPGAs. I've done very little research on this topic, since it's so far removed from my own core expertise, but it's a hot topic in bioinformatics, so I'd be doing a big disservice by not touching on this subject at all. However, I hope people will correct me if they spot errors.

So what is an FPGA? I'd suggest you read the wikipedia article linked above, but I'd sum it up as a chip that can be added to a computer, which has the ability to optimize the way in which information is processed, so as to accellerate a given algorithm. It's a pretty cool concept - move a particular part of an algorithm into the hardware itself to speed it up. Of course, there are disadvantages as well. Reprogramming is (was? - this may have changed) a few orders of magnitude slower than processing information, so you can't change the programming on the fly while processing data and still hope to get a speed up. Some chips can change programming of unused sub-sections, while other algorithms are running... but now we're getting really technical.

(For a very good technical discussion, I suggest this book, of which I've read a few useful paragraphs.)

Rather than discuss FPGAs, which are a cool subject on their own, I'd rather discuss their applications in Bioinformatics. As far as I know, they're not widely used for most applications at the moment. The most processor intensive bioinformatics applications, Molecular Modeling and drug docking, are mainly vector-based calculationd, so vector chips (eg Graphics Processing Units - GPUs) are more applicable for them. As for the rest, CPUs have traditionally been "good enough". However, recently the following two things seem to have accelerated this potential mariage of technology:
  1. The makers of FPGAs have been looking for applications for their products for years and have targeted bioinformatics because of it's intense computer use. Heavy computer use is always considered to be a sign that more efficient processing speed is an industry need - and FPGAs appear to meet that need - on the surface.
  2. Bioinformatics was doing well with the available computers, but suddenly found itself behind the processing curve with the advent of Second Generation Sequencing (SGS). Suddenly, the amount of information being processed spiked by an order of magnitude (or more), causing the bioinformaticians to scream for more processing power and resources.
So, it was inevitable that FPGA producers would hear about the demand for more power in the field, and believe that it's the ideal market into which they should pluge. To the casual observer, Bioinformatics needs more efficiency and power, and FPGA producers are looking for a martet where efficiency and power are needed! Is this a match made in heaven or what?

Actually, I contend that FPGAs are the wrong solution for several reasons.

While Second Generation Sequencing produces tons more data, the algorithms being employed haven't yet settled down. Every 4 months we pick a different aligner. Every 3 months we add a new data base. Every month we produce a more efficient version of our algorithms for interpreting the data. Due to the overhead in producing an algorithm translation into hardware necessary to use the FPGA (which seems large to me, but may not be to people more fluent in HDL) would mean that you'd spend a disproportionate amount of time trying to get the chips set up to process your data - which you're only going to use for a short period of time before moving on. And the gain of efficiency would probably be wiped out by the amount of effort introduced.

Furthermore, even when we do know the algorithms being used are going to stay around, a lot of our processing isn't necessarily CPU bound - but rather is I/O or memory bound. When you're trawling through 16Gb or memory, it's not necessarily obvious that adding more speed to the CPU will help. Pre-fetching and pre-caching are probably doing more to help you out than anything else bound to your CPU.

In the age of multi-CPUs, using multi-threaded programs already reduces many of the pains that plague bioinformaticians. Most of my java code is thrilled to pull 2, 3, or more processors in to work faster - without a lotof explicit multi-treadding. (My record so far is 1496% cpu usage - nearly 15 processors.) I would expect that buying 16-way processors is probably more cost-efficient than buying 16 FPGAs in terms of processing data for many of the current algorithms in use.

Buying more conventional resources will probably alleviate the sudden bottle-neck in compute power, rather than innovating around new solutions to solve the need. It's likely that many groups getting into the second generation genomics technologies failed to understand the processing demands of the data, and thus didn't plan adequately for the resources. This means that much of the demand for data processing is just temporary, and may even be aleviated with more efficient algorithms in the future.

So where does the FPGA fit in?

I'd contend that there are very few products out there that would benefit from FPGAs in Bioinformatics... but there are a few. Clearly, all bioinformaticians know that aligning short reads is one of those areas. Considering that a full Maq run for a flow cell from an Illumina GAII takes 14+ hours on a small cluster, that would be one area in which they'd clearly benefit.

Of course, no bioinformatician wants to have to reprogram an FPGA on the fly to utilize their work. Were I to pick a model, it would probably be to team up with an aligner group, to produce a stand alone, multi-FPGA/CPU hybrid box with 32Gb of RAM, and a 3-4 year upgrade path. Every 4 months you produce a new aligner algorithm and HDL template, and users pick up the aligner and HDL upgrade, and "flash" their computer to use the new software/hardware. This would follow the Google Appliance model: an automated box that does one task, and does it well, with the exception that hardware "upgrades" come along with the software patches. That would certainly turn a few heads.

At any rate, only time will tell. If the algorithms settle down, FPGAs may become more useful. If the FPGAs become easier to program for bioinformaticians, they may find a willing audience. If the FPGAs begin to understand the constraints of the bioinformatics groups, they may find niche applications that will truly benefit from this technology. I look forward to seeing where this goes.

Ok... now that I've gone WAY out on a limb, I think it's time to tackle a few of those tasks on my list.

Labels: , , ,

Thursday, October 2, 2008

Paired End Chip-Seq

Ok, time for another editorial on ChIP-Seq and a very related topic. Paired-End Tags.

When I first heard about Paired End Tags (PETs), I thought they'd be the solution to all of our worries for ChIP-Seq. You'd be able to more accurately pin-point the ends of each fragment, and thus bracket binding sites more easily.

Having played with them for a bit, I'm not realy sure that's the case. I don't think they do any better than the adaptive or triangle distributions that I've been using for a while, and I don't think it really matters where that other end is, in the grand scheme of things. The peaks don't seem to dramatically shift anywhere they wouldn't be otherwise, and the resolution doesn't seem to change dramatically.

I guess I just don't see much use in using them.

Of course, what I haven't had the chance to do is run a direct PET against a SET library to see how they compete head to head. (I've only played with the stats for each individually, so take my comments with a grain of salt.)

That said, people will start using PET for ChIP-Seq, despite the increased cost and slightly smaller number of tags. The theory goes that the number of mappable tags will increase slightly to compensate for the smaller number of tags.

That increase in mappable tags may, in fact, be the one redeeming factor here. If the background noise turns out to be full of tags that are better mapped with their paired end mate, then noise decreases, and signal increases - and that WOULD be an advantage.

Anyhow, if anyone really has the data (SET + PET for one ChIP-Seq) to do this study, FindPeaks 3.2 now has PET support, and all the tools for doing PET work. All that's left is to get the manual together. A work in progress, which I'd be willing to accellerate if I knew someone was using it.

Labels:

Wednesday, October 1, 2008

MAQ 0.7.1 binary map files

I took a look at the Maq 0.7.1 today, with the intent of getting the new .map files into the Vancouver Short Read Analysis package (Both FindPeaks and SNP calling). It turned out to be a very quick job. The only difference between version 0.6 and 0.7.1 is that the seq length constant in the core of the .map file can now be 128 or 64, whereas the older version only allowed 64.

Unfortunately, the Maq authors didn't include a flag in the header to specify which constant was used when a given .map file was created. In fact, there is no way to tell (as far as I know) from the binary file except to try opening it with one or the other - and see if you get garbage reads.

Anyhow, I figured I'd share that, in case anyone else is looking to use Maq 0.7.1.

Otherwise, the only other map file related difference between the versions (it was pointed out to me, but I had the opportunity to observe for myself as well), is that Maq 0.7.1 no longer writes out the .map files as the reads are processed - it's now all held in memory till the complete set of alignments is complete, and then is all dumped to disk at once. I'm not sure why that is, but it's an interesting difference, none the less.