Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Friday, August 15, 2008

Vacation time...

So, Today's blog is a little bit different. First, I'm not going to do any serious blogging - I have lots to add to the conversation about indels yesterday, but I'm going to have to leave it for a while, and come back to it in a few weeks. After all, I have a few other things on my mind. I've been planning a vacation with my girlfriend for the past... oh, say... 6 months - and it's finally arrived.

Second, I'm going to experiment with un-moderated comments - with some luck, there won't be too much spam while I'm away. If it works out, I'll leave it that way once I get back as well. Hopefully, that will help encourage more open conversations, as comments should appear on the blog right away.

Third, I'm just going to provide a little bit of information on how to contact me - since I won't have email, you'll just have to do it the good old fashioned way and mount an expedition. To find me, you'll find several clues in the photos below, from someone who's already returned from the trip I'm about to take. Actually, the pictures are well worth looking through, if you need a reason to procrastinate - the photgrapher is fantastic:

Link to photos

Anyhow, I promise to reply to emails, tidy up comments and all the rest of that stuff when I get back. But for the next two weeks, this blog will be quiet. Happy end-of-August, everyone, and see you in September.

Labels:

Thursday, August 14, 2008

Indel Calling

William left an interesting comment yesterday, which I figured was worthy of it's own post, albeit a short one. (Are any of my posts really short, tho?) His point was that everyone in the genomics field is currently paying a lot of attention to SNPs, and very little attention to INDELs (Insertions and Deletions). And, I have to admit - this is true. I'm not aware of anyone really trying to do indels with Solexa reads, yet. But there are very good reasons for this.

The first is a lack of tools - most aligners aren't doing gapped alignments. Well, there's Exonerate and ZOOM, and possibly blast, but all of those have their problems. (Exonerate gapped mode is very slow, poor support, and we had a very difficult time making some versions work at all, while Blast is completely infeasible for short reads in a reasonable time scale and not guaranteed to give the right answer, while Zoom just hasn't been released yet. If there are others, feel free to let me know.) Nothing currently available will really do a good gapped short read alignment. And, if you've noticed they key words: "short read", you're on to the real reason why no one is currently working with indels.

Yep - the reads are just too short. If you have a 36bp read, or even, say a 42bp read, you're looking at (best case) having your indel right in the middle of the sequence, giving you two 21-base sequences, one on each side of the gap. Think about that for a moment and let that settle in. How much of the genome is unique for 21bp reads, which may have 2 or more SNPs or sequencing errors? I'd venture to say it's 60% or so. With the 36 base pair read, you're looking at two 18-bp reads, which is more like 40-50% of the genome. (Please don't quote me on those numbers - they're just estimates.) And that's best case.

If your gap is closer to the end, you'll get something more like a 32bp read and a 10bp read.... and I wouldn't trust a 10bp seed to give the correct match against the genome no matter what aligner you've got - especially if it comes from the "poor" end of an Illumina sequence.

So that leaves you with two options: use a paired end tag, or use a longer read.

Paired end tags (PET) have been around for a couple months, now. We're still trying to figure out the best way of using the technology, but it's coming. People are mostly interested in using them for other applications - gross structural abnormalities, inversions, duplications, etc, but indels will be in there. It should be a few more months before we really see some good work done with PETs in the literature. I know of a couple of neat applications already, but a lot of the difficulty was just getting a good PET aligner going. Maq is there now, and it does an excellent job, albeit post processing the .map files for PET is not a lot of fun. (I do have software for it, tho, so it's definitely a tractable problem.)

Longer reads are also good. Once you get gaps with enough bases on either side to do double-seed searches, we'll get fast Indel capable aligners - and I'm sure it's coming. There are long reads being attempted this week at the GSC. I don't know anything about them, or the quality, but if they work, I'd expect to see a LOT more sequences being generated like this in the future, and a lot more attention being paid to indels.

So, I can only agree with William: we need to pay more attention to indels, but we need the technology to catch up first.

P.S. For 454 fans out there, yes, you do get longer reads, but I think you also need a lot of redundancy to show that the reads aren't artifacts. As 454 ramps up its throughput, we'll see both the Solexa and 454 platforms converge towards better data for indel studies.

Labels: , ,

Tuesday, August 12, 2008

SNP callers.

I thought I'd switch gears a bit this morning. I keep hearing people say that the next project their company/institute/lab is going to tackle is a SNP calling application, which strikes me as odd. I've written at least 3 over the last several months, and they're all trivial. They seem to perform as well as any one else's SNP calls, and, if they take up more memory, I didn't think that was too big of a problem. We have machines with lots of RAM these days, and it's relatively cheap, these days.

What really strikes me as odd is that people think there's money in this. I just can't see it. The barrier to creating a new SNP calling program is incredibly low. I'd suggest it's even lower than creating an aligner - and there are already 20 or so of those out there. There's even an aligner being developed at the GSC (which I don't care for in the slightest, I might add) that works reasonably well.

I think the big thing that everyone is missing is that it's not the SNPs being called that important - it's SNP management. In order to do SNP filtering, I have a huge postgresql database with SNPs from a variety of sources, in several large tables, which have to be compared against the SNPs and gene calls from my data set. Even then, I would have a very difficult time handing off my database to someone else - my database is scalable, but completely un-automated, and has nothing but the psql interface, which is clearly not the most user friendly. If I were going to hire a grad student and allocate money to software development, I wouldn't spend the money on a SNP caller and have the grad student write the database - I'd put the grad student to work on his own SNP caller and buy a SNP management tool. Unfortunately, it's a big project, and I don't think there's a single tool out there that would begin to meet the needs of people managing output from massively-parallel sequencing efforts.

Anyhow, just some food for thought, while I write tools that manage SNPs this morning.

Cheers.

Labels: , , ,

Monday, August 11, 2008

ZOOM-ing along.

I've recently been having a public conversation with the people at Bioinformatics Solutions Inc about their latest project - a short read aligner called ZOOM. While I joked about them only accepting resumes in Microsoft Word format, (Update: Apparently that's already taken off their web page.) they do seem to have their act together in other respects. Of course, I haven't actually seen their application yet, and I have only their word to go by, however, what I am hearing seems promising. I suggest people go read that thread on SeqAnswers to get an idea of what they're offering.

Unfortunately (- or maybe not - ) the timing is a bit off for me. I'm in the process of clearing things off my desk for my two week vacation, starting Friday, and I believe they'll be releasing the demo version of their software next week sometime. Still, I suspect I won't let that bother me too much. Between coconuts, hammocks and snorkeling, I don't think I'll have much time to think about missing a new aligner. (Just a hunch.)

Disclosure: I suppose, since people might perceive this as a conflict of interest, I should point out that I do have several ties to the company. I believe Ming Li is one of their founders, and I did (briefly) work with him and Paul Kearney for a thesis project at the University of Waterloo close to a decade ago. A relation of mine works there, as well, though I don't have any insider information, so don't ask. I don't own stock or have any interest in the company otherwise. Bioinformatics is just too small of a community in Canada not to have some connections to anyone else doing bioinformatics here. Six degrees of separation, and all.

Labels: ,

Sunday, August 10, 2008

National Geographic prices (way off topic.)

Sometimes I have to wonder whether Americans believe that the world ends at their border. Nothing against Americans, but why discriminate against Canadians? I wanted to subscribe to the National Geographic Traveler magazine, after seeing a copy of their magazine today.

Then I came across their pricing scheme:

Subscription for anywhere in the US: $10.
Subscription for places overseas: $36

The not so smart part: Subscription for Canadians: $34

I sent them the following email.



Hi,

I'm not sure who the correct person is to contact about pricing for the national geographic traveler magazine, but I really wanted to express my displeasure at discovering the horribly skewed prices for the subscription rate for Canadians.

I came across your magazine today, and thought it had fantastic articles and photography, and was sufficiently interested that I went online to subscribe. However, once I discovered the dramatic difference between the US and the Canadian prices, I decided to send this email instead.

After all, I live within 40km of the US border and use a currency currently trading at near par with the US Dollar... and yet, I'm being asked to pay 3x the price? Hardly credible! Even worse, we're closer than either Alaska or Hawaii, and are being asked to pay more than either of those states. The Canadian postal system can't possibly cost *that* much more than the US for magazines.

I hope National Geographic seriously reconsiders it's pricing policy, and stops discriminating against a market of 36 Million people.

Signature.

Labels:

Thursday, August 7, 2008

MS word formatted resumes?

I just took a look at Bioinformatics Solutions Inc's web page. They're the makers of the upcoming ZOOM (Zillions of Oligos Mapped) aligner software. Rumour says it's supposed to perform well. If I understood correctly, they're applying some neat pattern matching algorithms from their Pattern Hunter software to do ultra-fast short read gapped alignments.

Anyhow, I saw their careers page, (No, I'm not currently looking for a job) and thought that was a great example of mis-purposed document formats. Somehow, I'd expect a bioinformatics company to be a bit more tuned into things like that. I also came to the realization that any software company that wants MS format docs instead of PDFs can't be a great place for a Linux tool-chain-based coder. (MS documents seem so 1990's... what does Google ask for?)

Considering I haven't even done my comprehensive exam yet, I guess I won't have to worry about that for a while.

And now, back to work.

[Update: Google is much more clued in: "PDF, HTML, or Microsoft Word documents or text formats are acceptable or you can submit using plain text format"]

programming with a loaded gun.

I had an anonymous comment the other day that started off like this:

Don't blame sloppy code practices on Perl; it would be just as easy to obfuscate the line you mention in another language.


That got me thinking about it, and I rapidly came to the conclusion that I don't think you can obfuscate code as much in another language as you can in perl. Perl is like a loaded gun. Used wisely, you could win a biathlon, catch crooks or... um... defend your country. (I'm not actually a big fan of guns, so this metaphor is stretching things a bit for me.) Used irresponsibly, you could shoot yourself in the foot, rob a bank or invade Kuwait. The gun-lovers tell you it's not the gun that's responsible for the bad things that are done with it. The famous quote is that "Guns don't kill people, people kill people." Well, I don't want to get into a debate on the merits of guns, but as a tool, they need to be used wisely. I don't think anyone would debate that.

I argue, so does perl.

Giving a novice programmer perl is like giving a teenager a loaded shotgun without a safety. The consequences are less dramatic, but equally irresponsible. You need training to use both. Without training, you write code that can never be understood, can't be trusted, and is unmaintainable. How do you correct a bug in code you can't wrap your head around?

Anyhow, yes, other languages can be used to obfuscate code, but perl, in my humble opinion, is designed to give you a flexibility that just isn't found in other languages. How many other modern languages can be used to read in a variable then use the content of the variable as the name of another variable? Certainly not C, Java or VB. And that's just my favorite example at the moment, I'm sure there are others.

Where this is leading me, is that all languages have problems, and all languages can be abused. A good program in a given languages has certain traits, just as the bad do. I'm working on my next blog post, which I figure I should try to answer the question, what does good code look like in a given language, and - more entertaining for me - what does bad code look like? I've seen a lot of examples of each, lately, so I may as well share what I know.

Labels: ,

Tuesday, August 5, 2008

Linux and you...

Over the past couple years, I've slowly been moving everyone I know towards Linux. I personally started playing with Linux back in 1997, when I bought my first computer - a 166MHz Pentium. I moved over to it full time (Slackware) in 2002, when I was at grad school, and deleted windows 2000 from my computer about a year and a half later, in 2004. That was a pretty big milestone for me, but I haven't looked back.

In 2005, after the n-th round of virus deletions from my father's computer, I moved him to Fedora, and a year later to Ubuntu - and he loves it. We had a few problems in the beginning, mainly with poor network printing support and some strange modem configurations, but subsequent dist-upgrades have all solved those problems.

I converted my girlfriend's computer in 2006, when I thought Ubuntu was finally ready for her, with a clean interface, programs that met all her needs, and NTFS read/write built into the kernel.

The only computer left in the family is my step-mother's computer, which won't be a hard sell after this week. (40+ viruses in windows, and she's been stuck using my father's computer since last monday, so she's learning her way around the Linux desktop, now.)

Still, the greatest shock for me was my girlfriend's comments this afternoon. She went back to her Windows XP partition so that she could convert some movies to an ipod compatible format and discovered that she hates Windows now! Windows is too slow, takes too long to boot up, has too many windows pop up (and that's just XP!)... the list goes on. I think I've infected someone else with the Linux bug. Better yet, she told me I should convince my step-mother to switch to Linux. I wish I could quote her exactly, because she said it quite eloquently:

"At first, I was skeptical, because I didn't want to have to learn anything complicated, but once I found out how easy it was, it was easy to switch."

Linux is truly ready for the desktop. Now, if only I could convince the IT people at work to upgrade from Red Hat Enterprise Linux 4 (circa 2003).

Labels: