Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Thursday, July 30, 2009

Science Online 2009 London.

I just saw the most awesome conference in a tweet from Daniel McArthur: Science Online 2009 London. Unfortunately, a) it's in London, UK, which is a little too far for me to walk and b) they're already filled up with registrants. Fortunately, they will be streaming the whole conference on the web, which I'm highly tempted to buy into. (It costs 10 GBP... that's ~$20 CDN, which is vastly more reasonable than flying to London.)

The program has awesome events, including Blogging for impact (Speakers: Dave Munger, Daniel MacArthur), Author identity – Creating a new kind of reputation online (Speakers: Duncan Hull, Geoffrey Bilder, Michael Habib, Reynold Guida - I have to admit I don't know any of them... but I'll go look them up later), and Legal and Ethical Aspects of Science Blogging (Speakers: Petra Boynton, David Allen Green).

In fact, pretty much every session sounds like it will be worth the 10 pounds... If only I were in london

Labels:

I hate Facebook - part 2

I wasn't going to elaborate on yesterday's rant about hating facebook, but several people made comments, which got me thinking even more.

My main point yesterday was that I hate facebook because it's protocols aren't open, and is consequently is a "Walled Garden" approach to social networking. (Here's another great rant on the subject) That's not to say that you can't work with it - there are plugins for pidgin that let you chat on the facebook protocol, and there are clients (as was pointed out to me) that will integrate your IMs with the facebook chat for windows. But that wasn't my point anyways.

My point is that it's creating it's own separate protocols, which are each independent of the ones before it. In contrast to a service like twitter, in which the underlying protocol is XML, and is thus easily manipulated, using Facebook requires you work within their universe of standards. (I'm not the first person to come up with this - google will find you lots of examples of other people blogging the same thing.)

On the whole, that's not necessarily a bad thing, but common, reusable standards are what drive progress.

For instance, without a common HTML standard, the web would not have flourished - we'd have many independent webs. If AOL had their way, they'd still have you dialing up into their own proprietary Internet.

Without a common electricity format, we'd have to pick the appropriate set of appliances for our homes with independent plugs - buying a hair dryer would be infinitely more painful than it would need to be.

Without a common word processing format, we'd suffer every time we try to send a document to someone who's not using the same word processor that you do. (Oh wait, that's actually Microsoft's game - they refuse to properly support the one common document format every one else uses.)

So, when it comes to Facebook, my hate is this - if they used a simple RSS feed for the wall, I could have used that instead of twitter on my site. If they used a simple Jabber format for their chat, I could have merged that with my google chat account. And then there's their private message system... well, that's just email, but not accessible by IMAP or POP.

What they've done is try to resurect a business model that the web-unsavy keep trying. In the short term, it's pure money. You drive people into it because everyone is using it. The innovate concept makes it's adoption rapid and ubiquitous - but then you fall into the trap. The second generation of sites use open standards, and that allows FAR more cool things to be accomplished.

Examples of companies trying the walled garden approach on the net:

AOL and their independent internet, accessible only to AOL subscribers. Current Status: Laughable

Microsoft's Hotmail, where hotmail users can't export their email to migrate away. Current Status: GMail fodder.

Yahoo's communities. Current Status: irrelevant.

Wall Street Journal's new site. Current Status: ridiculed by people younger than 45.

Apple's i(phone/pod/tunes/etc). Current Status: Frequently hacked, forced to accept the defacto .mp3 format. (No Ogg yet...)

Ok, that's enough examples. All I have to say is that when Google (or anyone else) gets around to building a social networking site that's open and easy to play with, it won't be long before Facebook colapses.

The moral of the story? Don't invest too much in your facebook profile - it'll be obsolete in a few years.

Labels:

Wednesday, July 29, 2009

I hate facebook

I have a short rant to end the day, brought on by my ever increasing tie-in between the web and my desktop (now KDE 4.3):

I hate facebook.

It's not that I hate it the way I hate Myspace, which I hate because it's so easy to make horribly annoying web pages. It's not even that I hate it the way I hate Microsoft, which I hate because their business engages in unethical practices.

I hate it because it's a walled garden. Not that I have a problem with walled gardens in principle, but it's just so inaccessible - which is exactly what the facebook owners want. If you can only get at facebook through the facebook interface, you have to see their adds, which makes them money, if you ever get sucked into them. (You now have to manually opt out of having your picture used in adds for your friends... its a new option for your profile in your security settings, if you don't believe me.)

Seriously, the whole facebook wall can be recreated with twitter, the photo albums with flickr, the private messages with gmail.... and all of it can be tied together in one place. Frankly, I suspect that's what Google's "Wave" will be.

If I could integrate my twitter account with my wall on facebook, that would be seriously useful - but why should I invest the energy to update my status twice? Why should I have to maintain my own web page AND the profile on facebook...

Yes, it's a minor rant, but I just wanted to put that out there. Facebook is a great idea and a leader of it's genre, but in the end, it's going to die if its community starts drifting towards equivalent services that are more easily integrated into the desktop. I can now update twitter using an applet on my desktop - but facebook still requires a login so that I can see their adds.

Anyhow, If you don't believe me about where this is all going, wait to see what Google Wave and Chrome do for you. I'm willing to bet desktop publishing will have a whole new meaning, and on-line communities will be a part of your computer experience even before you open your browser window.

For a taste of what's now on my desktop, check out the OpenDesktop, Remember the Milk and microblog ( or even Choqok) plasmoids.

Labels: , , ,

Aligner tests

You know what I'd kill for? A simple set of tests for each aligner available. I have no idea why we didn't do this ages ago. I'm sick of off-by-one errors caused by all sorts of slightly different formats available - and I can't do unit tests without a good simple demonstration file for each aligner type.

I know Sam format should help with this - assuming everyone adopts it - but even for SAM I don't have a good control file.

I've asked someone here to set up this test using a known sequence- and if it works, I'll bundle the results into the Vancouver Package so everyone can use it.

Here's the 50-mer I picked to do the test. For those of you with some knowledge of cancer, it comes from tp53. It appears to blast uniquely to this location only.
>forward - chr17:7,519,148-7,519,197
CATGTGCTGTGACTGCTTGTAGATGGCCATGGCGCGGACGCGGGTGCCGG

>reverse - chr17:7,519,148-7,519,197
ccggcacccgcgtccgcgccatggccatctacaagcagtcacagcacatg

Labels: , , , , , ,

Tuesday, July 28, 2009

From a report

I was trying to sum up some of the development work done on FindPeaks in April-June this year for a quarterly report and ended up writing the following text. Maybe someone will be inspired by it to give the package a shot. (=


FindPeaks now includes Control and Compare modes that operate to identify features that differ in statistically significant ways between a sample and a control or two samples. In the Control mode, only those locations which differ significantly with greater enrichment in the sample are preserved, whereas Compare mode identifies areas of differing enrichment in both the sample and the control. This method uses peak pairing and linear regression methods that are symmetrical (resulting in identical peak pairing and statistics regardless of the order of the data sets presented). These methods can be used in a wide variety of situations including ChIP-Seq, RNA-Seq and even in copy number variation analysis for whole genome comparative analysis.

FindFeatures is a new application in the FindPeaks/Vancouver Short Read Analysis Package that allows peaks identified by the FindPeaks application to be mapped to annotated features on the genome of interest contained in the Ensembl database. This tool set uses the peaks files produced by the FindPeaks application to convert the relevant locations to a generic - bedfile-like format, which can then be used to identify any genes (introns or exons) to which they map. This may also be used to identify areas upstream of genes, or in close proximity to other features of interest.

Dates and misleading messages.

Here's an entertaining debugging challenge for people.

I was trying to get the history of code changes between April and June, so that I could write up a quick report for a working group at the GSC. I used the following command:
svn log -r {2009-06-31}:{2009-04-01}

and got the following error:
svn: Syntax error in revision argument '{2009-06-31}:{2009-04-01}'

After scratching my head for a while to figure out what the correct syntax was, and going through a ton of different threads on-line to figure out what the correct format should be, I finally figured out the error...

Are you ready for it?

June doesn't have 31 days. Replacing it with the correct date range solved the error. Oops!

Labels:

Monday, July 27, 2009

how recently was your sample sequenced?

One more blog for the day. I was postponing writing this one because it's been driving me nuts, and I thought I might be able to work around it... but clearly I can't.

With all the work I've put into the controls and compares in FindPeaks, I thought I was finally clear of the bugs and pains of working on the software itself - and I think I am. Unfortunately, what I didn't count on was that the data sets themselves may not be amenable to this analysis.

My control finally came off the sequencer a couple weeks ago, and I've been working with it for various analyses (snps and the like - it's a WTSS data set)... and I finally plugged it into my FindPeaks/FindFeatures pipeline. Unfortunately, while the analysis is good, the sample itself is looking pretty bad. In looking at the data sets, the only thing I can figure is that the year and a half of sequencing chemistry changes has made a big impact on the number of aligning reads and the quality of the reads obtained. I no longer get a linear correlation between the two libraries - it looks partly sigmoidal.

Unfortunately, there's nothing to do except re-seqeunce the sample. But really, I guess that makes sense. If you're doing a comparison between two data-sets, you need them to have as few differences as possible.

I just never realized that the time between samples also needed to be controlled. Now I have a new question when I review papers: How much time elapsed between the sequencing of your sample and it's control?

Labels: , , , , , ,

Picard code contribution

Update 2: I should point out that the subject of this post has been resolved. I'll mark it down to a misunderstanding. The patches I submitted were accepted several days after being sent and rejected, once the purpose of the patch was clarified with the developers. I will leave the rest of the post here, for posterity sake, and because I think that there is some merit to the points I made, even if they were misguided in their target.


Today is going to be a very blog-ful day. I just seem to have a lot to rant about. I'll be blaming it on the spider and a lack of sleep.

One of the things that thrills me about Open Source software is the ability for anyone to make contributions (above and beyond the ability to share and understand the source code) - and I was ecstatic when I discovered the java based Picard project, an open source set of libraries for working with SAM/BAM files. I've been slowly reading through the code, as I'd like to use it in my project for reading/writing SAM format files - which nearly all of the aligners available are moving towards.

One of those wonderful tools that I use for my own development is called Enerjy. It's an Eclipse plug-in designed to help you write better java code by making suggestions about things that can be improved. A lot of it's suggestions are simple: re-order imports to make them alphabetical (and more readable), fill in missing javadoc flags, etc. They're not key pieces, but they are important to maintain your code's good health. It does also point the way to things that will likely cause bugs as well (such as doing string comparisons with the "==" operator).

While reading through the Picard libraries and code, Enerjy threw more than 1600 warnings. It's not in bad shape, but it's got a lot of little "problems" that could easily be fixed. Mainly a lot of missing javadoc, un-cast generic types, arrays being passed between classes and the like. As part of my efforts to read through and understand the code, which I want to do before using it, I figured I'd fix these details. As I ramped up into the more complex warnings, I wanted to start small while still making a contribution. Open source at it's best, right?

The sad part of the tale is that open source only works when the community's contributions are welcome. Apparently, with Picard, code cleaning and maintenance isn't. My first set of patches (dealing mainly with the trivial warnings) were rejected. With that reception, I'm not going to waste my time submitting the second set of changes I made. That's kind of sad, in my opinion. I expressly told them that these patches were just a small start and that I'd begin making larger code contributions as my familiarity with the code improves - and at this rate, my familiarity with the code is definitely not going to mature as quickly, since I have much less motivation to clean up their warnings if they themselves aren't interested in fixing them.

At any rate, perhaps I should have known. Open source in science usually means people have agendas about what they'd like to accomplish with the software - and including contributions may mean including someone on a publication downstream if and when it does become published. I don't know if that was the case here: it was well within the project leader's rights to reject my patches on any grounds they like, but I can't say it makes me happy. I still don't enjoy staring at 1600+ warnings every time I open Eclipse.

The only lesson I take away from this is that next time I see "Open Source" software, I'll remember that just because it's open source, it doesn't mean all contributions are welcome - I should have confirmed with the developers before touching the code that they are open to small changes, and not just bug fixes. In the future, I suppose I'll be tempering my excitement for open source science software projects.

update: A friend of mine pointed me to a link that's highly related. Anyone with an open source project (or interested in getting started in one) should check out this blog post titled Teaching people to fish.

Labels: , , , , , ,

Giant Spider...

Ok, way big diversion from my usual set of topics.

I came downstairs for a snack in the evening, slapped some cheese and tomatoes on a slice of bread, and then looked down at the floor when some movement caught my eye - and then ran for a glass. I'm not terrified of spiders, but this bugger was BIG.

After catching the spider, I looked online - I'm not used to finding spiders this size in Canada, and figured it might be something nasty. Indeed, my best classification for it is probably a Hobo Spider, which is actually a venomous spider. (So much for naively thinking there are no poisonous spiders in Canada!) It lacks the banded pattern on the legs - which I carefully investigated in the pictures I took before figuring out how to handle it.



At any rate, the spider was "ejected" from the house, and I spent some time making sure it hadn't invited any friends over for the party. And, I'm happy to report, there were no bites at the end of the exercise.

Labels:

Sunday, July 26, 2009

PHP script for latest twitter tweet in HTML

One of my (many) projects this weekend was to sign up for twitter and then use it as a means of making micro updates to my web page. Obviously, it shouldn't be hard, but I had a lot of details to work out, and several tickets to have my hosting service upgrade to PHP5, and install the curl library (both of which were necessary for this hack to work).

Since it's all working now, I thought I'd share the source. This can obviously be modified, but for now, here's the script that's doing doing the job. Yes, bits of it were pulled from all over the web, and some of it was cobbled together by me. Obviously, you'll need to put the correct source for the feed, which is marked below as "http://twitter.com/..####.rss"

Enjoy!



# INSTANTIATE CURL.
$curl = curl_init();

# CURL SETTINGS.
curl_setopt($curl, CURLOPT_URL, "http://twitter.com/..####.rss$
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 0);

# GRAB THE XML FILE.
$xmlTwitter = curl_exec($curl);

#curl_close($curl);

# SET UP XML OBJECT.
$xmlObjTwitter = simplexml_load_string( $xmlTwitter );
$item = $xmlObjTwitter -> channel -> item;
$title = substr_replace($item -> title,'',0,8);
$url = $xmlObjTwitter -> channel -> link;
echo "Anthony tweets: {$title}";

Labels: ,

Thursday, July 23, 2009

What I wish people had told me when I started graduate studies

I suggested to my friend (who finished his masters degree last year) that he should know everything there is to know about grad school and should write a book on it - and he declined.

So, I spent some time thinking this morning about what advice I'd pass along to grad students who are just starting out, which resulted in this post: 10 things I'd like to tell new grad students. If I do a second post on the topic, it might include advice specifically targeted for bioinformatics students, but that's a long way off. Either way, please feel free to add comments on this post. I'm sure I have missed things.

At any rate, this advice is for people who are already in grad school or are about to start. None of this will convince you to stay here if you're doubting yourself (we all did), or if you're trying to weather the ups and downs. (we all have them too...) There are already excellent resources out there for that, with which I can't hope to contribute anything. So, for those of you who are ready to stick it out, here's my advice:

  1. Keep good notes! Believe it or not, just about everything you ever do as a graduate student will some day be useful - you'll be writing a presentation and realize you just need that one number you collected in your first 8 months in the lab, or you'll remember that you did some long drawn out set of scripts at one point and running them again would be useful - if only you'd written them down... I guarantee that good and complete notes will knock at least a month or two off of time you spend in school.

  2. Keep yourself motivated. I won't beat around the bush here. There will be days you don't want to work. There might even be weeks or months. Sometimes it'll be burnout, sometimes it'll just be depression (experiments don't always work) and sometimes it'll just be plain laziness. Whatever it is, you're going to have to find a way to power through it and keep your momentum. One of the best pieces of advice I've had is to start each day or week by setting goals and then holding myself to them. You can seriously improve your work ethic this way.

  3. Keep sight of the big picture. Remember that you're working on something innovative and creative, and the little details like software bugs and optimizing protocols are just stepping stones along the way. It's easy to get sucked into the mindset that your job is only those details, but if you keep sight of the big picture, you'll have an easier time evaluating what it is you really need to be doing. (Will that bug help you get a paper? No? Maybe you should spend more time writing out a more efficient algorithm anyway...) Your goal is to graduate with some cool research - and you shouldn't be working on the details unless they get you there - or accomplish a goal that is clearly useful to you.

  4. Learn everything you can. One of those magic things about grad school is that you're given free reign to learn. Use it! Learn new software applications, learn new subjects, learn new hobbies. You never know what will be a help to you in the future or will help you get a job down the road. I learned to snowboard while doing my masters degree, and it turned out to be the only way to get sunshine in January in Vancouver, which is probably the only reason I don't get S.A.D. every year (and spend a month or two without the ability to get things done.) Seriously - Just about every skill you pick up will help you out somehow... one day.

  5. Build your community. Grad school is also one of those few times in your life when everyone is willing to talk to you. You've already established you're not a fool by finishing your undergrad and getting into graduate school, so nearly everyone will be willing to give you the time of day and most of them will even be willing to give you advice on how to make your life easier. You never know who might be able to help you later in life, and you never know who might turn into a great friend. Go forth and meet your peers! They'll provide support when (not if) you need it, and can help you work through your problems - and remember, your peers aren't all in the same lab or university as you.

  6. Don't be afraid to take/ignore advice. Ok, so you asked the Nobel prize winner that hangs out in the cafeteria for some advice on your project. First of all, remember that it's just advice. If you're doing science, no one actually knows the answer (unless they beat you to the experiment), so take all the advice you get with a grain of salt. Remember to judge the advice upon it's merits, rather than upon the status of the person who gives it. Nobel prize winners can give advice that's every bit as bad as the next person, and even a "lowly undergrad" can point you in the right direction now and then. Even if the Nobel prize winning scientist tells you it's the dumbest thing they've ever heard, don't be afraid to test it out now and then. Of course, don't forget to remember that sometimes dumb ideas really are dumb, and sometimes it is worth listening to the advice. A good scientist will eventually figure out the difference and know which leads to test out. In the meantime, be courageous and try to be insightful when evaluating what people tell you.

  7. Don't fall prey to false economy reasoning. One of those things I've gotten used to hearing are comments like "I don't have time to clean up my code - I need to put in new features!" or " I can't learn how to use a new tool, I just need the results." This is called false economy: when you forget that investments pay dividends in the long term, and only take the short term goals into account. For one good example, I learned how to use professional desktop publishing software, which forced me to spend a whole week making a single poster. Ever since then, I've been able to reuse the template and rely on that learned skill set to bang out posters in about 2 days each - and I've done 7 or 8 by since. By investing the time, I've become much more productive in the long term. (And yes, spending a week on cleaning up your code will save you weeks or months on debugging and ease of future coding.)

  8. Be assertive about what you want. Many beginning grad students are intimidated by the people around them just because everyone else is more senior, and that can be devastating to your own ambitions. You can easily get sucked into other people's projects, goals or even politics if you're afraid to strike out on your own. Remember that you are an individual and you have your own tasks to accomplish. Be assertive about what you think will help you achieve those goals and get you through the project. On the other hand, don't forget to respect your peers - you probably won't make it through if you piss them all off.

  9. Nothing will be perfect. Several years in, this one still gets me. No publication is ever perfect, your results don't need to be bulletproof before you publish them and posters will never tell you the whole story. Do the best you can, and try to do as much as you can, and revisions will help you fix everything else up as you go. That is to say, try to balance out your ability to pay attention to details (don't be sloppy), but don't expect to have every last detail in place before you start writing it up.

  10. Write lots. As a grad student, your productivity is measured by your ability to publish what you've done. However, that's not the only measure you should take of your time as a grad student. Write as much as you can, and practice your communication skills. The more you write, the more you learn about how to get your point across. Write emails to build up your network, write a blog to practice getting your point across, write publications to build your resume and write notes to cultivate it as a habit. The more you write, and it doesn't really matter what you write, the better off you'll be.



So, there you have it. 10 things you can do that will enhance your grad student experience. If you need more advice, try this site, which has more collected advice than any other place I've ever seen.

Good luck and good publishing!

Labels:

Friday, July 17, 2009

Community

This week has been a tremendous confluence of concepts and ideas around community. Not that I'd expect anyone else to notice, but it really kept building towards a common theme.

The first was just a community of co-workers. Last week, my lab went out to celebrate a lab-mate's successful defense of her thesis (Congrats, Dr. Sleumer!). During the second round of drinks (Undrinkable dirty martinis), several of us had a half hour conversation on the best way to desalinate an over-salty martini. As weird as it sounds, it was an interesting and fun conversation, which I just can't imagine having with too many people. (By the way, I think Obi's suggestion wins: distillation.) This is not a group of people you want to take for granted!

The second community related event was an invitation to move my blog over to a larger community of bloggers. While I've temporarily declined, it raised the question of what kind of community I have while I keep my blog on my own server. In some ways, it leaves me isolated, although it does provide a "distinct" source of information, easily distinguishable from other people's blogs. (One of the reasons for not moving the larger community is the lack of distinguishing marks - I don't want to sink into a "borg" experience with other bloggers and just become assimilated entirely.) Is it worth moving over to reduce the isolation and become part of a bigger community, even if it means losing some of my identity?

The third event was a talk I gave this morning. I spent a lot of time trying to put together a coherent presentation - and ended talking about my experiences without discussing the actual focus of my research. Instead, it was on the topic of "successes and failures in developing an open source community" as applied to the Vancouver Short Read Analysis Package. Yes, I'm happy there is a (small) community around it, but there is definitely room for improvement.

Anyhow, at the risk of babbling on too much, what I really wanted to say is that communities are all around us, and we have to seriously consider our impact on them, and the impact they have on us - not to mention how we integrate into them, both in our work and outside. If you can't maximize your ability to motivate them (or their ability to motivate you), then you're at a serious disadvantage. How we balance all of that is an open question, and one I'm still working hard at answering.

I've attached my presentation from this morning, just in case anyone is interested. (I've decorated it with pictures from the South Pacific, in case all of the plain text is too boring to keep you awake.)

Here it is (it's about 7Mb.)

Labels: , , , , , , , ,

Thursday, July 9, 2009

New Tool: KeepNote

Obviously I haven't updated much here lately - I've been pretty busy and inspiration hasn't struck me much in the last few days to get anything written. However, I started using some new software this morning, and I'm enjoying it so much I figured I have to share.

One of the big problems I have, as a bioinformatician, is keeping track of all the notes and one off scripts I write. I don't want to use an SVN, because it's just a repository with no organization. I don't want to use a wiki, because it's a huge hassle to maintain for small projects, and I hate using text files.

The compromise, it seems, is to use standards compliant files with a hell of a wrapper around them that does the organization for you, and the one I found is called KeepNote. The project page and downloads can be found at http://rasm.ods.org/keepnote/. The software is available for all major OS (Linux, Mac and even Windows), and can be installed relatively quickly and (for the most part) painlessly. (Linux builds are missing a library in the dependencies, but that can be figured out pretty quickly - just apt-get the missing lib and re-install if you hit this problem.)

While it may not fit everyone's workflow, my few hours of using it have already helped me get my tools organized and assembled in a logical manner, and it's allowed me to remove a load of files from my desktop. There are still bugs with it: I had to manually do some configuration of the the web browser, text editor and such before I could get started, but so far I haven't hit any of the bugs.

It also claims to help you organize notes - which I can clearly see. next time I go to a conference, I'll be using this for recording and organizing the usual 30-40 pages of notes I take.

For me, this falls under the heading of required tools for bioinformaticians and students alike and I look forward to seeing the project evolve and grow.

Labels: , , ,