Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Wednesday, February 25, 2009

Microsoft Sues TomTom over patents

I saw a link to Microsoft suing a Linux-based GPS maker, TomTom, which made me wonder what Microsoft is up to. Some people were saying that this is Microsoft's way of attacking Linux, but I thought not. I figured Microsoft probably has something more sly up it's sleeve.

Actually, I was disappointed.

I went into the legal document (the complaint) to find out what patents Microsoft is suing over... and was astounded by how bad the patents are. Given the recent decision in the Bilski ruling, I think this is really Microsoft looking for a soft target in which it's able to test the waters and see how valid it's patents are in the post-Bilski court environment... Of course, I think these are probably some of Microsoft's softest patents. I have a hard time seeing how any of them will stand up in court. (Aka, pass the obviousness test and, simultaneously, the transformative test proposed in Bilski.)

If Microsoft wins this case, it'll be back to claiming Linux violates 200+ patents. If it loses the case, I'm willing to be we won't hear that particular line of FUD again. I can't imagine any of the 200+ patents it says that Linux violates are any better than the crap it's enforcing here.

Anyhow, for your perusal, if you'd like to see what Microsoft engineers have been patenting in the last decade, here are the 8 that Microsoft is trying to enforce. Happy reading:

6,175,789

Summary: Attaching any form of a computer to a car.

7,045,745

Summary: Giving driving instructions from the perspective of the driver.

6,704,032

Summary: having an interface that lets you scroll and pan around, changing the focus of the scroll.

7,117,286

Summary: A computer that interacts or docks with a car stereo.

6,202,008

Summary: A computer in your car... with Internet access!

5,579,517

Summary: File names that aren't all the same length - in one operating system.

5,578,352

Summary: File names that aren't all the same length - in one operating system... again.

6,256,642

Summary: A file system for flash-erasable, programmable, read-only memory (FEProm).

Overwhelmed by the brilliance at Microsoft yet? (-;

Labels: ,

Friday, February 20, 2009

Gairdner Symposium

Somehow, this year's Gairdner symposium completely managed to escape my notice until today, when a co-worker forwarded it along to me. For those of you who don't know the Gairdner awards, I believe it's roughly the Canadian equivalent to the Swedish Nobel prize, although only for medicine and medical sciences. Since I wasn't aware of it until just a few years ago, I don't think it's has quite the same level of recognition, but the more I look into it, the more I discover that it carries just about as much weight: 78 of the 298 award winners have gone on to win Nobel prizes.

At any rate, this year is the 50th anniversary of the foundation (although there's only been 48 years of prizes, apparently), so they're putting on one heck of a show. The Vancouver symposium will be a three part event at the Chan auditorium at UBC, in which 4 Nobel Laureates will take part in discussions ranging from Personal Medicine to the future of the field of health care.

Anyhow, if you're in Vancouver on March 11th, I highly recommend you get yourself a set of tickets. They're available for free from Ticketmaster. (Yes, they will charge you for free tickets that you still have to print yourself. Ticketmaster blows chickens.) Here are the links:

Session 1: Gairdner Symposium - The Future of Medicine (morning session)
Session 2: Gairdner Symposium - The Future of Medicine (afternoon session)
Session 3: 2009 Michael Smith Forum - Personal Medicine (evening session)

I'm already excited about it!

Labels:

Wednesday, February 18, 2009

Three lines of Java code you probably don't want to write

Ah, debugging. Ironically, it's one of the programming "skills" I'm good at. In fact, I'm usually able to debug code faster than I can write it - which leads to some interesting workflows for me. Anyhow, today I managed to really mess myself up, which took several hours of rewriting code and debugging to figure out. In the end, it all came down to three lines, which I hadn't looked at carefully enough - any of the 8-10 times I went over that subroutine.

The point was to transfer all of the reads in the local buffer back into the buffer_ahead, preserving the order - they need to be at the front of the queue. The key word here was "all".
In any case, I thought I'd share it as an example of what you shouldn't do. (Does anyone else remember the Berenstain bears books? "Son, this is what you should not do, now let this be a lesson to you.")

Enjoy:
for (int r = 0; r < buffer_ahead.size(); r++) {
buffer_ahead.add(r, local_buffer.remove(0));
}

Labels:

Tuesday, February 17, 2009

FindPeaks 3.3

I have to admit, I'm feeling a little shy about writing on my blog, since the readership jumped in the wake of AGBT. It's one thing to write when you've got 30 people reading your blog... and yet another thing when there are 300 people reading it. I suppose if I can't keep up the high standards I've being trying to maintain, people will stop reading it and then I won't have anything to feel shy about... Either way, I'll keep doing the blog because I'm enjoying the opportunity to write in a less than formal format. I do have a few other projects on the go, as well, which include a few more essays on personal health and next-gen sequencing... I think I'll aim for one "well thought through essay" a week, possibly on Fridays. We'll see if I can manage to squeeze that in as a regular feature from now on.

In addition to blogging, the other thing I'm enjoying these days is the programming I'm doing in Java for FindPeaks 3.3 (which is the unstable version of FindPeaks 4.0.) It's taking a lot longer to get going than I thought it would, but the efforts are starting to pay off. At this point, a full chip-seq experiment (4 lanes of Illumina data + 3 lanes of control data) can be fully processed in about 4-5 minutes. That's a huge difference from the 40 minutes that it would have taken with previous versions, which would have been sample only.

Of course, the ChIP-seq field hasn't stood still, so a lot of this is "catch-up" to the other applications in the field, but I think I've finally gotten it right with the stats. With some luck, this will be much more than just a catch-up release, though. It will probably be a few more days before I produce a 4.0 alpha, but it shouldn't be long, now. Just a couple more bugs to squash. (-;

At any rate, in addition to the above subjects, there are certainly some interesting things going on in the lab, so I'll need to put more time into those projects as well. As a colleague of mine said to me recently, you know you're doing good work when you feel like you're always in panic mode. I guess this is infinitely better than being underworked! In case anyone is looking for me, I'm the guy with his nose pressed to the monitor, fingers flying on the keyboard and the hunched shoulders. (That might not narrow it down that much, I suppose, but it's a start...)

Labels:

Monday, February 16, 2009

EMBL Advanced Course in Analysis of Short Read Sequencing Data

This email showed up in my mailbox today, and I figured I could pass it along. I don't know anything about it other than what was shown below, but I thought people who read my blog for ChIP-Seq information might find it... well, informative.

I'm not sure where they got my name from. Hopefully it wasn't someone who reads my blog and thought I needed a 1.5 day long course in ChIP-Seq! (-;

At any rate, even if I were interested, putting the workshop in Heidelberg is definitely enough to keep me from going. The flight alone would be about as long as the workshop. Anyhow, here's the email:




Dear colleagues,

We would like to announce a new course that we will be having in June 2009 addressed to bioinformaticians with basic understanding of next generation sequencing data.

The course will cover all processing steps in the analysis of ChIP-Seq and RNA-Seq experiments: base calling, alignment, region identification, integrative bioinformatic and statistical analysis.

It will be a mix of lectures and practical computer exercises (ca. 1.5 days and 1 day, respectively).

Course name: EMBL Advanced Course in Analysis of Short Read Sequencing Data
Location: Heidelberg, Germany
Period: 08 - 10 June 2009
Website: link
Registration link: link
Registration deadline: 31 March 2009

Best wishes.

Looking forward for your participation to the course,
Adela Valceanu
Conference Officer
European Molecular Biology Laboratory
Meyerhofstr. 1
D-69117 Heidelberg

Labels:

Saturday, February 14, 2009

Art for sale... totally un-related to science

When I started this blog, I had planned to use it to post my photographs and other art projects. Clearly that didn't work out how I expected it to. Anyhow, I thought I'd jump back into that mode for a brief (Weekend) post, and put up a picture of a painting I did a few years ago, and am now trying to sell.


I'm reasonably sure that no one will want it, but I think it'll be an interesting experiment to post it on craigslist and see if anyone is willing to pay for it. Hey, if Genome Canada doesn't get funding for a few more years, I might have another career to fall back on. (-;

Labels:

Collection of all of the AGBT 2009 notes

I've had several requests for a link to all of my notes from AGBT 2009, so - after some tweaking and relabeling - I've managed to come up with a single link to all of the AGBT postings. (There are a few very sparse postings from AGBT 2008, but they don't contain much information that's really useful.

Anyhow, if you'd like the link to all of my notes, you can find them here: http://www.fejes.ca/labels/AGBT%202009.html

Labels:

Friday, February 13, 2009

Time for a new look

For people who read my blog on my web page, rather than through feeds, you might notice that my page looks different today. I had some feedback from other bloggers at AGBT (in particular, Daniel MacArthur of Genetic Future), who made some suggestions that should have been simple to implement. Unfortunately, my template is so customized, it's nearly impossible to make the changes without breaking the layout.

So, I figured the best thing to do is clean up and start from scratch. If you notice odd changes in the template, that would be why. (=

In the meantime, there's lots of good FindPeaks news - including controls, which are starting to work, new file formats, and a HUGE speed increase. (Whole genome runs in 6 minutes with a control... wow.)

Anyhow, I've got lots to do - and don't mind the blog template tinkering, 6 minutes at a time.

Labels:

Wednesday, February 11, 2009

Epidemiology and next-generation(s) sequencing.

I had a very interesting conversation this morning with a co-worker, which ended up as a full fledged conversation about how next generation sequencing will end up spilling out of the research labs to the physician's office. My co-worker originally stated that it will take 20 years or so for it to happen, which seems kind of off to me. While most inventions take a lot longer to get going, I think that next-gen sequencing will cascade over more quickly to general use a lot more quickly than people appreciate. Let me explain why.

The first thing we have to acknowledge is that pharmaceutical companies have a HUGE interest in making next gen sequencing work for them. In the past, pharma companies might spend millions of dollars getting a drug candidate to phase 2 trials, and it's in their best interest to get every drug as far as they can. Thus, any drug that can be "rescued" from failing at this stage will decrease the cost of getting drugs to market, and increases revenues significantly for the company. With the price of genome sequencing falling to $5000/person, it wouldn't be unreasonable for a company to do 5-10,000 genomes for the phase 3 trial candidates, as insurance. If the drug seems to work well for a population associated with a particular set of traits, and not well for another group, it is a huge bonus for the company in getting the drug approved. If the drug causes adverse reactions in a small population of people which associate with a second set of traits, then it's even better - they'll be able to screen out adverse responders.

When it comes to getting FDA approval, any company that can clearly specify who the drug will work for - who it won't work for - and who shouldn't take it, will be miles ahead of the game, and able to fast track their application though the approval process. That's another major savings for the company.

(If you're paying attention, you'll also notice at least one new business model here: retesting old drugs that failed trials to see if you can find responsive sub-populations. Someone is going to make a fortune on this.)

Where does this meet epidemiology? Give it 5-7 years, and you'll start to see drugs appear on the shelf with warnings like "This drug is counter-indicated for patients with CYP450 variant XXXX." Once that starts to happen, physicians will really have very little choice but to start sending their patients for routine genetic testing. We already have PCR screens in the labs for some diseases and tests, but it won't be long before a whole series of drugs appear with labels like this, and insurance companies will start insisting that patients have their genomes sequenced for $5000, rather than have 40-50 individual test kits that each cost $100.

Really, though, what choice will physicians have? When drugs begin to show up that will help 99% of the patients for which they should be prescribed, but are counter indicated for genomic variations, no physician will be willing to accept the risk of prescribing without the accompanying test. (Malpractice insurance is good... but only gets you so far!) And as the tests get more complex, and our understanding of underlying cause and effect of various SNPs starts to increase, this is going to quickly go beyond the treatment of single conditions.

I can only see one conclusion: every physician will have to start working closely with a genetic councilor of some sort, who can advise on relative risk and reward of various drugs and treatment regimes. To do otherwise would be utterly reckless.

So, how long will it be until we see the effects of this transformation on our medical system? Well, give it 5 years to see the first genetic counter-indications, but it won't take long after that for our medical systems (on both sides of the border in North America) to feel the full effects of the revolution. Just wait till we start sequencing the genomes of the flu bugs we've caught to best figure out which anti-viral to use.

Gone are the days when the physician will be able to eye up his or her patient and prescribe whatever drug he or she comes up with off the top of their head. Of course, the hospitals aren't yet aware of this tsunami of information and change that's coming at them. Somehow, we need to get the message to them that they'll have to start re-thinking the way they treat people, instead of populations of people.

Labels: , ,

Monday, February 9, 2009

AGBT 2009 – Thoughts and reflections

Note: I wrote this last night on the flight home, and wasn't able to post it till now. In the meantime, I've gotten some corrections and feedback that I'll go through and make corrections to my blog posts as needed. In the meantime, here's what I wrote last night.

****

This was my second year at AGBT, and I have to admit that I enjoyed this year a little more than the last. Admittedly, it's probably because I knew more people and was more familiar with the topics being presented than I was last year. Of course, comprehensive exams and last year's AGBT meeting were very good motivators to come up to speed on those topics.

Still, there were many things this year that made the meeting stand out, for which the organizing committee deserves a round of applause.

One of the things that worked really well this year was the mix of people. There were a lot of industry people there, but they didn't take over or monopolize the meeting. The industry people did a good job of using their numbers to host open houses, parties and sessions without seeming "short-staffed". Indeed, there were enough of them that it was fairly easy to find them to ask questions and learn more about the “tools of the trade.”

On the other hand, the seminars were mainly hosted by academics – so it didn't feel like you were sitting through half hour infomercials. In fact, the sessions that I attended were all pretty decent, with a high level of novelty and entertainment factor. The speakers were nearly all excellent, with only a few that felt of “average” presentation quality. (I managed to take notes all the way through, so clearly I didn't fall asleep during anyone's talk, even if I had the momentary zone out caused by the relentless 9am-9pm talk schedule.)

At the end of last year's conference, I returned to Vancouver – and all I could talk about was Pacific Biosciences SMRT technology, which dominated the “major announcement” factor for me for the past year. At this year's conference, there were several major announcements that really caught my attention. I'm not sure if it's because I have a better grasp of the field, or if there really was more of the “big announcement” category this year, but either way, it's worth doing a quick review of some of the major highlights.

Having flown in late on the first day, I missed the Illumina workshop, where they announced the extension of their read length to 250 bp, which brings them up to the same range as the 454 technology platform. Of course technology doesn't stand still, so I'm sure 454 will have a few tricks up their sleeves. At any rate, when I started talking with people on thursday morning, it was definitely the hot topic of debate.

The second topic that was getting a lot of discussion was the presentation by Complete Genomics, which I've blogged about – and I'm sure several of the other bloggers will be doing in the next few days. I'm still not sure if their business model is viable, or if the technology is ideal... or even if they'll find a willing audience, but it sure is an interesting concept. The era of the $5000 genome is clearly here, and as long as you only want to study human beings, they might be a good partner for your research. (Several groups announced they'll do pilot studies, and I'll be in touch with at least one of them to find out how it goes.)

And then, of course, there was the talk by Marco Marra. I'm still in awe about what they've accomplished – having been involved in the project officially (in a small way) and through many many games of ping-pong with some of the grad students involved in the project more heavily, it was amazing to watch it all unfold, and now equally amazing to find out that they had achieved success in treating a cancer of indeterminate origin. I'm eagerly awaiting the publication of this research.

In addition to the breaking news, there were other highlights for me at the conference. The first of many was talking to the other bloggers who were in attendance. I've added all of their blogs to the links on my page, and I highly suggest giving their blogs a look. I was impressed with their focus and professionalism, and learned a lot from them. (Tracking statistics, web layout, ethics, and content were among a few of the topics upon which I received excellent advice.) I would really suggest that this be made an unofficial session in the future. (you can find the links to their blogs as the top three in my "blogs I follow" category.)

The Thursday night parties were also a lot of fun – and a great chance to meet people. I had long talks with people all over the industry, where I might not otherwise have had a chance to ask questions. (Not that I talked science all evening, although I did apologize several times to the kind Pacific Biosciences guy I cornered for an hour and grilled with questions about the company and the technology. And, of course, the ABI party where Olena got the picture in which Richard Gibbs has his arm around me is definitely another highlight. (Maybe next year I'll introduce myself before I get the hug, so he knows who I am...)

One last highlight was the panel session sponsored by Pacific Biosciences, in which Charlie Rose (I hope I got his name right) mediated the discussion on a range of topics. I've asked a guest writer to contribute a piece based on that session, so I won't talk too much about it. (I also don't have any notes, so I probably shouldn't talk about it too much anyhow.) It was very well done with several controversial topics being raised, and lots of good stones were turned over. One point is worth mentioning, however: One of the panel guests was Eric Lander, who has recently come to fame in the public's eye for co-chairing a science committee requested by the new U.S. President Obama. This was really the first time I'd seen him in a public setting, and I have to admit I was impressed. He was able to clearly articulate his points, draw people into the discussion and dominate the discussion while he had the floor, but without stifling anyone else's point of view. It's a rare scientist who can accomplish all of that - I am now truly a fan.

To sum up, I'm very happy I had the opportunity to attend this conference and looking forward to see what the next few years bring. I'm going back to Vancouver with an added passion to get my work finished and published, to get my code into shape, and to keep blogging about a field going through so many changes.

And finally, thanks to all of you who read my blog and said hi. I'm amazed there are so many of you, and thrilled that you take the time to stop by my humble little corner of the web.

Labels:

Saturday, February 7, 2009

Stephan Schuster, Penn State University - “Genomics of Extinct and Endangered Species”

Last year, introduced nanosequencing of complete extinct species. What are the implication of extinct genomes on endangered species.

Mammoth: went extinct 3 times... 45,000ya, 10,000 ya, and 3,500ya. Wooly rhino: 10,000 years ago, Moa 500 years ago (were eaten), Thylacine 73 years ago.. And Tasmanian devils, which are expected only to last another 10 years.

Makes you wonder about dinosaurs.. maybe dinosaurs just tasted like chicken.

Looking at population structure and biological diversity from a genomic perspective. (Review of Genotyping Biodiversity.) Mitochondrial genome is generally higher copy, and thus was traditionally the one used, but now with better sequencing, we can target nuclear DNA.

Mammoth Mitochondrial genome has been done. ~16,500bp. Includes ribosomal, coding and noncoding regions. In 2008, can get 1000x coverage on the mitochrondrial. You need extra coverage to correct for damaged DNA.

This has now allowed 18 mammoth mitochondrial genome sequences. 20-30 SNPs between members of same groups, and 200-300 between groups. WAY more sequencing than is available for african elephants!

Have now switched to using hair instead of bone, and can use hair shaft. (not just follicle)

Ancient DNA = highly fragmented. 300,000 sequences, 45% was nuclear DNA.

Now: Sequenced bases: 4.17Gb. Genome size is 4.7Gb. 77 Runs, got 32.6 million bases.

Can visit mammoth.psu.edu for more info.

Sequenced mammoth orthologs of human genes. Compared to watson/venter... rate of predicted genes of chromosomes (“No inferrences here”), Complete representation of genome available. SAP =Single Amino acid Polymorphism.

(Discussion Divergence for mammoth) coalescence time for human and neandethal, 600,000. Same thing happens for mammoth, but not really well accepted because the biological evidence doesn't show it.

Did the same thing for the Tasmanian Tiger. Two complete genomes – only 5 differences between them.

Hair for one sample was taken from what fell off when preserved in a jar of ethanol!

Moa: did it from egg shell!

Wooly rhino: did the wooly rhino from hair – did other rhinos. (wooly is the only extinct one.) Rhinos radiated only a million years ago, so couldn't resolve phylogenic tree. Tried: hair, horn, hoof, and bone.. bone was by far the worst.

Now, to jump to the living: the tasmanian devil. Highly endangered. 1996 infectious cancer discovered (not figured out till 2004). Devils protected since 1941. Isolations with fences, islands, mainland, insurance population. Culling and vaccination are also possible.

Genome markers will be very useful. Problem is probably because there is nearly no diversity in population. Sequenced museum sample devils, and show mitochondrial DNA had more diversity in non-living population.

Project for full genome is now underway – two animals. (More information on plans on what to do with this data and how to save them.) SNP info for genotyping to direct captive breeding program.
(“Project Arc”) Trying to breed resistant animals.

Labels:

Len Pennacchio, Lawrence Berkely National Laboratory - “ChIP-Seq Accurately Predicts Tissue Specific Enhancers in Vivo”

The Noncoding Challenge: 50% of GWAS are falling into non-coding studies. CAD and Diabetes fall in gene deserts, so how do they work. Regulatory regions. Build a category of distal enhancers.

Talk about Sonic Hedgehog, involved in limb formation. Regulation of expression is a million bases away from the gene. There are very few examples. We don't know if we'll find lots, or if this is just the tip of the iceberg. How do we find more?

First part: work going on in the lab for the past 3 years. Using conservation to identify regions that are likely invovled. Using ChIP-Seq to do this.

Extreme conservation. Either things conserved over huge spans (human to fish) or within a smaller group. (human mouse, chimp).

Clone the regions into vectors, put them in mouse eggs, and then stain for Beta-galactosidase. Tested 1000 constructs, 250,000 eggs, 6000 blue mice. About 50% of them work as reproducible enhancers. Do everything at whole mouse level. Each one has a specific pattern. [Hey, I've seen this project before a year or two ago... nifty! I love to see updates a few years later.]

Bin by anatomical pattern. Forebrain enhancers is one of the big “bins”. Working on forebrain atlas.

All data is at enhancer.ldl.gov. Also in Genome Browser. There is also an Enhancer Browser. Comparative genomics works great at finding enhancers in vivo. No shortage of candidates to test.

While this works, it's not a perfect system. Half of the things don't express, and the system is slow and expensive. The comparative genomics also tells you nothing about where it expresses, so this is ok for wide scans, but not great if you want something random.

Enter ChIP-Seq. (brief model of how expression works) Collaboration with Bing Ren. (brief explanation of ChIP-Seq). Using Illumina to sequence. Looking at bits of mouse embryo. Did chipseq, got peaks. What's the accuracy?

Took 90 of predictions, used same assay. When p300 was used, now up to 9/10 of conserved sequences work as enhancers. Also tissue specific.

Summarize: using comparative gives you 5-16% active things in one tissue. Using ChIP-Seq, you get 75-80%.

How good or bad is comparative genomics at ranking targets? 5% of exons are constrained, almost the rest are moderately constrained. [I don't follow this slide. Showing better conserved in forebrain and other tissues].

P300 peaks are enriched near genes that are expressed in the same tissues.

Conclusion: p300 is a better way of prediction enhancers.
P300 occupancy circumvents DNA conservation only approach.

What about negatives? For ones that don't work, it's even better, but mouse orthologs bind, while human does not bind any more in mice.

Conclusion II: Identified 500 more enhancers with first method, and now a few reads done 9 months ago have 5000 new elements using ChIP-Seq.

Many new things can be done with this system, and integrating it with WGAS.

Labels:

Bruce Budowle, Federal Bureau of Investigation - “Detection by SOLiD Short-Read Sequencing of Bacilus Anthracis and Tersinia Pestis SNPs for Strain Id

We live in a world with a “heightened sense of concern.” The ultimate goal is to reduce risk, whether it's helping people with flooding, or otherwise. Mainly, they work on stopping crime and identifying threat.

Why do we do this? We've only had one anthrax incident since 2001... but we've been been using bioterrorism for a 2000 years. (several examples given.)

Microbial Forensics. We don't just want knee jerk responses. Essentially the same as any other forensic discipline, again, to reduce risk. This is a very difficult problem. Over 1000 agents known to infect humans: 217 viruses, 538 bacterial species, 307 fungi, 66 parasitic protozoa. Not are all effective, but there are different diagnostic challenges. Laid out on the tree of life.... it's pretty much the whole thing.

Biosynthetic technology. New risks are accruing due to advances in DNA synthesis. Risks are vastly outweighted by benefits of synthesis... bioengineering also plays a role.

Forensic genetic questions:
what is the source?
Is it endemic?
what is the significance?
How confident can you be in results?
Are there alternative explanations?

So, a bit of history on the “Amerantrax” case. VERY complex case, changed the way the government works on this type of case. Different preparations in different envelopes.

Goals and Objectives:
could they reverse engineer the process? To figure out how it was done? No, too complex, didn't happen.

First sequencing – did a 15 locus 4-colour genotyping system. Was not a validated process – but helped identify strain. That helped narrow down the origin of the strain. Some came from texas, but it was more likely to have come from a lab than to come from the woods.

Identifying informative SNPs. Don't need to know the evolution – just the signature. That can be then used for diagnostics. Whole genome sequencing for genotyping was a great use. Back in 2001, most of this WGS wasn't possible. They had a great deal from Tigr – only $125,000 to sequence the genome. From the florida isolate : took 2 weeks, found out interesting details about copy number of plasmids. The major cost was then to validate and understand what was happening.

Florida was compared to Ames to one from UK, which gave 11 SNPs only. Many evolution challenges that came up. The strain they used was “cured” of it's plasmid, so it evolved to have other SNPs... a very poor reference genome.

The key to identification: one of the microbiologists discovered that some cultures had different morphology. That was then used as another signature for identifying the source.

Limited Strategy: it didn't give the whole answer – only allows them to rule out some colonies. It would be more useful to sequence full genomes... so entered into deal with ABI SOLiD for genome sequencing.

Some features were very appealing. One of them is the Emulsion PCR. Helped to improve quality and reliability of the assay. And beads, were useful too.

Multiplex value was very useful. Could test 8 samples simultaneously using barcoding, including the reference Ames strain. Coverage was 16x-80x, depending on DNA concentration. Multiple starting points gives more confidence, and to find better SNPs.

Compare to reference: found 12 SNPs in resequenced reference. When you look at SNP data, you see that there was a lot of confidence if it's in both direction... however, it only turns up on the one strand. That becomes a major way to remove false positive result. That was really only possible by using higher coverage.

Not going to talk about Pestis.. (almost out of time.) Similar points, 130-180X coverage. Found multidrug transporter in the strain which has been a lab strain for 50 years. Plasmids were also higher coverage. SNPs were less in the north american, etc.
An interesting point. If you go to the ref in genbank, there are known errors in the sequence. Several have been corrected, and the higher coverage was helpful in figuring out the real sequence past the errors.

$1000 /strain using multiplex, using equipment that is not yet available. This type of data really changes the game, and can now screen samples VERY quickly (a week).

Conclusions:
Every project is a large scale sequencing project
depth is good
multiplexing is good
keep moving to higher accuracy sequencing.

Labels:

Andy Fire, Stanford University - “Understanding and Clinical Monitoring of Immune-Related Illnesses Using Massively-Parallel IgH and TcR Sequencing”

The story starts off with a lab that works on small RNAs, which believe they form a small layer of immunity. [did I get that right?] They work in response to foreign DNA.

Joke Slide: by 2018, we'll have an iSequencer.

Question: can you sequence the immunome. [new word for me.] Showing a picture of lymphoma cells, which to me looks like a test to see if you're colour blind. There are patches of slightly different shades...

Brief intro to immulogy. “I got an F in immunology as a grad student.” [There's hope for me, then!]
Overview of VDJ Recombination, control by B-Cell differentiation. This is really critical – responsible for our health. One Model: If something recognizes both a virus and self, then you can end up with autoimmune response.

There is a continuum based on this. It's not necessarily an either /or relationship.

There is a PCR/454 test for VDJ distribution. Under some cases, you get a single dominating size class, and that is usually a sign of disease, such as lymphoma. You can also use 454 for this, since you need longer reads, and read the V, D and J units in the amplified fragment. Similar to email, you can get “spam”, and you can use similar technologies to drop out the “spam” from the run.

To show the results of the tests for B-cell recombination species, you put V on one axis, J on the other. D is dropped to make it more viewable. In lymphoma, a single species dominates the chart.

An interesting experiment – dilute with regular blood to see detection limit – it's about 1:100. For some lymphomas, you can't use these primers, and they don't show up. There are other primers for the other diseases.

So what happens in normal distributions? Did the same thing with VDJ, (D included so there are way more spots). Neat image.. Do this experiments with two aliquots of blood from the same person. Look for concordance. You find lots of spots fail to correspond well at the different time points, but many do.

On another project, Bone Marrow transplant. Recipient has a funny pattern, mostly caused by “spam” because the recipient really has very little immune system left. The patient eventually gets the donor VDJ types, which is a completely donor response. You can also do something like this for autoimmune disorders.

Malfunctioning Lymphoid cells cause many human diseases and medical side-effects. (several examples given.)

Labels:

Keynote Speaker: Rick Wilson, Washington University School of Medicine - “Sequencing the Cancer Genome”

Interested in:
1.Somatic mutations in protein coding genes, including indels.
2.Also like to find: non-coding mutations, miRNAs and lincRNAs.
3.Like to learn about germ line variations.
4.Differential transcription and splicing
5.CNV
6.structural variation
7.Epigenetic changes
8.big problem: integrate all of this data.. and make sense of it.

Paradigm for years: exon focus for large collection of samples. Example: EGFR mutations in Lung Cancer. Large number of patients (some sample) had EGFR mutations. Further studies carry on this legacy in Lung cancer using new technology. However, when you look at pathways, you'll find out finding that the pathways are more important than individual genes.

Description of “The Cancer Genome Atlas”

Initial lists of genes mutated in cancer. Mutations were found, many of which were new. (TCGA Research Network, Nature, 2008)

Treatment-related hypermutation. Another example of TCGA's work: glioblastoma. Although they didn't want treated samples, in the end they took a look and saw that treated samples have interesting changes in methylation sites, when MMR genes and MGMT were mutatated. If you know the status of the patient's genome, you can better select the drug (eg, not use a alkylation based drug).

Pathways analysis can be done... looking for interesting accumulations of mutations. Network view of the Genome... (just looks like a mess, but a complex problem we need to work on.)

What are we missing? What are we missing by focusing on exons? There should be mutations in cancer cells that are outside exons.

Revisit the first slide: now we do “Everything” from the sample of patients, not just the list given earlier.

(Discussion of AML cancer example.) (Ley at al, Nature 2008)
Found 8 heterozygous somatic mutations, 2 somatic insertion mutations. Are they cause or effect?
The verdict is not yet in. Ultimately, functional experiments will be required.

There are things we're not doing with the technology: Digital gene expression counts. Can PCR up gene of interest from tumour, sequence and do a count: how many cells have the genotype of interest?
Did the same thing for several genes, and generally got a ratio around 50%.

Started looking at GBM1. 3,520,407 tumour variants passing SNP filter. Broke down to Coding NS splice sites, coding silent, conserved regions, regulatory regions including miRNA targes, non repetitive regions, everything else (~15,000). Many of the first class were validated.

CNV analysis also done. Add coverage to sequence variants, and the information becomes more interesting. Can then use read pairs to find breakpoints/deletions/insertions.

What's next for cancer genomics? More AML (Doing more structural variations, non-coding information, more genomes), more WGS for other tumours, and more lung cancer, neuroblastoma... etc.

“If the goal is to understand the pathogenensis of cancer, there will never be a substitute for understanding the sequence of the entire cancer genome” – Renato Dulbecco, 1986

Need ~25X coverage of WGS tumour and normal – also transcriptome and other data. Fortunately, costs are dropping rapidly.

Labels:

Peter Park, Harvard Medical School - “Statistical Issues in ChIP-Seq and its Application to Dosage Compensation in Drosophila”

(brief overview of ChIP-Seq, epigenomics again)

ChIP-Seq not always cost-competitive yet. (can't do it at the same cost as chip-chip)

Issues in analysis:Generate tags, align, remove anomalus, assemble, subtract background, determine binding position, check sequencing depth.

Map tags in strand specific manner: (Like directional flag in Findpeaks). Scoring tags accounting for that profile. Can be incorporated into peak caller.

Do something called Cross-correlation analysis. (look at peaks in both directions.) use this to rescue more tags. Peaks get better if you add good data, and worse if you add bad data. Use it to learn something about histone modification marks. (Tolstorukov et al, Genome Research).

How deep to sequence? 10-12M reads is current. That's one lane on illumina, but is it enough? What quality metric is important? Clearly this depends on the marks you're seeing (narrow vs broad, noise, etc). Brings you to saturation analysis? Show no saturation for STAT1, CTCF, NRSF. [not a surprise, we knew that a year ago... We're already using this analysis method, however, as you add new reads, you add new sites, so you have to threshold to make sure you don't keep adding new peaks that are insignificant. Oh, he just said that. Ok, then.]

Talking about using “fold enrichment” to show saturation. This allows you to estimate how many tags you need to get a certain tag enrichment ratio.

See paper they published last year.

Next topic: Dosage compensation.

(Background on what dosage compensation is.)

In drosophila, the X chromosome is up-regulated in XY, unlike in humans, where the 2nd copy of the X is quashed in the XX genotype. Several models available. Some evidence that there's something specific and sequence related. Can't find anything easily in ChIP based methods – just too much information. Comparing ChIP-seq, you get sharp enrichment, whereas on ChIP-chip, you don't see it. Seems to be saturation issue (dynamic range) on ChIP-chip, and the sharp enrichments are important.
You get specific motifs.

Deletion and mutation analysis. The motif is necessary and sufficient.

Some issues: Motif on X is enriched, but only by 2-fold. Why is X so much upregulated, then? Seems Histone H3 signals depleted over the entry sites on X chr. May also be other things going on, which aren't known.

Refs: Alekseyenko et al., Cell, 2008 and Sural et al., Nat Str Mol Bio, 2008

Labels:

Alex Meissner, Harvard University- “From reference genome to reference epigenome(s)”

Background on Chip-Seq.

High-throughput Bisulfite Sequencing. At 72 bp, you can still map these regions back without much loss of mapping ability. You get 10% loss at 36bp, 4% at 47bp and less at 72bp.

This was done with a m-CpG cutting enzyme, so you know all fragments come with at least a single Methylation. Some update on technology recently, including drops in cost and longer reads, and lower amounts of starting material.

About half of the CpG is found outside of CpG islands.

“Epigenomic space”: look at all marks you can find, and then external differences. Again, many are in gene deserts, but appear to be important in disease association. Also remarkable is the degree of conservation of epigenetic patterns as well as genes.

Questions:
where are the functional elements?
when are they active?
when are they available

Also interested in Epigenetic Reprogramming (Stem cell to somatic cell).

Recap: Takahashi and Yamanaka: induce pluripotent stemcell with 4 transcription factors: Oct2, Sox2, c-Myc & KLF4[?] General efficiency is VERY low (0.0001% - 5%). Why are not all cells reprogramming?

To address this: ChIP-Seq before and after induction with 4 transcription factor. Strong correlation with chromatin state and iPS. Clearly see that genes in open chromatin are responsive. Chromatin state in MEFs correlates with reactivation.

Is loss of DNA methylation at pluripotency genes the critical step to fully reprogram? Test hypothesis that by demethylation, you could cause more cells to become pluripotent. Loss of DNA methylation does indeed allows transition to pluripotency shown. [lots of figures, which I can't copy down.]

Finally: loss of differentiation potential in culture. Embryonic stem cell to neural progenitor, but eventually can not differentiate to neurons, just astrocytes. (Figure from Jaenisch and Young, Cell 2008)

Human ES cell differentiation: often fine in morphology, correct markers... etc etc, but specific markers are not consistent. Lose methylation and histone marks, which cause significant changes in pluripotency.

Can't yet make predictions, but on the way towards it in the future where you can assess cell type quality using this information.

Labels:

Marco Marra's Talk

That was clearly the coolest thing we've seen so far. From genome to cancer treatment, which seems to have worked to reduce the tumour size.. Wow. I was aware of the work, having been involved in a small way, but I wasn't aware of the outcome until just last night.

Mind Blowing. Personalized medicine is here.

**BREAKING NEWS** Marco Marra, BC Cancer Agency - “Sequencing Cancer Genomes and Transcriptomes: From New Pathology to Cancer Treatment.”

Why sequence Cancer-ome: Most important: treatment-response difference. To match treatments to patients. Going to focus on that last one.

Two Anecdotes: Neuroblastoma (Olena Morozova and David Kaplan), and Papillary adenocarcinoma (tongue), primary unkown. 70 year difference in age. They have nothing in common except for “can sequence analysis present new treatment options?”

Background on Neuroblastoma. Most common cancer in infants, but not very common. 75 cases per year in Canada. Patients often have relapse and remission cycles after chemotherapy. Little improvement until recently, when Kaplan was able to show abiltity to enrich for tumour initiating cells (TICs). This gave a great model for more work.

Decided to have a look at Kaplan's cells, and did transcriptome libraries (RNA-Seq) using PET, and sequenced a flow cell worth: giving 5Gb of raw seq from one sample, 4 from the other. Align to reference genome using custom database. (Probably Ryan Morin's?) Junctions, etc.

Variants found that are B-cell related. Olena found markers, worked out lineage, and showed it was closer to B-cell malignancy than brain cancer sample. These cells also show neuroblastomas, when reintroduced to mice. So, is neuroblastoma like B-cell in expression? Yes, they seem to have a lot of traits in common. It appears as though the neuroblastoma is expressing early markers.

Thus, if you can target B-Cell markers, you'd have a clue.

David Kaplan verified and made sure that this was not contamination (Several markers). Showed that yes, the neuroblastoma cells are expressing b-cell markers, and that these are not B-cells. Thus, it seems that a drug that targets B-Cell markers could be used. (Rituximab, and Milatuzamab) Thus, we now have an insight that we wouldn't have have had before. (Very small sample, but lots of promise.)

Anectdote 2: 80 year old male with adenoma of the tongue. Salivary gland origin possibly? Has had surgery and radiation and a Cat scan revealed lung nodules (no local recurrance.) There is no known standard chemotherapy that exists... so several guesses were made, and an EGFR inhibitor was tried.. Nothing changed. Thus, BC Cancer was approached: what can genome studies do? Didn't know, but willing to try. Genome from formalin fixed sample (which is normally not done), and handful of WTSS from Fine-needle aspirates. (nanograms, which required amplification). 134Gb of aligned sequence across all libraries – about 110Gb to genome. (22X genome, 4X transcriptome.)

Data analysis, compared across many other in-house tumours, and looked for evidence of mutation. CNV was done from Genome. Integration with drug bank, to then provide appropriate candidates for treatment.

Comment on CNV: histograms shown: Showed that as many bases are found in single allele as diploid and then again, just as many in triploid and then some places at 4 and 5s. Was selected pressure involved in picking some places for gain, whereas much of the genome was involved in loss?

Investigated a few interesting high CNV regions, one of which contains RET. Some amplifications are highly specific, containing only a single gene, whereas they are surrounded by loss of CNV regions.

Looking at Expression level, you see a few interesting things. There is a lack of absolute correlation between changes in CNV and the expression of the gene.

When looking for intersection, ended up with some interesting features:
30 amplified genes in cancer pathways (kegg)
76 deleted genes in cancer pathways
~400 upregulated, ~400 downregulated genes
303 candidate non-synonymous snps
233 candidate novel coding SNPs
... more.

Went back to drugbank.ca (Yvonne and Jianghong?) When you merge that with target genes, you can find drugs specific to those targets. One of the key items on the list was RET.

Back to patient, the patient was using EGFR targetting drug. Why weren't they responsive? Turns out that p10 and RB1 are lost in this patient... (see literature.. didn't catch paper).

Pathway diagram made by Yvonne Li. Shows where mutations occur in pathways, gains and losses of expression are shown as well. Notice Lots of expression from RET, and no expression from p10. p10 regulates (negative) the RET pathway. Also increases of Mek and Ras. Suggests that in this tumour, activation of RET could be driving things.

Thus, came up with a short list of drugs. Favorite was Sunitinib. It's fairly non-specific, used for renal cell carcinoma. Currently in clinical trials, tested for other cancers. Implications that RET is involved in some of those diseases (MEN2a, MEN2B and thyroid cancers.) RET sequence in patient was not likely to be mutated in patient.

CAT scans: response to Sunitinib and Erlotinib. When on the EGFR targetting drug, nodule grew. On Sunitinib, the cancer retreated!

Lots of Unanswered questions: Is RET really driving this tumour? Is drug really acting on RET? Is PTEN/RB1 loss responsible for erlotinib resistance in this tumour?

We don't think we know everything, but can we use genome analysis to suggest treatment: YES!

First question: how did this work with ethics boards? How did they let you pass that test? Answer: this is not a path to treatment, it is a path towards making suggestion. In some cancers there is something called Hair Analysis. It can be considered or ignored. Same thing here: we didn't administer... we just proposed a treatment.

Labels:

Keynote Speaker: Rick Myers, Hudson-Alpha Institute - “Global Analysis of Transcriptional Control in Human Cells”

Talking about gene regulation – has been well studied for a long time, but only recent on a genomic scale. The field still wants comprehensive, accurate, unbiased, quantitative measurements (DNA methylation, DNA binding protein, mRNA) and they want it cheap fast and easy to get.

Next gen has revolutionized the field: ChIP-Seq, mRNA-Seq and Methyl-Seq are just three of them. Also need to integrate them with genome-wide genetic analysis.

Many versions of each of those technology.

RNA-Seq: 20M reads give 40 reads per 1kb-long mRNA present as low as 1-2 mRNA per cell. Thus, 2-4 lanes are need for deep transcriptome measurement. PET + long reads is excellent for phasing, and junctions.

ChIP-Seq: transcription factors and histones.. but should also be used for any DNA binding protein. (Explanation of how ChIP-Seq works.) Using no-antibody control generally gives you no background [?] Chip without control gets you into trouble.

Methylation: Methyl-seq. Cutting at unmethylated sites, then ligate to adaptors and fragment. Size select and run. (Many examples of how it works.)

Studying human embryonic stem cells. (Cell lines are old and very different.... hopefully there will be new ones available soon.) Using it for Gene expression versus methylation status: When you cluster by gene expression, they cluster by pathways. The DNA methylation patterns did not correlate well, more along the line of individual cell lines than pathways. Thus, they believe it's not controling the pathways.. but that could be an artifact of the cell lines.

26,956 methylation sites. Many of them (7,572) are in non CpG regions.

Another study: Studying Cortisol. Steroid hormone made by adrenal gland. Controls 2/3rds of all biology, helps restore homeostasis and affects a LOT of pathways: blood pressure, blood sugar, suppress immune system, etc. Fluctuates throughout the day. Pharma is very interested in this.
Levels are also tied to mood, etc.

Glucocorticoid receptor binds hormone in cytoplasm, translocates to nucleus. Activates and represses transcription of thousands of genes.

Chip-seq in A549: GR (-hormone): 579 peaks. GR (+ hormone): 3,608 peaks. Low levels of endogenous cortisol in the cell probably accounts for the background. (of peaks, ~60% are repressive, ~40% are inducing.) When investigating the motifs, top 500 hits really changes the binding site motif! No longer as set as originally thought – and led to discovery of new genes controled by GRE. Also show that there's a co-occupancy with AP1.

[Method for expression quantization: Use windows over exons.]

Finally: a few more little stories. Mono-allelic transcription factor binding. Turns out to occur frequently, where only one allele is bound in ChIP, and the other is not binding at all. (in the shown case, turns out the SNP causes a methylation site, which changes binding.) Same type of event also happens to methylation sites.

Still has time: just raise the point of Copy Number Variation. Interpretation is very important, and can be skewed by CNVs. Cell lines are particularly bad for this. If you don't model this, it will be a significant problem. Just on the verge of incorporating this.

They are going to 40-80M reads for RNA-Seq. Their version of RNA-Seq is good, and doesn't give background. The deeper you go, the more you learn. Not so much with ChIP-Seq, where you saturate sooner.

Labels:

Friday, February 6, 2009

Kevin McKernan, Applied Biosystems - "The whole Methylome: Sequencing a Bisulfite Converted Genome using SOLiD"

Background on methylation. It's not rare, but it is clustered. This is begging for enrichment. You can use Restriction Enzymes. Uses Mate Pairs to set this up. People can also use MeDIP and a new third method: methyl binding protein from Invitrogen. (Seems to be more sensitive.)

MeDIP doesn't grab CpG, tho... just leaving single stranded DNA, which is a pain for making libraries. Using only 90ng. There is a slight bias on adaptors, tho. Not yet optimized. If they're bisulfite converting, it has issues (protecting adaptors, requires densely methylated DNA, etc). They get poor alignment because methylation areas tend to be repetitive. Stay tuned, though, for more developments

MethylMiner workflow: Sheer genomic DNA, and put adaptors on it, and then biotin bind methyls? You can titrate methyl fractions off the solid support, so you can then sequence and know how many methyls you'll have. Thus, mapping becomes much easier, and sensitivity is better.

When you start getting up to 10-methyls in a 50mer, bisulfite treating + mapping is a problem. It's also worth mentioning that methylation is not binary when you have a large cell population.

The methyl miner system was tested on A. thaliana, SOLiD fragments generated... good results obtained, and salt titration seems to have worked well, and mapping reads show that you get the right number (aprox) of methyl Cs. - but mapping is easy, since you don't need to bisulfite.

Showed examples where some of genes are missed by MeDIP, but found by MethylMiner.

(Interesting note, even though they only have 3 bases after conversion (generally), it's still 4 colour.)

Do you still get the same number of reads on both strands? Yes...

Apparently methylation is easier to align in colourspace. [Not sure I caught why.] Doing 50mers with 2MM. (Seems to keep % align-able in colourspace, but bisulfite treated base space libraries can only be aligned about 2/3rds as well.

When bisulfite converted, 5mCTP will appear as a SNP in alignment. To approach that, you can do fractionation in MethylMiner kit, which gives you a more rational approach to alignments.

You can also make a LMP library, and then treat with 5mCTP when extending, so you get two tags, then separate tags – (they keep a barcode) and then pass over methylminer kit... etc etc... barcoded mapping to detect methyl C's better.

Also have a method in which you do something the same way, but ligate hairpins on the ends... then put on adaptors, and then sequence the ends, to get mirror imaged mate pairs. (Stay tuned for this too.)

There are many tools to do Methylation mapping: colourspace, lab kits and techniques.

Labels:

Stephen Kingsmore, National Centre for Genome Resources - “Digital Gene Expression (DGE) and Measurement of Alternative Splice Isoforms, eQTLs and cSN

[Starts with apologizing for chewing out a guy from Duke... I have no idea what the back story is on that.]

Developed their own pipeline, with a web interface called Alpheus, which is remotely accessible. They have and Ag biotech focus, which is their niche. Would like to get into personal genome sequencing.

Application 1: Schizophrenia DGE.
pipeline: ends with Anova analysis. Alignment to several references: transcripts and genome. 7% span exon junctions. MRNA-Seq Coverage. Read count gene based expression analysis is as good as or better than arrays or similar tech. Using Principle component analysis. Using mRNA-Seq, you can clearly separate their controls and cases, which they couldn't do with Arrays. It improves diagnosis component of Variance.
Showing “Volcano Plots”.

Many of genes found for schizophrenia converged on a single pathway, known to be involved in neurochemistry.

Have a visualization tool, and showed that you can see junctions and retained introns, and then wanted to do it more high throughput. Started a collaboration to focus on junctions, to quantify alternative transcript isoforms. Working on first map of splicing of transcriptome in human tissues. 94% of human genes have multiple exons. Every one had alternative splicing in at least one of the tissues examined.

92% have biochemically relevant splicing. (minimum 15%?)

8 types of alternative splicing... 63% of alternative splicing is tissue regulated. 30% of splicing occurs between individuals. (So tissue splicing trumps individuals)

[Brief discussion of 454 based experiment... similar results, I think.]

Thus:
1.cost effective,
2.timely
3.biologically relevant
4.identified stuff missed by genome sequencing

Finally, also compared genotypes from individuals looking at cSNPs. Cis-acting SNPS causing allelic imbalance. Used it to find eSNPS (171 found). Finally, you can also Fine Map eQTN within eQTL.

Labels:

Jesse Gray, Harvard Medical School - “Neuronal Activity-Induced Changes in Gene Expression as Detected by ChIP-Seq and RNA-Seq”

Now “widespread overlapping sense/antisense transcription surrounding mRNA transcriptional start sites.”

Thousands of promoters exhibit divergent transcriptional initiation. Annotated TSS come from NCBI. There are 25,000 genes. There is an additional anti-sense TSS (TSSa) 200 bp upstream. [Nifty, I hadn't heard about that.]

Do RNA-Seq and ChIP-Seq. Using SOLiD. SOLiD or Ambion [not sure which] plans to sell the method as a kit for WTSS/WT-Seq.

Using RNA Pol II ChIP-Seq.

Anti-sense transcription peaks about ~400 bases upstream of TSS. When looking at the genome browser, you see overlapping TSS-associated transcription. (you see negative strand reads on the other direction, upstream from TSS, and on the forward strand at the TSS, with a small overlap in the middle.)

It is a small amount of RNA being produced.

Did a binomial statistical test, fit to 4 models:
1.sense only initiation
2.divergent initiation (overlap)
3.anti-sense only initiation
4.divergent (no overlap)

The vast majority are TSSs with divergent overlap, 380 with divergent (no overlap), 900 sense only, 140 anti-sense only. Many other sites were discarded because it was unclear what was happening. This is apparently a wide-spread phenomenon.

Might this be important? Went back to ChIP-Seq to classify the peaks into these categories from RNA Pol II expt. (Same categories.) Is this a meaningful way to classify sites, and what does it tell us?

How many of those peaks have a solid PhastCons score, which should tell something about the read. No initiation has the lowest scores... the ones with the antisense models have the highest conservation at the location of antisense initiation.

Where do the peaks fall, when they have anti-sense? Anti-sense are bimodal, sense only and bi-direcitional are just before the TSS, and non-bi-modal.

Tentatively, yes, it seems like this anti-sense is functionally important.

Does TSSo change efficiency of initiation?

Break into two categories. Non-overlap TSSs, and overlap TSSs. It appears that overlap TSSs produce more than twice the RNA than non-overlap. This could be a bias... could be selecting for highly expressed genes. Plot the RNA Pol II occupancy at the star sites, there is a big difference at the overlapping TSS. Non-overlap has higher occupancy at non-overlap, but lower up or down stream than overlap. Thus the transition to elongation may be less efficient.

Does TSSo change efficiency of initiation? Tentatively, yes.

Comment from audience: this was discovered a year ago in a paper a year ago by Kaplan (Kaparov?). Apparently this was lately described that these are cleaved into 31nt capped reads. THus, the fate of the small RNA should be of interst. 50% of genes had this phenomenon.

Question from audience: what aligner was used, and how were repetitive sections handled. Only uniquely mapping read, using SOLID pipeline. (Audience member thinks that you can't do this analysis with that data set.) Apparently, someone else claims it doesn't matter.

My Comment: This is pretty cool. I wasn't aware of the anti-sense transcription in the reverse direction from the TSS. It will be interesting to see where this goes.

Labels:

Terrence Furey, Duke University - “A Genome-Wide Open Chromatin Map in Human Cell Types in the ENCODE Project”

2003: initial focus on 1% of genome. Where are all the DNA elements.
2007: Scale up from 1% to 100%

Where are all of the regulatory element in the genome: a parts list of all functional elements.

We now know: 53% unique, 45% repetitive, 2% are genes. Some how, the 98% controls the other 2%.

Focussed on regions of open chromatin. Open chromatin is not bound to nucleosomes.

Contains:
1.promotors
2.enhancers
3.silencers
4.insulators
5.locus control regions
6.meiotic recombination hotspots.

Use two assays: DNAse hyper-sensitivity. Used at single site in the past, now used for high throughput genome wide assays. The second method is FAIRE: formaldehyde assisted identification of regulatory elements. It's a ChIP-Seq. [I don't know why they call it FAIRE... it's exactly a ChIP experiment – I must be missing something.]

Also explaining what ChIP-Seq/ChIP-chip is. They now do ChIP-Seq. Align sequences with MAQ. Filter on number of aligned locations. (keep up to 4 alignments). Use F-Seq. Then call peaks with a threshold. Use a continuous value signal.

The program is F-Seq, created by Alan Boyle. Outputs in Bed and Wig format. Also deals with alignability “ploidy”. (Boyle et al, Bioinformatics 2008). They use Mappability to calculate smoothing.

[This all sounds famillar, somehow... yet I've never heard of F-Seq. I'm going to have to look this up!]

Claim you need normalization to do proper calling. Normalization can also be applied if you know regions of duplications.

[as I think about it, continuous read signals must create MASSIVE wig files. I would think that would be an issue.]

Peak calling validation: ROC analysis. False positive along bottom axis, true positives on vertical axis. Show chip-seq and chip-array have very high concordance.

Dnase I HS – 72 Million sequences, 149,000 regions, 58.5Mb – 2.0%
FAIRE – 70 Million sequences, 147,000 regions, 53Mb – 1.8%

Compare them – and you see the peaks correspond with the peaks in the other. Not exact, but similar. Very good coverage by FAIRE of the Dnase peaks. Not as good the other way, but close.

Goal of project should be done on a huge list of cells (92 types?? - 20 cell lines, add 50 to 60 more, including different locations in body, disease, cells exposed to different agents... etc etc.) RNA is tissue specific, so that changes what you'll see.

Summary:
Using dnase and fare assays to define open chromatin map
exploring many cell times,
discovery of ubiquitous and cell specific elements.

Note: Data is available as quickly as possible - next month or two, but may not be used for publication for the first 9 months.

Labels:

Kai Lao, Applied Biosystems - “Deep Sequencing-Based Whole Transcriptome Analysis of Single Early Embryos”

I think all sequencing was done with ABI SOLiD.

To get answers about early life stages, you need to do single cells – early life is in single cells, or close to it. When you separate a two cell embryos, miRNAs are symmetrically distributed (measured by array). T1 and T2 have similar profiling. When you separate in 4 cells – it's still the same....

Can you do the same thing with next gen sequencing to do whole transcriptome? (Yes, apparently, but the slide is too dark to see what the method is.) Quantified cDNA libraries on gel, then started looking at results.

If you do everything perfectly, concordance between forward and reverse strand should be same. However, if you do the concordance between two blastomers, you see different results. [not sure what the difference is, but things aren't concordant between two samples....]

First, showed that libraries have very high concorndance – same oocyte gives excellent concordance. However, between dicer knock out and wt, you get several genes that do not have same expression expression in both. Many genes are co-up-regulated or co-down-regulated.

One gene was Dppa5. In wt, it had low expression, in Dicer-KO and ago2-KO, they were upregulated.

After Dicer Genes were KO at day 5, only 2% of maternal miRNAs survived in a mature Dicer KO oocyte (30 days.) Dicer-KO embryos can not form viable organisms (beyond first few cell stages.)

Deeper sequencing is better. With 20M reads, you get array level data. You get excellent data beyond 100M reads.

No one ever proved that multiple isoforms are expressed at the same time in a cell – used this data to map junctions, and showed they do exist. 15% of genes expressed in a single cell as different isoforms.

Labels:

Matthew Bainbridge, Baylor College of Medicine - “Human Variant Discovery Using DNA Capture Sequencing”

overview: technology + pipeline, then genome pilot 3, snp calling, verification.

Use solid phase capture – Nimblegen array + 454 sequeencing
map with BLAT and cross_match. SNP calling (ATLAS-SNP).

All manner of snp filtering.
1.Remove duplicates with same location
2.Then filter on p value.
3.More.. [missed it]

226 samples of 400.

Rebalanced Arrays.. Some exons pull down too much, and others grab less. You can change concentrations, then, and then use the rebalanced array.

Average coverage came down, but overall coverage went up.. Much less skew with rebalanced array. 3% of target region just can't get sequence. 90% of sequence ends up covered 10x or better.

Started looking at SNPs – frequency across individuals.

Interested in Ataxia, hereditary neurological disorder. Did 2 runs in first pilot test on 2 patients. Now do 4. Found 18,000 variants. Found one in the gene named for that disease – turned out to be novel, and non-synonymous. Follow up on it, and it looks good: and sequence it in the rest of the family, but it didn't actually exist outside that patient.

So that brings us to validation: Concordance to HapMap, etc etc, but they only tell you about false negatives, not false positives. You have to go learn more about false positives with other methods, but the traditional ones can't do high throughput. So, to verify, they suggest using other platforms: 454 + SOLiD.

When they're done, you get a good concordance, but the false positives drop out. The interesting thing is “do you need high quality in both techniques?” The answer seems to be no. You just need high quality in one... but do you need even that? Apparently, no, you can do this with two low quality runs from different platforms. Call everything a SNP (errors, whatever.. call it all a SNP.) When you do that and then build your concordance, you can get a very good job of SNP calling! (60% are found in dbsnp.)

My Comments: Nifty.

Labels:

Complete Genomics, part 2

Ok, I couldn't resist - I visited the Complete genomics "open house" today... twice. As a big fan of start up companies, and an avid follower of the 2nd gen (and possibly now 3rd gen) sequencing, it's not every day that I get the chance to talk to the people who are working on the bleeding edge of the field.

After yesterday's talk, where I missed the first half of the technology that Complete Genomics is working on, I had a LOT of questions, and a significant amount of doubt about how things would play out with their business model. In fact, I would say I didn't understand either particularly well.

The technology itself is interesting, mainly because of the completely different approach to generating long reads... which also explains the business model, in some respects. Instead of developing a better way to "skin the cat", as they say, they went with a strategy where the idea is to tag and assemble short reads. That is to say, their read size for an individual read is in the range of a 36-mer, but it's really irrelevant, because they can figure out which sequences are contiguous. (At least, as I understood the technology.) Ok, so high reliability short reads with an ability to align using various clues is a neat concept.

If you're wondering why that explains their business model, it's because I think that the technique is a much more difficult pipeline to implement than any of the other sequencing suppliers demand. Of course, I'm sure that's not the only reason - the reason why they'll be competitive is the low cost of the technology, which only happens when they do all the sequencing for you. If they had to box reagents and ship it out, I can't imagine that it would be more significantly cheaper than any of the other setups, and probably much more difficult to work with.

That said, I imagine that in their hands, the technology can do some pretty amazing things. I'm very impressed with the concept of phasing whole chromosomes (they're not there yet, but eventually they will be, I'm sure), and the nifty way they're using a hybridization based technique to do their sequencing. Unlike the SOLiD, it's based on longer fragments, which answers some of the (really geeky, but probably uninformed) thermal questions that I had always wondered about with the SOLiD platform. (Have you ever calculated the binding energy of a 2-mer? It's less than room temperature). Of course the cell manages to incorporate single bases (as does Pacific Biosciences), but that uses a different mechanism.

Just to wrap up the technology, someone left an anonymous comment the other day that they need a good ligase, and I checked into that. Actually, they really don't need one. They don't use an extension based method, which is really the advantage (and achilles heel of the method), which means they get highly reliable reads (and VERY short fragments, which they have to then process back to their 36-to 40-ish-mers).

Alright, so just to touch on the last point of their business model, I was extremely skeptical when I heard they were going to only sequence human genomes, which is a byproduct of their scale/cost model approach. To me, this meant that any of the large sequencing centres would probably not become customers - they'll be forced to do their own sequencing anyhow for other species, so why would they treat humans any differently? What about cell lines, are they human enough?...

Which left, in my mind, hospitals. Hospitals, I could see buying into this - whoever supplies the best and least expensive medical diagnostics kit will obviously win this game and get their services, but that wouldn't be enough to make this a google-sized or even Microsoft-sized company. But, it would probably be enough to make them a respected company like MDS metro or other medical service providers. Will their investors be happy with that... I have no idea.

On the other hand, I forgot pharma. If drug companies start moving this way, it could be a very large segment of their business. (Again, if it's inexpensive enough.) Think of all the medical trials, disease discovery and drug discovery programs... and then I can start seeing this taking off.

Will researchers ever buy in? That, I don't know. I certainly don't see a genome science centre relinquishing control over their in house technology, much like asking Microsoft to outsource it's IT division. Plausible... but I wouldn't count on it.

So, in the end, all I can say is that I'm looking forward to seeing where this is going... All I can say is that I don't see this concept disappearing any time soon, and that, as it stands, there's room for more competition in the sequencing field. The next round of consolidation isn't due for another two years or so.

So... Good luck! May the "best" sequencer win.

Labels: ,

Keynote: Richard Gibbs, Baylor College of Medicine - “Genome Sequencing to Health and Biological Insight”

Repetitive things coming up in genomics, and comments about the knowledge pipeline. Picture of snake that ate two lightbulbs.... [random, no explanation]

“cyclic” meeting history: used to be GSAC, then stopped when it became too industrial. Then switched to AMS, and then transitioned to AGBT. We're coming back to the same position, but it's much more healthy this time.

We should be more honest about our conflicts.

The pressing promise infront of us – making genomics accessible. Get yourself genotyped... (he did), the information presented is just “completely useless!”

We know it can be really fruitful to find variants. So how do we go do that operationally? Targeted sequencing versus whole genome. What platform (compared to coke vs. Pepsi.)

They use much less Solexa, historically. They just had good experiences with the other two platforms.

16% of watson snps are novel, 15% of venter snps are novel. ~10,500 novel variants.(?) [not clear on slide]

Mutations in Human gene mutation database. We already know the database just aren't ready yet.. not for functional use.

Switch to talking about SOLiD platform:

SNP detection and validation. Validation is difficult – but having two platforms do the same thing, it's MUCH easier to knock out false positives. Same thing on indels. You get much higher confidence data. Two platforms is better than one.

Another cyclic event: Sanger, then next-gen then base-error modelling. We used to say “just do both strands”, and now it's coming back to “just sequence it twice”. (calls it “just do it twice” sequencing.)

Knowledge chain value: sequencing was the problem, then it became the data management, and soon, it'll shift back to sequence again.

Capture: it's finally “getting there”. Exon capture and nimblegen work very well in their hands. Coverage is looking very well.

Candidate mutation for ataxia mutaion. In one week got to a list. Of course, they're still working on the list itself.

How to make genotyping useful?
1.develop physicians and genetics connection
2.retain faith in genotypic effects
3.need to develop knowledge of *every* base.
4.Example, function, orthology...and...

Other issues that have to do with the history of each base. MapMap3/Encode. Sanger based methods, about 1Mb each patient. Bottom line: found a lot of singletons. They found a few sites that were mutated independently, not heritable.

Other is MiCorTex. 15,200 people (2 loci). Looking for athlerosclerosis. Bottom line: we find a lot of low frequency variants. Sequenced so many people, you can make predictions (“The coalescent”). Sample size is now a significant fraction of population, so the statistics change. All done with Sanger!

Change error modeling – went back to original sequencing and got more information on nature of calls. Decoupling of Ne and Mu in a large sample data.

In the works: represent Snp error rates estimates with genotype likelihood.
1000 genomes pilot 3 project. If high penetrance variants are out there, wouldn't it be nice to know what they're doing and how. 250 samples accumulated so far.

Some early data: propensity for non-sense mutations.
Methods have evolved considerably
whole exome
variants will be converted to assays
data merged with other functional variants.

Both whole genome and capture are both doing well.
Focus is now back on rare variants
platform comparison also good
Db's still need work
site specific info is growing
major challenge of variants understanding can be achieved by ongoing functional studies and improve context.

Labels:

John Todd, University of Cambridge - “The Identification of Susceptibly Genes in Common Diseases Using Ultra-Deep Sequencing”

Type 1 diabetes: a common multifactorial disease. One of many immune-mediated disease that in total affect ~5% of the population. Distinct epidemiological & clinical features. Genome wide association success... but.. What's next?

There is a pandemic increase in type 1 diabetes. Since 1950's, there's an abrupt 3% increase each year. Age at diagnosis has been decreasing. Now 10-15% are diagnosed under 5 years old.

There is a strong north-south and seasonality bias to it. Something about this disease tracks with seasons.. vitamin D? Viruses?

Pathology: massive infiltration of beta cell islets.

In 1986: 1000 genotypes. In 1996: multiplexing allowed 1,000,000 genotypes, now allows full genome association.

Crones and diabetes are “the big winners” from the welcome trust – most heritable and easily diagnosed of the seven diseases originally selected.

Why do people get type 1 diabetes. Large effect at the HLA classII = immune recognition of beta cells. 100's of other genes in common and rare alleles of SNPs and SV in immune homesostatsis.

Disease = a threshold os susceptibility alleles and a permissive environment.

What will the next 20 years look like: National registers of diseases. (linkage to records and samples where available.) Mobile phone text health, identificaion of causal genes and their patheways (mechanisms), natural history of disease susceptibility and newborn susceptibility by their TID gneetic profile. What dietary, infectious, gut flora-host interactions modify these and which can we affect?

Can we slow the disease spread down?

There are 42 chromosome regions in type 1 diabetes, with 96 genes. Which are causal? What are the pathways? What are the rare variants? Geneome-wide gene-isoform expression. Genotype to protein information.

Ultra-deep sequencing study: 480 patients and 480 controls, PCR of exons and did 454. 95% probability of allele at .3%.

Found one hit: IFIH1. Followed up in 8000+ patients – found this gene was not associated with disease, but with protection from disease! Knock it out, and you become susceptible!

It's possible that this is associated with protection of viral infections. The 1000 genome project may also help give us better information for this type of study.

The major prevention trial to prevent type 1 diabetes is ingestion of insulin to restore immune tolerance to insulin.

Do we know enough about type 1 diabetes?

Maybe one of the pathways in type1 diabetes is a defect in oral tolerance?

Type 1 diabetes co-segregates with stuff like ciliac disease (wheat tolerance.) One of the rare auto immune diseases for which we know the environmental factor (gluton). Failure of gut immune to be tolerant of glutin.

The majority of loci between type 1 diabetes and cilliac are similar. (sister diseases)

Compared genes in Type 1 and Type 2 diabetes – they are not overlapping. No molecular basis for the grouping of these two diseases.

Common genotypes are ok for predicting type 1. ROC curve presented. Can identify population that is likely to develop T1D, but.... how do you treat?

Going from genome to treatment is not obvious, tho.

Healthy volunteers – recallable by genotype, age, etc (Near Cambridge).

Most susceptibility variants affect gene regulation & splicing. Genome wide expression analysis of mRNA and isoforms in pure cell population. Need to get down to lower volume of input material and lower costs.

Using high throughput sequencing with allele-specific expression(ASE). Looking or eQTLs for disease and biomarkers. Doing work on other susceptibility genes. (Using volunteers recallable by genotype).

Looking for new recruits: Chair of biomedical stats, head of informatics, chair of genomics clinical....

Labels:

Kathy Hudson, The Johns Hopkins University - “Public Policy Challenges in Genomics”

Challenges: getting enough evidence is difficult: Analytic validity, clinical validity.. etc etc

Personal value is there theoretically – but will it work?

Two different approaches: who offers them, and then who makes the tests?

Types: either performed with or without consent. Results returned.. or not. There are now a large number of people offering tests for a wide number of conditions.

Are the companies medical miracles, or just marketing scam? Are the predictions really medically relevant. FTC is supposed to stop companies that lie... but for genetic testing they just put out a warning.

Role of states in regulating: States dictate who can authorize a test. However, in some states anyone can order it, not just medical personel.

How they're made:
Two types of tests: Lab tests and (homebrews) test “kits”. The level of regulator oversight is disparate. Difference is not apparent to people ordering them, but they have different types of oversight.

[flow charts on who regulates what] Lab tests are not under FDA (done through the CMS)... and it makes no sense to be there. you can't get access to basic science information through CMS, whereas in FDA, that's a key part of mandate(?)

Example about proficiency testing – which as poorly implemented in law, and is still not well done. The list is now out of date – and none of the list of diseases being tested have genetic basis. CMS can't give information on what the numbers in the reported values mean (labs get 0's for multi-year tests, but CMS can't explain it.)

FDA regulation of test kits are much more rigorous.

Genentech started arguing that the two path system should not be there. Should be regulated based on risk, not manufacturer. Obama-Burr introduced genetic medicine bill in 2007, and something more recently by Kennedy. (Also biobanking?)

Steps to effective testing:
1.level over oversight based on risk
2.tests should give answer nearly all the time
3.data linking genotype to phenotype should be publicly accessible
4.high risk tests should be subject to independent review before entering market
5.pharmacogenetics should be on label
6.[missed this point]

Privacy: should it be public? Who percieves it as what?

More people are concerned about financial privacy than medical privacy. 1/3 think that medical record should be “super secret” : and what part of it they thought should be most private, most people thought it was social security number! Genetic test and family history is way down the list of what needs to be protected.

People trust doctors and researchers well, but not employers. Genetic information nondiscrimination act is a consequence of that trust level. (not a direct result?)

The new Privacy Problem? DNA snooping. Who is testing your DNA? (Something about a half-eaten waffle left by Obama that ended up on ebay... claiming it had his DNA on it.)

Many actions: testing, implementing lawas, modernizing laws, transparency, better testing

My comments: It was a really engaging talk, with great insight into US law in genetics. I'd love to see a more global view, but still, quite interesting.

Labels:

Howard McLeod, University of North Carolina, Chapel Hill - “Using the Genome to Optimize Drug Therapy”

“A surgeon who uses the wrong sde of the scalpel cuts her own fingers and not hte patient.
If the same applied to drugs they would have been investigated very carefully a long time ago.”
Rudolph Bucheim. (1849)

The clinical problem: Multiple active regiments for the treatment of most diseases. Variation in response to therapy, unprecedented toxicity + and cost issues! With choice comes decision. How do you know which drug to provide.

“We only pick the right medicine for a disease 50% of the time”. Eventually we find the right drug, but it may take 4-5 tries. Especially in cancer.

“toxicity happens to the patient, not the prescriber”

[Discussion of importance of genetics. - very self-deprecating humour... Keeps the talk very amusing. Much Nature vs. Nurture again]

“Many Ways To Skin a Genome”. Tidbit: Up to half of the DNA being meausred can come from the lab personal handling the sample. [Wha?] DNA testing is being done in massive amounts: newborns, transplants..

“you can get DNA from anything but OJ's glove.”

We also see applications of genetics in drug metabolism. Eg, warfarin. Too much: bleeding, too little; clotting. One of only two drugs that has it's own clinic. [yikes.] Apparently methadone is the other. Why does it have its own clinic? “That's because this drug sucks.” Still the best thing out there, though. Discussion of CYP based mechanisms and the Vitamin K reductase target. Showed family tree – too much crossing of left and right hand...

Some discussion of results – showing that there are difference in genetics that strongly influences metabolism of warfarin.

Genetics is now become part of litigation – Warfarin is one of the most litigated drugs.

We need tools that translate genetics in to lay-speak. IT doesn't help to tell people they have a CYP2C*8.. they need a way to understand and interpret that.

If we used genetics, we'd be able to go from 11% to 57% of “proper doese” at the first time with warfarin.

Pharmacogenomics have really started to take off and there are now at least 10 examples.

What is becoming important is pathways... but there are MANY holes. We know what we know, We don't know what we don't know.

We can do much of the phenotyping in cell lines – we can ask “is this an inheritable trait?” This should focus our research efforts in some areas.

Better systematic approach to sampling patients.

What do we do after biomarker validation? Really, we do nothing – we assume someone else will pick it up (Through osmosis... that's faith based medicine!) We need to talk to the right people and then hand it off – we need to do biomarker-driven studies with the goal of knowing who to hand it off too.

Take home message:

Pharmacogenetic analysis of patient DNA is ready for prime time.

My Comment: Very amusing speaker! The message is very good, and it was engaging. The Science was well presented and easily understandable, and the result is clear: there's lots more room for improvement, but we're making a decent start and there is promise for good pharmagenomics.

Labels:

Keynote Speaker: Kari Sefansson, deCODE Genetics - “Common/Complex traits with emphasis on disease”

Sounds like Sean Connery!

Basic assumption is that information is the basic unit of life – and the genome is the carrier. Creating database where we can start decoding that information – and have had some success, including to find the genes for the love of complex crossword puzzles. (-:

Traits range in complexity from simple mendelian all the way to really complex genotypes and phenotypes, which are often involved in diseases. One thing to keep in mind is that they also have geographical traits.

First example: melanoma. Very different genes (for light hair and skin) occur in the population varrying by location - people in iceland don't have problems carrying this gene, but those in spain would!

Second example: Genetic risk of atrial fibrillation is genetic risk of cardiogenic stroke. About 30% of stroke is indeterminate origin, but a significant proportion is associated with several genetic traits. [insert much statistics here!]

Third example: Thyroid Cancer – (published today?). Incidence is increasing of late. If it's caught early, it has a very good prognosis. It has a very large familial component. Did a genome association in iceland, identified 1100 individuals, and had genotypes 580 of them. Pulled out 2 significant loci (independent), and they associated with two forms of thyroid cancer. [more statistics too fast to make notes...] Individuals with both genes have 5.7X increase in risk. (Multiplicative model.) The two loci also have diffences in clinical presentations. Candidate for first is FOXE(TTF2) transcription factor. Second is NKX2-1(TTF1). Apparently these gene(s?) regulate Thyroid Stimulating Hormone... so there may be an interesting mechanism.

Where are we now when it comes to discoveries of sequence varieties that code for genetic components of complex disease? There seems to be a significant amount of undiscovered diseases. Most of the ones that have been discovered have risk factors over 5%.... [not sure if that's right] Bottom line is that the detection limits are such that we can't find the really low variants with lower risk factors.

There may be a large contribution from rare variants with large effects
There may be a large contribution from rare variants with small or modes effects.
[one more.. not fast enough]

Started deCODE based on family based methods, and now have returned to it. Concept of Surrogate Parenthood – surrogates work as well as natural parents for phasing of proband. To get down to all traits with 2% of variants, they would only need to sequence ~2000 people. [Daniel says there are only 400,000 people in Iceland].

Have also noted genes where the risk factor is different between maternally and paternally passed genes.

Prostate cancer: have shown that there are 8 genes that have a cumulative risk factor. Important for treatment and preventative care.

End by pointing out that in all of the common disease, it is a disease where there are both environmental and genetic components. How do they interact? How do they fit into our debate (nature vs. Nurture).

Published on nicotine dependence and lung cancer last year. In iceland, it's purely environmental – only smokers in iceland develop cancer (14%). Discovered a sequence variant that makes you more likely to smoke more because you're more likely to crave the nicotine.. where is the line between nature and nurture, then? To solve this problem, you need to understand the brain – to understand the behaviours that make us susceptible to environmental diseases.

Labels:

Site Feed changes

I'm never one to shy away from changes, even when I should probably know better. One of the things that came up yesterday while talking to the other (read: much more professional) science bloggers was that I should be monitoring my rss feeds, and they all unanimously suggested feedburner.

Of course, I've never set it up before, so I'm still in the process of trying to figure out how it works - but hang in, I've still got another half hour before the first session starts. That should be more than enough time, right?

Labels:

Thursday, February 5, 2009

Complete Genomics

[Missed start of talk]

Inexpensive. Non-sequential bases? No ligase required.

Long Fragment Reads. Start with high mol wt dna – 70-100k bases, sample prep that barcodes them, sequence it, and then informatically map it back to the fragment. Assembly then gives you 100kb base length. Chromosome phasing begins to become possible. [spiffy!]

Thus, you can do this over a genome, as well, it allows the maternal and paternal dna to be worked out.

Not planning to sell instruments: only going to be a sequence centre doing it as a service. 20,000 genomes by end of 2010. Big challenge is actually assembly. 60K cores in cluster, 30Pb diskspace.

Will partner with Genome centres. Yesterday signed an agreement to try a pilot with Broad. Will build genome centres around the world.

Trying to make sequencing ubiquitous. Send them sample, then click on link to get your sample.

Saves you capital on sequencing and then on compute infrastructure.

Will only do Humans! [I cracked up at this point.]

My comments: Ok, I missed the beginning, but the end was intersting. I totally don't understand the business model. By doing only human, they'll only find hospital customers.. which hospital will pay for them to build a data centre. I'll elaborate more on that in another post.

Labels:

Erin Pleasance, Welcome Trust Sanger Institute - “Whole Cancer Genome Sequencing and Identification of Somatic Mutations”

Goals of cancer genome sequencing: WGSS read sequencing. Detection of substitutions, indels, rearragements, cnv. Detection of coding and non-coding genomes. Catalog of somatic mutaitons, functional impact and mutational patterns. Drivers vs. Passengers.

Talk tonight about one cancer and one matched normal genome.

NCI-H209 small cell lung cancer cell line.

Cancer dcell line and non-cancer cell line derived from same individual.
Prior sequencing by PCR and capilary.
Somatic mutations: 6/Mb, or 18,000 in genome.
Other data also available: affy SNP6 and expression arrays.

Show karyotype. Kinda funky, but mostly sane. (-:

used AB SOLiD machine... strategy is pretty obvious: sequence cancer and matched normal. All PET, and aligned with MAQ, corona for substitutions.

How much sequence do you need to do? Turns out, you need equal amounts of both – and it's about 30X coverage. There is a GC effect on coverage.

Compare with dbSNP. About 80% are there.

Look at tumour only, with simple filtered reads: about 50% are not in dbSNP. Many are probably germline variants. Mutations vs SNP rate: Need to call SNPs and mutations using control at the same time to get best results. As well, if you have greater than diploid chromosomes, you need to worry about that too.

Also: CNV changes and ploidy, normal cell contamination, base qualities, and it's important to do indel detection first.

CNV: easy to obtain, and cleaner than array data.

Structural variants from paired read, do it genome wide. 50 of mutants interrupt genes (of 125 in tumour only.)

Rearrangements: can also look at that. (Saw many rearrangement events).

Structural variants at basepair resolution. (Using Velvet... good job Daniel).

Last thing of interest: Small indels (less than 10bp.) Paired end reads, anchor with one end.

Medium indels can be found by identifying deviation in insert size (Heather Peckham). You can see a shift in size... not an actual significant change. [interesting method] Can be seen in comparison between normal and tumour.

To summarize: somatic variants throughout the genome. Circos plots (=
Somatic mutations, functional impact? Recurrence? Pathways?

Labels:

Christopher Maher, University of Michigan - “Integrative Transcriptome Sequencing to Discover Gene Fusion in Cancer”

80% of all known gene fusions are associated with 10% of human cancers. Epithelial cancers account for 80% of cancer deaths, but have only 10% of known fusions.

Mined publicly available datasets and looked for genes with outlier expression.

Will use next-gen sequencing to get direct sequence evidence of chimeric events. Decided to use both 454 and Illumina. Categorized: mapping reads, partially aligned reads, non-mapping reads. Used same samples [whoa.. classification just got extensive.... moving on.]

Chimera discovery using long read technology. Sequenced: VcaP, LNCaP, RWPE. Found 428, 247, 83 chimeras respectively.

Then added illumina. First checked that they could find the fusion that they know. 21 reads mapped there.

Found both intra- and inter-chromosomal candidates, and then validated 73% of them.

So, to recap: candidates found by both 454 and illumina were MUCH more selective and found they were throwing out false positives, but keeping all the known targets.

Confirmed results with FISH.

Next expt: Identification of novel chimeras in prostate tumour samples. Found candidate sequences from non-mapping read, then worked to validate. How does it work, and what's it's frequency? Found it in 7 metastatic prostate tissues and is androgen inducible. In a meta study, found the fusion of interest in about 50% of prostate cancers.

Came up with a chimeric classification system: 5: inter chromosomal translocation, inter chromsomeal complex rearrangements, intra chromsomal complex deletions, intra chromosomal complex rearrangements.

Summary: validated 14 novel chimeras
Demonstrated cell line can harbor non-specific fusions...
[too slow to catch last point]

Answer to question: 100bp reads would have been long enough to nominate fusions.

Labels:

Anna Kiialainen, Uppsala University - “Identification of Regulatory Genetic Variation That Affects Drug Response in Childhood Acute Lymphoblastic Leuk

Review:
1.most common in children
2.20% do not respond to treatment
3.multi-factorial disease

Allele specific expression is important. Normally, you only have one copy of each gene, however you can get different ratios of expression from each allele, which leads to very different proportions in the sample of each allele.

Advantages: they can serve as internal standards for each other.

Causes: SNPs that affect transcription or stability. Or, allele specific promotor (regulation or methylation).

Samples: 700 children with acute lymphoblastic leukemia. Yearly follow up data and drug response, in vitro drug sensitivity, immuophenotype, cytogenic data. RNA available for 1/3 of them.

Genotyped over 3531 SNPs in 2529 genes. ASE was detected in 400 (16%) of the informative genes, 67 of which displayed monoallelic gene expression. (Milani L, et al, Genome Research 2009)

Methylation analysis: Selection of 1536 CpG sites from >50,000 CpG sites in genes displaying ASE. Custom GoldenGate methylation panel. (ibid)

SNP discovery: 56 genes displaying ASE selected fro sequencing in 90 ALL samples. Template preparation with Nimblegen seq. capture. Illumina sequencing.

To date: 16 samples hybridzed. 5 samples sequenced with GA I. 81-97% align to genome (Eland), 28-67% align to target region (MAQ).

Overview of sample sequencing coverage.

Initial SNP discovery – 2063-4283 SNPs found with MAQ. 3422 in at least two samples. 818/3422 are novel (not in db.)

My comments: This is an interesting talk, from the big picture view. Dr. Kiialainen is spending a lot of time talking about metrics that haven't really been used much this year (percent alignment, etc.) and explaining figures that are relatively simple. There wasn't much data presented – essentially it's no more than an outline and statistics of the sequences gathered. Not my least favourite talk, but had very little content, unfortunately. Knowing you can do allelic studies is neat, however, which is clearly the best part of the talk. My advisor is chairing the session... and asking the same questions he asks me. Nice to know I'm not the only who answers with “No, I haven't done that yet!”

Labels:

David Dooling, Washington University School of Medicine, “Next-Generation Informatics”

[Had to change rooms, missed some of the start of the talk]

The rate of change is far outstripping moores law.

So: Framing the problem - Viewpoints:
Lims: [Picture of Richard Stallman.. Nice!] how do we process and track information?
Analysis: [picture of Freud.. also Nice... same beard?] How do we process and extract information?
Project Leads: In, and Out... what's the answer?

Pipelines: Always changing! Buffers, software, tools, etc etc, etc!!!!

Analysis: Changing Pipeline: Proliferation of Data has led to a proliferation of tools.

So how do we do things on a massive scale, but deal with the constant change.

“We've always been pushing the envelope...” using the past as a guide to how to deal with the change.

As developers, put it in terms of flow charts, databases, pathways.. etc. Get a handle on the problem

How we deal with it: Regular entities to event entities to processing directives

The problem comes when the processing directives change... and that's a big change – frequently. So, to deal with it, entities were classified. To apply this, things were abstracted to big units, which can each be modular. By making things modular, they can be substituted.

1.Created an object-relational mapping (ORM) layer.
2.Object Context
3.Dynamic command-line interface
4.Integrated Documentation System.

ORM was created from scratch because none of the others were able to cope with the stuff the workload that was being demanded of it. Everything works in XML, so you can verify flow, and it makes it easier to do parallelization.

All of these things together become “Genome Model”, which is a thin wrapper around all their tools, which give you massively parallel system with excellent data management and reporting.

Yikes... has an easy PERL API. [Everyone likes perl? Count me out.]

working model for employees: Pairing: analysts are paired with programmers so that better software is written.
Challenges:
Still much more to do.
Sequenceing is demolishing Moores Law
The cult of traces – desire to have raw information at our fingertips. (ven diagrams don't scale well, but things like Circos do!)

Labels:

Pacific Biosciences - Steven Turner " Applying Single Molecule Real Time DNA Sequencing"

Realizing the power of polymerase SMRT.
Each nucleotide is labeled uniquely, the flurophores are truncated, leaving behind just the dna. Using a Zero mode waveguide, only the one being read is shown. [cool videos].

At end of signal, it just moves on to the next base...

Every day, they're working on SMRT – showing a demo run.. 3000reaction in parallel. Multi kb genomic fragment. Just one Polymerase. Similar to electrophoresis... keeps going.... and going and going. Real time – several bases per second. Put it to bottom of screen ,just keeps going on and on and on. [it IS transfixing.]

Start with genomic, sheer by any method you want, and now, ligate with HAIRPINS! It's now circular.... so you can keep going around and around. You get both sense and antisense DNA. Can close any size... call it a “SMRTbell prep”, (eg, not a dumbell... heh.. not really that funny.) They also use strand displacement enzyme, so it just displaces what was already there.

First project was a human BAC, last november 107kb chromosome 17. Production readlength: 446bp. Max read: 2,047bp. Aligned to NCBI, and validated by Sanger.

In non repetitive: 99.996% accuracy. Missed 3 SNPs that were false negatives. Repetitive. 99.96% missed 7 bases. Have made significant progress since then..

Sequenced E.coli to 38Fold... 99.3% (last january), max readlength at 2800bases. 99.9999992% [I hope I got that right!]

4 errors + 1 variant on whole genome (Q54!)

Heh.. they had issues from artifact caused by more DNA closer to ORI in E.coli from stopping cultures in midphase. Now have incredibly accuracy that they can measure it.

Accuracy does not vary more than 5% over 1200 bases. Heads for Q60 around 20-fold coverage.

8 molecule coverage. (8 Individual DNA strands have contributed.) Dependant on the fluorophores... they each show brightness profiles. So, some channels are still weak, but they have new ones in development to replace it.

One example: First time you can bridge a single 3200bp region. 3bp/sec. (2.6kb duplication region in the middle.)

Development: average of 946bp read length... and up to 1600 at the high end. You trade throughput with readlength... at one end, fewer SMRT waveguides complete, but long reads, at the other, more complete at the shorter read.

Consensus on a single molecule. You can also do heterogeneity. If you put in mixes, you get out a mix, with a linear relationship to the fraction recovered. (eg, snps will be very clean.)

Flexibility: you can do long OR short reads. Redundancy is high, so you can get 1ppm sensitivity. 12 prototype instruments in operation. Expect delivery in Q3 2010.

Labels:

A quick note...

I don't know if anyone is following along with my (terribly disorganized) notes from AGBT today as I haven't had any comments, but I figured I should just mention that there are a few things I've left out, and haven't had the chance to blog. For instance, I didn't mention some of then neat people I've met today, the old friends I've caught up with, the bloggers I had lunch with, or the dinner panel hosted by Pacific Biosciences... or even the three random people who recognized my name and said hi. (Apparently there are people who read my blog out there!)

So, just so you don't think I'm skimping out on those things, too, I will get around to talking about them at some point - when I get to plug in my laptop and type at my leisure... anyhow, more things are starting. Later.

Adam Siepel, Biologcal statistics & Computational biology – Comparative Analysis of 2x genomes: Progress, challenges and opportunities.

Working on the newly released mammalian genomes at 2x coverage. We're rapidly filling out the phylogeny, so there's a lot of progress going on. We can learn a lot by comparing genomes more than we can by looking to a single genome.

Placental mammals (Eutherian) are well sequenced. The last of the 2x assemblies were released just last week. There are 22 genomes being focussed on: most are 2x, a couple are 7x, and some are in progress of being ramped up.

One of the main obstacles is error. (sequencing or otherwise). Miscalled bases and indels from erroneous sequences have a big impact. Thus, the goal is clean up the 2x sets. In 120 bases, 5 spurrious indels and 7 miscalled bases. [Wow, that's a lot of error.] Nearly 1/3rd of all 1-2 base indels are spurious.

Thus, comparative genomics often gets hit hard by these errors.

A solution: error correction rates: use redundancy to systematically reduce error. In some sense, there is a version built in – we can use comparative genomics to “decode” the error correcting code. This can be done because the changes between species tend to be variable in predictable ways.

The core idea: Indel Imputation: “Lineage-specific indels in low-quality sequence are likely to be spurious.

Do an “automatic reconstruction” using parsimony... If a lineage specific indel is low quality, then assume it's an error. More computationally intense methods are actually not much more effective.

There is also base Masking – don't try to guess what they should be, but just change them to N's. Doing these thing may change reading frames, however.

After doing the error correction, the error appears to drop dramatically. (I'm not sure what the metric was, however.)

Summary: good dataset with some error. Correction method used here is a “blunt instrument”, many or most errors can be masked or corrected if some over-correction is allowed. There is a trade off, of course.

Conservation has its own problems, which can be a problem as well. Thus, they have been working on new programs for this type of work: PhyloP. Has multiple algorithms for scaling phylogeny, and the like. Extensive evaluations of power for these methods were undertaken. However, the problem is that people are at the limit for what they can get out from conservation, depending on what's there. Pretty reasonable power is good when selection is good, or when the elements are longer (eg 3bp.)

Discussing uses of conservation.... moving towards single base pair resolution.

Labels:

Jeff Rogers, Baylor Human Genome Sequencing Center - “Linking the fossil record with comparative primate genomics”

Recently moved, so doesn't have a lot of results to discuss. Thus, this will be from the basis of a primatologist. What are the implcations of next gen sequencing for interpreting genomic comparison? Obviously it's huge.

First non human primate = chimpanzee (our closest species “relative”). Second was Rhesus Macaque (Single most important for biomedical research.)

Just about all of the species that are representative on the chart, (Raaum et al, 2005, J. Hum evolution 38, 237-257), are now done or slated to be sequenced soon.

Talk about baboons for a while – there are a lot of old world monkeys related to baboons, and all of them will be sequenced. There are desert baboons, plains and grassland baboons and rainforest baboons. [Ok, this is a lot of baboon information. And I've never seen so many baboon pictures in a presentation.] There are high mountain primates too, and they apparently eat grass – and are closely related to baboons. [Who knew.]

And now macaques – there are 22 species.. I think only two of them are being sequenced.

Dr. Rogers is suggesting we re-think how we decide which species to sequence, to best use the next-gen sequencing technology. Pick a few “focal species”, and then do all of the other species too. This would help with closely related genomes. It would help us get more information about genome evolution and dynamives, more information about ancestral genomes, and finally, more benefits to biomedical research. Discovery of novel animal models for studies of disease and normal functional variations.

This is, apparently, not a new proposal. They did a similar concept for Rhesus monkeys and Orangutangs. [So if they're already doing it, what is he proposing that's new?] What he's suggesting is to expand the bredth and depth of the sequencing. [ok. Fair enough.]

So, back to baboons. The olive baboon is the main reference baboon, but the others should all be done.. they're different in habitat and behaviour.

At baylor: do reference baboon to 3x sanger, 4x 454. Lots more statistics presented. Take away message: better coverage = better sequence information. You can also get good results by mixing sanger with 454.

[I lost focus here... I didn't see much worth writing down.]

Moving on to evolution and species... When you sequence another primate, you don't get a look at the ancestor. [dude... evolution 101.] Talking about some of the other species that have died out – fossil species.

Now we're on to different motion. Some monkeys walk on all four, putting weight on the palms on their hands. Others (apes) walk mainly on the feet? [I'm not so sure what this is about.] I think the point is that there are some ancestral monkeys that walk differently than any of the other living apes and monkeys. [Interesting, but highly unrelated to anything else at this conference.]

It's controversial: human ancestors were never knuckle walkers, unlike all other chips, gorillas, etc.. thus, knuckle walking evolved twice, and humans evolved bipedal motion separately without a knuckle walk intermediary.

Conclusion: there is a lot of diversity among primate, both living and extinct. We should study it to help understand our own biomedical applications.

My Comments: The wrong talk for this conference, but a nice diversion. The content was provided slowly enough that everyone could follow, which made for easy note taking. Kinda like an undergrad lecture in primatology. Still, I'm really at a loss to explain why anything presented here really is a consequence of the advent of new sequencing technology.

Phil Stephens, Sturctural Somatic Genomics of Cancer

Andy Futreal could not show up – he got snowed in in philladelphia.:
All work is from illumina platform.

Providing an overview of multistep model of cancer...

Precancer to in situ cancer, to invasive cancer, to metastatic cancer.

They believe cancer have 50-100 driver mutations plus 1000 passenger mutations. Some carry 10-100's of thousands of passengers.

The big question is how to identify the driver mutations. Today, the focus on is on structural variations. 200Bps to 10s of Mb. Can be seen as copy number variations or as copy number neutral (balanced translocations, etc)

For instance, the upregulation of ERBB2, by causing multi copies.

For the most part, we have no idea what the genomic instability does to the copy number at local points, and what they accomplish.

Balanced translocations are interesting because they tend to create fusion genes. There art at least 367 genes known to be implicated in human oncogenesis, 281 are known to be translocated. 90% are in leukemias, lymphomas... [missed the last one]

Use 2nd gen to study these phenomenon. For structural variation, it's always 400bp fragments, with PET sequencing. Align using MAQ. Basically, look for locations where the fragments align in wider than 400bp locations. Need high enough coverage to then check that these are real. You then need to verify using PCR – check if the germline had the mutation as well. Futreal's group is only interested in Somatic.

Now, that's the principle, so does it work? Yes, they published it last year.

NCI-H2171. Has 6 previously known structural changes for control. Very simple copy number variation identification. Solexa copy number data is at least as good as Affy Chip. They suggest Solexa has the ability to find the true copy number variation,whereas Affy chip tends to saturate.

For control, they found intra-chromosomal reads, and then verified with PCR. Two reads mapped to the breakpoint, and were able to work out the consequences of the break. Used a circos diagram to show most translocations are intra-chromosome, and only a small number of them are inter-chromosome.

Since publication, they've now worked on the same project to update the data. They're better at doing what they did the first time around. They redid it on 9 matched breast cancer cell lines, and got ~9x coverage.

HCC38 – no highly amplified regions. Found 289 somatic chromosomal translations. Most of the changes are due to Tandem Duplications, however it was not replicated in another cell line. So, structural variation is highly variable

distinct patters emerged: one has lots of Tandem Duplications, one with very little structural variations, and one with a more lymphoma like pattern.


“Sawtooth” pattern to CNV graph: lots of different things going on. Some are simple, some are difficult.

What are the Structural Variations doing?

Looked at examples of fusion proteins. In one cell line hCC38, found 4 S.V. Found smaller SV's as well.

Duplications of exons 14 and 15 in one particular gene: receptor tyrosine kinase, which seems to be in the ligand binding domain. Also evidence from many other observations of SNPs in the same domain.

What they didn't know was if it would reflect what's going on in breast cancer. 15 primary breast cancers were then sequenced. (65Gb total).

Huge diversity was found. Anywhere from 8-230+ structural variants per tumour. Same patterns as in the cell lines are found. 11 potential promotor fusions...

[The numbers are flying fast and furious, and I can't get close to keeping up with them.]

151 genes are found in 2 samples. 12 in 3, 5 in 4....

How do you assay for variants? FISH, cDNA pcr! Other mutations in rearranged genes. Whole exome seqeuncing, trancriptome sequencing, epigenetic changes are down the road.

Also can look at relationship between somatic break point positions in the geome.

Conclude: PET sequencing is useful for structural variation,
average breast cancer has ~100 mutations (somatic)
Average cancer has ~3.2 fusion genes

Question: Genome vs Transcriptome? Answer: Both!
Question: how many of the hits are false? Answer: at first it was 95%, now it's down to 10%.

My comments: very nice talk. Since this is basically similar to what I'm working, it's very cool. It's nice to know that PET makes such a huge difference. The paper referenced in the talk was a good read, but I'm giong to have to go back and reread it.

Labels:

Tom Hudson – Ontario Institute of Cancer Research “Genome Variation and Cancer”

Talking about two cancers: Colorectal tumours
1200 cases and 1200 controls
looking for predictors of disease

1536 SNPs from candidate genes, in 10K coding non-synonymous SNPs, Affy 100K and 500K arrays.

Eventually found a hit in a gene dessert (Long intergenic non-coding RNA... learned the name this morning. (= Close to myc, but hasn't been correlated to anything.

In last year, 10 validated loci in 10,000 individuals, with very small odds rations (1.10 to 1.25). One of them is a gene: SMAD7. 5 loci are also in near genes that are involved in things... but are not actually in the gene.

Since there are 10 alleles, you'd think it would be a distribution, however most people carry 9 (27%)! There is also a linear relationship between the number of alleles and the risk of developing cancer. However, this still doesn't seem to be the causative allele.

Enrichment of Target Regions. Using a specific chip with 3.14Mb colon cancer specific regions. Those regions didn't take all of the space, so they added other colon cancer gene sequences as well.

Protocol: 6ng, fragmentation (300-500bp)... [I'm too slow]

Exon capture arrays are being used, andpPreliminary results: 40 DNA's : 65Gb.

Use MAQ to do alignments. Coverage 75% at 10X, 95.6% at 1X.

“More than 99% of gDNA has % GC that allows effective capture”

Analyzable Target Regions: 39175, 232 coding exons
Average coverage: 70.3

40 individuals yeild 8,706 SNPs
Known 59.6%
new snps, 2,397
Total number in coding exons: 77

Sequencing data compared to Affy data, very high concordance.

Rare alleles may be driving risk in several sporadic cases. Stop codons were found in 6 individuals with sporadic CRC.

Follow up genotyping is required to validate new SNPs and correlate with phenotype.

Second topic: International Cancer Genome Consortium.

“Every cancer patient is different, every tumour is different.” Lessons learned: Huge amount of heterogeneity within and across tumour types. High rate of abnormality, and sample quality matters!

50 tumour types x 500 tumours = 50,000 genomes.

Major issues: Specimens, consent, quality measures, goals, datasets, technologies, data releases.

[Mostly discussion of the mechanics of the project management, who's involved and where it's happening, as well as tumours, which I'm sure can be found on the OICR's web page. OICR is committeed to 500 tumours, using Illumina and Solid. They are also creating cell lines and the like, so there will be a good resource available.

Pancreatic data sets should be available on the OICR web page by June 2009.

Question: why Illumina and Solid? Answer: they didn't know which would mature faster. By doing both, they have more confidence in SNPs. They never know which will win in the end, either.

My Comments: Not a lot of science content in the second half, but quite neat to know they've had success with their CRC work. It seems like a huge amount of work for a very small amount of information, but still quite neat.

Labels:

Keynote Speaker: Eddy Rubin, Joint Genome Institute - “Genomics of Cellulosic Biofuels”

They're funded by DoE, so they have a very different focus. So, after all the work they did, they realized that E stood for Energy, so they've started working on that. (-;

more than 98% of energy in transportation is from petroleum, for which there are environmental and political consequences. They've known about it for 30 years, but haven't worked on it much yet.
Churchill quote: You can count on Americans to do the right thing, as long as they've exhausted all other options.

Focuses on things like biofuels, where most of the focus will be on cellulosic biofuels. For those who don't know, it is basically using biomass – mostly cellulse.. Many current technologies just use the sugar (edible) part of the plant – but cellulosic energy would use the non-edible parts of the plant, eg, the cellulose.

Every gallon of cellulosic biofuel produces 12x Less CO2 production, and 8x less than corn biofuel.

How goes genome of bioenergy plant feedstocks help?

10k-fold increase in energy derived from domesticated grasses and wheats as compared to wild grains. So, domestication is a big deal. Can we domesticate Poplar?

If we could choose, we'd like short, stubby trees with compact root systems. There are groups that are systematically manipulating Auxins to try to cause this to occur. They've had some success. Can create shorter, stubbier trees, or trees with thicker trunks. So, it's working reasonably well.

Poplar is niche, though. The real thing is grasses. They can be harvested, and they don't need to be replanted. (Something about them squirting their nutrients into the soil at the end of the year...)

Anyhow, there are already organisms that do cellulosic breakdown, so those should also be sequenced.

(One of them is a “Stink bird”, which belches and smells.... odd. Another is a shipbore mollusk, which digests ships bottoms.)

Can we replicate cellulosic degredation as those in intestinal environments?

To dissect termites, you chill them on ice, and then pull off their heads from tails, and eventually the guts are displayed. Ok, then.

Once you have the guts, you can sequence the microbiome. Doing so, they found more than 500 Cellulose and Hemicellulose degrading enzymes.

They also work with cow guts (fistulated cows.) The amount of volume obtained from 200 termites: 500 ul. The amount from one cow: 100ml.

(for the record, pictures of wood chips after 72 hours in a cow stomach – not appealing.)

One experiment that can be done is to feed the cow various types of feed to see what enzymes are being used. The enzymes being used are very different, but the microbial community is the same. This is a new source of enzymes for degredation of energy crops.

The final step in this process: conversion of biofuels to liquid fuel. The easiest one is fermentaion. More than 20% of sugars you get from degrading wood is Xylose... and it's not being fermeted. So, organisms that use xylose and convert to ethanol have been found and are being used.

Ethanols has problems, though – transportation and efficiency of production. Ethanol kills the organisms that produce it.

“Ethanol is for drinking, not for driving” Jay... [missed the last name]

Pathway Engineering is going to become an important part of the field, so that organisms will do the things we want them too. [Sort of seems like a shortcut around diversity... I wonder if people will be saying that in 10 more years..]

My Comments: This was a pretty standard talk about cellulosic biofuel/ethanols. I saw similar talks in 2006, so I don't think much has changed since then, but the work goes on. I don't know why it was a keynote, in terms of subject, but definitely was a well done talk!

Labels:

Oyster genome...

Note to my boss: Shenzen is sequencing the oyster genome.  See, you should have sent me back to Tahiti to work on the pearl oyster genome! (-;

Labels: ,

Jun Wang, Beijing Genomics Institute at Shenzhen - “Sequencing, Sequencing and Sequencing”

Shenzen is one of the biggest sequencing centres in the world – both in sequencing throughput and the quantity of computing, and such.

With >500Gb per month, what would you do?

The obvious choice is to do whole genomes: From Giant Panda to the tree of life. (Is the panda really a bear?) Formal reason: They eat bamboo, are cute and nice.... and they're cute! Ok, the real reason: Selected an animal “without competition” for sequencing, has a significant “Chinese element”, and proof of concept that short read length is good for the assembly on a large genome.

Why do we need longer reads? 10 years ago, it was the question, can you sequence by shotgun sequencing? Yes... now can we do with short reads? Yes: but there are questions
Read length: the longer the better
Insert sizes: for finishing, this becomes important
Depth: determines quality.

Why short reads work: most of the genome is really unique anyhow. Insert size is probably the most important mater.

( Started with a pilot project: cucumber. )

Panda: has 20 chr + X/X. Did inserts from 150-10,000. 50X sequence coverage, 600X physical coverage.

Genome coverage is 80%. Gene coverage 95%, Single base error rate is Q50, less than 1/100kb.

Gene stats: 27.8k homologous to dog genes.

Evolutionarily closer to dog, of sequenced genomes, next closest to cat. (But panda is a bear.) It's evolutionary rate is slightly higher than dog. Would like to add significant species to tree of life.

One of the original questions on what to sequence: “Tastes good, sequence it!” Now, it's close to 50% of the major dinner table! [yikes]

Instead, now proposes cute things: Penguins!

Aiming to sequence “big genomes”, 100Gb+ genomes.

First Asian was sequenced last year.

Is one genome enough? No, probably not.. Need 100's to study population genetics. Now taking part in 100 genome project. Committed for 3Tb. (about 500 individuals.)

De Novo assembly is the only solution for a complete structure variation (SV) map. Still too expensive, though.

Started a new project in sequencing asian cancer patients. The cost is about $4000-5000 per sample. [I missed how many per person]

top 10 causes of death for asians... start to rank, and decide which to attack.

4P healthcare (personalized medical care) is coming (All based on personal genomics). Picture of FAR too many people on a beach in china.

Already sequenced all major rice cultivars. Found many selective sweeps – lots of new variation?

Also working on Silkworm study... [this is just rapidly turning into a list of projects they've started. Interesting, but nothing much to gain from it.]

DNA methylome: just finished the first asian version.

Also working on methylation that changes as you climb mountains. [Ok, I just don't really get this one.] High altitude adaptation... [but why is this a priority?]

[At the bottom of the slide it said “Work? Fun? Science?” I'm not really sure if that was any of the above.... strange.]

Also doing Whole Transcriptome. Several species, plants, insects, etc.

You need huge depths (400x) to get all transcripts greater than or equal to 1, but decreases from there.

Also started a 1000 plant collaboration. Genomics has barely scratched the bast biodeiversity on the planet. They are going to start working on this. From Algae to flowering plants.

1Gb of transcript sequencing per sample would be equivalent to 2M EST.

Now doing 75bp reads PET.

Another project: Metagenomics of the Human Intestinal Tract.

“Sequencing is Basic” [eh?]

Question asked: How many people are there?  Answer: 1000, over 3 campuses, mostly young university drop outs who work hard and sleep in the lab!  

My Comments: It's interesting to know what these guys are doing, but it just seems really random. They may be the biggest, but I wonder where they're going with the technology.. It appears to be a technology in search of a project, unlike the way the rest of the world is working towards projects, and then applying the technology. Maybe someone else can figure out what their underlying goal is and explain it to me. :/

Labels:

Les Biesecker, NHGRI/NIG - “ClinSeq: Piloting Large-Scale Medical Sequencing for Translational Genomics Research”

Clinicians are more conservative than biologists. Change will be hard, and will take time to see which of them work, and how to use them. [Interesting observation.. but not new]

First figure. Three main traits: Genome Bredth, # of subjects, Clinical data. Each displayed on a different axis. [Uh oh, I can already tell this talk is going to be way outside of the realm of my interest, but I'll try to take decent notes.]

We basically want all of these: lots of genomes, lots of people, and lots of clinical data – that gives you the ideal study.

Genetic architecture of disease: it's a spectrum, with rare to common and alleles, and with high and low penatrance, with lots of admixture of diseases and phenotypes. (Using a Yin-Yang variant diagram to explain it.)

What we need to do is develop the clinical infrastructure to allow this type of data to be produced. We need to get to the point where we have a clinician with a patient in one room with full access to the patient's genome. [odd, that.. do you really think that's the way to go?]

Initial approach to one study (athrosclerosis):
  • 1000 subjects
  • Initial phenotype
  • Sequence 400 candidate genes (“Completely wigs [clinicians] out.”)
  • Associate variants with photypes
  • Return results.

Told people not to sign up to the project unless you're willing to have your whole genome sequenced and used for the study. Apparently, sequencing genomes for clinical purposes is a “radical” idea. However, they have really been overwhelmed by applicants. Currently seqeunced 300 patients, have ~600 people recruited. (Using current PCR pipeline) The idea is to use an older technology, and then once the pipeline is in place, do the substitution, so that everything is in place. The key bottle neck is the bioinformatics, not the generation of data.

Something about “CLIA”, the process a sample can be flowed through and return results to patients.. I've never heard of it, but is apparently a part of clinical studies.

Coverage: 140 genes, 402k bp,
Variants: Oops too slow. ~3000?

985 Hapmaps overlapped.

Uncommon alleles chart – seems to have a lot of very uncommon SNPs, so you're still finding lots of snps.

Back to CLIA. They have a data flow pipeline, which brings the patient back to the clinic, so that they can review the results.

List of subprojects: positive controls, validating recent assoc. of rare variants & phenotypes, sequencing genes under GWAS peaks for rare, high pentrance variants, testing associations, control cohort for other sample sets, search for miRNA variants, cDNA sequencing to measure expression, capture method refinement, patient motivations and preferences for results of medical sequencing, testing automated vs manual pedigree acquisition.

One positive example: By doing genomic sequencing (and several other tests), they identified a patient who had a mutation in LDL, which changed the way that that whole family is being treated. [Neat.]

Of interest: compared their results to another study and showed that they had a completely different result. [I missed what that other study was.... different genotyping variants found, I think.]

Pentrance: we currently know how to work with high penetrance variance, and so maybe that's where we should start, and then wander down the penetrance curve till we get to the low end. The ones at the VERY low end, Dr. Beisecker claims they're not clinical tests... they are just “noise”, if I understand him correctly.

Classic paradigm: hypothesis, phenotype, apply assay, Correlate.
New paradigm: apply assay, sort genotype, generate hypothesys, sort phenotype, Correlate

There are no conclusions to the talk or the study, because they're just getting underway. Many patients are interested in this research, and don't shy away from genome sequencing. We can use this pipeline to look for variants.. and it will accept new sequencing technolgies as they are developed. When exon seqeucning is ready, they'll do it.. and one day they'll move to whole genome.

My comments: Actually, not a bad talk, but really so far outside of the realm of what I'm used to working on that I'm not sure what to make of it. Doing whole genome association is never easy, and the assertion that we need to get there eventually is good. He acknowledged that we don't know how to get there – and that's not really a surprise for the clinical setting.

Labels:

Mike Kozal, Yale University, “Ulta Deep Sequencing and Other Genotyping Technologies to Detect Low-Abundance Drug-Resistant Viral Variants”

Lots of political jokes. “Following Eric Lander , I feel like McCain following an Obama speach...”

Talking about three specific viral pathogens: HIV, HCV, HBV.

The percentage of persons surviving longer with Aids is increasing. We're getting better at taking care of these people. Proportion of patients dying with aids is actually aproaching that of a non-infected person. A 25 year old with aids can now be expected to live another 40 years. With all medications, they can halt replication of the desease. The disease burden is still a problem, however.

There are a few patients, however, who have resistant strains. Clearly, this is caused by people who have the disease and getting treatment who are still transmitting. It's about 10% of the population who have resistant strains.

Therefore, in the clinic, HIV genotyping is standard. 1000's are being ordered daily in the US.

In addition to the “sloppy” polymerase, ~10 Billion viruses are produced a day in a single person. (Wow.. no wonder it mutates rapidly! )

Dr. Kozal is now covering some of the tools used for genotyping HIV, as well as how resistance forms.. Highly useful for clinicians, though maybe not so much for those who are already familiar with resistance development. (Re-emergence model.) Clinics still use Sanger based sequencing – so re-emergence is a problem when sequencing can't detect sub-populations.

One major problem in seqeuncing for viruses is linkage: How do you distinguish three separate variants from one variant with three mutations? Could be very important in how the drug treatment is applied.

Note: Oddly enough, at this point, somewhow Dr. Kozal has switched to discussion of amplicons for a sequencing stragegy. I'm a little confused as to what process he used, since this seems to be in the context of 454 sequencing.. I hadn't realized that 454 sequencing required cloning. Or I could be very confused about what he's describing.

Talking about a study done, now, in which some patients failed out quickly – presumably because they harboured a mutation that allowed the virus to dodge the drug being tested. Most of the variations were found at VERY low abundance (<5%) and could not be detected by Sanger seq. Depending on type and level of variant, the patient's time to failure could be predicted.

Switch to Hep B.

Describing Sanger approach to screen, where early studies showed that variants have clinical implications. Main limiting factor is the ability to extract sufficient viral genomes for the assay.

Jumping back to other viruses: apparently 34% of HIV patients who were classified as “wildtype” (by Sanger sequencing) actually have drug resistant strains when the same study is redone with real-time pcr.

Other technologies: PASS technology (parallel-analysis sequencing?) Can get down to 1%, and able to re-analyze the same cluster. Neat – I missed the journal reference though (font was too small - I'll have to move up for the next session..)

Needs: New diagnostic tools – need to move into clinical settings. How sensitive do you need to be for good outcomes? How do you treat linked mutations? Can linkage be used for better outcome prediction. Current floor detection seems to be 0.2% (mainly caused by PCR problems.)

My comment: This is neat, and is certainly an interesting application where Second-Gen sequencing could have a huge impact, yet the talk is mainly about Sanger based sequencing, and how it should be replaced with new technology, with the ultimate question of “how deep do we need to go?” in the study of viral genomics for clinical use.

Edit: Answering questions with "If I could get access to the president's ear..." while Eric Lander is sitting in the audience.  Nice.

(I think I'm going to have to revisit these notes and clean them up at some point... as I look over my notes, I can see they're pretty messy.)

Labels:

Eric Lander, Broad Institute of Harvard and MIT – The New World of Genome Sequencing

Dramatic increases in data production – we now expect exponential increases in productivity, and exponential decreases in cost. Next Gen sequencing is where all of this comes from – thanks to the major players in the field. 2Gb a day is where we're going now – but by the next meeting it will have changed well beyond that.

We should now consider sequencing as a general purpose tool, the same way that we used to consider computers as a specific problem solving tool, but now consider them broadly applicable. We now should use general purpose sequencing devides.

Over view of talks: Epigenmics, variations in known genomes, mutations in cancer, transcriptional proviling, microbes, de novo sequencing.

Alex Meisner gave a good review of Epigenomics yesterday. Two major components: histones & DNA methylation. Yesterday's experiment was ChIP-on-chip, however, we now do ChIP-Seq. Compared to Chip-chip, Chip-Seq is easier and vastly cheaper... and reproducible. It's now taking over the whole field. Chromatin state maps will shortly be the standard, giving us a complete catalog of the epigenomics of all signals in all cell states.

Variations in known sequencing; Our old challenge was to go beyond single gene diseases, but that now we're in a position to take a comprehensive look at the genome. The number of SNPs have “skyrocketed” in the past decade, which allows us to now do 10^6 snps on a chip at one time (2007). However, in the past year, there are now 200+ genes associated with diseases. (He no longer makes the slides that show them.)

“We have barely scratched the surface of genes”.

We do well with high frequency SNPs In contrast, 0.5-5% frequency genes are poorly studied. We need to do more sequencing to get at these patients. However, we're now seeing sequences from individuals being resequenced, but we'll start seeing much much more than that in the future – 100's and 1000's of people being resequenced. (Woo.. neanderthal and wooly mamoth shout out.)

A neat example of this is the sequencing of people who interact differently with TB and TB drugs. Apparently there are only about 40 differences that seem to be involved. Another example is stickleback sequencing – where a single lane of illumina is enough to genotype an individual fish.

Cancer genomes: Cancer genome Atlas project was formed, and much handwringing has followed about how much we do know, and how much we don't know. What starts to appear, however, when sequencing begins to shed light on it is that very clear signals appear showing what is involved. New genes, clear breakpoints... and all of this is leading to pathways in cancer. Cool. We now phrase our work in terms of what pathways are hit in which cancer, not single genes being mutated. “Dissecting cancer will require sequencing of 1000's of individuals.” However, we still need to worry about error rates, which haven't yet come down to the sensitivity of the tests we need to do.

WTSS: microRNA, ab initio construction of transcriptomes.... not much said here.

Microbes: we're now sequencing microbiomes for use in energy harvest... again just a quick acknoledgement of the field.

De Novo Genome Assembly: we're still working on it.

This was a quick “whirlwind” tour of what's going on in the field. In this new world of sequencing, will we find completely new phenomenon?

Long intergenic non-coding RNAs (lincRNAs) - the paper just came out this weekend. Extensive transcription in mammals. We now have a better idea of what's being transcribed using the new technology. “Are most non-coding RNA transcripts functional”? (Reviewing various perspectives on it.) Apparently, only about a dozen functional large ncRNAs are known. They are now using epigenomics to figure out what's going on. - use the ones that have the expressed gene marks... there are now 1600 novel sigatures that were not known as protein-coding genes. (Characterizing intergenic K4-k36 domains). So what do these things do biologically? They catalog expression patterns, and can associate with pathway profiles... etc. Profiling and correlation is the key to solving this mystery, and they all clearly suggest a biological role in the cell.

Eg, some of them are clearly regulated by p53. This seems to be the potential repressor of other genes, when p53 wants to down regulate other genes, or upregulate with others.. How? Possibly through Polycomb repressor compex? They're anti-transcription factors! Nifty.

“> 50% of lincRNAs expressed in various cell types bind Polycomb or other factors.” “Suggests whole world of gene regulators!”

My Comments: Wow, that was a pretty decent opening talk. The overview was well done, focusing on the challenges, but without dwelling on the problems. The final part of the talk was focussed on the recent paper on lincRNA, which sounds really intersting. I'm quite interested in following up on that paper. Good timing on having it out before AGBT. (-;

The first question is about Eric Lander's selection to be part of the Obama "team" on science.  Spiffy!  (Quick pump for the hope that Obama stays in power for 8 years...  hehe.)  Apparently, his first question with the science group is "what has happened since the sequencing of the human genome project, and is progress going as fast as expected"?  (paraphrased, of course.)

Labels:

10 years of AGBT

Welcome to AGBT... though, if anyone else is blogging this event, you'll know that talks began yesterday. I ran into Daniel Zebrino this morning, who filled me in a little on the talks from yesterday. Apparently, the evening talks (of which Daniel's was one of them) were all very good overviews of the several fields, from aligners to WTSS to... well, you can check out the agenda at agbt.org/agenda.html

In any case, back to what's going on right now.

The first talk is an overview of the last 10 years – what's happened since the AGBT meeting was started in 1999. Apparently, and it's no surprise to anyone who's here that this year is the best attended. There were so many applications for attendance to this year's AGBT that 400 people on the wait list that didn't get in.

I'm learning very quickly that this format is going to be very difficult...

Ok, this talk is quickly degenerating into a roast of some of the top players in the field. 10 year old pictures of Marco Marra, Craig Venter... ok, pretty much everyone. Good lord. 10 years old pictures of AGBT are somewhat scary.

Well, as the day goes, I'll try to adapt the format of the blogs to fit things a bit better onto the blog.  Bear with my lousy spelling as I try to take notes quickly. (=  

Tuesday, February 3, 2009

Countdown to AGBT

Ok, I really don't have much intelligent to add tonight.  I'll be flying to Marco Island tomorrow morning, and the blogging will start with thursday's seminars.  I'm just setting up the macbook for the trip, and then I'm going to go upstairs to pack. The poster is already safely in it's tube.

Yes, I should really have started packing earlier... but I had to watch the canucks game.  (Their first win in 8 games, if they can just survive the last 10 seconds.  Yep, they did!)  

See you in a day or two....

Monday, February 2, 2009

two steps forward to realize you haven't gone anywhere...

I'm still trying to finish off my AGBT poster, which is a scary thought. AGBT starts Wednesday, and I haven't sent off anything to the printers yet. In fact, until about 5 minutes ago, I was still processing the raw data. At any rate, it only takes a few minutes to re-generate all of my figures (yay for automating processes), so I know what all of them (except for the two most complex) will look like.

After spending the whole weekend working on this, as well as a significant investment in time over the past week, most of the figures have barely changed. Woo! Adding 5 more cell lines (samples, really), hasn't done much to change things. I suppose that's a direct consequence of the law of diminishing returns, and clearly the returns are diminishing. Of course, that could also be a consequence of the complete lack of saturation of the sequencing of the 5 added samples, but I don't think people want to deal with that yet...

In fact, that's probably one of the biggest issues in genomics: we all want to get the best results with the least investment, and with genomes, the investments are big. To really ensure that the results are the best they could be, I'd have to get several more flow cells of data from each of these new samples.

And, that brings us right back to the question that was asked at several panels at last year's AGBT: How much sequencing is enough. My favorite answer, last year, was "it won't be enough until we've sequenced every person on the planet." Unfortunately for my poster, we're a LONG way from that. But again, how much will we really learn from sequencing the 6 billionth person? I can't imagine it's worth the time or investment by that point. (By the way, if we wanted to do the same thing for some species, such as bow whales, you'd have to stop around the 8000th individual - since there aren't any more than that. There are 750,000 people on the planet for every bow whale left! But i digress...)

In light of my results, I am rethinking the "how much is enough?" question. Given what I saw today, probably a few 1000 is more than enough, but then again, you have to ask "enough for what?"

For personalized medicine, the answer is clearly going to be to sequence everyone who gets health care (which we hope is everyone.) Unfortunately, the technology required to do that is a long way from where we are now. (Although, we did see some very promising technologies at AGBT last year.) For my poster, though, I wonder if I already have enough to do what I need to. There's my two steps forward... and the realization that taking those two steps probably didn't add much.

You'll notice I also mentioned AGBT a few times in today's post - and some of the reasons why I'm looking forward to another year of intense genomics discussions. New technologies, new methods and discussions with other scientists on where the field is going... and, of course, a lot of opinions on how we're going to get there.

And one thing I didn't mention. In the course of realizing I hadn't gotten any further forwards with my new results, I discovered a few interesting steps sidewise. Isn't that just like science? If you keep your eyes open, you'll find things you weren't looking for. (=

I'm sure there will be lots of surprises at this year's AGBT too.