Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: - Please come visit my blog there.

Monday, February 9, 2009

AGBT 2009 – Thoughts and reflections

Note: I wrote this last night on the flight home, and wasn't able to post it till now. In the meantime, I've gotten some corrections and feedback that I'll go through and make corrections to my blog posts as needed. In the meantime, here's what I wrote last night.


This was my second year at AGBT, and I have to admit that I enjoyed this year a little more than the last. Admittedly, it's probably because I knew more people and was more familiar with the topics being presented than I was last year. Of course, comprehensive exams and last year's AGBT meeting were very good motivators to come up to speed on those topics.

Still, there were many things this year that made the meeting stand out, for which the organizing committee deserves a round of applause.

One of the things that worked really well this year was the mix of people. There were a lot of industry people there, but they didn't take over or monopolize the meeting. The industry people did a good job of using their numbers to host open houses, parties and sessions without seeming "short-staffed". Indeed, there were enough of them that it was fairly easy to find them to ask questions and learn more about the “tools of the trade.”

On the other hand, the seminars were mainly hosted by academics – so it didn't feel like you were sitting through half hour infomercials. In fact, the sessions that I attended were all pretty decent, with a high level of novelty and entertainment factor. The speakers were nearly all excellent, with only a few that felt of “average” presentation quality. (I managed to take notes all the way through, so clearly I didn't fall asleep during anyone's talk, even if I had the momentary zone out caused by the relentless 9am-9pm talk schedule.)

At the end of last year's conference, I returned to Vancouver – and all I could talk about was Pacific Biosciences SMRT technology, which dominated the “major announcement” factor for me for the past year. At this year's conference, there were several major announcements that really caught my attention. I'm not sure if it's because I have a better grasp of the field, or if there really was more of the “big announcement” category this year, but either way, it's worth doing a quick review of some of the major highlights.

Having flown in late on the first day, I missed the Illumina workshop, where they announced the extension of their read length to 250 bp, which brings them up to the same range as the 454 technology platform. Of course technology doesn't stand still, so I'm sure 454 will have a few tricks up their sleeves. At any rate, when I started talking with people on thursday morning, it was definitely the hot topic of debate.

The second topic that was getting a lot of discussion was the presentation by Complete Genomics, which I've blogged about – and I'm sure several of the other bloggers will be doing in the next few days. I'm still not sure if their business model is viable, or if the technology is ideal... or even if they'll find a willing audience, but it sure is an interesting concept. The era of the $5000 genome is clearly here, and as long as you only want to study human beings, they might be a good partner for your research. (Several groups announced they'll do pilot studies, and I'll be in touch with at least one of them to find out how it goes.)

And then, of course, there was the talk by Marco Marra. I'm still in awe about what they've accomplished – having been involved in the project officially (in a small way) and through many many games of ping-pong with some of the grad students involved in the project more heavily, it was amazing to watch it all unfold, and now equally amazing to find out that they had achieved success in treating a cancer of indeterminate origin. I'm eagerly awaiting the publication of this research.

In addition to the breaking news, there were other highlights for me at the conference. The first of many was talking to the other bloggers who were in attendance. I've added all of their blogs to the links on my page, and I highly suggest giving their blogs a look. I was impressed with their focus and professionalism, and learned a lot from them. (Tracking statistics, web layout, ethics, and content were among a few of the topics upon which I received excellent advice.) I would really suggest that this be made an unofficial session in the future. (you can find the links to their blogs as the top three in my "blogs I follow" category.)

The Thursday night parties were also a lot of fun – and a great chance to meet people. I had long talks with people all over the industry, where I might not otherwise have had a chance to ask questions. (Not that I talked science all evening, although I did apologize several times to the kind Pacific Biosciences guy I cornered for an hour and grilled with questions about the company and the technology. And, of course, the ABI party where Olena got the picture in which Richard Gibbs has his arm around me is definitely another highlight. (Maybe next year I'll introduce myself before I get the hug, so he knows who I am...)

One last highlight was the panel session sponsored by Pacific Biosciences, in which Charlie Rose (I hope I got his name right) mediated the discussion on a range of topics. I've asked a guest writer to contribute a piece based on that session, so I won't talk too much about it. (I also don't have any notes, so I probably shouldn't talk about it too much anyhow.) It was very well done with several controversial topics being raised, and lots of good stones were turned over. One point is worth mentioning, however: One of the panel guests was Eric Lander, who has recently come to fame in the public's eye for co-chairing a science committee requested by the new U.S. President Obama. This was really the first time I'd seen him in a public setting, and I have to admit I was impressed. He was able to clearly articulate his points, draw people into the discussion and dominate the discussion while he had the floor, but without stifling anyone else's point of view. It's a rare scientist who can accomplish all of that - I am now truly a fan.

To sum up, I'm very happy I had the opportunity to attend this conference and looking forward to see what the next few years bring. I'm going back to Vancouver with an added passion to get my work finished and published, to get my code into shape, and to keep blogging about a field going through so many changes.

And finally, thanks to all of you who read my blog and said hi. I'm amazed there are so many of you, and thrilled that you take the time to stop by my humble little corner of the web.


Saturday, February 7, 2009

Stephan Schuster, Penn State University - “Genomics of Extinct and Endangered Species”

Last year, introduced nanosequencing of complete extinct species. What are the implication of extinct genomes on endangered species.

Mammoth: went extinct 3 times... 45,000ya, 10,000 ya, and 3,500ya. Wooly rhino: 10,000 years ago, Moa 500 years ago (were eaten), Thylacine 73 years ago.. And Tasmanian devils, which are expected only to last another 10 years.

Makes you wonder about dinosaurs.. maybe dinosaurs just tasted like chicken.

Looking at population structure and biological diversity from a genomic perspective. (Review of Genotyping Biodiversity.) Mitochondrial genome is generally higher copy, and thus was traditionally the one used, but now with better sequencing, we can target nuclear DNA.

Mammoth Mitochondrial genome has been done. ~16,500bp. Includes ribosomal, coding and noncoding regions. In 2008, can get 1000x coverage on the mitochrondrial. You need extra coverage to correct for damaged DNA.

This has now allowed 18 mammoth mitochondrial genome sequences. 20-30 SNPs between members of same groups, and 200-300 between groups. WAY more sequencing than is available for african elephants!

Have now switched to using hair instead of bone, and can use hair shaft. (not just follicle)

Ancient DNA = highly fragmented. 300,000 sequences, 45% was nuclear DNA.

Now: Sequenced bases: 4.17Gb. Genome size is 4.7Gb. 77 Runs, got 32.6 million bases.

Can visit for more info.

Sequenced mammoth orthologs of human genes. Compared to watson/venter... rate of predicted genes of chromosomes (“No inferrences here”), Complete representation of genome available. SAP =Single Amino acid Polymorphism.

(Discussion Divergence for mammoth) coalescence time for human and neandethal, 600,000. Same thing happens for mammoth, but not really well accepted because the biological evidence doesn't show it.

Did the same thing for the Tasmanian Tiger. Two complete genomes – only 5 differences between them.

Hair for one sample was taken from what fell off when preserved in a jar of ethanol!

Moa: did it from egg shell!

Wooly rhino: did the wooly rhino from hair – did other rhinos. (wooly is the only extinct one.) Rhinos radiated only a million years ago, so couldn't resolve phylogenic tree. Tried: hair, horn, hoof, and bone.. bone was by far the worst.

Now, to jump to the living: the tasmanian devil. Highly endangered. 1996 infectious cancer discovered (not figured out till 2004). Devils protected since 1941. Isolations with fences, islands, mainland, insurance population. Culling and vaccination are also possible.

Genome markers will be very useful. Problem is probably because there is nearly no diversity in population. Sequenced museum sample devils, and show mitochondrial DNA had more diversity in non-living population.

Project for full genome is now underway – two animals. (More information on plans on what to do with this data and how to save them.) SNP info for genotyping to direct captive breeding program.
(“Project Arc”) Trying to breed resistant animals.


Len Pennacchio, Lawrence Berkely National Laboratory - “ChIP-Seq Accurately Predicts Tissue Specific Enhancers in Vivo”

The Noncoding Challenge: 50% of GWAS are falling into non-coding studies. CAD and Diabetes fall in gene deserts, so how do they work. Regulatory regions. Build a category of distal enhancers.

Talk about Sonic Hedgehog, involved in limb formation. Regulation of expression is a million bases away from the gene. There are very few examples. We don't know if we'll find lots, or if this is just the tip of the iceberg. How do we find more?

First part: work going on in the lab for the past 3 years. Using conservation to identify regions that are likely invovled. Using ChIP-Seq to do this.

Extreme conservation. Either things conserved over huge spans (human to fish) or within a smaller group. (human mouse, chimp).

Clone the regions into vectors, put them in mouse eggs, and then stain for Beta-galactosidase. Tested 1000 constructs, 250,000 eggs, 6000 blue mice. About 50% of them work as reproducible enhancers. Do everything at whole mouse level. Each one has a specific pattern. [Hey, I've seen this project before a year or two ago... nifty! I love to see updates a few years later.]

Bin by anatomical pattern. Forebrain enhancers is one of the big “bins”. Working on forebrain atlas.

All data is at Also in Genome Browser. There is also an Enhancer Browser. Comparative genomics works great at finding enhancers in vivo. No shortage of candidates to test.

While this works, it's not a perfect system. Half of the things don't express, and the system is slow and expensive. The comparative genomics also tells you nothing about where it expresses, so this is ok for wide scans, but not great if you want something random.

Enter ChIP-Seq. (brief model of how expression works) Collaboration with Bing Ren. (brief explanation of ChIP-Seq). Using Illumina to sequence. Looking at bits of mouse embryo. Did chipseq, got peaks. What's the accuracy?

Took 90 of predictions, used same assay. When p300 was used, now up to 9/10 of conserved sequences work as enhancers. Also tissue specific.

Summarize: using comparative gives you 5-16% active things in one tissue. Using ChIP-Seq, you get 75-80%.

How good or bad is comparative genomics at ranking targets? 5% of exons are constrained, almost the rest are moderately constrained. [I don't follow this slide. Showing better conserved in forebrain and other tissues].

P300 peaks are enriched near genes that are expressed in the same tissues.

Conclusion: p300 is a better way of prediction enhancers.
P300 occupancy circumvents DNA conservation only approach.

What about negatives? For ones that don't work, it's even better, but mouse orthologs bind, while human does not bind any more in mice.

Conclusion II: Identified 500 more enhancers with first method, and now a few reads done 9 months ago have 5000 new elements using ChIP-Seq.

Many new things can be done with this system, and integrating it with WGAS.


Bruce Budowle, Federal Bureau of Investigation - “Detection by SOLiD Short-Read Sequencing of Bacilus Anthracis and Tersinia Pestis SNPs for Strain Id

We live in a world with a “heightened sense of concern.” The ultimate goal is to reduce risk, whether it's helping people with flooding, or otherwise. Mainly, they work on stopping crime and identifying threat.

Why do we do this? We've only had one anthrax incident since 2001... but we've been been using bioterrorism for a 2000 years. (several examples given.)

Microbial Forensics. We don't just want knee jerk responses. Essentially the same as any other forensic discipline, again, to reduce risk. This is a very difficult problem. Over 1000 agents known to infect humans: 217 viruses, 538 bacterial species, 307 fungi, 66 parasitic protozoa. Not are all effective, but there are different diagnostic challenges. Laid out on the tree of life.... it's pretty much the whole thing.

Biosynthetic technology. New risks are accruing due to advances in DNA synthesis. Risks are vastly outweighted by benefits of synthesis... bioengineering also plays a role.

Forensic genetic questions:
what is the source?
Is it endemic?
what is the significance?
How confident can you be in results?
Are there alternative explanations?

So, a bit of history on the “Amerantrax” case. VERY complex case, changed the way the government works on this type of case. Different preparations in different envelopes.

Goals and Objectives:
could they reverse engineer the process? To figure out how it was done? No, too complex, didn't happen.

First sequencing – did a 15 locus 4-colour genotyping system. Was not a validated process – but helped identify strain. That helped narrow down the origin of the strain. Some came from texas, but it was more likely to have come from a lab than to come from the woods.

Identifying informative SNPs. Don't need to know the evolution – just the signature. That can be then used for diagnostics. Whole genome sequencing for genotyping was a great use. Back in 2001, most of this WGS wasn't possible. They had a great deal from Tigr – only $125,000 to sequence the genome. From the florida isolate : took 2 weeks, found out interesting details about copy number of plasmids. The major cost was then to validate and understand what was happening.

Florida was compared to Ames to one from UK, which gave 11 SNPs only. Many evolution challenges that came up. The strain they used was “cured” of it's plasmid, so it evolved to have other SNPs... a very poor reference genome.

The key to identification: one of the microbiologists discovered that some cultures had different morphology. That was then used as another signature for identifying the source.

Limited Strategy: it didn't give the whole answer – only allows them to rule out some colonies. It would be more useful to sequence full genomes... so entered into deal with ABI SOLiD for genome sequencing.

Some features were very appealing. One of them is the Emulsion PCR. Helped to improve quality and reliability of the assay. And beads, were useful too.

Multiplex value was very useful. Could test 8 samples simultaneously using barcoding, including the reference Ames strain. Coverage was 16x-80x, depending on DNA concentration. Multiple starting points gives more confidence, and to find better SNPs.

Compare to reference: found 12 SNPs in resequenced reference. When you look at SNP data, you see that there was a lot of confidence if it's in both direction... however, it only turns up on the one strand. That becomes a major way to remove false positive result. That was really only possible by using higher coverage.

Not going to talk about Pestis.. (almost out of time.) Similar points, 130-180X coverage. Found multidrug transporter in the strain which has been a lab strain for 50 years. Plasmids were also higher coverage. SNPs were less in the north american, etc.
An interesting point. If you go to the ref in genbank, there are known errors in the sequence. Several have been corrected, and the higher coverage was helpful in figuring out the real sequence past the errors.

$1000 /strain using multiplex, using equipment that is not yet available. This type of data really changes the game, and can now screen samples VERY quickly (a week).

Every project is a large scale sequencing project
depth is good
multiplexing is good
keep moving to higher accuracy sequencing.


Andy Fire, Stanford University - “Understanding and Clinical Monitoring of Immune-Related Illnesses Using Massively-Parallel IgH and TcR Sequencing”

The story starts off with a lab that works on small RNAs, which believe they form a small layer of immunity. [did I get that right?] They work in response to foreign DNA.

Joke Slide: by 2018, we'll have an iSequencer.

Question: can you sequence the immunome. [new word for me.] Showing a picture of lymphoma cells, which to me looks like a test to see if you're colour blind. There are patches of slightly different shades...

Brief intro to immulogy. “I got an F in immunology as a grad student.” [There's hope for me, then!]
Overview of VDJ Recombination, control by B-Cell differentiation. This is really critical – responsible for our health. One Model: If something recognizes both a virus and self, then you can end up with autoimmune response.

There is a continuum based on this. It's not necessarily an either /or relationship.

There is a PCR/454 test for VDJ distribution. Under some cases, you get a single dominating size class, and that is usually a sign of disease, such as lymphoma. You can also use 454 for this, since you need longer reads, and read the V, D and J units in the amplified fragment. Similar to email, you can get “spam”, and you can use similar technologies to drop out the “spam” from the run.

To show the results of the tests for B-cell recombination species, you put V on one axis, J on the other. D is dropped to make it more viewable. In lymphoma, a single species dominates the chart.

An interesting experiment – dilute with regular blood to see detection limit – it's about 1:100. For some lymphomas, you can't use these primers, and they don't show up. There are other primers for the other diseases.

So what happens in normal distributions? Did the same thing with VDJ, (D included so there are way more spots). Neat image.. Do this experiments with two aliquots of blood from the same person. Look for concordance. You find lots of spots fail to correspond well at the different time points, but many do.

On another project, Bone Marrow transplant. Recipient has a funny pattern, mostly caused by “spam” because the recipient really has very little immune system left. The patient eventually gets the donor VDJ types, which is a completely donor response. You can also do something like this for autoimmune disorders.

Malfunctioning Lymphoid cells cause many human diseases and medical side-effects. (several examples given.)


Keynote Speaker: Rick Wilson, Washington University School of Medicine - “Sequencing the Cancer Genome”

Interested in:
1.Somatic mutations in protein coding genes, including indels.
2.Also like to find: non-coding mutations, miRNAs and lincRNAs.
3.Like to learn about germ line variations.
4.Differential transcription and splicing
6.structural variation
7.Epigenetic changes
8.big problem: integrate all of this data.. and make sense of it.

Paradigm for years: exon focus for large collection of samples. Example: EGFR mutations in Lung Cancer. Large number of patients (some sample) had EGFR mutations. Further studies carry on this legacy in Lung cancer using new technology. However, when you look at pathways, you'll find out finding that the pathways are more important than individual genes.

Description of “The Cancer Genome Atlas”

Initial lists of genes mutated in cancer. Mutations were found, many of which were new. (TCGA Research Network, Nature, 2008)

Treatment-related hypermutation. Another example of TCGA's work: glioblastoma. Although they didn't want treated samples, in the end they took a look and saw that treated samples have interesting changes in methylation sites, when MMR genes and MGMT were mutatated. If you know the status of the patient's genome, you can better select the drug (eg, not use a alkylation based drug).

Pathways analysis can be done... looking for interesting accumulations of mutations. Network view of the Genome... (just looks like a mess, but a complex problem we need to work on.)

What are we missing? What are we missing by focusing on exons? There should be mutations in cancer cells that are outside exons.

Revisit the first slide: now we do “Everything” from the sample of patients, not just the list given earlier.

(Discussion of AML cancer example.) (Ley at al, Nature 2008)
Found 8 heterozygous somatic mutations, 2 somatic insertion mutations. Are they cause or effect?
The verdict is not yet in. Ultimately, functional experiments will be required.

There are things we're not doing with the technology: Digital gene expression counts. Can PCR up gene of interest from tumour, sequence and do a count: how many cells have the genotype of interest?
Did the same thing for several genes, and generally got a ratio around 50%.

Started looking at GBM1. 3,520,407 tumour variants passing SNP filter. Broke down to Coding NS splice sites, coding silent, conserved regions, regulatory regions including miRNA targes, non repetitive regions, everything else (~15,000). Many of the first class were validated.

CNV analysis also done. Add coverage to sequence variants, and the information becomes more interesting. Can then use read pairs to find breakpoints/deletions/insertions.

What's next for cancer genomics? More AML (Doing more structural variations, non-coding information, more genomes), more WGS for other tumours, and more lung cancer, neuroblastoma... etc.

“If the goal is to understand the pathogenensis of cancer, there will never be a substitute for understanding the sequence of the entire cancer genome” – Renato Dulbecco, 1986

Need ~25X coverage of WGS tumour and normal – also transcriptome and other data. Fortunately, costs are dropping rapidly.


Peter Park, Harvard Medical School - “Statistical Issues in ChIP-Seq and its Application to Dosage Compensation in Drosophila”

(brief overview of ChIP-Seq, epigenomics again)

ChIP-Seq not always cost-competitive yet. (can't do it at the same cost as chip-chip)

Issues in analysis:Generate tags, align, remove anomalus, assemble, subtract background, determine binding position, check sequencing depth.

Map tags in strand specific manner: (Like directional flag in Findpeaks). Scoring tags accounting for that profile. Can be incorporated into peak caller.

Do something called Cross-correlation analysis. (look at peaks in both directions.) use this to rescue more tags. Peaks get better if you add good data, and worse if you add bad data. Use it to learn something about histone modification marks. (Tolstorukov et al, Genome Research).

How deep to sequence? 10-12M reads is current. That's one lane on illumina, but is it enough? What quality metric is important? Clearly this depends on the marks you're seeing (narrow vs broad, noise, etc). Brings you to saturation analysis? Show no saturation for STAT1, CTCF, NRSF. [not a surprise, we knew that a year ago... We're already using this analysis method, however, as you add new reads, you add new sites, so you have to threshold to make sure you don't keep adding new peaks that are insignificant. Oh, he just said that. Ok, then.]

Talking about using “fold enrichment” to show saturation. This allows you to estimate how many tags you need to get a certain tag enrichment ratio.

See paper they published last year.

Next topic: Dosage compensation.

(Background on what dosage compensation is.)

In drosophila, the X chromosome is up-regulated in XY, unlike in humans, where the 2nd copy of the X is quashed in the XX genotype. Several models available. Some evidence that there's something specific and sequence related. Can't find anything easily in ChIP based methods – just too much information. Comparing ChIP-seq, you get sharp enrichment, whereas on ChIP-chip, you don't see it. Seems to be saturation issue (dynamic range) on ChIP-chip, and the sharp enrichments are important.
You get specific motifs.

Deletion and mutation analysis. The motif is necessary and sufficient.

Some issues: Motif on X is enriched, but only by 2-fold. Why is X so much upregulated, then? Seems Histone H3 signals depleted over the entry sites on X chr. May also be other things going on, which aren't known.

Refs: Alekseyenko et al., Cell, 2008 and Sural et al., Nat Str Mol Bio, 2008


Alex Meissner, Harvard University- “From reference genome to reference epigenome(s)”

Background on Chip-Seq.

High-throughput Bisulfite Sequencing. At 72 bp, you can still map these regions back without much loss of mapping ability. You get 10% loss at 36bp, 4% at 47bp and less at 72bp.

This was done with a m-CpG cutting enzyme, so you know all fragments come with at least a single Methylation. Some update on technology recently, including drops in cost and longer reads, and lower amounts of starting material.

About half of the CpG is found outside of CpG islands.

“Epigenomic space”: look at all marks you can find, and then external differences. Again, many are in gene deserts, but appear to be important in disease association. Also remarkable is the degree of conservation of epigenetic patterns as well as genes.

where are the functional elements?
when are they active?
when are they available

Also interested in Epigenetic Reprogramming (Stem cell to somatic cell).

Recap: Takahashi and Yamanaka: induce pluripotent stemcell with 4 transcription factors: Oct2, Sox2, c-Myc & KLF4[?] General efficiency is VERY low (0.0001% - 5%). Why are not all cells reprogramming?

To address this: ChIP-Seq before and after induction with 4 transcription factor. Strong correlation with chromatin state and iPS. Clearly see that genes in open chromatin are responsive. Chromatin state in MEFs correlates with reactivation.

Is loss of DNA methylation at pluripotency genes the critical step to fully reprogram? Test hypothesis that by demethylation, you could cause more cells to become pluripotent. Loss of DNA methylation does indeed allows transition to pluripotency shown. [lots of figures, which I can't copy down.]

Finally: loss of differentiation potential in culture. Embryonic stem cell to neural progenitor, but eventually can not differentiate to neurons, just astrocytes. (Figure from Jaenisch and Young, Cell 2008)

Human ES cell differentiation: often fine in morphology, correct markers... etc etc, but specific markers are not consistent. Lose methylation and histone marks, which cause significant changes in pluripotency.

Can't yet make predictions, but on the way towards it in the future where you can assess cell type quality using this information.


**BREAKING NEWS** Marco Marra, BC Cancer Agency - “Sequencing Cancer Genomes and Transcriptomes: From New Pathology to Cancer Treatment.”

Why sequence Cancer-ome: Most important: treatment-response difference. To match treatments to patients. Going to focus on that last one.

Two Anecdotes: Neuroblastoma (Olena Morozova and David Kaplan), and Papillary adenocarcinoma (tongue), primary unkown. 70 year difference in age. They have nothing in common except for “can sequence analysis present new treatment options?”

Background on Neuroblastoma. Most common cancer in infants, but not very common. 75 cases per year in Canada. Patients often have relapse and remission cycles after chemotherapy. Little improvement until recently, when Kaplan was able to show abiltity to enrich for tumour initiating cells (TICs). This gave a great model for more work.

Decided to have a look at Kaplan's cells, and did transcriptome libraries (RNA-Seq) using PET, and sequenced a flow cell worth: giving 5Gb of raw seq from one sample, 4 from the other. Align to reference genome using custom database. (Probably Ryan Morin's?) Junctions, etc.

Variants found that are B-cell related. Olena found markers, worked out lineage, and showed it was closer to B-cell malignancy than brain cancer sample. These cells also show neuroblastomas, when reintroduced to mice. So, is neuroblastoma like B-cell in expression? Yes, they seem to have a lot of traits in common. It appears as though the neuroblastoma is expressing early markers.

Thus, if you can target B-Cell markers, you'd have a clue.

David Kaplan verified and made sure that this was not contamination (Several markers). Showed that yes, the neuroblastoma cells are expressing b-cell markers, and that these are not B-cells. Thus, it seems that a drug that targets B-Cell markers could be used. (Rituximab, and Milatuzamab) Thus, we now have an insight that we wouldn't have have had before. (Very small sample, but lots of promise.)

Anectdote 2: 80 year old male with adenoma of the tongue. Salivary gland origin possibly? Has had surgery and radiation and a Cat scan revealed lung nodules (no local recurrance.) There is no known standard chemotherapy that exists... so several guesses were made, and an EGFR inhibitor was tried.. Nothing changed. Thus, BC Cancer was approached: what can genome studies do? Didn't know, but willing to try. Genome from formalin fixed sample (which is normally not done), and handful of WTSS from Fine-needle aspirates. (nanograms, which required amplification). 134Gb of aligned sequence across all libraries – about 110Gb to genome. (22X genome, 4X transcriptome.)

Data analysis, compared across many other in-house tumours, and looked for evidence of mutation. CNV was done from Genome. Integration with drug bank, to then provide appropriate candidates for treatment.

Comment on CNV: histograms shown: Showed that as many bases are found in single allele as diploid and then again, just as many in triploid and then some places at 4 and 5s. Was selected pressure involved in picking some places for gain, whereas much of the genome was involved in loss?

Investigated a few interesting high CNV regions, one of which contains RET. Some amplifications are highly specific, containing only a single gene, whereas they are surrounded by loss of CNV regions.

Looking at Expression level, you see a few interesting things. There is a lack of absolute correlation between changes in CNV and the expression of the gene.

When looking for intersection, ended up with some interesting features:
30 amplified genes in cancer pathways (kegg)
76 deleted genes in cancer pathways
~400 upregulated, ~400 downregulated genes
303 candidate non-synonymous snps
233 candidate novel coding SNPs
... more.

Went back to (Yvonne and Jianghong?) When you merge that with target genes, you can find drugs specific to those targets. One of the key items on the list was RET.

Back to patient, the patient was using EGFR targetting drug. Why weren't they responsive? Turns out that p10 and RB1 are lost in this patient... (see literature.. didn't catch paper).

Pathway diagram made by Yvonne Li. Shows where mutations occur in pathways, gains and losses of expression are shown as well. Notice Lots of expression from RET, and no expression from p10. p10 regulates (negative) the RET pathway. Also increases of Mek and Ras. Suggests that in this tumour, activation of RET could be driving things.

Thus, came up with a short list of drugs. Favorite was Sunitinib. It's fairly non-specific, used for renal cell carcinoma. Currently in clinical trials, tested for other cancers. Implications that RET is involved in some of those diseases (MEN2a, MEN2B and thyroid cancers.) RET sequence in patient was not likely to be mutated in patient.

CAT scans: response to Sunitinib and Erlotinib. When on the EGFR targetting drug, nodule grew. On Sunitinib, the cancer retreated!

Lots of Unanswered questions: Is RET really driving this tumour? Is drug really acting on RET? Is PTEN/RB1 loss responsible for erlotinib resistance in this tumour?

We don't think we know everything, but can we use genome analysis to suggest treatment: YES!

First question: how did this work with ethics boards? How did they let you pass that test? Answer: this is not a path to treatment, it is a path towards making suggestion. In some cancers there is something called Hair Analysis. It can be considered or ignored. Same thing here: we didn't administer... we just proposed a treatment.


Keynote Speaker: Rick Myers, Hudson-Alpha Institute - “Global Analysis of Transcriptional Control in Human Cells”

Talking about gene regulation – has been well studied for a long time, but only recent on a genomic scale. The field still wants comprehensive, accurate, unbiased, quantitative measurements (DNA methylation, DNA binding protein, mRNA) and they want it cheap fast and easy to get.

Next gen has revolutionized the field: ChIP-Seq, mRNA-Seq and Methyl-Seq are just three of them. Also need to integrate them with genome-wide genetic analysis.

Many versions of each of those technology.

RNA-Seq: 20M reads give 40 reads per 1kb-long mRNA present as low as 1-2 mRNA per cell. Thus, 2-4 lanes are need for deep transcriptome measurement. PET + long reads is excellent for phasing, and junctions.

ChIP-Seq: transcription factors and histones.. but should also be used for any DNA binding protein. (Explanation of how ChIP-Seq works.) Using no-antibody control generally gives you no background [?] Chip without control gets you into trouble.

Methylation: Methyl-seq. Cutting at unmethylated sites, then ligate to adaptors and fragment. Size select and run. (Many examples of how it works.)

Studying human embryonic stem cells. (Cell lines are old and very different.... hopefully there will be new ones available soon.) Using it for Gene expression versus methylation status: When you cluster by gene expression, they cluster by pathways. The DNA methylation patterns did not correlate well, more along the line of individual cell lines than pathways. Thus, they believe it's not controling the pathways.. but that could be an artifact of the cell lines.

26,956 methylation sites. Many of them (7,572) are in non CpG regions.

Another study: Studying Cortisol. Steroid hormone made by adrenal gland. Controls 2/3rds of all biology, helps restore homeostasis and affects a LOT of pathways: blood pressure, blood sugar, suppress immune system, etc. Fluctuates throughout the day. Pharma is very interested in this.
Levels are also tied to mood, etc.

Glucocorticoid receptor binds hormone in cytoplasm, translocates to nucleus. Activates and represses transcription of thousands of genes.

Chip-seq in A549: GR (-hormone): 579 peaks. GR (+ hormone): 3,608 peaks. Low levels of endogenous cortisol in the cell probably accounts for the background. (of peaks, ~60% are repressive, ~40% are inducing.) When investigating the motifs, top 500 hits really changes the binding site motif! No longer as set as originally thought – and led to discovery of new genes controled by GRE. Also show that there's a co-occupancy with AP1.

[Method for expression quantization: Use windows over exons.]

Finally: a few more little stories. Mono-allelic transcription factor binding. Turns out to occur frequently, where only one allele is bound in ChIP, and the other is not binding at all. (in the shown case, turns out the SNP causes a methylation site, which changes binding.) Same type of event also happens to methylation sites.

Still has time: just raise the point of Copy Number Variation. Interpretation is very important, and can be skewed by CNVs. Cell lines are particularly bad for this. If you don't model this, it will be a significant problem. Just on the verge of incorporating this.

They are going to 40-80M reads for RNA-Seq. Their version of RNA-Seq is good, and doesn't give background. The deeper you go, the more you learn. Not so much with ChIP-Seq, where you saturate sooner.


Friday, February 6, 2009

Kevin McKernan, Applied Biosystems - "The whole Methylome: Sequencing a Bisulfite Converted Genome using SOLiD"

Background on methylation. It's not rare, but it is clustered. This is begging for enrichment. You can use Restriction Enzymes. Uses Mate Pairs to set this up. People can also use MeDIP and a new third method: methyl binding protein from Invitrogen. (Seems to be more sensitive.)

MeDIP doesn't grab CpG, tho... just leaving single stranded DNA, which is a pain for making libraries. Using only 90ng. There is a slight bias on adaptors, tho. Not yet optimized. If they're bisulfite converting, it has issues (protecting adaptors, requires densely methylated DNA, etc). They get poor alignment because methylation areas tend to be repetitive. Stay tuned, though, for more developments

MethylMiner workflow: Sheer genomic DNA, and put adaptors on it, and then biotin bind methyls? You can titrate methyl fractions off the solid support, so you can then sequence and know how many methyls you'll have. Thus, mapping becomes much easier, and sensitivity is better.

When you start getting up to 10-methyls in a 50mer, bisulfite treating + mapping is a problem. It's also worth mentioning that methylation is not binary when you have a large cell population.

The methyl miner system was tested on A. thaliana, SOLiD fragments generated... good results obtained, and salt titration seems to have worked well, and mapping reads show that you get the right number (aprox) of methyl Cs. - but mapping is easy, since you don't need to bisulfite.

Showed examples where some of genes are missed by MeDIP, but found by MethylMiner.

(Interesting note, even though they only have 3 bases after conversion (generally), it's still 4 colour.)

Do you still get the same number of reads on both strands? Yes...

Apparently methylation is easier to align in colourspace. [Not sure I caught why.] Doing 50mers with 2MM. (Seems to keep % align-able in colourspace, but bisulfite treated base space libraries can only be aligned about 2/3rds as well.

When bisulfite converted, 5mCTP will appear as a SNP in alignment. To approach that, you can do fractionation in MethylMiner kit, which gives you a more rational approach to alignments.

You can also make a LMP library, and then treat with 5mCTP when extending, so you get two tags, then separate tags – (they keep a barcode) and then pass over methylminer kit... etc etc... barcoded mapping to detect methyl C's better.

Also have a method in which you do something the same way, but ligate hairpins on the ends... then put on adaptors, and then sequence the ends, to get mirror imaged mate pairs. (Stay tuned for this too.)

There are many tools to do Methylation mapping: colourspace, lab kits and techniques.


Stephen Kingsmore, National Centre for Genome Resources - “Digital Gene Expression (DGE) and Measurement of Alternative Splice Isoforms, eQTLs and cSN

[Starts with apologizing for chewing out a guy from Duke... I have no idea what the back story is on that.]

Developed their own pipeline, with a web interface called Alpheus, which is remotely accessible. They have and Ag biotech focus, which is their niche. Would like to get into personal genome sequencing.

Application 1: Schizophrenia DGE.
pipeline: ends with Anova analysis. Alignment to several references: transcripts and genome. 7% span exon junctions. MRNA-Seq Coverage. Read count gene based expression analysis is as good as or better than arrays or similar tech. Using Principle component analysis. Using mRNA-Seq, you can clearly separate their controls and cases, which they couldn't do with Arrays. It improves diagnosis component of Variance.
Showing “Volcano Plots”.

Many of genes found for schizophrenia converged on a single pathway, known to be involved in neurochemistry.

Have a visualization tool, and showed that you can see junctions and retained introns, and then wanted to do it more high throughput. Started a collaboration to focus on junctions, to quantify alternative transcript isoforms. Working on first map of splicing of transcriptome in human tissues. 94% of human genes have multiple exons. Every one had alternative splicing in at least one of the tissues examined.

92% have biochemically relevant splicing. (minimum 15%?)

8 types of alternative splicing... 63% of alternative splicing is tissue regulated. 30% of splicing occurs between individuals. (So tissue splicing trumps individuals)

[Brief discussion of 454 based experiment... similar results, I think.]

1.cost effective,
3.biologically relevant
4.identified stuff missed by genome sequencing

Finally, also compared genotypes from individuals looking at cSNPs. Cis-acting SNPS causing allelic imbalance. Used it to find eSNPS (171 found). Finally, you can also Fine Map eQTN within eQTL.


Jesse Gray, Harvard Medical School - “Neuronal Activity-Induced Changes in Gene Expression as Detected by ChIP-Seq and RNA-Seq”

Now “widespread overlapping sense/antisense transcription surrounding mRNA transcriptional start sites.”

Thousands of promoters exhibit divergent transcriptional initiation. Annotated TSS come from NCBI. There are 25,000 genes. There is an additional anti-sense TSS (TSSa) 200 bp upstream. [Nifty, I hadn't heard about that.]

Do RNA-Seq and ChIP-Seq. Using SOLiD. SOLiD or Ambion [not sure which] plans to sell the method as a kit for WTSS/WT-Seq.

Using RNA Pol II ChIP-Seq.

Anti-sense transcription peaks about ~400 bases upstream of TSS. When looking at the genome browser, you see overlapping TSS-associated transcription. (you see negative strand reads on the other direction, upstream from TSS, and on the forward strand at the TSS, with a small overlap in the middle.)

It is a small amount of RNA being produced.

Did a binomial statistical test, fit to 4 models:
1.sense only initiation
2.divergent initiation (overlap)
3.anti-sense only initiation
4.divergent (no overlap)

The vast majority are TSSs with divergent overlap, 380 with divergent (no overlap), 900 sense only, 140 anti-sense only. Many other sites were discarded because it was unclear what was happening. This is apparently a wide-spread phenomenon.

Might this be important? Went back to ChIP-Seq to classify the peaks into these categories from RNA Pol II expt. (Same categories.) Is this a meaningful way to classify sites, and what does it tell us?

How many of those peaks have a solid PhastCons score, which should tell something about the read. No initiation has the lowest scores... the ones with the antisense models have the highest conservation at the location of antisense initiation.

Where do the peaks fall, when they have anti-sense? Anti-sense are bimodal, sense only and bi-direcitional are just before the TSS, and non-bi-modal.

Tentatively, yes, it seems like this anti-sense is functionally important.

Does TSSo change efficiency of initiation?

Break into two categories. Non-overlap TSSs, and overlap TSSs. It appears that overlap TSSs produce more than twice the RNA than non-overlap. This could be a bias... could be selecting for highly expressed genes. Plot the RNA Pol II occupancy at the star sites, there is a big difference at the overlapping TSS. Non-overlap has higher occupancy at non-overlap, but lower up or down stream than overlap. Thus the transition to elongation may be less efficient.

Does TSSo change efficiency of initiation? Tentatively, yes.

Comment from audience: this was discovered a year ago in a paper a year ago by Kaplan (Kaparov?). Apparently this was lately described that these are cleaved into 31nt capped reads. THus, the fate of the small RNA should be of interst. 50% of genes had this phenomenon.

Question from audience: what aligner was used, and how were repetitive sections handled. Only uniquely mapping read, using SOLID pipeline. (Audience member thinks that you can't do this analysis with that data set.) Apparently, someone else claims it doesn't matter.

My Comment: This is pretty cool. I wasn't aware of the anti-sense transcription in the reverse direction from the TSS. It will be interesting to see where this goes.


Terrence Furey, Duke University - “A Genome-Wide Open Chromatin Map in Human Cell Types in the ENCODE Project”

2003: initial focus on 1% of genome. Where are all the DNA elements.
2007: Scale up from 1% to 100%

Where are all of the regulatory element in the genome: a parts list of all functional elements.

We now know: 53% unique, 45% repetitive, 2% are genes. Some how, the 98% controls the other 2%.

Focussed on regions of open chromatin. Open chromatin is not bound to nucleosomes.

4.insulators control regions
6.meiotic recombination hotspots.

Use two assays: DNAse hyper-sensitivity. Used at single site in the past, now used for high throughput genome wide assays. The second method is FAIRE: formaldehyde assisted identification of regulatory elements. It's a ChIP-Seq. [I don't know why they call it FAIRE... it's exactly a ChIP experiment – I must be missing something.]

Also explaining what ChIP-Seq/ChIP-chip is. They now do ChIP-Seq. Align sequences with MAQ. Filter on number of aligned locations. (keep up to 4 alignments). Use F-Seq. Then call peaks with a threshold. Use a continuous value signal.

The program is F-Seq, created by Alan Boyle. Outputs in Bed and Wig format. Also deals with alignability “ploidy”. (Boyle et al, Bioinformatics 2008). They use Mappability to calculate smoothing.

[This all sounds famillar, somehow... yet I've never heard of F-Seq. I'm going to have to look this up!]

Claim you need normalization to do proper calling. Normalization can also be applied if you know regions of duplications.

[as I think about it, continuous read signals must create MASSIVE wig files. I would think that would be an issue.]

Peak calling validation: ROC analysis. False positive along bottom axis, true positives on vertical axis. Show chip-seq and chip-array have very high concordance.

Dnase I HS – 72 Million sequences, 149,000 regions, 58.5Mb – 2.0%
FAIRE – 70 Million sequences, 147,000 regions, 53Mb – 1.8%

Compare them – and you see the peaks correspond with the peaks in the other. Not exact, but similar. Very good coverage by FAIRE of the Dnase peaks. Not as good the other way, but close.

Goal of project should be done on a huge list of cells (92 types?? - 20 cell lines, add 50 to 60 more, including different locations in body, disease, cells exposed to different agents... etc etc.) RNA is tissue specific, so that changes what you'll see.

Using dnase and fare assays to define open chromatin map
exploring many cell times,
discovery of ubiquitous and cell specific elements.

Note: Data is available as quickly as possible - next month or two, but may not be used for publication for the first 9 months.


Kai Lao, Applied Biosystems - “Deep Sequencing-Based Whole Transcriptome Analysis of Single Early Embryos”

I think all sequencing was done with ABI SOLiD.

To get answers about early life stages, you need to do single cells – early life is in single cells, or close to it. When you separate a two cell embryos, miRNAs are symmetrically distributed (measured by array). T1 and T2 have similar profiling. When you separate in 4 cells – it's still the same....

Can you do the same thing with next gen sequencing to do whole transcriptome? (Yes, apparently, but the slide is too dark to see what the method is.) Quantified cDNA libraries on gel, then started looking at results.

If you do everything perfectly, concordance between forward and reverse strand should be same. However, if you do the concordance between two blastomers, you see different results. [not sure what the difference is, but things aren't concordant between two samples....]

First, showed that libraries have very high concorndance – same oocyte gives excellent concordance. However, between dicer knock out and wt, you get several genes that do not have same expression expression in both. Many genes are co-up-regulated or co-down-regulated.

One gene was Dppa5. In wt, it had low expression, in Dicer-KO and ago2-KO, they were upregulated.

After Dicer Genes were KO at day 5, only 2% of maternal miRNAs survived in a mature Dicer KO oocyte (30 days.) Dicer-KO embryos can not form viable organisms (beyond first few cell stages.)

Deeper sequencing is better. With 20M reads, you get array level data. You get excellent data beyond 100M reads.

No one ever proved that multiple isoforms are expressed at the same time in a cell – used this data to map junctions, and showed they do exist. 15% of genes expressed in a single cell as different isoforms.


Matthew Bainbridge, Baylor College of Medicine - “Human Variant Discovery Using DNA Capture Sequencing”

overview: technology + pipeline, then genome pilot 3, snp calling, verification.

Use solid phase capture – Nimblegen array + 454 sequeencing
map with BLAT and cross_match. SNP calling (ATLAS-SNP).

All manner of snp filtering.
1.Remove duplicates with same location
2.Then filter on p value.
3.More.. [missed it]

226 samples of 400.

Rebalanced Arrays.. Some exons pull down too much, and others grab less. You can change concentrations, then, and then use the rebalanced array.

Average coverage came down, but overall coverage went up.. Much less skew with rebalanced array. 3% of target region just can't get sequence. 90% of sequence ends up covered 10x or better.

Started looking at SNPs – frequency across individuals.

Interested in Ataxia, hereditary neurological disorder. Did 2 runs in first pilot test on 2 patients. Now do 4. Found 18,000 variants. Found one in the gene named for that disease – turned out to be novel, and non-synonymous. Follow up on it, and it looks good: and sequence it in the rest of the family, but it didn't actually exist outside that patient.

So that brings us to validation: Concordance to HapMap, etc etc, but they only tell you about false negatives, not false positives. You have to go learn more about false positives with other methods, but the traditional ones can't do high throughput. So, to verify, they suggest using other platforms: 454 + SOLiD.

When they're done, you get a good concordance, but the false positives drop out. The interesting thing is “do you need high quality in both techniques?” The answer seems to be no. You just need high quality in one... but do you need even that? Apparently, no, you can do this with two low quality runs from different platforms. Call everything a SNP (errors, whatever.. call it all a SNP.) When you do that and then build your concordance, you can get a very good job of SNP calling! (60% are found in dbsnp.)

My Comments: Nifty.


Keynote: Richard Gibbs, Baylor College of Medicine - “Genome Sequencing to Health and Biological Insight”

Repetitive things coming up in genomics, and comments about the knowledge pipeline. Picture of snake that ate two lightbulbs.... [random, no explanation]

“cyclic” meeting history: used to be GSAC, then stopped when it became too industrial. Then switched to AMS, and then transitioned to AGBT. We're coming back to the same position, but it's much more healthy this time.

We should be more honest about our conflicts.

The pressing promise infront of us – making genomics accessible. Get yourself genotyped... (he did), the information presented is just “completely useless!”

We know it can be really fruitful to find variants. So how do we go do that operationally? Targeted sequencing versus whole genome. What platform (compared to coke vs. Pepsi.)

They use much less Solexa, historically. They just had good experiences with the other two platforms.

16% of watson snps are novel, 15% of venter snps are novel. ~10,500 novel variants.(?) [not clear on slide]

Mutations in Human gene mutation database. We already know the database just aren't ready yet.. not for functional use.

Switch to talking about SOLiD platform:

SNP detection and validation. Validation is difficult – but having two platforms do the same thing, it's MUCH easier to knock out false positives. Same thing on indels. You get much higher confidence data. Two platforms is better than one.

Another cyclic event: Sanger, then next-gen then base-error modelling. We used to say “just do both strands”, and now it's coming back to “just sequence it twice”. (calls it “just do it twice” sequencing.)

Knowledge chain value: sequencing was the problem, then it became the data management, and soon, it'll shift back to sequence again.

Capture: it's finally “getting there”. Exon capture and nimblegen work very well in their hands. Coverage is looking very well.

Candidate mutation for ataxia mutaion. In one week got to a list. Of course, they're still working on the list itself.

How to make genotyping useful?
1.develop physicians and genetics connection
2.retain faith in genotypic effects
3.need to develop knowledge of *every* base.
4.Example, function, orthology...and...

Other issues that have to do with the history of each base. MapMap3/Encode. Sanger based methods, about 1Mb each patient. Bottom line: found a lot of singletons. They found a few sites that were mutated independently, not heritable.

Other is MiCorTex. 15,200 people (2 loci). Looking for athlerosclerosis. Bottom line: we find a lot of low frequency variants. Sequenced so many people, you can make predictions (“The coalescent”). Sample size is now a significant fraction of population, so the statistics change. All done with Sanger!

Change error modeling – went back to original sequencing and got more information on nature of calls. Decoupling of Ne and Mu in a large sample data.

In the works: represent Snp error rates estimates with genotype likelihood.
1000 genomes pilot 3 project. If high penetrance variants are out there, wouldn't it be nice to know what they're doing and how. 250 samples accumulated so far.

Some early data: propensity for non-sense mutations.
Methods have evolved considerably
whole exome
variants will be converted to assays
data merged with other functional variants.

Both whole genome and capture are both doing well.
Focus is now back on rare variants
platform comparison also good
Db's still need work
site specific info is growing
major challenge of variants understanding can be achieved by ongoing functional studies and improve context.


John Todd, University of Cambridge - “The Identification of Susceptibly Genes in Common Diseases Using Ultra-Deep Sequencing”

Type 1 diabetes: a common multifactorial disease. One of many immune-mediated disease that in total affect ~5% of the population. Distinct epidemiological & clinical features. Genome wide association success... but.. What's next?

There is a pandemic increase in type 1 diabetes. Since 1950's, there's an abrupt 3% increase each year. Age at diagnosis has been decreasing. Now 10-15% are diagnosed under 5 years old.

There is a strong north-south and seasonality bias to it. Something about this disease tracks with seasons.. vitamin D? Viruses?

Pathology: massive infiltration of beta cell islets.

In 1986: 1000 genotypes. In 1996: multiplexing allowed 1,000,000 genotypes, now allows full genome association.

Crones and diabetes are “the big winners” from the welcome trust – most heritable and easily diagnosed of the seven diseases originally selected.

Why do people get type 1 diabetes. Large effect at the HLA classII = immune recognition of beta cells. 100's of other genes in common and rare alleles of SNPs and SV in immune homesostatsis.

Disease = a threshold os susceptibility alleles and a permissive environment.

What will the next 20 years look like: National registers of diseases. (linkage to records and samples where available.) Mobile phone text health, identificaion of causal genes and their patheways (mechanisms), natural history of disease susceptibility and newborn susceptibility by their TID gneetic profile. What dietary, infectious, gut flora-host interactions modify these and which can we affect?

Can we slow the disease spread down?

There are 42 chromosome regions in type 1 diabetes, with 96 genes. Which are causal? What are the pathways? What are the rare variants? Geneome-wide gene-isoform expression. Genotype to protein information.

Ultra-deep sequencing study: 480 patients and 480 controls, PCR of exons and did 454. 95% probability of allele at .3%.

Found one hit: IFIH1. Followed up in 8000+ patients – found this gene was not associated with disease, but with protection from disease! Knock it out, and you become susceptible!

It's possible that this is associated with protection of viral infections. The 1000 genome project may also help give us better information for this type of study.

The major prevention trial to prevent type 1 diabetes is ingestion of insulin to restore immune tolerance to insulin.

Do we know enough about type 1 diabetes?

Maybe one of the pathways in type1 diabetes is a defect in oral tolerance?

Type 1 diabetes co-segregates with stuff like ciliac disease (wheat tolerance.) One of the rare auto immune diseases for which we know the environmental factor (gluton). Failure of gut immune to be tolerant of glutin.

The majority of loci between type 1 diabetes and cilliac are similar. (sister diseases)

Compared genes in Type 1 and Type 2 diabetes – they are not overlapping. No molecular basis for the grouping of these two diseases.

Common genotypes are ok for predicting type 1. ROC curve presented. Can identify population that is likely to develop T1D, but.... how do you treat?

Going from genome to treatment is not obvious, tho.

Healthy volunteers – recallable by genotype, age, etc (Near Cambridge).

Most susceptibility variants affect gene regulation & splicing. Genome wide expression analysis of mRNA and isoforms in pure cell population. Need to get down to lower volume of input material and lower costs.

Using high throughput sequencing with allele-specific expression(ASE). Looking or eQTLs for disease and biomarkers. Doing work on other susceptibility genes. (Using volunteers recallable by genotype).

Looking for new recruits: Chair of biomedical stats, head of informatics, chair of genomics clinical....


Kathy Hudson, The Johns Hopkins University - “Public Policy Challenges in Genomics”

Challenges: getting enough evidence is difficult: Analytic validity, clinical validity.. etc etc

Personal value is there theoretically – but will it work?

Two different approaches: who offers them, and then who makes the tests?

Types: either performed with or without consent. Results returned.. or not. There are now a large number of people offering tests for a wide number of conditions.

Are the companies medical miracles, or just marketing scam? Are the predictions really medically relevant. FTC is supposed to stop companies that lie... but for genetic testing they just put out a warning.

Role of states in regulating: States dictate who can authorize a test. However, in some states anyone can order it, not just medical personel.

How they're made:
Two types of tests: Lab tests and (homebrews) test “kits”. The level of regulator oversight is disparate. Difference is not apparent to people ordering them, but they have different types of oversight.

[flow charts on who regulates what] Lab tests are not under FDA (done through the CMS)... and it makes no sense to be there. you can't get access to basic science information through CMS, whereas in FDA, that's a key part of mandate(?)

Example about proficiency testing – which as poorly implemented in law, and is still not well done. The list is now out of date – and none of the list of diseases being tested have genetic basis. CMS can't give information on what the numbers in the reported values mean (labs get 0's for multi-year tests, but CMS can't explain it.)

FDA regulation of test kits are much more rigorous.

Genentech started arguing that the two path system should not be there. Should be regulated based on risk, not manufacturer. Obama-Burr introduced genetic medicine bill in 2007, and something more recently by Kennedy. (Also biobanking?)

Steps to effective testing:
1.level over oversight based on risk
2.tests should give answer nearly all the time linking genotype to phenotype should be publicly accessible
4.high risk tests should be subject to independent review before entering market
5.pharmacogenetics should be on label
6.[missed this point]

Privacy: should it be public? Who percieves it as what?

More people are concerned about financial privacy than medical privacy. 1/3 think that medical record should be “super secret” : and what part of it they thought should be most private, most people thought it was social security number! Genetic test and family history is way down the list of what needs to be protected.

People trust doctors and researchers well, but not employers. Genetic information nondiscrimination act is a consequence of that trust level. (not a direct result?)

The new Privacy Problem? DNA snooping. Who is testing your DNA? (Something about a half-eaten waffle left by Obama that ended up on ebay... claiming it had his DNA on it.)

Many actions: testing, implementing lawas, modernizing laws, transparency, better testing

My comments: It was a really engaging talk, with great insight into US law in genetics. I'd love to see a more global view, but still, quite interesting.


Howard McLeod, University of North Carolina, Chapel Hill - “Using the Genome to Optimize Drug Therapy”

“A surgeon who uses the wrong sde of the scalpel cuts her own fingers and not hte patient.
If the same applied to drugs they would have been investigated very carefully a long time ago.”
Rudolph Bucheim. (1849)

The clinical problem: Multiple active regiments for the treatment of most diseases. Variation in response to therapy, unprecedented toxicity + and cost issues! With choice comes decision. How do you know which drug to provide.

“We only pick the right medicine for a disease 50% of the time”. Eventually we find the right drug, but it may take 4-5 tries. Especially in cancer.

“toxicity happens to the patient, not the prescriber”

[Discussion of importance of genetics. - very self-deprecating humour... Keeps the talk very amusing. Much Nature vs. Nurture again]

“Many Ways To Skin a Genome”. Tidbit: Up to half of the DNA being meausred can come from the lab personal handling the sample. [Wha?] DNA testing is being done in massive amounts: newborns, transplants..

“you can get DNA from anything but OJ's glove.”

We also see applications of genetics in drug metabolism. Eg, warfarin. Too much: bleeding, too little; clotting. One of only two drugs that has it's own clinic. [yikes.] Apparently methadone is the other. Why does it have its own clinic? “That's because this drug sucks.” Still the best thing out there, though. Discussion of CYP based mechanisms and the Vitamin K reductase target. Showed family tree – too much crossing of left and right hand...

Some discussion of results – showing that there are difference in genetics that strongly influences metabolism of warfarin.

Genetics is now become part of litigation – Warfarin is one of the most litigated drugs.

We need tools that translate genetics in to lay-speak. IT doesn't help to tell people they have a CYP2C*8.. they need a way to understand and interpret that.

If we used genetics, we'd be able to go from 11% to 57% of “proper doese” at the first time with warfarin.

Pharmacogenomics have really started to take off and there are now at least 10 examples.

What is becoming important is pathways... but there are MANY holes. We know what we know, We don't know what we don't know.

We can do much of the phenotyping in cell lines – we can ask “is this an inheritable trait?” This should focus our research efforts in some areas.

Better systematic approach to sampling patients.

What do we do after biomarker validation? Really, we do nothing – we assume someone else will pick it up (Through osmosis... that's faith based medicine!) We need to talk to the right people and then hand it off – we need to do biomarker-driven studies with the goal of knowing who to hand it off too.

Take home message:

Pharmacogenetic analysis of patient DNA is ready for prime time.

My Comment: Very amusing speaker! The message is very good, and it was engaging. The Science was well presented and easily understandable, and the result is clear: there's lots more room for improvement, but we're making a decent start and there is promise for good pharmagenomics.


Keynote Speaker: Kari Sefansson, deCODE Genetics - “Common/Complex traits with emphasis on disease”

Sounds like Sean Connery!

Basic assumption is that information is the basic unit of life – and the genome is the carrier. Creating database where we can start decoding that information – and have had some success, including to find the genes for the love of complex crossword puzzles. (-:

Traits range in complexity from simple mendelian all the way to really complex genotypes and phenotypes, which are often involved in diseases. One thing to keep in mind is that they also have geographical traits.

First example: melanoma. Very different genes (for light hair and skin) occur in the population varrying by location - people in iceland don't have problems carrying this gene, but those in spain would!

Second example: Genetic risk of atrial fibrillation is genetic risk of cardiogenic stroke. About 30% of stroke is indeterminate origin, but a significant proportion is associated with several genetic traits. [insert much statistics here!]

Third example: Thyroid Cancer – (published today?). Incidence is increasing of late. If it's caught early, it has a very good prognosis. It has a very large familial component. Did a genome association in iceland, identified 1100 individuals, and had genotypes 580 of them. Pulled out 2 significant loci (independent), and they associated with two forms of thyroid cancer. [more statistics too fast to make notes...] Individuals with both genes have 5.7X increase in risk. (Multiplicative model.) The two loci also have diffences in clinical presentations. Candidate for first is FOXE(TTF2) transcription factor. Second is NKX2-1(TTF1). Apparently these gene(s?) regulate Thyroid Stimulating Hormone... so there may be an interesting mechanism.

Where are we now when it comes to discoveries of sequence varieties that code for genetic components of complex disease? There seems to be a significant amount of undiscovered diseases. Most of the ones that have been discovered have risk factors over 5%.... [not sure if that's right] Bottom line is that the detection limits are such that we can't find the really low variants with lower risk factors.

There may be a large contribution from rare variants with large effects
There may be a large contribution from rare variants with small or modes effects.
[one more.. not fast enough]

Started deCODE based on family based methods, and now have returned to it. Concept of Surrogate Parenthood – surrogates work as well as natural parents for phasing of proband. To get down to all traits with 2% of variants, they would only need to sequence ~2000 people. [Daniel says there are only 400,000 people in Iceland].

Have also noted genes where the risk factor is different between maternally and paternally passed genes.

Prostate cancer: have shown that there are 8 genes that have a cumulative risk factor. Important for treatment and preventative care.

End by pointing out that in all of the common disease, it is a disease where there are both environmental and genetic components. How do they interact? How do they fit into our debate (nature vs. Nurture).

Published on nicotine dependence and lung cancer last year. In iceland, it's purely environmental – only smokers in iceland develop cancer (14%). Discovered a sequence variant that makes you more likely to smoke more because you're more likely to crave the nicotine.. where is the line between nature and nurture, then? To solve this problem, you need to understand the brain – to understand the behaviours that make us susceptible to environmental diseases.


Thursday, February 5, 2009

Complete Genomics

[Missed start of talk]

Inexpensive. Non-sequential bases? No ligase required.

Long Fragment Reads. Start with high mol wt dna – 70-100k bases, sample prep that barcodes them, sequence it, and then informatically map it back to the fragment. Assembly then gives you 100kb base length. Chromosome phasing begins to become possible. [spiffy!]

Thus, you can do this over a genome, as well, it allows the maternal and paternal dna to be worked out.

Not planning to sell instruments: only going to be a sequence centre doing it as a service. 20,000 genomes by end of 2010. Big challenge is actually assembly. 60K cores in cluster, 30Pb diskspace.

Will partner with Genome centres. Yesterday signed an agreement to try a pilot with Broad. Will build genome centres around the world.

Trying to make sequencing ubiquitous. Send them sample, then click on link to get your sample.

Saves you capital on sequencing and then on compute infrastructure.

Will only do Humans! [I cracked up at this point.]

My comments: Ok, I missed the beginning, but the end was intersting. I totally don't understand the business model. By doing only human, they'll only find hospital customers.. which hospital will pay for them to build a data centre. I'll elaborate more on that in another post.


Erin Pleasance, Welcome Trust Sanger Institute - “Whole Cancer Genome Sequencing and Identification of Somatic Mutations”

Goals of cancer genome sequencing: WGSS read sequencing. Detection of substitutions, indels, rearragements, cnv. Detection of coding and non-coding genomes. Catalog of somatic mutaitons, functional impact and mutational patterns. Drivers vs. Passengers.

Talk tonight about one cancer and one matched normal genome.

NCI-H209 small cell lung cancer cell line.

Cancer dcell line and non-cancer cell line derived from same individual.
Prior sequencing by PCR and capilary.
Somatic mutations: 6/Mb, or 18,000 in genome.
Other data also available: affy SNP6 and expression arrays.

Show karyotype. Kinda funky, but mostly sane. (-:

used AB SOLiD machine... strategy is pretty obvious: sequence cancer and matched normal. All PET, and aligned with MAQ, corona for substitutions.

How much sequence do you need to do? Turns out, you need equal amounts of both – and it's about 30X coverage. There is a GC effect on coverage.

Compare with dbSNP. About 80% are there.

Look at tumour only, with simple filtered reads: about 50% are not in dbSNP. Many are probably germline variants. Mutations vs SNP rate: Need to call SNPs and mutations using control at the same time to get best results. As well, if you have greater than diploid chromosomes, you need to worry about that too.

Also: CNV changes and ploidy, normal cell contamination, base qualities, and it's important to do indel detection first.

CNV: easy to obtain, and cleaner than array data.

Structural variants from paired read, do it genome wide. 50 of mutants interrupt genes (of 125 in tumour only.)

Rearrangements: can also look at that. (Saw many rearrangement events).

Structural variants at basepair resolution. (Using Velvet... good job Daniel).

Last thing of interest: Small indels (less than 10bp.) Paired end reads, anchor with one end.

Medium indels can be found by identifying deviation in insert size (Heather Peckham). You can see a shift in size... not an actual significant change. [interesting method] Can be seen in comparison between normal and tumour.

To summarize: somatic variants throughout the genome. Circos plots (=
Somatic mutations, functional impact? Recurrence? Pathways?


Christopher Maher, University of Michigan - “Integrative Transcriptome Sequencing to Discover Gene Fusion in Cancer”

80% of all known gene fusions are associated with 10% of human cancers. Epithelial cancers account for 80% of cancer deaths, but have only 10% of known fusions.

Mined publicly available datasets and looked for genes with outlier expression.

Will use next-gen sequencing to get direct sequence evidence of chimeric events. Decided to use both 454 and Illumina. Categorized: mapping reads, partially aligned reads, non-mapping reads. Used same samples [whoa.. classification just got extensive.... moving on.]

Chimera discovery using long read technology. Sequenced: VcaP, LNCaP, RWPE. Found 428, 247, 83 chimeras respectively.

Then added illumina. First checked that they could find the fusion that they know. 21 reads mapped there.

Found both intra- and inter-chromosomal candidates, and then validated 73% of them.

So, to recap: candidates found by both 454 and illumina were MUCH more selective and found they were throwing out false positives, but keeping all the known targets.

Confirmed results with FISH.

Next expt: Identification of novel chimeras in prostate tumour samples. Found candidate sequences from non-mapping read, then worked to validate. How does it work, and what's it's frequency? Found it in 7 metastatic prostate tissues and is androgen inducible. In a meta study, found the fusion of interest in about 50% of prostate cancers.

Came up with a chimeric classification system: 5: inter chromosomal translocation, inter chromsomeal complex rearrangements, intra chromsomal complex deletions, intra chromosomal complex rearrangements.

Summary: validated 14 novel chimeras
Demonstrated cell line can harbor non-specific fusions...
[too slow to catch last point]

Answer to question: 100bp reads would have been long enough to nominate fusions.


Anna Kiialainen, Uppsala University - “Identification of Regulatory Genetic Variation That Affects Drug Response in Childhood Acute Lymphoblastic Leuk

1.most common in children
2.20% do not respond to treatment
3.multi-factorial disease

Allele specific expression is important. Normally, you only have one copy of each gene, however you can get different ratios of expression from each allele, which leads to very different proportions in the sample of each allele.

Advantages: they can serve as internal standards for each other.

Causes: SNPs that affect transcription or stability. Or, allele specific promotor (regulation or methylation).

Samples: 700 children with acute lymphoblastic leukemia. Yearly follow up data and drug response, in vitro drug sensitivity, immuophenotype, cytogenic data. RNA available for 1/3 of them.

Genotyped over 3531 SNPs in 2529 genes. ASE was detected in 400 (16%) of the informative genes, 67 of which displayed monoallelic gene expression. (Milani L, et al, Genome Research 2009)

Methylation analysis: Selection of 1536 CpG sites from >50,000 CpG sites in genes displaying ASE. Custom GoldenGate methylation panel. (ibid)

SNP discovery: 56 genes displaying ASE selected fro sequencing in 90 ALL samples. Template preparation with Nimblegen seq. capture. Illumina sequencing.

To date: 16 samples hybridzed. 5 samples sequenced with GA I. 81-97% align to genome (Eland), 28-67% align to target region (MAQ).

Overview of sample sequencing coverage.

Initial SNP discovery – 2063-4283 SNPs found with MAQ. 3422 in at least two samples. 818/3422 are novel (not in db.)

My comments: This is an interesting talk, from the big picture view. Dr. Kiialainen is spending a lot of time talking about metrics that haven't really been used much this year (percent alignment, etc.) and explaining figures that are relatively simple. There wasn't much data presented – essentially it's no more than an outline and statistics of the sequences gathered. Not my least favourite talk, but had very little content, unfortunately. Knowing you can do allelic studies is neat, however, which is clearly the best part of the talk. My advisor is chairing the session... and asking the same questions he asks me. Nice to know I'm not the only who answers with “No, I haven't done that yet!”


David Dooling, Washington University School of Medicine, “Next-Generation Informatics”

[Had to change rooms, missed some of the start of the talk]

The rate of change is far outstripping moores law.

So: Framing the problem - Viewpoints:
Lims: [Picture of Richard Stallman.. Nice!] how do we process and track information?
Analysis: [picture of Freud.. also Nice... same beard?] How do we process and extract information?
Project Leads: In, and Out... what's the answer?

Pipelines: Always changing! Buffers, software, tools, etc etc, etc!!!!

Analysis: Changing Pipeline: Proliferation of Data has led to a proliferation of tools.

So how do we do things on a massive scale, but deal with the constant change.

“We've always been pushing the envelope...” using the past as a guide to how to deal with the change.

As developers, put it in terms of flow charts, databases, pathways.. etc. Get a handle on the problem

How we deal with it: Regular entities to event entities to processing directives

The problem comes when the processing directives change... and that's a big change – frequently. So, to deal with it, entities were classified. To apply this, things were abstracted to big units, which can each be modular. By making things modular, they can be substituted.

1.Created an object-relational mapping (ORM) layer.
2.Object Context
3.Dynamic command-line interface
4.Integrated Documentation System.

ORM was created from scratch because none of the others were able to cope with the stuff the workload that was being demanded of it. Everything works in XML, so you can verify flow, and it makes it easier to do parallelization.

All of these things together become “Genome Model”, which is a thin wrapper around all their tools, which give you massively parallel system with excellent data management and reporting.

Yikes... has an easy PERL API. [Everyone likes perl? Count me out.]

working model for employees: Pairing: analysts are paired with programmers so that better software is written.
Still much more to do.
Sequenceing is demolishing Moores Law
The cult of traces – desire to have raw information at our fingertips. (ven diagrams don't scale well, but things like Circos do!)


Pacific Biosciences - Steven Turner " Applying Single Molecule Real Time DNA Sequencing"

Realizing the power of polymerase SMRT.
Each nucleotide is labeled uniquely, the flurophores are truncated, leaving behind just the dna. Using a Zero mode waveguide, only the one being read is shown. [cool videos].

At end of signal, it just moves on to the next base...

Every day, they're working on SMRT – showing a demo run.. 3000reaction in parallel. Multi kb genomic fragment. Just one Polymerase. Similar to electrophoresis... keeps going.... and going and going. Real time – several bases per second. Put it to bottom of screen ,just keeps going on and on and on. [it IS transfixing.]

Start with genomic, sheer by any method you want, and now, ligate with HAIRPINS! It's now circular.... so you can keep going around and around. You get both sense and antisense DNA. Can close any size... call it a “SMRTbell prep”, (eg, not a dumbell... heh.. not really that funny.) They also use strand displacement enzyme, so it just displaces what was already there.

First project was a human BAC, last november 107kb chromosome 17. Production readlength: 446bp. Max read: 2,047bp. Aligned to NCBI, and validated by Sanger.

In non repetitive: 99.996% accuracy. Missed 3 SNPs that were false negatives. Repetitive. 99.96% missed 7 bases. Have made significant progress since then..

Sequenced E.coli to 38Fold... 99.3% (last january), max readlength at 2800bases. 99.9999992% [I hope I got that right!]

4 errors + 1 variant on whole genome (Q54!)

Heh.. they had issues from artifact caused by more DNA closer to ORI in E.coli from stopping cultures in midphase. Now have incredibly accuracy that they can measure it.

Accuracy does not vary more than 5% over 1200 bases. Heads for Q60 around 20-fold coverage.

8 molecule coverage. (8 Individual DNA strands have contributed.) Dependant on the fluorophores... they each show brightness profiles. So, some channels are still weak, but they have new ones in development to replace it.

One example: First time you can bridge a single 3200bp region. 3bp/sec. (2.6kb duplication region in the middle.)

Development: average of 946bp read length... and up to 1600 at the high end. You trade throughput with readlength... at one end, fewer SMRT waveguides complete, but long reads, at the other, more complete at the shorter read.

Consensus on a single molecule. You can also do heterogeneity. If you put in mixes, you get out a mix, with a linear relationship to the fraction recovered. (eg, snps will be very clean.)

Flexibility: you can do long OR short reads. Redundancy is high, so you can get 1ppm sensitivity. 12 prototype instruments in operation. Expect delivery in Q3 2010.


Adam Siepel, Biologcal statistics & Computational biology – Comparative Analysis of 2x genomes: Progress, challenges and opportunities.

Working on the newly released mammalian genomes at 2x coverage. We're rapidly filling out the phylogeny, so there's a lot of progress going on. We can learn a lot by comparing genomes more than we can by looking to a single genome.

Placental mammals (Eutherian) are well sequenced. The last of the 2x assemblies were released just last week. There are 22 genomes being focussed on: most are 2x, a couple are 7x, and some are in progress of being ramped up.

One of the main obstacles is error. (sequencing or otherwise). Miscalled bases and indels from erroneous sequences have a big impact. Thus, the goal is clean up the 2x sets. In 120 bases, 5 spurrious indels and 7 miscalled bases. [Wow, that's a lot of error.] Nearly 1/3rd of all 1-2 base indels are spurious.

Thus, comparative genomics often gets hit hard by these errors.

A solution: error correction rates: use redundancy to systematically reduce error. In some sense, there is a version built in – we can use comparative genomics to “decode” the error correcting code. This can be done because the changes between species tend to be variable in predictable ways.

The core idea: Indel Imputation: “Lineage-specific indels in low-quality sequence are likely to be spurious.

Do an “automatic reconstruction” using parsimony... If a lineage specific indel is low quality, then assume it's an error. More computationally intense methods are actually not much more effective.

There is also base Masking – don't try to guess what they should be, but just change them to N's. Doing these thing may change reading frames, however.

After doing the error correction, the error appears to drop dramatically. (I'm not sure what the metric was, however.)

Summary: good dataset with some error. Correction method used here is a “blunt instrument”, many or most errors can be masked or corrected if some over-correction is allowed. There is a trade off, of course.

Conservation has its own problems, which can be a problem as well. Thus, they have been working on new programs for this type of work: PhyloP. Has multiple algorithms for scaling phylogeny, and the like. Extensive evaluations of power for these methods were undertaken. However, the problem is that people are at the limit for what they can get out from conservation, depending on what's there. Pretty reasonable power is good when selection is good, or when the elements are longer (eg 3bp.)

Discussing uses of conservation.... moving towards single base pair resolution.


Phil Stephens, Sturctural Somatic Genomics of Cancer

Andy Futreal could not show up – he got snowed in in philladelphia.:
All work is from illumina platform.

Providing an overview of multistep model of cancer...

Precancer to in situ cancer, to invasive cancer, to metastatic cancer.

They believe cancer have 50-100 driver mutations plus 1000 passenger mutations. Some carry 10-100's of thousands of passengers.

The big question is how to identify the driver mutations. Today, the focus on is on structural variations. 200Bps to 10s of Mb. Can be seen as copy number variations or as copy number neutral (balanced translocations, etc)

For instance, the upregulation of ERBB2, by causing multi copies.

For the most part, we have no idea what the genomic instability does to the copy number at local points, and what they accomplish.

Balanced translocations are interesting because they tend to create fusion genes. There art at least 367 genes known to be implicated in human oncogenesis, 281 are known to be translocated. 90% are in leukemias, lymphomas... [missed the last one]

Use 2nd gen to study these phenomenon. For structural variation, it's always 400bp fragments, with PET sequencing. Align using MAQ. Basically, look for locations where the fragments align in wider than 400bp locations. Need high enough coverage to then check that these are real. You then need to verify using PCR – check if the germline had the mutation as well. Futreal's group is only interested in Somatic.

Now, that's the principle, so does it work? Yes, they published it last year.

NCI-H2171. Has 6 previously known structural changes for control. Very simple copy number variation identification. Solexa copy number data is at least as good as Affy Chip. They suggest Solexa has the ability to find the true copy number variation,whereas Affy chip tends to saturate.

For control, they found intra-chromosomal reads, and then verified with PCR. Two reads mapped to the breakpoint, and were able to work out the consequences of the break. Used a circos diagram to show most translocations are intra-chromosome, and only a small number of them are inter-chromosome.

Since publication, they've now worked on the same project to update the data. They're better at doing what they did the first time around. They redid it on 9 matched breast cancer cell lines, and got ~9x coverage.

HCC38 – no highly amplified regions. Found 289 somatic chromosomal translations. Most of the changes are due to Tandem Duplications, however it was not replicated in another cell line. So, structural variation is highly variable

distinct patters emerged: one has lots of Tandem Duplications, one with very little structural variations, and one with a more lymphoma like pattern.

“Sawtooth” pattern to CNV graph: lots of different things going on. Some are simple, some are difficult.

What are the Structural Variations doing?

Looked at examples of fusion proteins. In one cell line hCC38, found 4 S.V. Found smaller SV's as well.

Duplications of exons 14 and 15 in one particular gene: receptor tyrosine kinase, which seems to be in the ligand binding domain. Also evidence from many other observations of SNPs in the same domain.

What they didn't know was if it would reflect what's going on in breast cancer. 15 primary breast cancers were then sequenced. (65Gb total).

Huge diversity was found. Anywhere from 8-230+ structural variants per tumour. Same patterns as in the cell lines are found. 11 potential promotor fusions...

[The numbers are flying fast and furious, and I can't get close to keeping up with them.]

151 genes are found in 2 samples. 12 in 3, 5 in 4....

How do you assay for variants? FISH, cDNA pcr! Other mutations in rearranged genes. Whole exome seqeuncing, trancriptome sequencing, epigenetic changes are down the road.

Also can look at relationship between somatic break point positions in the geome.

Conclude: PET sequencing is useful for structural variation,
average breast cancer has ~100 mutations (somatic)
Average cancer has ~3.2 fusion genes

Question: Genome vs Transcriptome? Answer: Both!
Question: how many of the hits are false? Answer: at first it was 95%, now it's down to 10%.

My comments: very nice talk. Since this is basically similar to what I'm working, it's very cool. It's nice to know that PET makes such a huge difference. The paper referenced in the talk was a good read, but I'm giong to have to go back and reread it.


Tom Hudson – Ontario Institute of Cancer Research “Genome Variation and Cancer”

Talking about two cancers: Colorectal tumours
1200 cases and 1200 controls
looking for predictors of disease

1536 SNPs from candidate genes, in 10K coding non-synonymous SNPs, Affy 100K and 500K arrays.

Eventually found a hit in a gene dessert (Long intergenic non-coding RNA... learned the name this morning. (= Close to myc, but hasn't been correlated to anything.

In last year, 10 validated loci in 10,000 individuals, with very small odds rations (1.10 to 1.25). One of them is a gene: SMAD7. 5 loci are also in near genes that are involved in things... but are not actually in the gene.

Since there are 10 alleles, you'd think it would be a distribution, however most people carry 9 (27%)! There is also a linear relationship between the number of alleles and the risk of developing cancer. However, this still doesn't seem to be the causative allele.

Enrichment of Target Regions. Using a specific chip with 3.14Mb colon cancer specific regions. Those regions didn't take all of the space, so they added other colon cancer gene sequences as well.

Protocol: 6ng, fragmentation (300-500bp)... [I'm too slow]

Exon capture arrays are being used, andpPreliminary results: 40 DNA's : 65Gb.

Use MAQ to do alignments. Coverage 75% at 10X, 95.6% at 1X.

“More than 99% of gDNA has % GC that allows effective capture”

Analyzable Target Regions: 39175, 232 coding exons
Average coverage: 70.3

40 individuals yeild 8,706 SNPs
Known 59.6%
new snps, 2,397
Total number in coding exons: 77

Sequencing data compared to Affy data, very high concordance.

Rare alleles may be driving risk in several sporadic cases. Stop codons were found in 6 individuals with sporadic CRC.

Follow up genotyping is required to validate new SNPs and correlate with phenotype.

Second topic: International Cancer Genome Consortium.

“Every cancer patient is different, every tumour is different.” Lessons learned: Huge amount of heterogeneity within and across tumour types. High rate of abnormality, and sample quality matters!

50 tumour types x 500 tumours = 50,000 genomes.

Major issues: Specimens, consent, quality measures, goals, datasets, technologies, data releases.

[Mostly discussion of the mechanics of the project management, who's involved and where it's happening, as well as tumours, which I'm sure can be found on the OICR's web page. OICR is committeed to 500 tumours, using Illumina and Solid. They are also creating cell lines and the like, so there will be a good resource available.

Pancreatic data sets should be available on the OICR web page by June 2009.

Question: why Illumina and Solid? Answer: they didn't know which would mature faster. By doing both, they have more confidence in SNPs. They never know which will win in the end, either.

My Comments: Not a lot of science content in the second half, but quite neat to know they've had success with their CRC work. It seems like a huge amount of work for a very small amount of information, but still quite neat.


Keynote Speaker: Eddy Rubin, Joint Genome Institute - “Genomics of Cellulosic Biofuels”

They're funded by DoE, so they have a very different focus. So, after all the work they did, they realized that E stood for Energy, so they've started working on that. (-;

more than 98% of energy in transportation is from petroleum, for which there are environmental and political consequences. They've known about it for 30 years, but haven't worked on it much yet.
Churchill quote: You can count on Americans to do the right thing, as long as they've exhausted all other options.

Focuses on things like biofuels, where most of the focus will be on cellulosic biofuels. For those who don't know, it is basically using biomass – mostly cellulse.. Many current technologies just use the sugar (edible) part of the plant – but cellulosic energy would use the non-edible parts of the plant, eg, the cellulose.

Every gallon of cellulosic biofuel produces 12x Less CO2 production, and 8x less than corn biofuel.

How goes genome of bioenergy plant feedstocks help?

10k-fold increase in energy derived from domesticated grasses and wheats as compared to wild grains. So, domestication is a big deal. Can we domesticate Poplar?

If we could choose, we'd like short, stubby trees with compact root systems. There are groups that are systematically manipulating Auxins to try to cause this to occur. They've had some success. Can create shorter, stubbier trees, or trees with thicker trunks. So, it's working reasonably well.

Poplar is niche, though. The real thing is grasses. They can be harvested, and they don't need to be replanted. (Something about them squirting their nutrients into the soil at the end of the year...)

Anyhow, there are already organisms that do cellulosic breakdown, so those should also be sequenced.

(One of them is a “Stink bird”, which belches and smells.... odd. Another is a shipbore mollusk, which digests ships bottoms.)

Can we replicate cellulosic degredation as those in intestinal environments?

To dissect termites, you chill them on ice, and then pull off their heads from tails, and eventually the guts are displayed. Ok, then.

Once you have the guts, you can sequence the microbiome. Doing so, they found more than 500 Cellulose and Hemicellulose degrading enzymes.

They also work with cow guts (fistulated cows.) The amount of volume obtained from 200 termites: 500 ul. The amount from one cow: 100ml.

(for the record, pictures of wood chips after 72 hours in a cow stomach – not appealing.)

One experiment that can be done is to feed the cow various types of feed to see what enzymes are being used. The enzymes being used are very different, but the microbial community is the same. This is a new source of enzymes for degredation of energy crops.

The final step in this process: conversion of biofuels to liquid fuel. The easiest one is fermentaion. More than 20% of sugars you get from degrading wood is Xylose... and it's not being fermeted. So, organisms that use xylose and convert to ethanol have been found and are being used.

Ethanols has problems, though – transportation and efficiency of production. Ethanol kills the organisms that produce it.

“Ethanol is for drinking, not for driving” Jay... [missed the last name]

Pathway Engineering is going to become an important part of the field, so that organisms will do the things we want them too. [Sort of seems like a shortcut around diversity... I wonder if people will be saying that in 10 more years..]

My Comments: This was a pretty standard talk about cellulosic biofuel/ethanols. I saw similar talks in 2006, so I don't think much has changed since then, but the work goes on. I don't know why it was a keynote, in terms of subject, but definitely was a well done talk!


Oyster genome...

Note to my boss: Shenzen is sequencing the oyster genome.  See, you should have sent me back to Tahiti to work on the pearl oyster genome! (-;

Labels: ,

Jun Wang, Beijing Genomics Institute at Shenzhen - “Sequencing, Sequencing and Sequencing”

Shenzen is one of the biggest sequencing centres in the world – both in sequencing throughput and the quantity of computing, and such.

With >500Gb per month, what would you do?

The obvious choice is to do whole genomes: From Giant Panda to the tree of life. (Is the panda really a bear?) Formal reason: They eat bamboo, are cute and nice.... and they're cute! Ok, the real reason: Selected an animal “without competition” for sequencing, has a significant “Chinese element”, and proof of concept that short read length is good for the assembly on a large genome.

Why do we need longer reads? 10 years ago, it was the question, can you sequence by shotgun sequencing? Yes... now can we do with short reads? Yes: but there are questions
Read length: the longer the better
Insert sizes: for finishing, this becomes important
Depth: determines quality.

Why short reads work: most of the genome is really unique anyhow. Insert size is probably the most important mater.

( Started with a pilot project: cucumber. )

Panda: has 20 chr + X/X. Did inserts from 150-10,000. 50X sequence coverage, 600X physical coverage.

Genome coverage is 80%. Gene coverage 95%, Single base error rate is Q50, less than 1/100kb.

Gene stats: 27.8k homologous to dog genes.

Evolutionarily closer to dog, of sequenced genomes, next closest to cat. (But panda is a bear.) It's evolutionary rate is slightly higher than dog. Would like to add significant species to tree of life.

One of the original questions on what to sequence: “Tastes good, sequence it!” Now, it's close to 50% of the major dinner table! [yikes]

Instead, now proposes cute things: Penguins!

Aiming to sequence “big genomes”, 100Gb+ genomes.

First Asian was sequenced last year.

Is one genome enough? No, probably not.. Need 100's to study population genetics. Now taking part in 100 genome project. Committed for 3Tb. (about 500 individuals.)

De Novo assembly is the only solution for a complete structure variation (SV) map. Still too expensive, though.

Started a new project in sequencing asian cancer patients. The cost is about $4000-5000 per sample. [I missed how many per person]

top 10 causes of death for asians... start to rank, and decide which to attack.

4P healthcare (personalized medical care) is coming (All based on personal genomics). Picture of FAR too many people on a beach in china.

Already sequenced all major rice cultivars. Found many selective sweeps – lots of new variation?

Also working on Silkworm study... [this is just rapidly turning into a list of projects they've started. Interesting, but nothing much to gain from it.]

DNA methylome: just finished the first asian version.

Also working on methylation that changes as you climb mountains. [Ok, I just don't really get this one.] High altitude adaptation... [but why is this a priority?]

[At the bottom of the slide it said “Work? Fun? Science?” I'm not really sure if that was any of the above.... strange.]

Also doing Whole Transcriptome. Several species, plants, insects, etc.

You need huge depths (400x) to get all transcripts greater than or equal to 1, but decreases from there.

Also started a 1000 plant collaboration. Genomics has barely scratched the bast biodeiversity on the planet. They are going to start working on this. From Algae to flowering plants.

1Gb of transcript sequencing per sample would be equivalent to 2M EST.

Now doing 75bp reads PET.

Another project: Metagenomics of the Human Intestinal Tract.

“Sequencing is Basic” [eh?]

Question asked: How many people are there?  Answer: 1000, over 3 campuses, mostly young university drop outs who work hard and sleep in the lab!  

My Comments: It's interesting to know what these guys are doing, but it just seems really random. They may be the biggest, but I wonder where they're going with the technology.. It appears to be a technology in search of a project, unlike the way the rest of the world is working towards projects, and then applying the technology. Maybe someone else can figure out what their underlying goal is and explain it to me. :/