Thanks for visiting my blog - I have now moved to a new location at Nature Networks. Url: http://blogs.nature.com/fejes - Please come visit my blog there.

Monday, March 1, 2010

Link to all my AGBT 2010 Notes.

One last blog for today.

If you're looking for the complete list of my AGBT 2010 notes, look no further. The link below has the full list of talks and workshops I attended. I haven't indexed it, but if you search for "AGBT 2010" within the page, it should take you to the next header/footer in reverse chronological order of the notes I took. Cheers!

AGBT 2010 notes.

Labels:

AGBT wrap up.

So, everyone else has weighed in with their reviews of AGBT 2010 already, and as usual, I'm probably one of the last to write anything down. Perhaps the extreme carpel tunnel syndrome I've exposed myself to by typing out my notes should suffice as an excuse...

Anyhow, I wanted to put down a few thoughts on what I saw, heard and discussed before I forget what I wanted to say.

First off, I know everyone has commented on the new technologies already. I'm very disappointed that I wasn't able to see the Ion Torrent presentation, and that I missed the presentation from Life Technologies. Those were two of the biggest hits, and I didn't see either of them. While I did get a quick introduction on the Life Technologies platform from a rep in the Life Tech suite, it's not quite the same.

However, I was there for several of the other workshops and launches, and in particular, the Pacific Biosciences workshop. In general, I think Pac Bio has been served up a lot of criticism for failing to disclose the exact error rate of their Single Molecule Real-Time (SMRT) sequencing platform, as well as for some of the problems they face. Personally, I'm not inclined to think of any of that as a failure - simply as engineering problems. Having worked on early 454 data, there were flaws that were equally disastrous as the challenges that Pac Bio now faces. Much of the criticism is simply directed at the fact that this is measuring single molecules of DNA, and not clusters. Clearly, there are will be challenges for them to overcome: The most obvious are that PacBio will have to lower the wattage of their light source and they'll likely have to do some directed evolution (or even rational design) to lower the frequency at which bases are incorporated too quickly to be read, or possibly come up with a chemistry solution. (More viscous solutions? who knows.) All of the 2nd generation platforms were launched with problems - and Pac Bio certainly isn't the exception to it. Each one gets better over time, and I'm certain PacBio will continue to improve. For the moment, they've suggested protocols like sequencing circular DNA that dramatically reduces the error rate, these issues aren't nearly as big as the hype makes them out to be.

Just to finish off on the subject of SMRT sequencing, I think Elaine Mardis' presentation on the results obtained with PacBio weren't outstanding. Normally, I get really jealous about PacBio results, and wish that I could get my hands on some of them - but this time, I was left a little flat. While there are really neat applications for single molecule sequencing, Human SNPs really aren't one of them. Why they chose to present that particular problem is somewhat beyond me. Not that the presentation was bad, but it failed to really showcase what the platform can be used for, IMHO. Their other presentations (SMRT Biology, for example), were pretty damn cool.

There has also been much talk about Complete Genomics, and how they're not going to make it, which I've already written up in the previous post. I see that as a failure to understand their business model and to understand who they're competing with (ie, not the other sequencing companies.) I expect that they'll be the microarrays of the future - cheap diagnostic tools, with even better repeatability than your average microarray. I don't think they should be written off just quite yet.

Finally, there has been much ado about the HiSeq 2000(tm), released by Illumina. While I have nothing against it (and am even looking forward to it), I don't see it as much except for an upgraded version of their last machine, the GAIIx. They've changed the form factor and the shape of the flow cell, and then enabled some things that were previously disabled (such as two sided tile scanning), it's really just an evolutionary change in a new box, which will allow them more room to grow the platform. Fair enough, really - I don't know how many more upgrades you could put into one of their original boxes, but there's nothing really new here that would have me running after them to get one. I should mention, however, that increased throughput and lower cost ARE significant and a good thing - they just don't appeal to my geeky fascination for new technology.

Another criticism I heard was that these companies shouldn't be calling their tech "3rd generation." Frankly, I've been advocating since last year that they SHOULD be called 3rd generation, so that criticism seems silly, to say the least. Pyrosequencing is clearly synonymous with the 2nd generation of sequencing technologies, while Sanger sequencing is clearly first generation, and hybridization is kind of zero-th generation (although you could make a case for SOLiD being 2nd generation, which would also drag Complete Genomics into that group as well, then). However, the defining characteristic of 3rd generation, to me, is the move away from sequencing ensembles of molecules. An auxiliary definition is that it's also the application of enzymes to do the sequencing itself. So, I'm just going to have to laugh at those who claim that 2nd and 3rd generation are all generically "next-generation" sequencing. There is a clear boundary between the two sets of technologies.

A topic I also wanted to mention was the use of technology at AGBT this year. Frankly, I was blown away by the coverage of all of the events through twitter. I enjoyed at least one talk where I left twitter open beside my text editor, and tried to keep notes while listening to the speaker had to say, while watching the audience's comments. If I hadn't been blogging, I think that would be the best way to engage. Insightful comments and questions were plentiful, and having people I respect discuss the topic was akin to having other scientists leave comments in the margins of a paper you're reading. [Somewhat like reading Sun Tsu's Art of War, where there are more annotations than original material, at some points.] Alas, it was too distracting to compile notes while reading comments, but it was really cool. Unfortunately, Internet coverage was spotty at best, and in some rooms, I wasn't able to get any signal at all. The venue is great, but just not equipped for the 21st century scientist. Had I been there at the end of the conference, I would have suggested that perhaps it's time to identify an alternate venue that can handle the larger crowds, as well as the technological demands of an audience that has 300+ laptop computers going at once. (Don't get me started on electrical outlets.)

I'd like to end on a few good points.

The poster session was excellent - too short, as always, but the quality of the posters were outstanding, and I had fantastic conversations with a lot of scientists. I won't mention them by name, but I'm sure they know who they are. I saw several tools I'll try to follow up on. (By the way, if anyone was looking for me, I spent less than 20 minutes by my poster throughout the conference. There just wasn't enough time to read all of them and still answer questions and absorb everything out there. Sorry about that - feel free to email me if you have questions.)

I should also mention that the vendors were all very hospitable. One of my enduring memories of this year will be Life Technologies allowing the Canadians to crash their suite and use one of their Demo TV's to watch the semi-final Olympic hockey game. (Canada vs. Slovakia.) We were desperately outnumbered by non-Canadians, but they tolerated our screaming pretty well. (A few of them even seemed curious about this weird sport played on ice...) And, of course, anyone who saw my tweets knows about PacBio and the hawaiian shirt, just to name a few examples (-;

So, again, I think AGBT was a great success and I enjoyed it tremendously. Rarely in my life do I get to pack so many talks, discussions and networking into such a short period of time. It may have left me looking somewhat like a deer caught in the headlights, but unquestionably I'm already looking forward to what will be revealed next year.

Labels:

Sunday, February 28, 2010

Complete Genomics, Revisited (Feb 2010)

While I'm writing up my notes on my way back to Vancouver, I thought I'd include one more set of notes - the ones I took while talking to the Complete Genomics team.

Before launching into my notes (which won't really be in note form), I should give the backstory on how this came to be. Normally, I don't do interviews, and I was very hesitant about doing one this time. In fact, the format came out more like a chat, so I don't mind discussing it - with Complete Genomic's permission.

Going back about a week or so, I received an email from someone working on PR for Complete Genomics, inviting me to come talk with them at AGBT. They were aware of my blog post from last year, written after discussing some aspects of their company with several members of the Complete Genomics team.

I suppose in the world of marketing, any publicity is good publicity, and perhaps they were looking for an update for the blog entry. Either way, I was excited to have an opportunity to speak with them again, and I'm definitely happy to write what I learned. I won't have much to contribute beyond what they've discussed elsewhere, but hey, not everything has to be new, right?

In the world of sequencing, who is Complete Genomics? They're clearly not 2nd generation technology. Frankly, their technology is the dinosaur in the room. While everyone else is working on single molecule sequencing, Complete Genomics is using technology from the stone age of sequencing - and making it work.

Their technology doesn't have any bells and whistles - and in fact, the first time I saw their ideas, I was fairly convinced that it wouldn't be able to compete in the world of the Illuminas and Pac Bios... and all the rest. Actually, I think I was right. What I didn't know at the time was that they don't need to compete. They're clearly in their own niche - and they have the potential to become the 300 pound gorilla.

While they're never going to be the nimble or agile technology developers, they do have a shot at dominating the market they've picked: Low cost, standardized genomics. As long as they stick with this plan - and manage to keep their cost lower than everyone else, they've got a shot... Only time will tell.

A lot of my conversation with Complete Genomics revolved around the status of their technology - what it is that they're offering to their customers. That's old hat, though. You can look through their web page and get all of the information - you'll probably even get more up to date information - so go check it out.

What is important is that their company is based on well developed technology. Nothing that they're doing is bleeding edge, nothing is going to be a surprise show stopper: of all of the companies doing genomics, they're the only one that can accurately chart the path ahead with clear vision. Pac bio may never solve their missing base problem, Illumina may never get their reads past 100bp, Life Tech may never solve their dark base problem, and Ion Torrent may never have a viable product. You never know... but Complete Genomics is the least likely to hit a snag in their plans.

That's really the key to their future fate - there are no bottle necks to scaling up their technology. We'll all watch as they bring down the distance between the spots on their chips, lower the amount of reagent required, and continue to automate their technology. It's not rocket science - it's just engineering. Each time they drop the scale of their technology down, they also drop the cost of the genome. That's clearly the point - low cost.

The other interesting thing about their company is that they've really put an emphasis on automation and value-added services. Their process is one of the more hands off processes out there. It's an intriguing concept. You fed-ex the DNA to them, and you get back a report. Done.

Of course, I have to say that while this may be their strength, it's probably also one of their weaknesses. As a scientist, I don't know that the bioinformatics of the field are well enough developed yet that I trust someone to do everything from alignment to analysis on a sample for me. I've seen aligners come and go so many times in the last 3 years that I really believe that there is value in having the latest modifications.
What you're getting from Complete Genomics is a snapshot of where their technology is at the moment you (figuratively) click the "go" button. Researchers like do play with their data, revisit it, optimize it and squeeze every last drop out of it - something that is not going to be easy with a Complete Genomics dataset. (They aren't sharing their tools..) However, as I said earlier, they're not in the business of competing with the other sequencing companies - so really, they may be able to side step this weakness entirely by just not targeting those people who feel this way about genomic data.

And that also brings me to their second weakness - they are fixated on doing one thing, and doing it well. That's often the sign of a good start-up company: a dogged pursuit of a single goal of excellence in one endeavour. However, in this one case, I disagree with Dr. Dramanac. Providing complete genomes is only part of the picture. In the long run, genomic information will have to be placed in the context of epigenetics, and so I wonder if this is an avenue that they'll be forced to travel in the future. For the moment, Dr. Drmanac insists that this is not something they'll do. If they haven't put any thought into it, when it does become necessary, it's something that will drive customers towards a company that can provide that information. Not all research questions can be solved by gazing into genomic sequences, and that's a reality that could bite them hard.

For the moment, at least, Complete Genomics is well positioned to work well with researchers who don't want to do the lab and bioinformatics tweaking themselves. You can't ask a microbiology lab to give up their PCR machine, and sequencing centres will never drop the 2nd (and now 3rd) generation technology lab to jump on board the 1st generation sequencing provided by Complete Genomics. Despite the few centres that have ordered a few genomes (wow.. I can't just believe I said "a few genomes"), I don't see any of them committing to it in the long run for all of the reasons I've pointed out above.

However, Complete Genomics could take over genomic testing for pharma or hospital diagnostics. Whoever is best able to identify variations (structural or otherwise) in genomes for the lowest cost will be the best bet to do cohort studies for patient stratification studies - and hey, maybe they'll be the back end for the next 23andMe.

So, to conclude, Complete Genomics has impressed me with their business model, and they have come to know themselves well. I'll never understand why they think AGBT is the right conference to showcase their company, when it's not likely to yield that many customers in the long run. But, I'm glad I've had the chance to watch them grow. Although they may be a dinosaur in the technology race, the T-Rex is still a fearsome beast, and I'd hate to meet one in a dark alley.

Labels:

Saturday, February 27, 2010

AGBT 2010 - Illumina Workshop

[I took these notes on a scrap of paper, when my laptop was starting to run low on batteries. They're less complete than most of the other talks I've taken notes on, but should still give the gist of the talks. Besides, now that I'm at the airport, it's nice to be able to lose a few pieces of scrap paper.]


Introducing the HiSeq 2000(tm)
- redefining the trajectory of sequencing

First presentation:
- Jared from Marketing


Overview of machine.
- real data of Genome and transcriptome
- more than 2 billion base pairs per run
- more than 25Gb per day
- uses line scanning (scan in rows, like a photocopier, instead of a whole picture at once, like a camera)
- now uses "dual surface engineering": image both the top and bottom surface, which means you have twice as much area to form clusters
- Machine holds two individual flow cells
- flow cells are held in by a vacuum
- simple insertion - just toggle a switch through three positions - an LED lights up when you've turned it on.
- preconfigured reagenets - bottles all stacked together: just push in the rack
- touch screen user interface
- "wizard" like set up for runs
- realtime metrics available on interface - even an ipod app (available for ipad too..)
- multimedia help will walk you through things you may not understand.
- major focus on ease of use
- it has the "simplest workflow" of any of the sequencing machines available
- tile size reduced [that's what I wrote but I seem to recall him saying that the number of tiles is smaller, but the tiles themselves are larger?]
- 1 run can now do a 30x coverage for a cancer and a normal (one in each flow cell.)
- 2 methylomes can be done in a week
- you could do 20 RNA-Seq experiments in 4 days.

Next up:
David Bently

Major points:
- error rates and feel of data are similar if not identical to the GAIIx.
- from a small sampling of experiments shown it looks like error rate is very slightly higher
- Demonstrated 300Gb/run, more than 25Gb per day at release
- PET 2x100 supported.
- Software is same for GAII [Although somewhere in the presentation, I heard that they are working on a new version of the pipeline (v 1.6?)... no details on it, tho.]

Next up:
Eliot Margulies, NHGRI/NIH Sequencing
- talking about projects today for the undiagnosed disease program

work flow
- basically same as in his earlier talk [notes are already posted.]
- use cross match to do realignment of reads that don't map first time
- use MPG scores

[In a technology talk, I didn't want to take notes on the experiment itself... mainly points are on the HiSeq data.

Data set: concordance with SNP Chips was in the range of 98% for each flow cell, 99% when both are combined (72x coverage)

Impressions:
- Speed: Increased throughput
- more focus on biology rather than on tweaking pipelines and bioinformatic processing. (eg, biological analysis takes front seat.)


Next Up:
Gary Schroth

Working on a project for Body Map 2.0 : Total human transcriptome
- 16 tissues, each PET 2x50bp, 1x75bp
Cost:
- $8,900 for 1x50bp
- multiplexing will reduce cost further.
- if you only need 7M reads, you could mutliplex 192 samples (on both cells, I assume), and the cost would be $46. (including seqeuncing, not sample prep.

[which just makes the whole cost equation that much more vague in my mind... Wouldn't it be nice to know how much it costs to do the whole process?]

[Many examples of how RNA-seq looks on HiSeq 2000 (tm)]

Summary:
- output has 5 billion reads, 300Gb of data.

Next up:
David Bently

Present a graph
- amount of sequence per run.
- looks like a "hockey stick graph"

[Shouldn't it be sequence per machine per day? It'd still look good - and wouldn't totally shortchange the work done on the human genome project. This is really a bad graph.... at least put it on a log scale.]

In the past 5 years:
- 10^4 scale in throughput
- 10^7 scale up in parallelizations

Buzzwords about the future of the technology:
- "Democratizating sequencing"
- "putting it to work"

Labels:

AGBT 2010 - Complete Genomics Workshop

Complete Genomics CEO:

Mission:
- sequence only human genomes - 1 Million genomes in the next 5 years
- build out tools to gain a good undertanding of the human genome
- done 50 genomes last year
- Recent Science publication
- expect to do 500 genomes/month

Lots of Customers.
- Deep projects

Techology
- don't waste pixels,
- use ligases to read
- very high quality reads - low cost reagents
- provide all bioinformatics to customers

Business
- don't sell technology, just results.
- just return all the processed calls (snps, snv, sv, etc)
- more efficient to outsource the "engineering" for groups who just want to do biology
- fedex sample, get back results.
- high throughput "on demand" sequencing
- 10 centres around the world
- Sequence 1 Million genomes to "break the back" of the research problem

Value add
- they do the bioinformatics

Waves:
- first wave: understand functional genomics
- second wave: pharmaceutical - patientient stratification
- third wave: personal genomics - use that for treatment

Focus on research community

Two customers to present results:
First Customer:

Jared Roach, Senior Research Sceintist, Institute for Systems Biology (Rare Genetic disease study)

Miller Syndrome
- studied coverage in four genomes
- 85-92% of genome
- 96% coverage in at least one individual
- Excellent coverage in unique regions.

Breakpoint resolution
- within 25bp, and some places down to 10bp
- identified 125 breakpoints
- 90/125 occur at hotspots
- can reconstruct breakpoints in the family

Since they have twins, they can do some nice tests
- infer error rate: 1x10^-5
- excluded regions with compression blocks (error goes up to 1.1^-5)
- Homozygous only: 8.0x10^-6 (greater than 90% of genome)
- Heterozygous only: 1.7x10^-4

[Discussion of genes found - no names, so there's no point in taking notes. They claim they get results that make sense.]

[Time's up - on to next speaker.

Second Customer:
Zemin Zhang, Senior Scientist, Genentech/Roche (Lung Cancer Study)

Cancer and Mutations
[Skipping overview of what cancer is.... I think that's been well covered elsewhere.]

Objective:
- lung cancer is the leading cause of cancer related mortality worldwide...
- significant unmet need for treatment

Start with one patient
- non small cell lung adenocarcinoma.
- 25 cigarettes/day
- tumour: 95% cancer cells

Genomic characterization on Affy and Agilent arrays
- lots of CNV and LOH
- circos diagrams!


- 131GB mapped sequence in normal, 171Gb mapped seq in tumour
- 46x coverage normal, 60x tumour
[Skipping some info on coverage...]

KRAS G12C mutation

what about rest of 2.7M SNVs?
- SomaticScore predicts SNV validation rates
- 67% are somatic by prediction
- more than 50,000 somatic SNV are projected

Selection and bias observed in the lung cancer genome by comparing somatic and germline mutations

GC to TA changes: Tobacco-associated DNA damage signature

Protection against mutations in coding and promoter regions.
- look at coding regions only - mutations are dramatically less than expected - there is probably strong selection pressure and/or repair

Fewer mutations in expressed genes.
- expressed genes have fewer mutations even lower in transcribed strand
- non-expressed genes have mutation rate similar to non-genic regions

Positive selection in subsets of genes
- KRAS is the only previously known mutation
- Genes also mutated in other lung cancers...
- etc

Finding structural variation by paired end reads
- median dist between pairs 300bp.
- distance almost never goes beyond 1kb.

Look for clusters of sequence reads where one arm is on a different chromosome or more than 1kb away
- small number of reads
- 23 inter-chr
- 56 intra-chr
- use fish + pcr
- validate results
- 43/65 test cases are found to be somatic and have nucleotide level breakpoint junctions
- chr 4 to 9 translocation
- 50% of cells showed this fusion (FISH)

Possible scenario of Chr15 inversion and deletion investigated.
[got distracted, missed point.. oops.]

Genomic landscape:
- very nice Circos diagram
- > 1 mutation for every 3 cigarettes

In the process of doing more work with Complete Genomics

Labels:

AGBT 2010 - Yardena Samuels - NHGRI

Mutational Analysis of the Melanoma Genome

Histological progression of Melanocyte Transformation
- too much detail to copy down

Goals:
- mutational analysis of signal transduction gene families in genome
- evaluate most highly mutated gene family members
- translational

Somatic mutation analysis.
- matched tumor normal
- make cell lines

Tumor Bank establishment
- 100 tumor normal samles
- also have original OCT blocks
- have clinical information
- do SNP detection for matching normal/tumor
- 75% of cells are cancer
- look for highly mutated oncogenes

Start looking for somatic mutations
- looking at TK family (kinome)
- known to be frequently mutated by cancer

Sanger did this in the past, but only did 6 melanomas
- two phases: discovery, validation
- started with 29 samples - all kinase domains
- looked for somatic mutations
- move on to sequence all domains...

- 99 NS mutations
- 19 genes

[She's talking fast, and running through the slides fast! I can't keep up no matter how fast I type.]

Somatic mutations in ERBB4 - 19% in total
- one alteration was known in lung cancer

[Pathway diagram - running through the members VERY quickly] (Hynes and Lane, Nature Reviews)

Which mutation to investigate? Able to use crystal structure to identify location of mutations. Select for the ones that were previously found in EGFR1 and (something else?)

Picked 7 mutations, cloned and over-expressed - basic biochemistry followed.

[Insert westerns here - pricket et al Nature Genetics 41, 2009]

ERBB4 mutations have increased basal activity - also seen in melanoma cells

Mutant ERBB4 promotes NIH3T3 Transformation

Expression of Mutant ERBB4 Provides an Essential cell Survival Signal in Melanoma
- oncogene addiction

Is this a good target in the clinic.
- used lapatinib.
- showed that it also works here in melanoma. Mutant ERBB4 sensitizes cells to lapatinib
- mechanism is apoptosis
- it does not kill 100% of cells - may be necessary to combine it with other drugs.

conclusions
- ERBB4 is mutated in 19% of melanomas
- reiterate poitns
- new oncogene in melanoma
- can use lapatinib
[only got 4 of the 8 or 9]

Future studies
- maybe use in clinics - trying a clinical trial.
- will isolated tumor dna w ICM
... test several hypotheses.
- sensitivity to lapatinib

What else should be sequenced? not taking into account whole genome sequencing.
- look at crosstalk to get good targets
- List of targets. (mainly transduction genes)

Want to look at other cancers, where whole exome was done.
- revealed : few gene alterations in majority of cancers. Limited number of siganlling pathways. Pathway oriented models will work better than Gene oriented models

[ chart that looks like london subway system... have no idea what it was.]

Personalized Medicine
- their next goal.

[great talk - way too fast, and is cool, but no NGS tie in. Seems odd that she's picking targets this way - WGSS would make sense, and narrow things down faster.]

Labels:

AGBT 2010 - Joseph Puglisi - Stanford University School of Meicine

The Molecular Choreography of Translation

Questions have made the same, despite recent advances - we still want to understand how the molecular machines work. We always have snapshots that capture the element of motion, but we want animation, not snapshots

Translation
- Converting nucleotides to amino acids.
- ribosome 1-20 aa/s
- 1/10^4 errors
- very complex process (tons of proteins factors, etc, required for the process)
-requires micro-molar concentrations of each component

Ribosome
- we now know the structure of the ribosome
- nobel prize given for it.
- 2 subunits. (50S & 30S)
- 3 sites, E, P & A
- image 3 trna's to a ribosome - in the 3 sites...
- all our shots are static - no animated
- The Ribosome selects tRNA for Catalysis - must be correct, and incorrect must be rapidly rejected
- EFTu involved in rejection

[Walking us through how ribosomes work - there are better sources for this on the web, so I'm not going to copy it.]

Basic questions:
= timing of factor
- initiation pathway
- origins of translational fidelity
- mechanisms

Look at it as a high dynamic process
- flux of tRNAs
- movements of the ribosome (internal and external)
- much slower than photosynthesis, so easier to observe.

Can we track this process in real time?
- Try: Label the ligand involved in translation.
- Problem: solution averaging destroys signal (many copies of ribosome get out of sync FAST.) would require single molecule monitoring
- Solution: immobilization of single molecule - also allows us to watch for a long time

Single molecule real time translation
- Functional fluorescent labeling of tRNAs ribosomes and factors
- surface immobilization retains function.
- observation of translation at micromolar conc. fluorescent components
- instrumentation required to resolve multiple colors
- yes, it does work.
- you can tether with biotin-streptavidin, instead of fixing to surface
- immobilization does not modify kinetics

Tried this before talking to Pac Bio - It was a disaster. Worst experiments they'd ever tried.

Solution:
- use PAcBio ZMW to do this experiment.
- has multiple colour resolution required
- 10ms time resolution

Can you put a 20nm ribosome into a 120nm hole? Use biotin tethering - Yes

Can consecutive tRNA binding be observed in real time? Yes

Flourescence doesn't leave after... they overlap because the labeled tRNA must transit through the ribosome.
- at low nanomolar sigals, you can see the signals move through individual
- works at higher conc.
- if you leave EF-G out, you get binding, but no transit - then photobleaching.
- demonstrate Lys-tRNA
- 3 three labeled dyes (M, F, K)... you can see it work.
- timing isn't always the same (pulse length)
-missing stop coding - so you see really long stall with labeled dye... and then sampling, as other tRNAs try to fit.
- you can also sequence as you code. [neat]

Decreased tRNA transit time at higher EF-G concentrations
- if you translocate faster, pulses are faster
- you can titrate to get the speed you'd like.
- translation is slowest for first couple of codons, but then speeds up. This may have to do with settling the reading frame? Much work to do here.

Ribosome is a target for antibiotics
- eg. erythromycin
- peptides exit through a channel in the 50S subunit.
- macrolide antibiotics block this channel by binding inside at narrowest point.
- They kill peptide chains at 6 bases. Are able to demonstrate this using the system.

Which model of tRNA dissociation during translation is correct
- tRNA arrival dependent model
- Translocate dependent model

Post syncrhonization of number of tRNA occupancy
- "remix our data"
- data can then be set up to synchronize an activity - eg, the 2nd binding.

Fusidic acid allows the translocation but blocks arrival of subsequent tRNA to A site.
- has no effect on departure rate of tRNA.

only ever 2 trnas at once on Ribosome. - it can happen, but not normally

Translocation dependent model is correct.

Correlating ribosome and tRNA dynamics
- towards true molecular movies
- label tRNAs... monitor fluctuation and movement

Translational processes are highly regulated
- regulation of initiation (51 and 3` UTR)
- endpoint in signallig pathways (mTOR, PKR)
- programmed changes in reading frames (frameshifts)
- control of translation mode (IRES, nromal)
- target of therapeutics (PTC124 [ribosome doesn't respect stop codons] and antibiotics)

Summary:
- directly track in real time
- tRNAs dissociate from the E site post translocation and no correlation...

Paper is in Nature today.

Labels:

AGBT 2010 - Joseph Puglisi - Stanford University School of Meicine

The Molecular Choreography of Translation

Questions have made the same, despite recent advances - we still want to understand how the molecular machines work. We always have snapshots that capture the element of motion, but we want animation, not snapshots

Translation
- Converting nucleotides to amino acids.
- ribosome 1-20 aa/s
- 1/10^4 errors
- very complex process (tons of proteins factors, etc, required for the process)
-requires micro-molar concentrations of each component

Ribosome
- we now know the structure of the ribosome
- nobel prize given for it.
- 2 subunits. (50S & 30S)
- 3 sites, E, P & A
- image 3 trna's to a ribosome - in the 3 sites...
- all our shots are static - no animated
- The Ribosome selects tRNA for Catalysis - must be correct, and incorrect must be rapidly rejected
- EFTu involved in rejection

[Walking us through how ribosomes work - there are better sources for this on the web, so I'm not going to copy it.]

Basic questions:
= timing of factor
- initiation pathway
- origins of translational fidelity
- mechanisms

Look at it as a high dynamic process
- flux of tRNAs
- movements of the ribosome (internal and external)
- much slower than photosynthesis, so easier to observe.

Can we track this process in real time?
- Try: Label the ligand involved in translation.
- Problem: solution averaging destroys signal (many copies of ribosome get out of sync FAST.) would require single molecule monitoring
- Solution: immobilization of single molecule - also allows us to watch for a long time

Single molecule real time translation
- Functional fluorescent labeling of tRNAs ribosomes and factors
- surface immobilization retains function.
- observation of translation at micromolar conc. fluorescent components
- instrumentation required to resolve multiple colors
- yes, it does work.
- you can tether with biotin-streptavidin, instead of fixing to surface
- immobilization does not modify kinetics

Tried this before talking to Pac Bio - It was a disaster. Worst experiments they'd ever tried.

Solution:
- use PAcBio ZMW to do this experiment.
- has multiple colour resolution required
- 10ms time resolution

Can you put a 20nm ribosome into a 120nm hole? Use biotin tethering - Yes

Can consecutive tRNA binding be observed in real time? Yes

Flourescence doesn't leave after... they overlap because the labeled tRNA must transit through the ribosome.
- at low nanomolar sigals, you can see the signals move through individual
- works at higher conc.
- if you leave EF-G out, you get binding, but no transit - then photobleaching.
- demonstrate Lys-tRNA
- 3 three labeled dyes (M, F, K)... you can see it work.
- timing isn't always the same (pulse length)
-missing stop coding - so you see really long stall with labeled dye... and then sampling, as other tRNAs try to fit.
- you can also sequence as you code. [neat]

Decreased tRNA transit time at higher EF-G concentrations
- if you translocate faster, pulses are faster
- you can titrate to get the speed you'd like.
- translation is slowest for first couple of codons, but then speeds up. This may have to do with settling the reading frame? Much work to do here.

Ribosome is a target for antibiotics
- eg. erythromycin
- peptides exit through a channel in the 50S subunit.
- macrolide antibiotics block this channel by binding inside at narrowest point.
- They kill peptide chains at 6 bases. Are able to demonstrate this using the system.

Which model of tRNA dissociation during translation is correct
- tRNA arrival dependent model
- Translocate dependent model

Post syncrhonization of number of tRNA occupancy
- "remix our data"
- data can then be set up to synchronize an activity - eg, the 2nd binding.

Fusidic acid allows the translocation but blocks arrival of subsequent tRNA to A site.
- has no effect on departure rate of tRNA.

only ever 2 trnas at once on Ribosome. - it can happen, but not normally

Translocation dependent model is correct.

Correlating ribosome and tRNA dynamics
- towards true molecular movies
- label tRNAs... monitor fluctuation and movement

Translational processes are highly regulated
- regulation of initiation (51 and 3` UTR)
- endpoint in signallig pathways (mTOR, PKR)
- programmed changes in reading frames (frameshifts)
- control of translation mode (IRES, nromal)
- target of therapeutics (PTC124 [ribosome doesn't respect stop codons] and antibiotics)

Summary:
- directly track in real time
- tRNAs dissociate from the E site post translocation and no correlation...

Paper is in Nature today.

Labels:

AGBT 2010 - Bing Ren - UCSD

Epigenomic Landscapes of Pluripotent and Lineage-Committed Human Cells

Sequencing of the human genome has led to
* identification of disease causing genes
* Personalized medicine
* advanced sequencing technologies
* Foundation for understanding the construction of human beings

But DNA is only half the story
* variations in DNA alone not account for all variations in phenotypic traits
* organisms with identical DNA often exhibit distinct phenotypes (eg plants, insects, mammals)
* Epigenetic changes contribute to human diseases, phenotypes, etc

We know about the mechanisms
* DNA is wrapped around histone proteins which can be modified
* DNA is itself modified (methylation)

[paraphrased] DNA is hardware, epigenome is the software (Duke university quote... missed author's name)

Challenges
* very complex
* varies among different cell types
* generally reprogrammed during the life cycle of tan organism
* Epigenome is also affected by environmental clues

How do we ecipher the "epigentic code"?
* sytematic approach
* large scale profindg of chromatin modification
* finding common modifications
* validation

Profiling:
* ChIP-Seq based. (started with Tiling arrays)
* use antibodies that recognize chromatin modification.

Vignette:
[beautiful pictures]
* Chromatin signature for the promoter and gene body
* H3K4me3 marks active promoters
* H3K36me3 marks gene body of active genes
* Signature has led to identification of thousands of long non-coding RNA genes.

Chromatin signatures of enhancers
* Can use information about modifications to model patterns
* predict enhancers in the human genome.
* 36,589 enhancer predictions were made
* 56% found in intergenic regions
* test a few with reporter assays - show that 80% of predicted enhancers do drive reporter genes. (Far fewer of the control sequences do - missed number)

Finding chromatin modification patterns in the genome de novo
(Hon et al, PLoS Comp Bio 2009)
* 16 different patterns of chromosome modification
* some are enhancers,
* others have no associations
* one has pattern highly enriched for exons.. regulates alt splicing.

Summary
* chromatin modification patterns could be used to annotate ...
* Epigenome Roadmap project (Generate reference epigenome maps for a large number of primary human cells and tissues)

Datasets are available at GEO. (NCBI)

Mapping of DNA methyltion and 53 histone modifications in human cells
* Human embryonic stem cells (H1)
* Fetal fibroblast cell line

Method for mapping DNA methylation
* Ryan Lister and Joe Ecker (Salk)
* sodium bisulfite (C to U), if not methylated
* Must do deep sequencing. If using HiSeq - could do it in 10 days. Used to take 20 runs
* Methylation status for more than 94% of cytosines determined.
* 75.5% in H1, 99.98% in Fibroblast
* DNA methylation is depletee from functional sequences
* no-CpG methlyation is enriched in gene body of transcribed genes suggesting link to the transcription process

11 chromatin modification marks
* comparing cells: different results
* K9me3 and K27me3 become dramatically extended (7% in ES to more than 30% in fibroblast.)
* genes with above marks are highly enriched in developmental genes.

Reduction of repressive chromatins in induced pluripotent cells

Repressive chromatin domains occupy small fraction of genome which is maintained as open structure in stem cells

Repressive chromatin domains occupy large fraction of genome, keeping genes involved in development silenced in differentiated cells.

Summary:
* widespread difference in epigenomes of ES and fibroblasts
* stem cells are characterized by abundant non-CpG methylation
* Expansion of repressive domains may be a key characteristic of cellular differentiation
* [Missed 2]

Labels:

AGBT 2010 - Jesse Gray - Harvard Medical School

Widespread RNA Polymerase II Recruitment and Transcription at Enhancers During Stimulus-Dependent Gene Expression

Mamalian brain is [paraphrased] Awesome technology
* Sensory experience shapes brain wiring via neuronal activation
* Whiskers compete for real estate in meta-sensory cortex.
* Brain can re-wire to adapt to environment
* Transcriptional changes in nucleous as brain cells reprogram
* (Discussion in terms of real-estate for rat whisker areas of brain.)

Neuronal activation affects circuit function by altering gene expression
* Activity dependent gene expression

Cascade
* Ca++ influx
* kinases & phosphatases
* CREB + SRF TFs
* recruit Creb binding protein
* Induce about 50-100x expression in genes (eg, fos)
* Can we do genome wide approaches to understand what's being expressed?

An experimental system for genome-wide analysis of activity-regulatee gene expression
* grow in dish
* depolarize with KCl
* do ChIP-seq and RNA-seq

CBP and transcription factor binding at fos locus
* see CBP binding at conserved region up stream, as well as promotor for fos gene
* also see NPAS4 CREB and SRF with similar (but not identical) binding sites

Is the activity dependent binding CBP restricted to the locus or genome wide?
* compare CBP peaks in both conditions
* binding appears limited to KCL stimulated only.

Are CBP-bound sites enhancers or promoters or both?
* Promoters don't necessarily drive transcription
* Promoters have H3K4Me3 histone modifications (enhancers dont)
* 3d configuration to bring enhancers together with promoters.

Most CBP peaks are not at TSSs and do not show H3K4Me3
* 5079 at TSSSs
* 36,069 not at TSSs

Align all seq that are enhancers
* there is much M3K4Me1 (clear pattern)
* there is not much M3K4Me3

Use known site
* upstream from Arc - used to build a construct

CBP and HK4Me1-marked loci function as activity-dependent transcriptional enhancers.
* Found 8 enhancers

Summarize:
* about 20,000 CBP sites that are activity-regulated enhancers
* do not correspond to annotated start sites
* H3K4Me1 modified
* lack H3K4Me3 mark
* do not initiate long RNAs
* confer activity-regulation on the arc promotor

Questions about activity-regulated enhancers
* do they play a role in binding RNA Polymerase II?
* Evidence is tending towards saying that most enhancers do not seem to have RNAPII binding.

fos enhancers bind RNAPII
* use chip for RNAPII and CBP
* 10-20% of sites have RNAPII at enhancer
* potential artifact - crosslinking conditions may exaggerate this by tying promotors and enhancers.

Does RNAPII at enhancers synthesize RNA?
* Enhancers at the fos locus produce enhancer RNAs
* non-polyadenylated RNA? Yes.
* you do get some transcription at enhancers... [doesn't this start to describe lincRNA?]

Enhancer transcription is correlated with promoter transcription.

The Arc enhancer can be activated without the presence of the Arc promoter
* increases in polymerase binding at enhancer even when promoter is gone.
* preliminary - but may not be transcription when the promoter is gone.
* what is the function of eRNA transcription? (don't know the answer yet)
* Could be that it helps to lay down epigenetic marks.

Labels:

AGBT 2010 - Keynote: Henry Erlich - Roche Molecular Systems

Applications of Next Generation Sequencing: HLA Typing With the GSFLX System

High Throughput HLA typing
* the allelic diversity is enormous
* Focussing on HLA class I and II genes (germ-line)

Challengeing because it's the most polymorphic region in the genome
* HLA-B has well over 1000 alleles
* only 68 different serological types can be distinguished
* 3,529 genes at 12 loci as of April 2009
* chromosome 6
* Can't be typed using existing conventional techniques [I assume in high throughput]
* DR-DQ region - involved in type I diabetes
[Much detail here, which I can't get down fast enough with any hope at accuracy.]

Polymorphism is highly localized.
* virtually all of the polymorphic amino acid residues are localized to a groove.
* most allelic differences are protein coding.
* critical to distinguish known alleles

Nomenclature
* eg HLA-A * 24020101
* only the first 4 numbers are the ones that distinguish the protein.

Survival curve for bone marrow transplant
* even with 8/8 allele matches, there are WAY more things that need to be matched - and so you need the best possible match.
* a single coding mismatch can cause graph vs host disease.
* Bone Marrow matching requires high precision

[List of disease applications - 22 different diseases including Narcolepsy, cancers, drug allergic reactions..]

GWAS in Type 1 diabetes.
* identified disease related genes - HLA SNPs are significant
* Dr-DQ haplotypes are associated strongly with Odds ratio for diabetes
* looking at genomic risk factors increase up to 40x

[something about a particular combination of DR-DQ giving VERY high risk, and consequently is never seen in humans...]

Forensics
* Dot blots... evolved into Probe Array Typing System.
* Even if you have hundreds of probes, you still have "HLA Genotye Ambiguity"
* "Fail to distinguish alleles" without NGS (with or without phasing..)

[Explanation of how 454 works - protocol]

Approach
* amplify exons with MID primers/emPCR/sequence

Benefits of clonal sequencing
* set phase to reduce ambiguity
* allow amplification and sequencing of multiple members of multi-gene family with generic primers
* allow sorting /separation of co-amplified sequences from target sequence (signal)

Parallel clonal sequencing of 8 loci x 24 samples

[More protocol... ]

Graph of read length : around 250bp

Connexio Assignment of DRB1 Genotype
* image reassuring to a HLA researcher.
* like the interface (plug for the company)
* aligns sequence, consensus sequence, does genotype assignment
* [Must admit, the information on this interface is rather mysterious to me...]
* [Several more slides of Connexio data and immunology types that mean nothing to me.]
* get a genotype report...

Analysis...

Testing on SCIDS patient
* patients are potentially chimeric
* look for presence of non-transmitted maternal allele
* can find stuff in "fail layer" because software assumes only two alleles possible.

[Wow... I know I don't know much immunology, but I'm not getting much out of this. This is a lot of software for immunologists, and I really don't understand the terminology, making it challenging to get coherent notes.]

Takes about 4 days - [says 5-7 on the slide]
* amplicon prep
* emulsion
* DNA bead process
* loading wells
* sequencing on GSLFX
* Data analysis

[Missed slide on how much data they were getting - 1M reads?]

Multiplex - 500 samples in one run
* Got good results [not copying down seemingly random DRB numbers...]

Labels:

Friday, February 26, 2010

AGBT 2010 - Christopher Mason - Weill Cornel Medical College

Developmental Changes in Human Neocortical Transcriptome Revealed by RNA-Seq

How do we go from sequence to organism?

Example of disease that they were able to find change in exon.. but that's not the normal. Brain transcriptome is especiallly bad.

Complexity of transcriptome is vast.

NGS transformed the amount of data we're getting

Compared microarrays vs RNA-seq
* RNA-seq gives you much more information on DE.
* Metric for RNA-seq expression (Reads per kb per million reads)
* Controls: spike in synthetic w poly-A tails [next slide: control worked]

Looking at brain
* validate existing gene boundaries.
* longer isoforms
* find other genes
* 70-90% of genes expressed in the brain with strong neuro-developmental correlation
* Ensembl genes categories expressed: many types of RNAs found
* ~18% of splicee forms are unique to each individual - splicing levels similar across development
* at high expression, 80-90% of genes have alt isoforms

[Lists of genes that were DE in fetal/adult brain - "things that make sense"]

What is different is Transcription Factors - especially Zinc Finger TFs.
* Shift towards fetal expression

Zinc Finger
* most rapidly expanding class of genes

Look at UTRs
* fetal brain exhibits myriad extensions of gene models and variable UTRs.
* TARs found. (Transcriptionally activated regions) - confirmed with PCR

No visible end of gene discovery.
* the deeper you go, the more new things you see.

ROC plot
* sensitivity (TP / TP + FN) and specificity
* looks incredible - nearly straight to 1.

Source of "wiggles" in RNA-seq.
* it's everything, really
* biggest problem: annotation is one source.

Human genome is not just 33Mb.... it's only 1/2 to 1/5th ofthe exome capture.
* 165 Mb have been validated on multiple SeQC platforms!

There aren't just 20,000 genes - it's closer to 45,000!

Begat: every bp of the genome is a locus for ttesting, each remiaing sequence is a variable.

Don't forget, we also have to filter out viruses/bacteria/other
* Code for Begat is available. (Email given - forgot to copy it down.)

Labels:

AGBT 2010 - Manual Garber - Broad

Annotating LincRNA Transcripts Using Targeted Sequencing

Goal: Identify functional large ncRNAs in the mammalian genome
* look like mRNA, but non-coding
* Use Chip-Seq to separate genome into regions
* use Tiling arrays, hybridize RNA...
* Tiling arrays - no information about connectivity, limited resolution

* studying the functions of lincRNAs reqruie precise sequences for both experimental and computational analyses.

Use RNA-Seq protocol to build transcriptome

what RNA-seq gives you:
* RNA, map to genome
* introns... junction reads.
* use reads with mate in poly-A to find end.

Used Tophat to align

Junction reads:
* Longer reads provide junction evidence
* first, use only reads that align with a gap. (Build connectivity map)
* topology map
* use map with ChIP-Seq data to build "paths"
* use paths to call transcripts
* clean up with Paired End Data - > join or kill unlikely isoforms.

Example:
* Mouse ES
* Illumina sequence (156M - 76bp reads)
* 75% exonic alignment
* correctly reconstruct most expressed known genes at single nucleotide resolution.
* works even on overlapping genes.
* 81% genes fully-reconstructed
* Good recovery of genes at all expression levels.

Novel Transcripts discovered:
* 800 loci between genes
** 250 out of 317 ES lincsRNA are reconstructed
* 200 loci overlapping genes
** 131 overlap coding exons. (making them antisense for visual purpose.)

Are they protein coding genes?
* LincRNAs are probably too small to produce proteins [Strange assumption, IMHO... maybe I'm missing something.]
* 650 of 800 have no lincRNAs have no coding potential
* have lower expression level than coding regions.
* intergenic transcript conservations.. (similar conservation to old lincRNAs)
* Antisense transcripts? - no antisense coding potential
* antisense expression - very low antisense expression
* Antisense conservation - a little more conserved than sense lincRNA because of overlap with exons of genes
* antisense exons are not conserved.

What do overlapping trancripts do?
* expression is low,
* little or no conservation
* correlation with overlapping transcripts
* Thus: artifacts, noise, fine tuners? other ideas?

Conclusion
* novel statistical method takes advantage of longer reads
* mouse ES coding gene novelties
* intergenic non coding RNA (lincRNA)
* new family of antisense non coding RNA
* validation of 18/20.

Labels:

AGBT 2010 - Brian Haas - Broad

Genome annotation using mRNA-Seq: A case study of Schizosaccharomyces pombe

Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions

Major chanllenge:
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult

Approach 1: de novo assembly
* treat them like EST
* align to genome

Approach 2: align reads to genome
* reconstruct based on alignments

Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations

Seq:
* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state

Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP

ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations

This is all exploration - This is NOT a tool Bake off.

Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.

Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns

Problems:
* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)

Examples given.
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.

Compare:
* none of these methods are great - they're all missing some that others caught.

Challenges:
* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.

For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)

Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts

Technology will greatly facilitate efforts
* Use stranded mRNA-seq

Summary:
* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.

Labels:

AGBT 2010 - Shuro Sen - NHGRI

Transcriptome Profiling of ClinSeq Particpants by Massively Parallel Short-Read DNA Sequencing

[No Microphone - I may not get much from this talk. Mostly I will be pulling from Slides, I think]

ClinSeq:
* cohort of 1,000 individuals
* initial focus on Cardiovascular disease
* Consent for follow up
* transcriptome, exome + few genomes
* application of large-scale medical sequencing in a clinical research setting.
* concurrent "Omes" from same individual
* move on to other diseases in the long term

* started with sanger
* now moved to Illumina

* published marker paper on this topic last Sept in Genome Research

ExpressSeq
* transcriptome component of ClinSeq
* demonstrate use of RNA-seq in clinical research
* better than SAGE or Microarray

Transcriptome + Exome
* gene expression
* splicing
* gene fusions
* etc

Atherosclerosis
* hardening of arteries
* Looking for biomarkers for calcification
* can look for it by CT scan (in example, arteries look like bone.. [Ouch!]

Study:
*4 people w high calcification, 4 with low calcification
* two RNA sources: LCLs and whole blood
* emphasis on uniform cell culture conditions
* repeated EBV transformation from same individual (see noise)
* RNA Fragmentation (Covaris S2)
* PCR amplification 12 cycles
* two PE 51bp lanes Illumina

Differential gene expression
* Expression vs Statistical Significance.
* "upside down volcano plot"
* found about 100 genes that were differently expressed and significant
* Looking at those 100 in detail
* Many of these genes are noise.
* more sequencing reads to improve statistical depth

Discussing his bet hits - but not giving names of genes.

[Kind of silly to take notes on random unnamed genes. Take home message is that some of the genes were found that were known in the process -but obviously not all of them. TFs, TKs and something associated with rheumatoid arthitis. This might be a good time for me to rant about how picking any random list of proteins will give you things that you think are promising. All gene hit sets are "interesting" at first, and useless when not validated... but that's obvious, no?]

Coming up
* analysis of next 8 subjects
* follow up
* sequence more subjects for rare variants
* integrated analysis of genome and transcriptome dat to uncover SNV loci underlying differential expression. ("integrating multiple omes")

Labels:

AGBT 2010 - Nicole Cloonan - The University of Queensland

Translation-State RNAseq of Human Embryonic Stem Cells using Paired-End Sequencing.

Intro to Stem Cells
* hot topic - potential for cell generating therapies
* Self renewable
* pluripotent
* directable
* tractable

Looking at Extracellular space network.
* molecules that control cell-cell interactions (among others)

The "Plurinet"
* defines the pluripotent status of the cell
* protein-protein interactions
(Muller et al, Nature 455:401-505)

Transcriptional complexity
* 6 transcripts per gene on average
* so how does this affect the plurinet

SOLiD RNA-Seq
* have a pipelien... [too fast]
* done SET and PET.
* 80% of tags map, 194M 50mers, 114M 25mers

Tags that don't map:
* LincRNA, intergene, etc...

PET.
* alternate splicing
* works well if you know what the annotations are.
* with PET, you can build transcript models if you don't have them already - learn more about alt. splice
* can be used for novel exon discovery

Chip-Seq from Ku et al
* Extended Exons. 3' exon extensions can be very long.

[Why is this Chip-Seq?]

Do Virtual Northerns
* Size fractionations
* What you find is that most annotated genes have the right refseq predicted lengths.
* however, some are shorter, some are longer
* Frequency at which tags from a particular library match predicted (based on refseq) vs from RNA data... You do see that some have very different results.

RNA are translated...
* if no signal peptide, cytoplasmic (on free ribosomes)
* if has signal, then it's translated by ribosomes bound to membranes
* use sucrose gradient to separate the two populations
* do PET, (35/75bp reads)
* compare signals in both fractions - they come up well in the predicted fraction.

Novel transcription
* membrane associated RNA have very different proportions of extension (mainly long 3' UTRs) than those in the cytoplasmic fraction

MiRNA biogenesis and mRNA interactions.
* use fractionation to test
* RISC associated with polysomes (which works with fractionation)
* complexes stay together through fractionation
* Long UTRS are enriched for mRNA binding sites

Back to Plurinet
* Complexity is incredibly increased with the extra products and miRNA

Summary:
* PET allows you to reconstruct loci level complexity from RNAseq data
* Size fractionation is useful
* translation state RNAseq allow s the capture of mRNA and miRNA data from polyribosomes
* Transcriptional complexity impacts greatly on interactions.

Labels:

AGBT 2010 - Jonas Korlach - Pacific Biosciences

Direct Single Molecule, Real Time RNA Sequencing.

Opportunity to further work with this platform to replace enzyme in ZMW with other enzymes of interest - can observe new functionality.

"Single Molecule Realtime Biology" [SRMB? How do you say that acronym?]

Of interest: Reverse Transcription
* replace polymerase with rna polymerase (reverse transcriptase)
* have done this - simple extension tests.
* done kinetic analysis, and the phospho dntps are incorporated well, but MUCH slower (1 order slower) than non-marked nucleotides

Tested the system out anyhow.
* Seems to work in principle - albeit it's slow. One dNTP in enzyme is not yet one nucleotide inserverion.

Ribosomal RNA Sequencing.
* Can withold catalytic metal, which allows binding, but not ligation. Thus, you can just watch the flourescnece - and in this case, binding only happens with correct nucleotide.
* can also detect modified RNA bases - eg, Pseudouridine. Can measure binding time - takes longer.

Detection of Modified RNA bases
* pauses indicate kinetic changes

For viruses, you can get a single enzyme to process the entire genome of a virus - very long read lengths at the tail end of the distribution.

HIV reverse transcriptatse translocation dynamics.
* use terminating bases and AIDS drugs - and monitor incorporation and pulses.
* Show graphs of kinetic analysis of P-Sites and N-site
* Can then study binding in the presense of the terminators/drugs.
* Can calculate binding energy from puslses.

Summary:
* Demonstrated SMRT RNA sequencing - still room to grow.
* Deomnstrated SMRT Biology - Translation (shown tomorrow) and reverse transcriptase.

Labels:

AGBT 2010 -Pacfic Biosciences Workshop

"The debut of the 3rd Generation"

Intro:
* Came from the basement of a building in cornell. [what is it with basements on campus?]
* technology detects 500 photons per base
* Raised $266M in company history

History
* show slide with first results that launched company - detecting 3 labeled C's, barely

"yes, it is big, yes, it is heavy, and yes, it does work"
* smallest: $50,000 desktop version
* Largest: full human genome in 15 minutes.

Already have manufacturing for reagents - and building a facility to construct machines.

Steve Turner: Founder, CSO, Board Member

Overview
1. brief overview of technology
2. Update on Collaborations
3. Instrument debut
4. Applications
5. Scalability

* Video of polymerase - same one from web.

Collaborations:
* influenza
* cancer transcript
* long read progress
* strobe seqeuncing for strutral variation
* Palustris systems biology [Go palustris!!!]
* circular sonsensus sequencing
* survey of coverage bias
* direct detection of methylation and DNA modification

Influenza:
* serotyping doesn't give picture - immunologically distict viruses.
* Fast Time to result: 9 hours from sample Extraction to sequencing analysis completion.
* did not look at consensus call - used single molecule reads.
* match single molecules with sequenced refernce genomes of similar influenza.
* Turned out that the strain was misidentified - phylogeny was incorrect.
* side benefit: in every case, each segment was covered in single reads. Potential for quasi-species studies of viruses.

Sequenced MCF-7
* known alt. splice forms implicated in tumorigenesis.
* Can map entire transcripts (2400bases) in single read.
[neat stuff]

10,351 base read scrolling... goes on and on.
* they see up to 20kb reads.

Strobe sequencing
* answer to Mate Pairs?
* Polymerase is damaged by laser, so reads will continue until damaged
* Turn off the light, and the polymerase is unharmed... will continue till you turn the lights back on.
* Who needs mate pairs when you can just sequence 10kb at a time?
* show repeat lengths - at 20kb, you can sequence most of your repeat regions. - Strobe it as well...
* Very useful for assembly.

Insertion AC223433 fosmid
* can use time as a way to look at insert size.

Palustris
* 58 contigs from palustris
* Hybrid assembly - now have a single contig. (Used Strobe, straight and other tech..)

Read Length.
* Expect that you can epxand readss to 50-70kb.
* demonstrate by haprpin ligation to lambda genome (linear)

circular consensus sequencing
* make something circular, then go 'round and 'round till you get consensus.
* Q40 on single molecules by going over it many times

Prep:
* results in Low bias for GC content
* tested on many organisms

modified nuclear bases
* look at kinetics of base incorporation
* modified nuclear base Methylated Adenosine causes kinetic differences
** 6-10x kinetic changes.
* Methylated Cytosine - still get a signal
* Hydroxymethycytosine: can also see that - also different from other traces
* duration and spacing are different for the three bases.
* Single base resolution, less than 1% FP, methylation detection on single moleucles
* also looked at other modifications - can always tell that it's different.
* Polymerase stalls at T-dimers.

[Summarized it all]

[Insert CEO talk here - wonderful company, wonderful people, "state of the art", hard work.]

Unveil worlds first 3rd gneration sequencer
* Movie time!
* 8 Cells per package - $100 per cell.
* SMRT Cell - 96 / tray.
* reagent plate (96 well)
* each cell works indepenently - in any protocol
* Uses CSV files
* API to LIMS with designs.
* System looks pretty child-proof (though probably not idiot-proof)

Monitoring ar run:
1. monitor at instrument or remotely
2. View real time base incorporations
3. remaining runtime
4. status of each cell from cell prep to run.

Signal to noise ratio is dramatically improved from last year

Alignment?

Portal:
* web based interface
* accessible from any computer
* automated secondary analysis

Reports:
* full complement of reports automatically generated
* quality files
* ....

Browswer integrated into viewer.

Supports:
* BAM/SAM
* FastQ
* SRA
* etc...

All in one day.
* sample prep to analysis.

* methylation sequencing will be released in an update
* direct rRNA sequencing.

Working towards SMRT Translation
* replace Trancription (Polymerase) with translation (ribosome & labeled tRNA....)

[ok, didn't see that coming]

Scaling of performance over instrument life
* current yeild 30% improved to 90%
* Multiplex: 80k improves to 160,000
* speed 1-3bps improving to 15bps

Throughput should pass 2nd generation with this instrument. Expect new instrument in 3 years to blow all of this away.

Interpretation of Genomics will require epigenetics, etc etc etc. and much data processing. [Oddly, That's what I tried to convince Complete Genomics people of this morning, without success.]

Questions:
* Dark Bases? They are not dark bases - they are missed bases. They now have better bases, that bind better than the natural bases. Missed bases are a problem - the nucleotide docks, and if happens too fast, you don't get enough phototons...

* Something about algorithms for de novo assembly - check out the posters, and we'll have more information for you.

* What is your error rate? [Very agressive question] Single pass error rate is greater than ensemble sequencing. You don't get systematic error in Pac Bio - Approach towards consensus is linear. You know when you see systematic errors - you can catch and repair. Expect Q90 with this technology.

* Exponential decay on read lengths.

Labels:

AGBT 2010 - Elaine Mardis - Washington University School of Medicine

Single Molecule Sequencing to Detect and Characterize Somatic Mutations in Cancer Genomes

[Disclaimer Statement - she is a Pac Bio board member]

Why Sequence Whole Genomes?
* [same as always - nothing new]

Focus on talk today is on point mutations

How Current NGS (eg, Illumina) works:
* Sequence tumour & normal to 30x,
* Compare to reference, then compare tumour to normal, and remove known dbnps sites, etc etc...
* Validate SNVs.

4 Tier levels.
* focus validation on Tier 1 results.

Why Validate?
* Pipeline is tuned to have a slightly elevated false positive mutation rate so things aren't missed.
* Orthogonal validation is important.
* Validation is expensive and time consuming, however.

Why check for prevalence of mutations?
* Each tumour gNA sample consists of the contributions of many tumour cells
* digital nature of NGS data allows an estimation of how common each validated mutation is in the tumor cell population
* more prevalent mutations are likely "older" - happen earlier in progression.

Recurrent SNVs
* why? Adding evidence. The ones that happen more often are likely to be earlier in progression and are thus more likely to be drivers. [Not sure I buy that logic, however.]

Limitations:
* Faster Sequence data generation (analysis is not getting cheaper)
* iNcreased validation/prealece data demand (need to decrease cost)
* Recurrent mutation screening (site specific vs whole gene)

Medical impact:
* always want our results to be useful. [Kind of ignoring this part... selling us on the use of sequencing for medical use.]

Discussion of AML project, as discussed in last talk.
* prognostic IDH1 mutations.

[Dr. Mardis' talks always remind me of an infomercial... It has the feel of a commercial presentation, but with data to back it up. It's glossy, the slides are clean, and the presentations feel well rehearsed - something we just don't get much of in science talks.]

Insert sales pitch for Pac Bio systems here.

[5 slides later... ]

three experiments:
* first for accuracy
* second for sensitivity
* third for detection of mutational prevalence

Accuracy:
* 32 directed PCr products from glioblastoma tumor normal pair
* 77% neoplastic cellularity
* SMRT sequencing (alpha prototype detector)
* Wrote software for SNP detection
* 94% of 86 known sites were found
* 6 FP and 6FN results

* 5 LOH sites were detected properly
* All mutations were detected at different confidence levels

Sensitivity
* used AML genome
* 95% population purity
* All variants detected at each cellularity...

Detection of Mutational Prevalence:
* Concordance with Illumina is good - but not great in tier 3 mutations. C to T mutations were slightly biased against.

Conclusions
* Platform is Ramping up quickly

Labels:

AGBT 2010 - Keynote speaker: James Downing - St. Jude Children's Hospital

The Molecular Pathology of Acute Leukemia

Was head of pathology at St. Jude for many years - doing cancer genomics before it was called cancer genomics.

First time at AGBT
No methodology or technology - focus on biology and clinical relevance. Not going to present NGS data! Using completely outdated technology - and all of it was published in the last 12 months.

The cancer he's focused on is the best characterized of all the cancers.

What leukemia really is: Proliferating B-cells that rapidly take over the whole body. Highest tumor lode of all the cancers. In his generation, 95% of children died within 12 months of diagnosis. Now have 80-85% cure... but relapses happen in 30%.

[Classical diagram of immune system lineage]

Mutations in early progenitors generate leukemias. Two types: ALL and AML. They are not homogeneous diseases, however. Distinct biological subtypes are characterized by translocations. - They contribute to the leukemia: Necessary, but not sufficient.

What are the biologic processes that need to be altered to generate leukemia:
1. Alteration in self-renewal capacity - need to become "immortal" (unlimited self-renewal) (eg AML1-ETO)
2. Need to have an altered response to growth signals - contnued growth (eg. BCR-ABL1)
3. Block in apoptosis (eg PML-RAR alpha)
4. Block in differentiation

Doing "routine molecular diagnosis":
* CNV, expression, etc
* Use Affy Chips

What have they found? (using 242 diagnostic ALLs with matched germ line DNA.)
* there are a small number of copy number changes per casee... vary markedly across the different subtypes. (eg, MLL: ~1, other has ~11)
* more Deletions that Amplifications
* 60% of b lineage all have a genetic lesion in a gene regulating B-cell differentiantion (PAC5, Ikaros, EVF, LDF1, BNK)

PAX5 deletions most common.
* 10 exons...
* Half of deletions deleted half of the genes
* Others delete required domains
* some were homozygous, but not all.
* Lots of fusions with this gene occurs as well.
* Point mutations were also seen in binding domains...

[Ok, so this gene can be deleted in many ways... got it. The cells find ways to kill off this gene.]

Haploinsufficiency in PAX5 deficient mice
* Was not sufficient to cause lymphoma.
* cooperates with BCR-ABL1 to cause lymphoma. (Mouse Model)
* strong driving pressure for diabling the b-cell differentiation genes in Leukemia.

60% of B-progenitors ALL have Mutations in B-cell regulatory Genes

Look at Ikaros
* entire literature about altered isoforms.
* saw a high frequency of mutations in BCR-ABL1 ALL,
* 85% of BCR-ABL ALL have deletions of Ikaros: Almost never see the deletions in Ikaros.
* mapping deletions of Ikaros: Some are complete, but there is a subset of deletions that commonly knock out all 4 zinc fingers (exons 3-6).
* Never see Ikaros "isoforms" without these deleitons. There probably are no isoforms - it's always genetic lesions.
* Deletions typically happen within a few bases of each other - result from aberrant RAG-mediated recombinations.

Start putting the lesions together. [Nice lists of genes for each of the 3 pathways]

Clinical relevance:
* looking for markers in a new cohort. Remove two types of ALL (BCR-ABL1 + infant), look at 221 samples: Are there new markers?
* Yes, it was Ikaros: 75% of relapse if you have Ikaros deletions.

Compare BCR-ABL1- and Ikaros- (Bad outcome) with BCR-ABL1+ ALL (Also has Ikaros deletions)
* Significant expression similarity
* Look at the Kinases: JAK family, which have a high rate of mutations in ALLs.

JAK mutations:
* not seen in other types of cancers - unique to JH2 domain, clustering in a single spot. (R683)
* Turns out that high risk ALL have JAK deletions.

CRLF2 = TSLPR, IL-7/IL07R
* Over expression of this receptor (compensating for Jak Mutations and lack of signaling), combine to cause a proliferative signal. [I didn't get everything here.]

Looking at high risk again:
* Ikaros deletions
* Jak Mutations
* CRLF2 (cytokine receptor mutations)

What other kinases are activated in this subset of patients?
* Work in progress
* quick review of other genes they're now finding... [too fast to get that down.]

Genetic Alterations Acquired at Relapse
* Relapsing is only 20% blast population.
* Need to Flow sort.
* CDKN2A/B mutations
* [list of genes, including ikaros... ]
* No common mechanism of relapse - variety of pathways
* Varieties do not include drug target mutations. It's always in signalling, etc.
* 7% of relapse is "unrelated" (secondary leukemia)
* 8% same as diagnosis
* 34% clonal evolution from diagnosis
* 51% clonal evolution from pre-leukemic clone

Summary:
* small number of variation
* Ikaros mutations
* Aberant RAG-mdeidated recombination
* JAK mutations
* ...

This disease "begs for NGS" - Get a complete picture of what's going on.
* Collaborating with WashU. (Mardis, Wilson, Ley)
* Doing the "Bad" leukemias (infant, high risk, CBF)
* also doing brain and solid tumours (neuroblastoma osteosarcoma, retinoblastoma)
* Started Feb 1st - already have 5 genomes and matched normals.
* over $50M invested in this project

Labels:

Thursday, February 25, 2010

AGBT 2010 - Ogan Abaan - NIH/NCI

Identification of novel cancer mutations in sarcomas

Sarcomas: two categories
* simple genetic changes (eg. Ewings)
* Complex genetic changes (eg, osteosarcomas)
Soft tissue sarcomas in general:
* rare
* high mastastasis.
* connective tissue origin.
* 50 subgroups - most have unknown biology
Tumour samples from 24 soft tissue sarcoma patents
* matched normals will be sequenced when available at some point in the future.

Target:
* 15k exons from 1334 genes
* used "in-solution" capture method.
* 33.5k -150mers
* no repeat masking.
* biotinylated baits

Used Eland - and used GAII or GAIIx, as available - mixed read lengths

Custom python scripts - wrote them himself. Still a work in progress.

Variant Calling is VERY simple. Uses Phred score based approach, adjusted by error rate at that position.

Did the standard: filter on dbsnp130, annotate on UCSC refGene and Visual confirmation (IGV Browser)

Shows stats - they don't look great, but they seem similar to those published in Tewhey et al (Genome Biol 2009). [Shown to justify low rates?]

Some optimization could be done to get more coverage.
* gets 23-46% at greater than or equal to 10x, paper gets 88% or more at 7x

6 of variants are known in COSMIC db.

KEGG pathway: Many mismatch repair... [actually, this is the usual set you'd see with any cancer sample. Nothing sticks out.]

Conclusion:
* 305 variants, no common variants.

Future:
* increase sample size.
* pathway analysis
* Understand biology

[Not the most impressive talk - I could give the same talk on my cell lines, and would have roughly the same results.... nothing particularly interesting.]


Labels:

AGBT 2010 - Ian Bosdet - BC Gancer Agency

Mutational Profiling of Pre and Post-Treatement Lung Tumors Using Whole-Transcriptome Sequencing and Targeted Sequence Capture

EGF receptor is often mutated (non-small cell)
* some tyrosine kinase inhibitors exist, but response is variable.
* clinical characteristics known to be associated with response was used as primary criterial for recruitment

Identifying patients that are likely to bbenefit from TKI therapy can have a significant impact on overall survival.
* cells become addicted to the rampant signalling from TK. Cutting it off can kill them
* often a mutation that can dampen or negate result of drug.

All cancers used were first line.
* non-smoker
* female & asian
* stage IIIb
* NSCLC 1st line.

Majority of patients have now progressed - and encouraged to donate 2nd biopsy.

65 patients over 2 years,
goal: non progression over 8 weeks.
80% did not progress in 8 weeks.
* 23 partial response,
* 24 stable disease

30 tumours selected for RNA sequencing
* 13 responders, 14 non-responders
* 3 progression tumours
* gene expression analysis and mutation discovery
* some correlation to clinical characteristics.

One gene correlated with EGFR sensitivity mutations.
Another seemed to correlated to smokers who did not respond: IER5L

Excess unaligned reads were aligned to virus transcripts - Highly enriched for Epistein-Barr Virus. Tumour ended up being re-classified.

3 patients then sequenced with Capture:
* Used Agilent (47,558 baits)
* Normal, pre-treatment and post-treatment tumour samples
* can be used to identify small deletions
* Putative somatic mutations resulting in significant amino-acid alterations were identified using SNVMix
* Mutations similar between patients were not observed, but pre-treatment tumour pairs show significant overlap.

[Talking about putative somatic mutations.... I got ripped into for doing the exact same analysis and calling the same mutations "most likely" somatic 2 weeks ago... DOH.]

Sumary:
* clinical selection of patients can greatly enhance incidence of EGFR and mutations and response to erlotnib at 8-weeks
* EGFR mutation status is a good but imperfect predictor of patient response
* mutation discovery in treatment naive lung tumours has identified a relatively small number of mutations (need validation(
* more progressions will be analyzed.

Labels:

AGBT 2010 - Daniel MacArthur - Welcome Trust Sanger Institute

Loss-of_Function Mutations in Healthy Human Genomes: Implications for Clinical Genome Sequencing

[Missed the firsts couple minutes?]
Analysis of 1000 genomes data.

Loss of Function sub-group
Aim: create a catalogue of variants predicted to result in severe disruption of gene function
What is a LOF variant: [annotation based on GENCODE v3lb]
1. stop codon SNPs
2. splice disruption SNPs
3. frame shift indels
4. disruptive structural variants. (eg. loss of exons, loss of start codons...)

LOF variants:
* enriched for:
** severe recessive mutations
** other variants with functional effects
** neutral variatns in redundant genes/pseudogenes
** Sequencing and annotation arefacts

Many of these will be neutral.

3 pilots.
* total of 1,6556 unique genes affected.
* that is to say that a substantial portion of the genome has LOF variants
* acknowledging that there are errors, that's still a lot. (=

Disrupted genes per individual. Visible difference between European vs. Yoruba. (Africans have higher variability)

Structural variants seem relatively constant, splicing seems constant, stops seem to vary most. (CEU, CHB, JPT, YRI) [I'm eyeballing]

Expect to se some carriers for recessive disease mutations
* Several likely carrier mutations identified. [didn't catch them]

Derived allele frequency spectra.
* stop and splice are heavily shifted to the low end (0.05+)
LOF sites are enriched for artefacts
* Conserved region have less polymorphisms, but equal amount of error.
* Non-conserved have more polymorphisms, and equal error:
** thus tends to increase artefact rate in conserved regions.

LOF clustering points to mapping and annotation arefacts
* 91% of LOF carying genes contain only one LOF variant.
* there are some genes that are enriched for multiple independent LOF variants.
** many of them are CNV, seg dup, close paralogues.... which means that they're artefacts too.
* other annotation artefacts exist too... LOFs are making them stand out.

Beyond cataloging:
* large scale sequencing studies tend to produce many potential LOF candidates
* discriminate between disease causing and benign variations.
* is there a functional profile distinguishing recessive and LOF-tolerant genes?

Compare LOF-tolerant genes (& non-OR) to 725 recessive disease genes from OMIM. (Early results)
* use it to do classification
* linear discriminant analysis

[Kind of feels like a fast drive-by-blogging... my notes really didn't do justice to Daniel's explanations - i just managed to get down some of the points.]

Labels:

AGBT 2010 - Timothy Triche - Children's Hospital Los Angeles

Unraveling the Complexity of Primary and Metastatic Ewing's Sarcoma Using Helicos Singele Molecule Sequencing

Came out of ongoing studies of high risk childhood cancers.
Ewing Sarcoma
* had no survivors, now has 50% survival rate.
* If it metastisize, there is no survival (poor outcome)
Started with 16 year old female
* metastasize 6 months later
* had a lot of DNA from stock piled bone marrow

Goal:
* use RNA/DNA/Epigenomics to understand cancer
Interested in just about all types of sequencing, and integrating it all. List pretty much every type of Next gen sequencing technique. [Not much they aren't interested in.]

Using Helicos to do sequencing
* Identify two p53 mutations - both previously known.
* Chimeric genes in sarcomas usually mean the rest of the genome is less rearranged. However, there were a fairly significant rearrangements in metastasis. (eg. Entire chr 7 & 8 duplicated, )

Metastasis is not just an explant.
CNV vs RNA:
* Statistically, there is a strong association between CNV duplications and RNA up expression of genes.

[something about cell adhesion molecules?]

Mechanisms of double strand breaks... uniform at nearly single base resolution.
* 11q24.3
* in middle of FLI1 gene.
* approximately 1Mb deletion in tumour, in metastasis, this completely disappears.

Breakpoint @ 22q12.2
* less defined... again CNV changes disappear.
18qter DEL & LOH in CHLA9 disappears in CHLA10: Is the metastasis derived from the primary?
* Deletions "disappear", so the the dominant clone in the primary is not likely the one that metastasized.

Comparing primary to metastasis

[Whoa... colours... orange fused to blue, turned to light blue.. something diluted... much too fast to take notes on this without pictures.]
* Dosage effects are seen.
* 22 chromosomes show LOH and profound Homozygosity in the Metastasis that is not seen in the primary. 16/20 chromosomes.
* This shows a major simplification in the genome.

Used RNA-Seq... some filtering on RNA.
* random primers, poly a Tail addition, Hybidize and ...
* look for fusion - get EWS-FLI1 fusion

overall RNA expression.
* far more complex, especially intronic, 5` and 3` of exons.
* This is regulated under controls that have yet to be discovered.
* more than 40% of transcription in the pirmary tumour and metastasis is non-exonic.
* Genes are up regulated in metastases
* Some times you see lots of intron expression, some times you see LOTS.

[ I'm going to go see another talk - Have to stop notes here.]

Labels:

AGBT 2010 - Kristian Cibulskis - Broad Institute

ITector: Accurate Somatic Mutation Detection in Whole Genome and Exome Capture

Mutation Detection - the goal
* Somatic Point Mutations: SNV in the tumour DNA that are not present in the normal

One challenge: Sensitivity
* tumour purity : Normal tissue gets into the sample - you may be testing normal tissue in high quantities
* ploidy: often there are multiple copies of the DNA

60% tumour is common - with 3x ploidy and min allele fraction: 0.23

Challenge 2: Specificity
* Signal: 1 somatic mutation per Mb
* Noise: 1000 common germline varients per Mb (in dbsnp)
* Mutations are not recurrent. (Constant discovery mode)
* 1000s mutations per sample, 100s of samples
* Too expensive to validate every mutation - would cost more than to discover.

MuTector:
* Core detection algorithm and practical artifact filters
* Under dev since Nov 2008
* Built upon GATK

Some artifacts can be cleaned up globally
* Remove molecular Duplicates
* Recalibrate Quality Scores (make Q values match)
* Locally Realign [Gapped - uses SW - I saw the poster]

Core Statistical Test
* Prior genotype probablities enforce variant expectation rate..
* first calculate score for non-reference (for tumor)
* then calculate scover for it being reference (for normal)
* Controling sequencing error
* Controlling missing a germline ref in the normal.

Running: you get more somatic mutations
* expected 30 somatic mutations, ended up with 133 in 30mb of coding sequence
* Error processes not captured by the core statistic produce high confidence mistakes
* Information about reference alleles and mutatn alleles should come from similar distributions
* linked mutations, library errors... etc

Filters:
* Sequence context causes base hallucinations
* Fisher's exact test to check distribution of strand of reads containing reference allele versus alternate allele
* Bigger effect in capture than whole genome

Misalignment:
* Sequencers/Aligners tend to make reproduceable errors, which then show up in alignments

Small changes to filters have big effects
* Very sensitive!

Filtering goes from 133 to 35.

Validation:
* 26/29, 30/35, 31/36, 92/100
* Around 95%

How Sensitive?
* use core statistics.
* depends on coverage! [of course]
* use theoretical prediction data and ultra deep coverage as "control"
* Both seem to give the same/similar results
* Average 60-80% power to detect

Beta Testing going on
* Release of the software will be soon!

Labels:

AGBT 2010 - Elliot Margulies - NHGRI/NIH

Sequencing and analysis of matched tumor and normal genomes from a melanoma patient

Experimental Design:
* melanoma tumor sample - sequence it
* matched normal blood sample - sequence it
* seems simple, but takes new tools.
* unique advertisement strategies. (-;

Saved 10 runs of Images alone - more than 100 Tb of storage

Compare Illumina 1.6 v 1.4
* Uniquely aligning read and next_phred
* Didn't explain the results of the graphs shown... missed the point.

Used Eland, partition into bins
* realign with xmatch. (well characterized and scales well.)

In the end, 2 whole genome datsets
* 2 x 100 bp read
* 33 tumour and 24 normal (lanes)
* total runs (5 and 3)
* total alignable reads 1billion/1.2billion

Coverage statistics:
* Greater than 99% covered 1x
* 5x-10x range for variants covered by 94-95%

Method for variant detection
* Most Probable Genotype
* bayesian statistic approach, prior probability of observing a non-ref allele (expected mutation rate)
* Equation given - not going to copy that for html.
* Confidence is the difference between the best call and the next most probable call.

[This looks VERY much like SNVMix2...]

Graph concordance with percentage called. If you use a cutoff of 10, you get 95% in the normal genome, 90% in the tumor.

Moved from MPG to Most Probable Variant (MPV)
* Compare between the best call and the probability of the reference data.
* improves the quality of the call.

Settings:
* Using MPV greater than 10 (4Million variants)
* Subtract out evidence for germ line or low coverage
** take out high confidence gernline variants
** subtract MPG is less than 10, but looks like a variant.
** throw out low confidence somatic variants.
* leaves 189,000 somatic variants (tumour variants)
* also filtering dbsnp
* break into coding/non-coding
* synonymous/non-synonymous
* verify SNVs by sanger sequencing. (75/84 verify) It may be that some of them are there, but not detectable by sanger.

Summary table of SNV pipeline.
* 174,000 non coding variants.

Paper: Local DNA Topography correlates with functional noncoding regions of the human genome.

Impact on SNPs on Local DNA Structure - sometimes this can change the structure alot.

Use "Chai" to do structure informed evolutionary information
* only about 10,000 overlap "chai" regions
* 2,176 appear to dramatically change DNA shape.

"Chai" spots are "mutation cold spots"
Future plans, look at more tumor normal pairs, and investigate it further.

Labels: