AGBT 2010 - Brian Haas - Broad
Genome annotation using mRNA-Seq: A case study of Schizosaccharomyces pombe
Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions
Major chanllenge:
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult
Approach 1: de novo assembly
* treat them like EST
* align to genome
Approach 2: align reads to genome
* reconstruct based on alignments
Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations
Seq:
* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state
Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP
ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations
This is all exploration - This is NOT a tool Bake off.
Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.
Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns
Problems:
* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)
Examples given.
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.
Compare:
* none of these methods are great - they're all missing some that others caught.
Challenges:
* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.
For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)
Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts
Technology will greatly facilitate efforts
* Use stranded mRNA-seq
Summary:
* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.
Leverage evidence for genome annotation
* eg, 3 ab initio gene predictions
Major chanllenge:
* lack of high quality evidence
* this is changing with NGS.
* we now have evidence - but we need to standarize and develop algorithms
* reconstructing transcripts is difficult
Approach 1: de novo assembly
* treat them like EST
* align to genome
Approach 2: align reads to genome
* reconstruct based on alignments
Sequencing genomes from Schizosaccharomyces
* pombe is model organism - sequenced in 2002
* 12.5Mb, 5k genes, avg gene 1,489 bp
* genome should be well annotated, good quality annotations
Seq:
* 44M reads, 65% aligned (Maq)
* align to genome - look good
* challenge is to bring it to high quality automated state
Align: Use TopHat for short read alignment + Cufflinks
Assemble: Velvet/Ananas + GMAP
ELT structures transferred into PASA, which does refinement, alt splicing and validate existing annotations
This is all exploration - This is NOT a tool Bake off.
Elts: Velvet (21167), Cufflins (4158), Ananas (8309)
Almost all alignments to genome were perfect.
Then, test how many assembled to reconstruct full length gene support: Ananas did best, cufflinks 2nd best, velet only 1/3 of those done by Ananas.
* Velvet did very well with supporting introns
Problems:
* readthrough and encroachment
* again, ananas did best, velvet 2nd best, Cufflinks worst (by a long shot.)
Examples given.
* Velvet seems to give fractionated transcripts.. breaks where coverage is high. [Probably seq errors are causing it to break?]
* some annotations needed to be extended
* corrected genes - merging two genes that are really one.
Compare:
* none of these methods are great - they're all missing some that others caught.
Challenges:
* some well covered genomic loci not fully reconstructed (paralogs?)
* intron readthrough/encroachment
* incorrectly merged genes/transcripts
* UTR structures and alt splicing.
For well covered genomic loci not fully reconstructed
* identify disjoint regions
* colect reads and assemble independently
* genome directed to avoid misassembly
* very fast to do this
* This helps, but still have a long way to go.
* more tuning needed (expect to get up to 90%)
Dissecting merged transcripts.
* use coverage based assembly clipping - break up transcripts
Technology will greatly facilitate efforts
* Use stranded mRNA-seq
Summary:
* the information from mRNA-seq is needed for high throughput annotation
* current tools show progress
* still much more to be done in optimization
* need for optimized methods for ALL types of genomes.
Labels: AGBT 2010
4 Comments:
Hi... not familiar with Ananas. Do you have a reference or link to this software? Thanks!
Sorry, I don't have much information on Ananas. I believe it is an in-house development project at the broad: The only thing I can suggest is to contact Brian Haas about it.
I wonder how Oases (by Zerbino et al, on the making) compares to the other tools. I guess that it should do better as it will still handle indels (like Velvet) but have fewer issues with partioned transcripts.
Dr. Haas was very clear that this wasn't intended to be a tool bake-off - just a test to see how well annotation could be done from RNA-Seq. I'm sure there's room for someone to do the tool comparison - must be a good paper in there somewhere.
Post a Comment
<< Home