Dear arabidopsis genomics community,
De-novo reconstruction of genes for this model plant, from 3 RNA sources
with EvidentialGene methods, without use of chromosomes or other species genes,
is accurate. These At_Evigene genes and comparisons are available, as sequences,
chromosome locations, genome map views and BLAST searches at
Comparison to Arabidopsis gene sets of other methods, including Pac-Bio RNA
sequencing, Trinity-Illumina RNA assembly, and genome gene models, find
the Evigene methods surpass these in accuracy of matching
At_Araport official gene set, related plant genes, and expressed introns.
For the 90% of genes expressed in the RNA samples used, 90-95% have
essentially same coding sequences as At_Araport genes, and 93% of introns are
common to both gene sets, plus 3% introns only in At_Evigene. A larger set of
alternate transcripts are recovered in At_Evigene than At_Araport. There are
also 100 additional loci not in the Araport set that look like good genes to
me, and another 80 putative loci that do not map well to At_TAIR10
chromosomes, but map well to At_Ler chromosomes. Any of you interested in more
on At_Evigene genes that differ from At_Araport, contact me for details.
There are now a few public Pac-Bio RNA gene sets, and publications suggesting
genes from single-molecule sequencing may be more accurate than genes from
Illumina short reads. My comparison for 3 plants: Arabidopsis, Zea mays corn,
and pine trees, is an objective comparison with different results: fully
assembled Illumina RNA produces the more accurate sets, including for loci
where both methods recover some transcripts, with better alternate and paralog
transcript reconstruction. Evigene's RNA-only constructions often surpass
accuracy of genome-modeled gene sets, also.
Who should consider EvidentialGene for gene reconstruction?
* genomicists who want accurate, complete and objectively reconstructed genes,
including those of you who may not believe my claims, but will look at
objective results on this.
* model and well-supported genome projects, where curators can use these
to improve precision of high value gene information.
* new species genomes, use as a primary gene set, with alternate transcripts,
and/or assess gene predictions, chromosome assemblies for accuracy.
* gene/genome improvement projects, to add alternate transcripts,
un-discovered and fragmented gene models.
* transcriptome and expression projects for more accurate genes.
Reconstruction from RNA only provides independent gene evidence, free of
errors and biases from chromosome assemblies and other species gene sets. Not
only are the easy, well known ortholog genes reconstructed well, but harder
gene problems of alternate transcripts, paralogs, and complex structured genes
are usually more complete from Evigene methods.
One of my goals with this work is to reconstruct many high-value (model,
otherwise) animal and plant gene sets in coming years. I welcome
collaborations, especially from groups with genomics + informatics
expertise. This methodology is highly automatable (think BIG DATA), but still
wants improvements. Species genes built with Evigene by independent
authors include a range of plants and animals, and several of these papers
provide independent reviews of Evigene versus other methods.
-- Don Gilbert
gilbertd @ indiana.edu
-- gilbertd from indiana.edu--http://marmot.bio.indiana.edu/