Software to read whole genomes?

mathog at mathog at
Tue Sep 30 12:48:12 EST 1997

In article <u9lo0fyb7a.fsf at>, eddy at (Sean Eddy) writes:
>  >  - the orientation of the read is annotated redundantly in parsable form
>  >    in the DEFINITION field:
>  >        zv37h04.s1 Soares ovary tumor NbHOT Homo sapiens clone 755863 3'
>  >    i.e.:
>  >        <clone plate location>.[s,r]1 <library> <clone ID> [5,3]' 
>  >    where an s1 is a 5' read; r1 is a 3' read.
>oops. The last line is correct; but the previous line is
>wrong/misleading. The .r1 or .s1 indicates the direction of the read.

The EST database is not consistent with respect to clone orientation.

To illustrate this point, I picked a single clone at random from near the
beginning of GB_EST1, with reasonable confidence that it would not conform
to the format you describe above, and fair confidence that it would not
contain orientation information at all. (This based on prior personal
experience.)  Sure enough, the randomly selected entry U21463 is:

LOCUS       HSU21463      623 bp    mRNA            EST       31-MAR-1995
DEFINITION  Human partial cDNA sequence with CCA repeat region, T3 end of clone
NID         g732488
SOURCE      human.
  ORGANISM  Homo sapiens
            Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata;
            Vertebrata; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 623)
  AUTHORS   Longshore,J.W.
  TITLE     Isolation and Characterization of Human Brain Genes with CCA
            trinucleotide repeats
  JOURNAL   Am. J. Hum. Genet. 55, A264 (1994)
REFERENCE   2  (bases 1 to 623)
  AUTHORS   Han,J., Hsu,C., Zhu,Z., Longshore,J.W. and Finley,W.H.
  TITLE     Over-representation of the disease associated (CAG) and (CGG)
            repeats in the human genome
  JOURNAL   Nucleic Acids Res. 22 (9), 1735-1740 (1994)
  MEDLINE   94261446
REFERENCE   3  (bases 1 to 623)
  AUTHORS   Longshore,J.W.
  TITLE     Direct Submission
  JOURNAL   Submitted (16-FEB-1995) John W. Longshore, Laboratory of 
            Genetics, University of Alabama, 1720 7th Ave. S., Sparks 442,
            Birmingham, AL 35294-0017, USA
FEATURES             Location/Qualifiers
     source          1. .623
                     /organism="Homo sapiens"
                     /note="T3 end of clone"
                     /clone_lib="Stratagene catalog #936205"
                     /dev_stage="2 year old"

Here deducing the orientation will require a trip to the library - "T3 end 
of clone" -> which end that is depends on vector -> vector not named
except via "Stratagene catalog #936205"

Going back to the example that you cite, which admittedly does contain the
direction information, the .r1/.s1 notation is nice, but it is in an
unparseable format coded into the definition field.  By unparseable, I mean
just that a generalized program that reads Genbank data fields will not be
able to trivially determine forward/reverse, since this information is
contained in a nonstandard format *for the database as a whole* within
another field. 

In any case, the biologically relevant direction information is found on
the mRNA line, which says (as it should): 

     mRNA            complement(<1..>413)

It is good that you picked this example to make your point, because here we
have a clone that is pretty clearly inserted in the reverse direction, as
judged by the direction of similarity found to another sequence.
Specifically, the homologies found are (GCG BESTFIT) 

  .s1 (len 413)   CAMK (len 1793)
  413->4     <==> 1385 -> 1793        Percent Similarity: 96.822  

  .r1 (len 617)
  391->617   <==> 1346 -> 1572        Percent Similarity: 98.238

This is why the mRNA line is reversed from the .s1/.r1 lines.  Can you name 
the piece of software that could have read this entry, noted the s1/r1 and
mRNA reversal, and accounted for it?  I doubt such a beast exists, since
the explanation is buried in yet another unparseable field, COMMENT, as:

            Possible reversed clone: similarity on wrong strand

The take home lesson is that not all EST entries contain direction
information, or contain that information in the same format, and even if
they have that information, the orientation may be questionable, and there
is no indication of the reliability of the information presented.  None of 
this matters much if you are working with 10-20 ESTs by hand, but if you 
are trying to process thousands of them, well, have fun.


David Mathog
mathog at
Manager, sequence analysis facility, biology division, Caltech 

More information about the Methods mailing list