[Genbank-bb] Rel150.fsa_aa missing sequence data
Cavanaugh, Mark (NIH/NLM/NCBI)
cavanaug at ncbi.nlm.nih.gov
Wed Nov 23 17:09:19 EST 2005
With GenBank 150.0, we transitioned from the use of application
asn2fast for FASTA generation to a re-implementation called asn2fsa .
asn2fsa was tested in a wide variety of contexts before its use.
There is one class of record, dating back to journal-scanning efforts
in the early 1990's, that was not included in any of those contexts.
That class consists of incompletely-sequenced proteins which are
*not* associated with a DNA sequence.
In such cases, the sequenced fragments of the protein are interspersed
with "virtual" Bioseqs, with a known length but no actual protein
sequence data. Here's a link for a sample record:
In the ASN.1, if you search for AAB24990, you will find that it lies
between AAB24989 and AAB24991 . But AAB24990 is not a "normal" peptide
fragment like its neighbors:
repr virtual ,
mol aa ,
length 24 ,
This "virtual" bioseq has no actual sequence data. Yet it has a length,
to represent the gap that exists between the AAB24989 and AAB24991
Unfortunately, the newer asn2fsa application is not aware that virtual
Bioseqs should be suppressed for GenBank Release FASTA products.
And as a result, this introduced some entries in the FASTA file
consisting of just sequence identifiers and a defline (such as it is).
There are very few "stand-alone" fragmentary proteins like this in
GenBank, and the journal-scanning effort was ceased many years ago.
We will try to update asn2fsa so that it will ignore virtual protein
bioseqs for GenBank 151.0
Thanks for reporting the problem. We truly appreciate the scrutiny
of the database that our users provide.
BTW: Other GCG users might find your pre-processing step useful, so
feel free to share it if you wish.
>From: Garry W. Martin [mailto:gmartin at MendelBio.COM]
>Sent: Wednesday, November 16, 2005 7:32 PM
>To: genbankb at magpie.bio.indiana.edu
>Subject: [Genbank-bb] Rel150.fsa_aa missing sequence data
>The protein sequence file, rel150.fsa_aa.gz, dated Oct 14, 2005,
>for the current GenBank release 150 contains several hundred
>fasta entries that have fasta headers but no peptide residue data.
>For examples, see any of these entries in the rel150.fsa_aa.gz file:
> >gi|263833|gb|AAB24990.1| No definition line found
> >gi|263835|gb|AAB24992.1| No definition line found
> >gi|263837|gb|AAB24994.1| No definition line found
> >gi|263839|gb|AAB24996.1| No definition line found
> >gi|263841|gb|AAB24998.1| No definition line found
>When we attempted to process the file with existing GCG v10.3 fasta
>file handling utilities (e.g., fastatogcg), those programs became
>confused because they assume that there will be at least one line of
>sequence data following each sequence header. We had to remove the
>null sequence entries with a preprocessing step in order to complete
>the installation of release 150.
>We have been processing each GenBank protein release in this way
>for about seven years and this is the first time we seen this
>Mendel Biotechnology, Inc.
More information about the Genbankb