On the usefulness of sequences in the databases

Jim Ostell ostell at object.nlm.nih.gov
Fri Jun 25 07:58:02 EST 1993

Patents in the sequence databases:

There has been some confusion about the purpose, meaning, and quality of the
patent sequences appearing in the sequence databases. To give a little
background, the European, U.S., and Japanese Patent Offices have made an
agreement to exchange patent related sequences with each other and to make
them available to the public. Each office has made their own arrangements for
capturing their patent sequence data. In the U.S., the NCBI is working with
the U.S. Patent and Trademark Office to enter the backlog of patents, gather
the new patents, and to distribute them in GenBank.

Speaking now only for the U.S. effort, most of the backlog has been entered.
In the last tow releases of GenBank, NCBI has included patent sequences in a
separate division of the database, "PAT". The sequences come jointly from the
data capture efforts of EMBL for European patents and the NCBI for the U.S.
patents.  The agreement with the USPTO was to take only DNA sequences at
least 10 residues long. Because a patent is a legal, not a scientific,
document, it is often difficult or impossible to reliably capture all the
information of biological interest such as features or even the organism
name. Further, it is often difficult or impossible, except by careful legal
scrutiny, to determine what sequence (if any) is claimed to be patented and
what is just an "exhibit" or associated information. Finally, the sequences
themselves may be presented in a sufficiently complex manner that it can be
hard to determine what the sequence itself is in some regions.

In consultation with USPTO then, it became clear that patent sequences could
not be guaranteed to be a source of new biological data. We took the approach
instead that the purpose of the patent sequences was to serve as a search
key back to the original patent.  To achieve this, we entered all sequences
in the patent which met the length limits and associated them with the
patent document. No attempt was made to biologically annotate them or
determine what was claimed by the patent. Their most valuable use is to
search by sequence similarity to discover patents that may overlap with a
possible claim made on a new sequence.  The other use would be to find the
sequences relevant to a given patent.

Since these uses are so specialized, we felt it was important to keep these
sequences in a separate division. Note that it is typical for a patent to
refer to published sequences or for sequence data to be published after a
patent is awarded. So most of these sequences have cognates in the usual
divisions of the database.

As the sequences directly submitted electronically to USPTO as part of the
patent application process become available, it is possible that they will
be more fully annotated biologically by the submittor. However, since the
immediate submittor is usually a lawyer's office and since such annotation
is not required for the patent to be considered, do not expect too much.

We hope this clarifies the attributes of patented sequence entries.

  Jim Ostell

More information about the Embl-db mailing list