[Genbank-bb] GenBank : INSDSeq XML semantic changes for October 15 2006

Mark Cavanaugh cavanaug at ncbi.nlm.nih.gov
Mon Jul 10 16:23:26 EST 2006


Dear GenBank Users,

Although this group isn't specifically intended for discussions about
an XML representation known as INSDSeq, enough GenBank users are
using INSDSeq XML that we feel some recent changes should be
announced here.

(INSD == International Nucleotide Sequence Database == the collaboration
 among DDBJ, EMBL, and GenBank.)

INSDSeq is collaborative XML DTD for sequence records that all three
members of the INSD support. The current version of the DTD (INSDSeq 1.4)
is still quite reminiscent of the GenBank, EMBL, and DDBJ flatfile
representations... However, additional structure is gradually being
introduced for various data elements, which we hope will prove useful for
XML users. The current DTD can be found at:

	http://www.ncbi.nlm.nih.gov/data_specs/dtd/INSD_INSDSeq.dtd
	http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd

Here is GenBank record M10101 in INSDSeq format:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&list_uids=146274&dopt=gbc

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Several semantic changes in NCBI's generation of INSDSeq 1.4 have been
(or will recently be) made:

1.) The basepair abbreviations for nucleotide sequences in the
    INSDSeq_sequence element have been switched from upper-case
    to lower-case letters.

    This change has already been implemented.

2.) The INSDReference_reference element now contains *only* the serial
    number of a reference.

    Using M10101 as an example, the first reference is:

      <INSDReference_reference>1</INSDReference_reference>
      <INSDReference_position>1768..3531</INSDReference_position>
      <INSDReference_authors>
        <INSDAuthor>Tiedeman,A.A.</INSDAuthor>
        <INSDAuthor>Smith,J.M.</INSDAuthor>
        <INSDAuthor>Zalkin,H.</INSDAuthor>
      </INSDReference_authors>
	....
    </INSDReference>

    Previously, the basepair position was redundantly presented in
    both the INSDReference_reference *and* INSDReference_position
    elements.

    This change has already been implemented.

3.) A tilde character ( ~ ) within INSDSeq_comment #PCDATA values will
    soon be used to indicate a linebreak.

    Doubled-tilde characters ( ~~ ) should be interpreted as a literal,
    single tilde character .

    The need for such a convention can be seen by examining the format
    of the COMMENT section in the GenBank flatfile representation of
    GenBank record AC183761 :

	http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=95147495

    The semi-structured paragraph-oriented nature of the COMMENT can
    be reproduced by XML rendering software with the adoption of
    tilde as a linebreak chracter.

    This convention for tilde has been in use for ASN.1 data provided
    by NCBI for many years. So its use in INSDSeq seems warranted.

    We expect that this change will be implemented by October 15 2006
    or earlier.

4.) INSDSeq_strandedness will soon be populated for all sequences.

    Currently, double-stranded DNA and single-stranded RNA sequences
    are presented without any INSDSeq_strandedness element. Only when
    the strandedness is something *other* than the defaults which are
    apprropriate for DNA/RNA is INSDSeq_strandedness provided.

    This practice will change by October 15 2006, such that a strandedness
    value is always presented, for all sequences.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

BTW: If you are a developer of software tools, and if the INSDSeq XML
representation is of interest to you (as opposed to the full-blown XML
equivalents of NCBI's ASN.1 specifications), we would like to hear from
you! Please send your suggestions for INSDSeq changes to the NCBI Service
Desk:

	info at ncbi.nlm.nih.gov

Mark Cavanaugh
GenBank
NCBI/NLM/NIH/HHS




More information about the Genbankb mailing list