retrieving a list of proteins together with their DNA CDS

Gill i at read.the.group
Wed May 16 04:44:52 EST 2001



Hi,

Given a (long) list of swissprot accession numbers I want to 
script-wise retrieve
a. all their sequences in fasta format.
b. all their matching DNA coding sequences in fasta format.

I know how to perform a.
(at http://www.expasy.ch/sprot/sprot-retrieve-list.html 
 or through the swissprot flat file )

I also see that each _individual_ swissprot protein entry is linked to
EMBL / GenBank / DDBJ for what they refer to as its "NOT_ANNOTATED_CDS".
(the first field under "Cross-references")

But how to automate this DNA CDS retrieval?
Two ways i can imagine (but dont know how to perform neither):

a. find a swissprot flat file containing both side by side.
b. 1. transform swissprot accession into genbank accession
      (eg P02906 -> X13380).
   2. retrieve a list of "NOT_ANNOTATED_CDS" by their genbank accession
     numbers.

While b. seems more realistic:
- I dont know how to perform step 1.
  for a large group of swissprot accession numbers.
  (the swissprot flatfile does not seem to contain the genbank cross ref)
- The best i could find given a list of genbank accesion numbers is how
  to retrieve <= 50 ( http://www.ebi.ac.uk/cgi-bin/emblfetch )
  but my lists are much longer... 
  (and im not even sure that web page returns only the "NOT_ANNOTATED_CDS")

Help?

Thanks,
-Gill





More information about the Bio-soft mailing list