Extraction of portions of GenBank flatfile entries

Peter Rice nospam at theebi.ac.uk
Fri Jan 31 08:10:07 EST 2003

<x-flowed>Constantinos G. Crambis wrote:
>  Is anyone aware of any software (not commercial), that can be used
>  for extraction of portions of the
>  flatfiles ? (I think that's how they call the GenBank entries you get
>  after a search) . For example if in a database flatfile entry,
>  you have a reference to coding sequence as " CDS : 235 =8A.. 1500 bp" ,
>  is there a software that can find the keyword "CDS"
>  in the flatfile, and then read and return the string composed of the
>  letters a c g t, that is between the numbers 235=8A..1500
>  in the sequence at the end of the file ?

Traditionally, the answer to this is Thure Etzold's SRS (free to
academics) ... You can run your query on the EBI's SRS website
srs.ebi.ac.uk (see below for details).

I am currently adding exactly this functionality to the EMBOSS suite

If you would be interested in sequence analysis applications that do
these kinds of queries, EMBOSS (which is free software) could be useful
for you.

  > I am particularly
>  interested, to extract promoter regions from whole gene entries of
>  GenBank.

Ah ... this is a little different ... you need sequences 5' of the CDS
start. EMBOSS will offer this... but in a future release. For now, you
need to extract the start and end positions yourself, and put them into
the sequence query but a simple perl script would be all you need.

Earlier versions of SRS offered it, but that was in the days when each
exon was a separate feature (the old feature table format, as we used to
call it). In SRS is was used mainly for splice site extraction.

In SRS ... you query for CDS features in the organism of interest. If
you click on the CDS link you get the sequence, even if it is spliced,
for example

Quick guide to SRS:
Start (a new session)
Select EMBL (it's the same as GenBank for content, of course :-) by
clicking the box.
Select "Extended Query Form" so you can see what you are doing

Organism: arabidopsis thaliana (no quotes)
FtKey select "cds"
Gene or Product : put in a gene anme if you think it will help you to
find a gene of interest (and if EMBL/Genbank have it annotated that way)
FtLength - could help to narrow a search - if your case was real and in
Arabidopsis, you could say >= 1266 and <= 1266 (but I didn't find it)

In the results, click on a hit to see the feature with the DNA sequence,
and to link to the protein in SwissProt or SpTrembl.

There are many ways to play with the sequence and entry formatting in
SRS and in EMBOSS, but, like Fermat, I do not have the space to write
them here!!!

Hope this helps

Peter Rice
pmr at ebi.ac.uk


More information about the Arab-gen mailing list