Databank updates

Peter Rice pmr at sanger.ac.uk
Tue Apr 11 03:34:11 EST 1995


In article <3mdab2$s77 at zaphod.crihan.fr> risler at cgmvax.cgm.cnrs-gif.fr (J-L Risler) writes:
>   As most of you are aware of, the increase in the EBI or Genbank sizes is 
>   not a minor problem...
>   Here I maintain a GCG-formatted version of the EBI databank (I exclude 
>   the EST division, because ESTs are mostly used in BLAST searches at 
>   remote sites).
>   It appears that today, the cumulated weekly updates since the last CD-ROM 
>   release of EBI is as large as the last release itself....
>   .....
>   My question is: do you know of an efficient program which, starting from 
>   the EBI flat file and the weekly updates flat files, will remove the 
>   redundancies and keep the last updated one, and possibly remove the ESTs 
>   from the updates?

Since you obviously have GCG installed locally, you could simply modify
their embltogcg program to exclude the new ESTs. They all have the
intended division on the first (ID) line where GCG picks up the entryname,
so you just need to look for " EST;" there and skip those entries.

As for excluding duplicates, that is trickier. Do you really want to
remove the old entries from the "latest release" you have GCG formatted?
Or do you want to exclude updates from the new data (you could only take
entries with "(Rel. 43, Created)" on the first DT line for example - those
should not be in the last release though there could be merged entries
around.

Or perhaps you just want to merge updates. I have a perl script to merge
the weekly update files and only take the newest versions, but now I just
pick up the cumulative update file each time (well, I am on the same campus
as the EBI so it doesn't strain the network :-) so I stopped working on it.

--
------------------------------------------------------------------------
Peter Rice                           | Informatics Division
E-mail: pmr at sanger.ac.uk             | The Sanger Centre
Tel: (44) 1223 494967                | Hinxton Hall, Hinxton,
Fax: (44) 1223 494919                | Cambs, CB10 1RQ
URL: http://www.sanger.ac.uk/~pmr/   | England




More information about the Bio-soft mailing list