matching the complete Oxford English Dictionary against SwissProt

Gaston Gonnet gonnet at inf.ethz.ch
Mon Nov 30 03:21:03 EST 1992


This reasearch finds/breaks two records:

  (a) The longest word which appears in the OED and as a protein sequence

  (b) The most useless piece of information (simulataneously in lexicography
	and computational biochemistry!)

The complete Oxford English Dictionary, second edition (20 volumes in
paper form) has 572728830 characters and is remarkably close to the human
genome in amount of information.  The longest matches, of any word in
the entire dictionary, against SwissProt version 23 are 9 characters
long (disappointingly short) and are the words:

ENSILISTS: (1 occurrence in SP, 1 in OED)

   OED:
   ensilist. [f. ensile + -ist.] One who preserves his crops by ensilage.
   1883 Hibernia July 103/2 Concrete has been adopted by many ensilists.

   SP:
   <E><AC>P17222;</AC>
   <DE>PRRB PROTEIN.</DE>
   <OS>ESCHERICHIA COLI.</OS>
   <SEQ>MSELSYLEKLMDGVEVEWLPLSKVFNLRNGYTPSKTKKEFWANGDIPWFRMDDIRENGRILGSSLQKISSC
	AVKGGKLFPENSILISTSATIGEHALITVPHLANQRFTCLALKESYADCFDIKFLFYYCFSLAEWCRKNTT
		 ^^^^^^^^^
	MSSFASVDMDGFKKFLIPRPCPDNPEKSLAIQSEIVRILDKFSALTAELTAELTAELSMRKKQYNYYRDQL
	LSFKEDEVEGKRKTLGEIMKMRAGQHISAHNIIERKEESYIYPCFGGNGIRGYVKEKSHDGEHLLIGRQGA
	LCGNVQRMKGQFYATEHAVVVSVMPGINIDWAFHMLTAMNLNQYASKSAQPGLAVGKLQELKLFVPSIERQ
	IYIAAILDKFDTLTNSITEVSRVKSSCARNSTNIIEICYLVSRSRK</SEQ></E>


HIDALGISM: (1 occurrence in SP, 4 in OED)

   OED:
   . . . Hence hidalgoish a., resembling or characteristic of a hidalgo.
   hidalgoism (hidalgism), the practice or manners of a hidalgo.

   SP:
   <E><AC>P03700;</AC>
   <DE>INTEGRASE.</DE>
   <OS>BACTERIOPHAGE LAMBDA.</OS>
   <KW>DNA RECOMBINATION; DNA INTEGRATION.</KW>
   <SEQ>MGRRRSHERRDLPPNLYIRNNGYYCYRDPRTGKEFGLGRDRRIAITEAIQANIELFSGHKHKPLTARINSD
	NSVTLHSWLDRYEKILASRGIKQKTLINYMSKIKAIRRGLPDAPLEDITTKEIAAMLNGYIDEGKAASAKL
	IRSTLSDAFREAIAEGHITTNHVAATRAAKSEVRRSRLTADEYLKIYQAAESSPCWLRLAMELAVVTGQRV
	GDLCEMKWSDIVDGYLYVEQSKTGVKIAIPTALHIDALGISMKETLDKCKEILGGETIIASTRREPLSSGT
					 ^^^^^^^^^
	VSRYFMRARKASGLSFEGDPPTFHELRSLSARLYEKQISDKFAQHLLGHKSDTMASQYRDDRGREWDKIEI
	K</SEQ></E>

This was done as an exercise in data structures and searching.  The
data structures used were Pat arrays.  The algorithm is similar but
simpler than the all-against-all matching.  It took 23 mins on my
workstation, just about as much time as it takes to read the entire
572Mb of dictionary.

Enjoy it!



More information about the Bioforum mailing list