matching the complete Oxford English Dictionary against SwissProt
gonnet at inf.ethz.ch
Mon Nov 30 03:21:03 EST 1992
This reasearch finds/breaks two records:
(a) The longest word which appears in the OED and as a protein sequence
(b) The most useless piece of information (simulataneously in lexicography
and computational biochemistry!)
The complete Oxford English Dictionary, second edition (20 volumes in
paper form) has 572728830 characters and is remarkably close to the human
genome in amount of information. The longest matches, of any word in
the entire dictionary, against SwissProt version 23 are 9 characters
long (disappointingly short) and are the words:
ENSILISTS: (1 occurrence in SP, 1 in OED)
ensilist. [f. ensile + -ist.] One who preserves his crops by ensilage.
1883 Hibernia July 103/2 Concrete has been adopted by many ensilists.
HIDALGISM: (1 occurrence in SP, 4 in OED)
. . . Hence hidalgoish a., resembling or characteristic of a hidalgo.
hidalgoism (hidalgism), the practice or manners of a hidalgo.
<KW>DNA RECOMBINATION; DNA INTEGRATION.</KW>
This was done as an exercise in data structures and searching. The
data structures used were Pat arrays. The algorithm is similar but
simpler than the all-against-all matching. It took 23 mins on my
workstation, just about as much time as it takes to read the entire
572Mb of dictionary.
More information about the Bioforum