GenBank Typography
Keith Robison
robison at golgi.harvard.edu
Thu Oct 24 20:47:59 EST 1991
After my recent posting, I wondered how many typos can be found amongst
GenBank keywords. To investigate, I performed the following operations:
1) Made a list of all words found in the DESCRIPTION or KEYWORDS fields
2) Extracted lists of all words appearing 1, 2, 3, or 4 times
(Hypothesizing that any particular typo is rare, so that it should
appear only a few times).
This was done with the UNIX utilities uniq, comm, and awk.
3) Ran the resulting lists through the UNIX spell utility to eliminate
correctly spelled ordinary English words.
4) Examined the results visually for words which are definitely or probably
typos. Words were included in the list if:
A. They were obviously wrong.
B. They looked wrong
C. They really should be hyphenated or split
(i.e. wildtype --> wild type; minicircle --> mini-circle)
D. Two similar spellings were found and I don't know which
is preferred.
A summary of the results (the complete list of dubious words appears at the end
of this posting).
1. The genetic databases have typos (no news here).
2. No conventions are enforced as to the spelling of Latin names,
especially ones ending in 'i' (one 'i' or two?).
3. If a G1 who is experienced with neither UNIX nor C can
generate a list of typos in about 1/2 a day using
a 1 page C program and standard UNIX utilities, why can't the
databases do this on a regular basis?
4. Biologists should be glad that so much is done in E.coli, since
many variants were found of other species names.
(look at the list and you'll understand what I mean)
Keith Robison
Harvard University
Program in Biochemistry, Molecular, Cellular, and Developmental Biology
The list:
actii acylneuraminate adpglucose aerugenes
agribacterium alphatic aminoimidaxole amonabactin
amyloliqufaciens anabena aprorepressor baarsi
baarsii bacteriodes balactosidase biliprotein
billiprotein biodayb borellia brevibacteruim
burdorferei caldophilus campesris carbaxybutanamido
carboxyphosphonoenolpyruvate cateachol
cellobiase cellobiosidase
chlorosomal colelicolor collagenoltyic collicin
comlete confering cryiiic crystaline
cyanobacteriium cystathionase cystathione cystathionine
cystein cysteinyl cytadhesin degredation
desulfobulbus dihydropholate ditrogenase diydrofolate
elelment elelment enatiomer endcoding
enlongation enodonuclease entertoxin enzymeiii
erdman erdmann erwina faaecalis
fagilis fimbrilin fimbrillin flexeri
flexineri flexner frameshift fructosovarans
fructosovorans gadium galaktokinase galctosidase
gammma genorrhoeae giganteus glycerolphosphate
glycerophosphate glyphosphate hackeliae haemohpilus
heatshock histidin histon hydroxybuterase
hydroxybuterate incompat innocuum insectidal
insetion instertion intracellularel isocitritase
isopenicllin israeli israelii kanamy
lamgda lantibiotic lysozume maltohexaose
mambrane mannanase membane meningiditis
menmbrane mensenteroides mesenteroides methanogene
methylotransferase minicircle miniplasmid minireplicon
minumum monphosphatem murein mycoplasm
nitogen ozaenae ozeanae paprtial
phosphoglucoisomerase phosphoglucoseisomerase
photosystyem phycobiliprotein
phycobilisome plamid plamsid pneomoniae
pneumionia pneumonie posphate prabable
precurser promotor propilin proteease
protins protocatechuate pseudoanabaena pseudoanabaene
psudomonas pyruvoyl ribosmoal ribulosephosphate
selenocysteyl sensivity sequenc sequense
signalling snechocystis solfatarious spaeroides
srain starin strai straim
strepyomyces subitilis subtitlis subtlis
suceptible sulfer sulfolbus sulfurylase
sutbtilis symbiosus synecchococcus synechoccus
syntase sythetase thermofilum thermogata
thermophila thermophile thermophillic thermotoga
thibautii thiebautii thphimurium thurigiensis
thuringensis thymdiylate thymidilate tmefaciens
tranposition transferrna transformtion transolcationg
trasnfer tryptophane tularensis tumef
tumefacians tumefacinens tumescens tymphimurium
typhemurium typhimuriun typhirmurium typhiumurium
typimurium uropathagenic uxact vanelli
whie wildtype woesi
More information about the Bioforum
mailing list