GenBank Typography

Keith Robison robison at
Thu Oct 24 20:47:59 EST 1991

After my recent posting, I wondered how many typos can be found amongst 
GenBank keywords.  To investigate, I performed the following operations:

1)  Made a list of all words found in the DESCRIPTION or KEYWORDS fields
2)  Extracted lists of all words appearing 1, 2, 3, or 4 times
    (Hypothesizing that any particular typo is rare, so that it should
     appear only a few times).
    This was done with the UNIX utilities uniq, comm, and awk.	

3)  Ran the resulting lists through the UNIX spell utility to eliminate
    correctly spelled ordinary English words.

4)  Examined the results visually for words which are definitely or probably
    typos.  Words were included in the list if:

	A.  They were obviously wrong.
	B.  They looked wrong
	C.  They really should be hyphenated or split
		(i.e. wildtype --> wild type; minicircle --> mini-circle)
	D.  Two similar spellings were found and I don't know which 
	    is preferred.

A summary of the results (the complete list of dubious words appears at the end
of this posting).

1.  The genetic databases have typos (no news here).

2.  No conventions are enforced as to the spelling of Latin names,
    especially ones ending in 'i' (one 'i' or two?).

3.  If a G1 who is experienced with neither UNIX nor C can 
    generate a list of typos in about 1/2 a day using 
    a 1 page C program and standard UNIX utilities, why can't the
    databases do this on a regular basis?

4.  Biologists should be glad that so much is done in E.coli, since
    many variants were found of other species names.
    (look at the list and you'll understand what I mean)

Keith Robison
Harvard University
Program in Biochemistry, Molecular, Cellular, and Developmental Biology

The list:

actii              acylneuraminate    adpglucose         aerugenes
agribacterium      alphatic           aminoimidaxole     amonabactin        
amyloliqufaciens   anabena            aprorepressor      baarsi             
baarsii            bacteriodes        balactosidase      biliprotein        
billiprotein       biodayb            borellia           brevibacteruim     
burdorferei        caldophilus        campesris          carbaxybutanamido  
carboxyphosphonoenolpyruvate          cateachol
cellobiase         cellobiosidase     
chlorosomal        colelicolor        collagenoltyic     collicin           
comlete            confering          cryiiic            crystaline         
cyanobacteriium    cystathionase      cystathione        cystathionine      
cystein            cysteinyl          cytadhesin         degredation        
desulfobulbus      dihydropholate     ditrogenase        diydrofolate       
elelment           elelment           enatiomer          endcoding          
enlongation        enodonuclease      entertoxin         enzymeiii          
erdman             erdmann            erwina             faaecalis          
fagilis            fimbrilin          fimbrillin         flexeri            
flexineri          flexner            frameshift         fructosovarans     
fructosovorans     gadium             galaktokinase      galctosidase       
gammma             genorrhoeae        giganteus          glycerolphosphate  
glycerophosphate   glyphosphate       hackeliae          haemohpilus        
heatshock          histidin           histon             hydroxybuterase    
hydroxybuterate    incompat           innocuum           insectidal         
insetion           instertion         intracellularel    isocitritase       
isopenicllin       israeli            israelii           kanamy             
lamgda             lantibiotic        lysozume           maltohexaose       
mambrane           mannanase          membane            meningiditis       
menmbrane          mensenteroides     mesenteroides      methanogene        
methylotransferase minicircle         miniplasmid        minireplicon       
minumum            monphosphatem      murein             mycoplasm          
nitogen            ozaenae            ozeanae            paprtial           
phosphoglucoisomerase		      phosphoglucoseisomerase
photosystyem       phycobiliprotein   
phycobilisome      plamid             plamsid            pneomoniae         
pneumionia         pneumonie          posphate           prabable           
precurser          promotor           propilin           proteease          
protins            protocatechuate    pseudoanabaena     pseudoanabaene     
psudomonas         pyruvoyl           ribosmoal          ribulosephosphate  
selenocysteyl      sensivity          sequenc            sequense           
signalling         snechocystis       solfatarious       spaeroides         
srain              starin             strai              straim             
strepyomyces       subitilis          subtitlis          subtlis            
suceptible         sulfer             sulfolbus          sulfurylase        
sutbtilis          symbiosus          synecchococcus     synechoccus        
syntase            sythetase          thermofilum        thermogata         
thermophila        thermophile        thermophillic      thermotoga         
thibautii          thiebautii         thphimurium        thurigiensis       
thuringensis       thymdiylate        thymidilate        tmefaciens         
tranposition       transferrna        transformtion      transolcationg     
trasnfer           tryptophane        tularensis         tumef              
tumefacians        tumefacinens       tumescens          tymphimurium       
typhemurium        typhimuriun        typhirmurium       typhiumurium       
typimurium         uropathagenic      uxact              vanelli            
whie               wildtype           woesi              

