AWK Scripts for BIO-JOURNALS to BibTeX

Thu Feb 27 17:04:00 EST 1992

        "BIO-SOFT at"
        "marvin at"
        "moberg at"
        "prm at"

Several weeks ago I posted an AWK script for converting the table of contents
posted on the bio-journals bulletin board to SCRIBE/BibTeX format.  There were
several problems with that script; it did not work for the journals of the
American Society for Microbiology, it depended on a "gsub" function not
available in all versions of AWK, and it did not produce bibliography
reference tags.

Two revised scripts are being posted with this announcement.  The script
ASM2BIB.AWK performs the conversion for the journals
  J. Bacteriol.
  J. Virol.
  Mol. Cell. Biol.
The script ENT2BIB.AWK performs the conversion for the journals
  J. Biol. Chem.
  Mol. Microbiol.
These scripts should be usable with all of versions of AWK, NAWK, Gnu-AWK
(or GAWK) and BAWK.  AWK and NAWK are UNIX utility programs; GAWK and BAWK are
programs which emulate (and extend) AWK on UNIX and other operating systems.
SCRIBE and BibTeX are bibliographic database programs; BibTeX is used with the
document preparation programs TeX and LaTeX.

To execute these scripts the appropriate AWK command is issued with a "-f"
switch immediately followed by the name of the script file, space and the
name of the table of contents file
  awk -f{script_name} {contents_name}
for example
The results are directed to the standard output, usually the terminal screen.
To capture the results, the output must be redirected to a file in the manner
appropriate for your operating system.  To produce the file JBACT.BIB in UNIX,
and in VMS
The AWK command itself should be replaced by NAWK, GAWK or BAWK as appropriate
for your system.

These scripts are fairly robust for failures of particular tables of contents
to conform to the appropriate format.  However, not every deviation from the
format can be detected and corrected for.  Also, some secondary editing of the
output will usually be required in order for BibTeX to produce correct
bibliographic entries.  Here are some of the features and/or problems to expect.

(1) The BibTeX entry tags are constructed from the first two authors' last
names and the year, all separated by periods.  The tags may not be unique and
should be changed as necessary.  If there is a problem parsing a name, the tag
may be malformed.  If the periods are not desired, they can be eliminated
in the scripts by removing ' "." ' from the lines beginning "tag = ".

(2) Some of the tables of contents have author and journal names in all upper-
case characters.  AWK does not provide a simple method for correcting this
capitalization problem.

(3) Author names containing "Jr.", "III", etc., may not be parsed correctly by
BibTeX and will need to be protected or corrected in secondary editing.  Author
names and words in titles are not capitalization protected, that is not enclosed
by "{" and "}", and that must be done in secondary editing as necessary.  No
special character or format conversions are performed in the titles, so those
must also be done in secondary editing.

(4) The author and title output fields are each put on a single line.  These
long lines may or may not be wrapped when they appear on a terminal.  Such long
lines are not a problem for BibTeX.

(5) The spaces appearing before the BibTeX field labels, as in "  author = ",
may be eliminated by removing them from the corresponding "printf" statements
in the script files.

I will receive bug reports and, within the restrictions inherent in working
with AWK and the human production of the tables of contents, will attempt to
correct them.  I will not be so receptive to requests for additional features
unaccompanied by the text of suggested script improvements.

I wish to acknowledge the help of Dr. Tom D. Schneider, National Cancer
Institute, Laboratory of Mathematical Biology, Frederick, Maryland  21702-1201
(toms at, in field testing these revised scripts.
                                 Dr. John S. Garavelli
                                 Database Coordinator
                                 Protein Identification Resource
                                 National Biomedical Research Foundation
                                 Washington, DC  20007
                                 POSTMASTER at GUNBRF.BITNET

More information about the Bio-soft mailing list