software for reading sequence from PDF file

Don Gilbert gilbertd at bio.indiana.edu
Fri Mar 19 17:54:47 EST 1999


Certainly what you ask can be done (extract specific text from PDF),
if the PDF docs are not encrypted/secured by the creators.  Text is
stored as text in PDF, not as bitmap images (unless the PDF was created
from a bitmap image) so you can pull out the text with the right
tool.  PDF format is well documented by Adobe.  

Here are some PDF links
http://www.adobe.com/prodindex/acrobat/adobepdf.html
http://www.ep.cs.nott.ac.uk/pdfcorner/
http://www.pdfzone.com/
See esp. here for extraction tools
http://www.pdfzone.com/products/software/toolinfo_extract.asp

I've written software to create PDF from various graphics/text.
It wasn't too hard.  If you need to write it, software to 
extract text should be a straight-forward programming project 
for some software engineer.  Java is a great match for PDF, since
the standard ZIP libraries of java work on PDF compressed data.

-- Don

In article <717801BBC100D211B89500805F6FAD93047D56 at snap01.synapticcorp.com>,
 <Tvenkatesh at synapticcorp.com> wrote:
>I would like to know if there is software that can convert PDF file into
>text files.
>Specifically we want to extract  sequences from patent documents which are
>stored as images in PDF
>format. We tried Acorobat reader, it did not help.
>I appreciate your help.
>Thanks
>Venky
>___________________________
>T. V. (Venky) Venkatesh, Ph D
>Senior Scientist (Bioinformatics and Molecular Biology)
>Synaptic Pharmaceutical Corporation
>215 College Road
>Paramus NJ 07652 - 1431
>201-261-1331x720 (Phone)
>201-261-0623(Fax)
>Tvenkatesh at synapticcorp.com
>
>


--
-- d.gilbert--biocomputing--indiana-u--bloomington-in-47405
-- gilbertd at bio.indiana.edu




More information about the Bio-soft mailing list