[Lazarus] Building help files: the nitty-gritty

Mon Jul 23 12:30:33 CEST 2012

On 23-7-2012 12:16, Mark Morgan Lloyd wrote:
> Marco van de Voort wrote:
> Do you ever have a situation where you need to index into a PDF, and if
> so how do you cope?
> 
> I'm looking at something where it would be beneficial to have a
> collection of electronics info, and I'd want immediate access to a given
> section. Allowing for the dominance of PDF in that industry, the best I
> can think of so far is to burst each PDF into pages, and then convert
> each page back to PDF.

Are you talking about PDFs with digital text or images with hidden text
("searchable PDFs" as they are sometimes called) as opposed to images only?

If so, extracting the text using e.g. pdftotext (e.g. page by page),
finding the text etc and instructing a pdf reader to open at a certain
page - if that reader supports it - may be helpful??
E.g. for the Sumatra PDF reader (IIRC, only available on Windows):
sumatrapdf -page <pageno>

Depending on the reader, you could do more, see e.g.:
https://code.google.com/p/sumatrapdf/wiki/CommandLineArguments

If the PDFs are images only, the only feasible way I'd see would be to
extract the images, OCR them and deal with it then; e.g. rebuild the
PDFs, adding the resulting text as hidden images, e.g. with the Linux
hocr2pdf tool provided by the exactimage package:
http://www.exactcode.com/site/open_source/exactimage/

FYI, I'm actually in the process of building a fairly simple document
scanning solution (for now on Linux) that:
- scans the document using sane, outputing tiff files
- performs optional cleanup, e.g. with the scantailor package
- recognizes text with tesseract, output layout/text info into hocr format
- combines the image and text into "searchable PDFs" using hocr2pdf
- adds metadata to the PDF
- adds the text to either a database or more probably some kind of full
text search package
- registers the scan into a database
- provide a viewer to search for documents, print them etc; in future
perhaps to be used for correcting OCR results

Not very far, but:
https://bitbucket.org/reiniero/papertiger/overview

For Dutch speaking readers: yes, the project name is intentional ;)