[Lazarus] Building help files: the nitty-gritty

Mon Jul 23 13:40:08 CEST 2012

Reinier Olislagers wrote:
> On 23-7-2012 12:16, Mark Morgan Lloyd wrote:

>> Do you ever have a situation where you need to index into a PDF, and if
>> so how do you cope?
>>
>> I'm looking at something where it would be beneficial to have a
>> collection of electronics info, and I'd want immediate access to a given
>> section. Allowing for the dominance of PDF in that industry, the best I
>> can think of so far is to burst each PDF into pages, and then convert
>> each page back to PDF.
> 
> Are you talking about PDFs with digital text or images with hidden text
> ("searchable PDFs" as they are sometimes called) as opposed to images only?

Any sort of PDF, including murky scans from Bitsavers.

> If so, extracting the text using e.g. pdftotext (e.g. page by page),
> finding the text etc and instructing a pdf reader to open at a certain
> page - if that reader supports it - may be helpful??
> E.g. for the Sumatra PDF reader (IIRC, only available on Windows):
> sumatrapdf -page <pageno>
> 
> Depending on the reader, you could do more, see e.g.:
> https://code.google.com/p/sumatrapdf/wiki/CommandLineArguments

Although highly non-portable. My thoughts were to have documentation on 
a server to be available to anybody using a specialist Lazarus app, the 
whole thing would be spoilt if as well as loading the app (tentatively, 
Borg-UI) and possibly lhelp and associated CHMs I required users to find 
a specific PDF reader and possibly integrate it with their browser.

> If the PDFs are images only, the only feasible way I'd see would be to
> extract the images, OCR them and deal with it then; e.g. rebuild the
> PDFs, adding the resulting text as hidden images, e.g. with the Linux
> hocr2pdf tool provided by the exactimage package:
> http://www.exactcode.com/site/open_source/exactimage/

Although even if the original wasn't a bitmap from Bitsavers, 
electronics stuff with lots of tabular material (control registers and 
the likes) is notoriously difficult to reformat. I /hope/ to be able to 
get the docs into a database (PostgreSQL handles binaries fairly well, 
which would simplify some of the replication/dissemination issues) even 
if they were saved as files local to each HTTP daemon.

> FYI, I'm actually in the process of building a fairly simple document
> scanning solution (for now on Linux) that:
> - scans the document using sane, outputing tiff files
> - performs optional cleanup, e.g. with the scantailor package
> - recognizes text with tesseract, output layout/text info into hocr format
> - combines the image and text into "searchable PDFs" using hocr2pdf
> - adds metadata to the PDF
> - adds the text to either a database or more probably some kind of full
> text search package
> - registers the scan into a database
> - provide a viewer to search for documents, print them etc; in future
> perhaps to be used for correcting OCR results
> 
> Not very far, but:
> https://bitbucket.org/reiniero/papertiger/overview
> 
> For Dutch speaking readers: yes, the project name is intentional ;)

I'm afraid I don't speak Dutch, although my mother used to quote "Bûter, 
brea, en griene tsiis" out of context every few months. But I like the 
palindrome :-)

-- 
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]