[Lazarus] Building help files: the nitty-gritty
Mark Morgan Lloyd
markMLl.lazarus at telemetry.co.uk
Mon Jul 23 13:40:08 CEST 2012
Reinier Olislagers wrote:
> On 23-7-2012 12:16, Mark Morgan Lloyd wrote:
>> Do you ever have a situation where you need to index into a PDF, and if
>> so how do you cope?
>> I'm looking at something where it would be beneficial to have a
>> collection of electronics info, and I'd want immediate access to a given
>> section. Allowing for the dominance of PDF in that industry, the best I
>> can think of so far is to burst each PDF into pages, and then convert
>> each page back to PDF.
> Are you talking about PDFs with digital text or images with hidden text
> ("searchable PDFs" as they are sometimes called) as opposed to images only?
Any sort of PDF, including murky scans from Bitsavers.
> If so, extracting the text using e.g. pdftotext (e.g. page by page),
> finding the text etc and instructing a pdf reader to open at a certain
> page - if that reader supports it - may be helpful??
> E.g. for the Sumatra PDF reader (IIRC, only available on Windows):
> sumatrapdf -page <pageno>
> Depending on the reader, you could do more, see e.g.:
Although highly non-portable. My thoughts were to have documentation on
a server to be available to anybody using a specialist Lazarus app, the
whole thing would be spoilt if as well as loading the app (tentatively,
Borg-UI) and possibly lhelp and associated CHMs I required users to find
a specific PDF reader and possibly integrate it with their browser.
> If the PDFs are images only, the only feasible way I'd see would be to
> extract the images, OCR them and deal with it then; e.g. rebuild the
> PDFs, adding the resulting text as hidden images, e.g. with the Linux
> hocr2pdf tool provided by the exactimage package:
Although even if the original wasn't a bitmap from Bitsavers,
electronics stuff with lots of tabular material (control registers and
the likes) is notoriously difficult to reformat. I /hope/ to be able to
get the docs into a database (PostgreSQL handles binaries fairly well,
which would simplify some of the replication/dissemination issues) even
if they were saved as files local to each HTTP daemon.
> FYI, I'm actually in the process of building a fairly simple document
> scanning solution (for now on Linux) that:
> - scans the document using sane, outputing tiff files
> - performs optional cleanup, e.g. with the scantailor package
> - recognizes text with tesseract, output layout/text info into hocr format
> - combines the image and text into "searchable PDFs" using hocr2pdf
> - adds metadata to the PDF
> - adds the text to either a database or more probably some kind of full
> text search package
> - registers the scan into a database
> - provide a viewer to search for documents, print them etc; in future
> perhaps to be used for correcting OCR results
> Not very far, but:
> For Dutch speaking readers: yes, the project name is intentional ;)
I'm afraid I don't speak Dutch, although my mother used to quote "Bûter,
brea, en griene tsiis" out of context every few months. But I like the
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk
[Opinions above are the author's, not those of his employers or colleagues]
More information about the Lazarus