[Lazarus] Building help files: the nitty-gritty

Mon Jul 23 14:13:41 CEST 2012

On 23-7-2012 13:40, Mark Morgan Lloyd wrote:
> Reinier Olislagers wrote:
>> On 23-7-2012 12:16, Mark Morgan Lloyd wrote:
>> Are you talking about PDFs with digital text or images with hidden text
>> ("searchable PDFs" as they are sometimes called) as opposed to images
>> only?
> 
> Any sort of PDF, including murky scans from Bitsavers.
> 
>> If so, extracting the text using e.g. pdftotext (e.g. page by page),
>> finding the text etc and instructing a pdf reader to open at a certain
>> page - if that reader supports it - may be helpful??
>> E.g. for the Sumatra PDF reader (IIRC, only available on Windows):
>> sumatrapdf -page <pageno>
>>
>> Depending on the reader, you could do more, see e.g.:
>> https://code.google.com/p/sumatrapdf/wiki/CommandLineArguments
> 
> Although highly non-portable. 
Agreed. Any takers for adding PDF functionality to lhelp or docview ;)
Note: viewers like SumatraPDF can be used without installation, so it
becomes a matter of deployment with your application.

You'd still face issues with Linux/OSX of course.

> My thoughts were to have documentation on
> a server to be available to anybody using a specialist Lazarus app, the
> whole thing would be spoilt if as well as loading the app (tentatively,
> Borg-UI) and possibly lhelp and associated CHMs I required users to find
> a specific PDF reader and possibly integrate it with their browser.

Your idea of extracting the single relevant PDF page into a separate PDF
and presenting that to the user could work, especially if you provide
forward/back navigation buttons (in a Lazarus app or web browser+PDF
plugin).
On Windows you might be able to embed some Activex PDF control, once
again not portable...

Captain obvious: I'd balance ease of use for this solution versus
distributing a portable PDF viewer and telling that to go to a certain page.

>> If the PDFs are images only, the only feasible way I'd see would be to
>> extract the images, OCR them and deal with it then; e.g. rebuild the
>> PDFs, adding the resulting text as hidden images, e.g. with the Linux
>> hocr2pdf tool provided by the exactimage package:
>> http://www.exactcode.com/site/open_source/exactimage/
> 
> Although even if the original wasn't a bitmap from Bitsavers,
> electronics stuff with lots of tabular material (control registers and
> the likes) is notoriously difficult to reformat. 
Yes, but I wasn't talking about reformatting, just pasting the invisible
text "behind" the visible graphics (in another layer) in the PDF.
Alternatively, you could store the retrieved text (and page numbers)
somewhere else.

Of course, OCR itself may be crummy with these kinds of PDFs... which
will lead to search misses/false positives.

If you have graphics-only PDFs and want to do text search you'd have to
perform this step if you want to go to a certain spot based on a user
search... unless you retype the text...

I /hope/ to be able to
> get the docs into a database (PostgreSQL handles binaries fairly well,
> which would simplify some of the replication/dissemination issues) even
> if they were saved as files local to each HTTP daemon.
Sounds like a large system?
You'd be having multiple HTTP servers that serve up PDFs coming from the
db? And viewing those with a web browser?
That does confuse me: what do you use the "specialist Lazarus app" for
you talked about earlier?

>> For Dutch speaking readers: yes, the project name is intentional ;)
> 
> I'm afraid I don't speak Dutch, although my mother used to quote "Bûter,
> brea, en griene tsiis" out of context every few months. But I like the
> palindrome :-)
Butter, bread and green cheese in Frisian? (Note: as the Frisians will
insist on telling you, different language from Dutch... only reason I
seem to understand the words is context and because it's close to both
English and Dutch ;)