[Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich DrDiettrich1 at aol.com
Mon Feb 14 21:50:41 CET 2011


Michael Schnell schrieb:

> AFAIK, the decision to use UTF8 is due to Linux using this encoding and 
> so no conversion is done in the LCL system API.

IMO more important: no new string and char type (Wide...) is required, 
no duplicate set of stringhandling procedures. This may be essential for 
databases and communication as well.

> This of course is bad 
> with Windows, as here the API uses UTF16 and everything needs to be 
> recoded in the LC System API on entry and exit.

The overhead may be neglectable in direct API calls, when these do real 
work. Strings in (visual) components can be converted once, into the 
internally used (OS display conforming) representation, and again the 
conversion overhead can be low until undetectable in the GUI.

> Supposedly doing 
> different string types - UTF8String vs (a reference counting version of 
> UTF-16-encoded) WideString - for Linux and Windows at the LCL-user-Code 
> interface is too confusing.

A *portable* UTF string implementation should be restricted, eliminating 
direct and indexed access to chars (which become substrings). A 
dedicated UTF16 class/type can be added at any time, as an optional package.

OTOH I agree that the weak (non-existing) distinction between Ansi and 
UTF8 strings is not pleasing. But here I'd establish a strong boundary 
between general (Unicode=UTF8) strings, and application specific strings 
of a single (immutable) codepage - remember that "Ansi" is not a single 
specific encoding, instead it's a collection of single-byte-char 
encodings, including UTF-8. Then the user can choose a specific codepage 
(or UTF-16) for use inside his application, with e.g. an AppString type. 
Then it's clear where conversions are required and have to be inserted 
automatically by the compiler.

The Delphi model, with differently encoded strings in the same string 
type, can result in much uncontrollable conversion overhead, easily 
outweighting the few possible optimizations with current AnsiStrings 
(assuming SBCS[1] only). The new ABI also is incompatible with existing 
DLLs of earlier Delphi/BCB versions, causing trouble with third-party 
components that are not available in the new ABI. Okay, no such problems 
exist with open source components, but not all Lazarus add-ons or apps 
are necessarily open source.

[1] With MBCS charsets the same rules apply as to UTF-8, so that UTF-8 
can immediately replace all MBCS encodings. So the decision about new 
string types *only* affects current SBCS Ansi users, even ASCII users 
are not affected.

DoDi





More information about the Lazarus mailing list