[Lazarus] String vs WideString

Marcos Douglas B. Santos md at delfire.net
Mon Aug 14 15:11:27 CEST 2017


On Mon, Aug 14, 2017 at 6:53 AM, Tony Whyman via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
>
> On 13/08/17 12:18, Juha Manninen via Lazarus wrote:
>>
>> Unicode was designed to solve exactly the problems caused by locale
>> differences.
>> Why don't you use it?
>
> I believe you effectively answer your own question in your preceding post:
>
>> Actually using the Windows system codepage is not safe any more.
>> The current Unicode system in Lazarus maps AnsiString to use UTF-8.
>> Text with Windows codepage must be converted explicitly.
>> This is a breaking change compared to the old Unicode suppport in
>> Lazarus 1.4.x + FPC 2.6.x.
>
> If you are processing strings as "text" then you probably do not care how it
> is encoded and can live with "breaking changes". However, if, for some
> reason you are or need to be aware of how the text is encoded - or are using
> string types as a useful container for binary data then, types that sneak up
> on you with implicit type conversions or which have semantics that change
> between compilers or versions, are just another source of bugs.
>
> PChar used to be  a safe means to access binary data - but not anymore,
> especially if you move between FPC and Delphi. (One of my gripes is that the
> FCL still makes too much use of PChar instead of PByte with the resulting
> Delphi incompatibility). The "string" type also used to be a safe container
> for any sort of binary data, but when its definition can change between
> compilers and versions, it is now something to be avoided.
>
> As a general rule, I now always use PByte for any sort of string that is
> binary, untyped or encoding to be determined. It works across compilers (FPC
> and Delphi) with consistent semantics and is safe for such use.
>
> I also really like AnsiString from FCP 3.0 onwards. By making the encoding a
> dynamic attribute of the type, it means that I know what is in the container
> and can keep control.
>
> I am sorry, but I would only even consider using Unicodestrings as a type
> (or the default string type) when I am just processing text for which the
> encoding is a don't care, such as a window caption, or for intensive text
> analysis. If I am reading/writing text from a file or database where the
> encoding is often implicit and may vary from the Unicode standard then my
> preference is for AnsiString. I can then read the text (e.g. from the file)
> into a (RawByteString) buffer, set the encoding and then process it safely
> while often avoiding the overhead from any transliteration. PByte comes into
> its own when the file contains a mixture of binary data and text.
>
> Text files and databases tend to use UTF-8 or are encoded using legacy
> Windows Code pages. The Chinese also have GB18030. With a database, the
> encoding is usually known and AnsiString is a good way to read/write data
> and to convey the encoding, especially as databases usually use a variable
> length multi-byte encoding natively and not UTF-16/Unicode. With files, the
> text encoding is usually implicit and AnsiString is ideal for this as it
> lets you read in the text and then assign the (implicit) encoding to the
> string, or ensure the correct encoding when writing.

Unicode everywhere and you using AnsiString and doing everything...
Now I'm confused.

> And anyway, I do most of my work in Linux, so why would I even want to
> bother myself with arrays of widechars when the default environment is UTF8?

Maybe you do not have problems because you don't use Windows.

> We do need some stability and consistency in strings which, as someone else
> noted have been confused by Embarcadero. I would like to see that focused on
> AnsiString with UnicodeString being only for specialist use on Windows or
> when intensive text analysis makes a two byte encoding more efficient than a
> variable length multi-byte encoding.

FPC and Lazarus claim they are cross-platform — this is a fact — and
because that, IMHO, both should be use in only one way in every
system, don't you think?

Best regards,
Marcos Douglas


More information about the Lazarus mailing list