[Lazarus] String vs WideString
Tony Whyman
tony.whyman at mccallumwhyman.com
Mon Aug 14 11:53:44 CEST 2017
On 13/08/17 12:18, Juha Manninen via Lazarus wrote:
> Unicode was designed to solve exactly the problems caused by locale differences.
> Why don't you use it?
I believe you effectively answer your own question in your preceding post:
> Actually using the Windows system codepage is not safe any more.
> The current Unicode system in Lazarus maps AnsiString to use UTF-8.
> Text with Windows codepage must be converted explicitly.
> This is a breaking change compared to the old Unicode suppport in
> Lazarus 1.4.x + FPC 2.6.x.
If you are processing strings as "text" then you probably do not care
how it is encoded and can live with "breaking changes". However, if, for
some reason you are or need to be aware of how the text is encoded - or
are using string types as a useful container for binary data then, types
that sneak up on you with implicit type conversions or which have
semantics that change between compilers or versions, are just another
source of bugs.
PChar used to be a safe means to access binary data - but not anymore,
especially if you move between FPC and Delphi. (One of my gripes is that
the FCL still makes too much use of PChar instead of PByte with the
resulting Delphi incompatibility). The "string" type also used to be a
safe container for any sort of binary data, but when its definition can
change between compilers and versions, it is now something to be avoided.
As a general rule, I now always use PByte for any sort of string that is
binary, untyped or encoding to be determined. It works across compilers
(FPC and Delphi) with consistent semantics and is safe for such use.
I also really like AnsiString from FCP 3.0 onwards. By making the
encoding a dynamic attribute of the type, it means that I know what is
in the container and can keep control.
I am sorry, but I would only even consider using Unicodestrings as a
type (or the default string type) when I am just processing text for
which the encoding is a don't care, such as a window caption, or for
intensive text analysis. If I am reading/writing text from a file or
database where the encoding is often implicit and may vary from the
Unicode standard then my preference is for AnsiString. I can then read
the text (e.g. from the file) into a (RawByteString) buffer, set the
encoding and then process it safely while often avoiding the overhead
from any transliteration. PByte comes into its own when the file
contains a mixture of binary data and text.
Text files and databases tend to use UTF-8 or are encoded using legacy
Windows Code pages. The Chinese also have GB18030. With a database, the
encoding is usually known and AnsiString is a good way to read/write
data and to convey the encoding, especially as databases usually use a
variable length multi-byte encoding natively and not UTF-16/Unicode.
With files, the text encoding is usually implicit and AnsiString is
ideal for this as it lets you read in the text and then assign the
(implicit) encoding to the string, or ensure the correct encoding when
writing.
And anyway, I do most of my work in Linux, so why would I even want to
bother myself with arrays of widechars when the default environment is UTF8?
We do need some stability and consistency in strings which, as someone
else noted have been confused by Embarcadero. I would like to see that
focused on AnsiString with UnicodeString being only for specialist use
on Windows or when intensive text analysis makes a two byte encoding
more efficient than a variable length multi-byte encoding.
Tony Whyman
MWA
More information about the Lazarus
mailing list