[Lazarus] String vs WideString

Tony Whyman tony.whyman at mccallumwhyman.com
Mon Aug 14 11:53:44 CEST 2017


On 13/08/17 12:18, Juha Manninen via Lazarus wrote:
> Unicode was designed to solve exactly the problems caused by locale differences.
> Why don't you use it?
I believe you effectively answer your own question in your preceding post:

> Actually using the Windows system codepage is not safe any more.
> The current Unicode system in Lazarus maps AnsiString to use UTF-8.
> Text with Windows codepage must be converted explicitly.
> This is a breaking change compared to the old Unicode suppport in
> Lazarus 1.4.x + FPC 2.6.x.
If you are processing strings as "text" then you probably do not care 
how it is encoded and can live with "breaking changes". However, if, for 
some reason you are or need to be aware of how the text is encoded - or 
are using string types as a useful container for binary data then, types 
that sneak up on you with implicit type conversions or which have 
semantics that change between compilers or versions, are just another 
source of bugs.

PChar used to be  a safe means to access binary data - but not anymore, 
especially if you move between FPC and Delphi. (One of my gripes is that 
the FCL still makes too much use of PChar instead of PByte with the 
resulting Delphi incompatibility). The "string" type also used to be a 
safe container for any sort of binary data, but when its definition can 
change between compilers and versions, it is now something to be avoided.

As a general rule, I now always use PByte for any sort of string that is 
binary, untyped or encoding to be determined. It works across compilers 
(FPC and Delphi) with consistent semantics and is safe for such use.

I also really like AnsiString from FCP 3.0 onwards. By making the 
encoding a dynamic attribute of the type, it means that I know what is 
in the container and can keep control.

I am sorry, but I would only even consider using Unicodestrings as a 
type (or the default string type) when I am just processing text for 
which the encoding is a don't care, such as a window caption, or for 
intensive text analysis. If I am reading/writing text from a file or 
database where the encoding is often implicit and may vary from the 
Unicode standard then my preference is for AnsiString. I can then read 
the text (e.g. from the file) into a (RawByteString) buffer, set the 
encoding and then process it safely while often avoiding the overhead 
from any transliteration. PByte comes into its own when the file 
contains a mixture of binary data and text.

Text files and databases tend to use UTF-8 or are encoded using legacy 
Windows Code pages. The Chinese also have GB18030. With a database, the 
encoding is usually known and AnsiString is a good way to read/write 
data and to convey the encoding, especially as databases usually use a 
variable length multi-byte encoding natively and not UTF-16/Unicode. 
With files, the text encoding is usually implicit and AnsiString is 
ideal for this as it lets you read in the text and then assign the 
(implicit) encoding to the string, or ensure the correct encoding when 
writing.

And anyway, I do most of my work in Linux, so why would I even want to 
bother myself with arrays of widechars when the default environment is UTF8?

We do need some stability and consistency in strings which, as someone 
else noted have been confused by Embarcadero. I would like to see that 
focused on AnsiString with UnicodeString being only for specialist use 
on Windows or when intensive text analysis makes a two byte encoding 
more efficient than a variable length multi-byte encoding.

Tony Whyman
MWA



More information about the Lazarus mailing list