[Lazarus] String vs WideString
Tony Whyman
tony.whyman at mccallumwhyman.com
Thu Aug 17 13:09:08 CEST 2017
On 16/08/17 11:05, Juha Manninen via Lazarus wrote:
> 2. Clean up the char type.
>> ...
>> Why shouldn't there be a single char type that intuitively represents
>> a single character regardless of how many bytes are used to represent it.
> What do you mean by "a single character"?
> A "character" in Unicode can mean about 7 different things. Which one
> is your pick?
> This question is for everybody in this thread who used the word "character".
Are you making my points for me? If such a basic term as "character"
means 7 different things then something is badly amiss. It should be
fairly obvious that in this context, character = printable symbol -
whilst for practical reasons allowing for format control characters such
as a "end of line" and "end of string".
I believe that you need to go back to the idea that you have both an
abstract representation of a character with a constant semantic,
separate from the actual encoding and for which there may be many
different and valid encodings. For example, using a somewhat dated
comparison, a lower case latin alphabet letter 'a' should always have a
constant semantic, but in ASCII is encoded as decimal 97, while in
EBCDIC is encoded as decimal 129. Even though they have different binary
values, the represent the same abstract character.
I want a 'char' type in Pascal to represent a character such as a lower
case 'a' regardless of the encoding used. Indeed, for a program to be
properly portable, the programmer should not have to care are the actual
encoding - only that it is a lower case 'a'.
Hence my proposal that a character type should include an implicit or
explicit attribute that records the encoding scheme used - which could
vary from ASCII to UTF-32.
You can then go on to define a text string as an array of characters
with the same encoding scheme.
>
>> Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
>> pages and Chinese variations on UTF8, that means that dynamic attributes
>> have to be included in the type. But isn't that the only way to have
>> consistent and intuitive character handling?
> What do you mean? Chinese don't have a variation of UTF8.
> UTF8 is global unambiguous encoding standard, part of Unicode.
I was referring to GB 18030 and that it has one, two and four byte code
points.
>
> The fundamental problem is that you want to hide the complexity of
> Unicode by some magic String type of a compiler.
> It is not possible. Unicode remains complex but the complexity is NOT
> in encodings!
> No, a codepoint's encoding is the easy part. For example I was easily
> able to create a unit to support encoding agnostic code. See unit
> LazUnicode in package LazUtils.
> The complexity is elsewhere:
> - "Character" composed of codepoints in precomposed and decomposed
> (normalized) forms.
> - Compare and sort text based on locale.
> - Uppercase / Lowercase rules based on locale.
> - Glyphs
> - Graphemes
> - etc.
>
> I must admit I don't understand well those complex parts.
> I do understand codeunits and codepoints, and I understand they are
> the easy part.
>
> Juha
The point I believe that you are missing is to consider that a character
is an abstract symbol with a semantic independent of how it is encoded.
Collation sequences are independent of encoding and should remain the
same regardless of how a character set is encoded.
More information about the Lazarus
mailing list