[Lazarus] String vs WideString

Juha Manninen juha.manninen62 at gmail.com
Wed Aug 16 12:05:43 CEST 2017


On Mon, Aug 14, 2017 at 4:21 PM, Tony Whyman via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
> UTF-16/Unicode can only store 65,536 characters while the Unicode standard
> (that covers UTF8 as well) defines 136,755 characters.
> UTF-16/Unicode's main advantage seems to be for rapid indexing of large
> strings.

That shows complete ignorance from your side about Unicode.
You consider UTF-16 as a fixed-width encoding.  :(
Unfortunately many other programmers had the same wrong idea or they
were just lazy. The result anyway is a lot of broken UTF-16 code out
there.


On Tue, Aug 15, 2017 at 12:15 PM, Tony Whyman via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
> If a topic keeps on being discussed after 10+ years of argument, the reason
> is usually either (a) the problem and its solution have not been documented
> properly, or (b) the outcome is an unsatisfactory compromise.

Or (c) The people discussing are ignorant about the topic.

> I went back and read the wiki article you mentioned and was no more the
> wiser as to why the current mess exists. Is it really no more than because
> Delphi continues to screw up in this area, so must FPC? The body of the
> article appears to be a set of notes - not necessarily wrong in themselves
> but lacking the background and context needed to explain why it is like it is.

Hmmm...
Originally the page was a mess because it had lots of irrelevant
background info about the old obsolete LCL Unicode support. Text was
added by many people but none was removed.
Finally I cleaned the page. It now has most relevant info at the top
and then special cases and technical details later.
I am rather happy with the page now, it explains how to use Unicode
with Lazarus as clearly as possible.
However I am willing to improve it. What kind of background and
context would you need?

> 1. Stop using the term "Unicode".

You can stop using it. No problem.
For others however it is a well defined international standard. See:
  https://en.wikipedia.org/wiki/Unicode

> 2. Clean up the char type.
> ...
> Why shouldn't there be a single char type that intuitively represents
> a single character regardless of how many bytes are used to represent it.

What do you mean by "a single character"?
A "character" in Unicode can mean about 7 different things. Which one
is your pick?
This question is for everybody in this thread who used the word "character".

> Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
> pages and Chinese variations on UTF8, that means that dynamic attributes
> have to be included in the type. But isn't that the only way to have
> consistent and intuitive character handling?

What do you mean? Chinese don't have a variation of UTF8.
UTF8 is global unambiguous encoding standard, part of Unicode.

The fundamental problem is that you want to hide the complexity of
Unicode by some magic String type of a compiler.
It is not possible. Unicode remains complex but the complexity is NOT
in encodings!
No, a codepoint's encoding is the easy part. For example I was easily
able to create a unit to support encoding agnostic code. See unit
LazUnicode in package LazUtils.
The complexity is elsewhere:
- "Character" composed of codepoints in precomposed and decomposed
(normalized) forms.
- Compare and sort text based on locale.
- Uppercase / Lowercase rules based on locale.
- Glyphs
- Graphemes
- etc.

I must admit I don't understand well those complex parts.
I do understand codeunits and codepoints, and I understand they are
the easy part.

Juha


More information about the Lazarus mailing list