[Lazarus] Converting all code to use UnicodeString

Tue Sep 26 00:04:29 CEST 2017

On Mon, Sep 25, 2017 at 6:23 PM, Sven Barth via Lazarus
<lazarus at lists.lazarus-ide.org> wrote:
> On 25.09.2017 23:11, Marcos Douglas B. Santos via Lazarus wrote:
>>>> [...]
>>> The flags are -MDelphiUnicode, -MDelphi or -MObjFPC.
>>> But they only compile the units with sources in the unit path, which
>>> excludes all FPC units. Also keep in mind that the system unit and the
>>> RTL require a lot of low level functions, which require separate
>>> versions.
>>
>> Which make this flags useless for that. It should be all code (my,
>> RTL, Lazarus, etc) to make this work using one type of string.
>
> No, because especially the RTL and FCL is usually provided precompiled.
> Thus you can't change the string type anymore afterwards without
> recompiling all the code.

That's I am talking about.
I use FPC and Lazarus by sources. I compile both. Never used an installer...
Maybe you already answered in other way, but: In that case, can I
compile FPC and Lazarus with these flags (all strings=UnicodeString)
and everything will work like that?

>> I can help in a high level way (Classes, Components, etc) not in the
>> compiler level.
>> But how can I know about these tasks? May I just pick one in bug
>> tracker that I want? How to know who is working on each task, which is
>> more important?
>
> Currently noone is working on it.

:-O

> A first step would be to add modeswitch headers to all units that must
> not use a specific mode (e.g. the System, ObjPas and some more language
> support units) like this:
>
> === code begin ===
>
> {$ifdef FPC_UNICODE_RTL}
> {$modeswitch unicodestrings}
> {$endif}
>
> === code end ===
>
> Once this is done one can test to compile the RTL, FCL and packages with
> FPC_UNICODE_RTL defined and see what blows and fix that step by step...
>
> Alternatively a constant in the System unit might be better so that one
> can check like this:
>
> === code begin ===
>
> // System unit
> {$ifdef FPC_UNICODE_RTL}
> FpcRtlIsUnicode = true;
> {$else}
> FpcRtlIsUnicode = false;
> {$endif}
>
> // some other unit
> {$if FpcRtlIsUnicode}
> {$modeswitch unicodestrings}
> {$endif}
>
> === code end ===

I've put {$modeswitch unicodestrings} in two simple programs (CLI and
GUI) and... CRASH!
Imagine working with this on the compiler level...  :)

My first thought about is that:
Every argument of all classes and functions should be raw string —
RawByteString.
May have some other types (UTF8String, UTF16String, etc) only for
users to use in the high level. For example: If the user know that a
file was encoded in UFT8, he/she will use UTF8String only to receive
that buffer.
Then, every single RTL class/function that works with Strings, should
check which encode was used (and we already have this today). These
functions will received an "string", will check which encode is (ie,
UTF8String following the example), and will pass to another built-in
private function to do the job.

"UnicodeString", as we know today, shouldn't exists. This does not
make any sense.
Only RawByteString — which should be only "string" — and others types
that defines the encode, as I said above, but used only once to
receive the buffer.

> Or if one wants to compile with -Municodestrings than instead the core
> units need to be protected with
>
> === code begin ===
>
> {$modeswitch unicodestrings-}
>
> === code end ===
>
> I'm currently not sure what would be the better approach in the long
> term... :/

Just guessing:
The default is not Unicode so, shouldn't have logic to use
{$modeswitch unicodestrings-}.

Regards,
Marcos Douglas