[Lazarus] Does Lazarus support a complete Unicode Component Library?

Sven Barth pascaldragon at googlemail.com
Sun Jan 2 20:40:02 CET 2011


On 02.01.2011 19:04, Bo Berglund wrote:
> On Sat, 01 Jan 2011 19:13:26 +0100, Sven Barth
> <pascaldragon at googlemail.com>  wrote:
>
>>> Is it converted somehow?
>>> The native widget's encoding is either UTF-8 or UTF-16.
>>> Is the string actually a Utf8String or Utf16String then?
>>> When do I need to pay attention to it?
>>
>> Currently there is no automatic conversion (it's planned in one of the
>> branches of FPC). For now a String (=AnsiString) can be seen as an
>> "array of byte". You as a developer are responsible that the string
>> contains the correct encoding.
>>
>> So in your above example the string that is stored in "s" will be UTF8
>> encoded, because it comes from the GUI. But if that string contains
>> multibyte characters those characters will appear as single "one byte"
>> characters if you access the string using [], Pos, Copy, etc.
>>
>> Example (note: this is not accurate UTF8 encoding, I'm just making that
>> up here)
>>
>> TMemo.Lines[0] contains: 'hä?!' ( h a-umlaut ? ! )
>> I now assume that an a-umlaut is encoded as "ae" (which isn't really the
>> case, but it's for the sake of an example ^^)
>> s now contains: 'h a e ? !'
>>
>> If you now want to access the second character of s you'd expect that
>> you'd get the a-umlaut, but if you do s[2] you'll get an "a". And if you
>> access the third one (s[3]) you'll get the "e" instead of "?".
>>
>> You need to convert the UTF8 string to a different one, e.g. UTF16:
>>
>> var
>>    us: UnicodeString;
>> begin
>>    us := UTF8Encode(s);
>> end;
>>
>> Now us[2] will return the a-umlaut.
>>
>> I hope this example clears that up a bit, if not: just ask more questions ;)
>>
>
> I just stumbled across this thread and it worries me a little since
> the way Delphi introduced unicode is by ambush....
>
> What they did was to redefine the type string from AnsiString to
> something else unicode-ish in Delphi 2009. So all applications doing
> some string manipulation on data of type string broke severely. I hope
> I will not see the same here in FPC/Lazarus?
>
> My concern is that I am communicating using RS232 and I use string
> variables to hold my messages and commands. The protocol used is
> defined on a byte by byte level and it will not accept some
> "automatic" conversion being forced on the variables.
>
> So, will FPC stay with the current definition of string and let the
> developers decide what to handle as unicode strings by using a
> different type for these strings? For example "widestring" or
> "unicodestring" or the like?

There is currently a branch of FPC where a "codepage aware" string type 
is developed. I don't know how far the RTL and compiler will be modified 
once that finalizes, but I believe that the developers will pay enough 
attention to backwards compatibility (as they do know as well) and that 
they'll listen to community input (and "fears") regarding this as well.

Regards,
Sven




More information about the Lazarus mailing list