[Lazarus] Debugging fixed strings in UTF8 encoding

Martin lazarus at mfriebe.de
Mon Apr 1 12:30:59 CEST 2013


On 01/04/2013 11:18, Mattias Gaertner wrote:
> On Mon, 01 Apr 2013 11:03:31 +0100
> Martin <lazarus at mfriebe.de> wrote:
>
>> On 01/04/2013 10:53, Mattias Gaertner wrote:
>>> On Mon, 01 Apr 2013 10:13:22 +0100
>>> Martin <lazarus at mfriebe.de> wrote:
>>>
>>>> [...]
>>>> It could do a heuristic, checking if the result has such invalid chars,
>>>> and if there is one then do all as #123. But an ascii sting may be a
>>>> valid utf8 string sometimes, yet the utf8 would map to entirely
>>>> different chars. In this case the heuristic would show a utf8 without
>>>> warning that the content is wrong (well it already does/would do)
>>> How likely is this case?
>>>
>> This wasn't a case against it, as more a general observation... And it
>> is already happening.
> Do you mean, there were already real cases where a string had bytes
> #191..#254 and is valid UTF-8, but was windows codepage?
>

How can I know? I don't know every string anyone has debugged until now?

All I say is, it is possible, that a string in ANSI, (with bytes > 128) 
may at the same time be a utf8 string. Probably not read able text in 
both ansi and utf8. But maybe random pass generator.
Anyway it does not matter.


What matters more:
- verify what else may be affected, and ensure it does not break
- decide what to do with none utf8:
* assume current codepage, and convert (probably not a good idea)
* display all chars >= 128 as #nnn
* assume still utf8, and only display broken chars as #nnn
- add tests (so It can be verified for a wide range of gdb versions)






More information about the Lazarus mailing list