[Lazarus] Debugging the unicode RTL

Sun Jan 22 12:17:58 CET 2023

I may have to double check some details, just going from memory....

On 22/01/2023 11:46, Michael Van Canneyt wrote:
> On Sun, 22 Jan 2023, Martin Frb via lazarus wrote:
>> On 12/01/2023 10:42, Michael Van Canneyt via lazarus wrote:
>>>
>>> - Debugging programs works in general quite OK, except for 
>>> displaying of
>>>   some strings. Most notable, the exception message is affected:
>>>   only the first character of the exception message is shown.
>>
>> This is hardcoded, and may be change-able, if debug info is avail.....
>>
>> With a look at the next paragraph below....
>> If
>>     Exception = class .... end
>>  is available (ideally within the correct unit, so the debugger can 
>> distinguish it from some user declared type), then the debugger could 
>> use this.
>>
>> Ideally it will by in unit system/sysutils.
>
> Sorry, I don't understand at all what you are saying here.

1)
If you look at the current code getting the message, it just hardcoded. 
Get pointer to instance, add hardcoded offset to FMessage field. Assume 
it is an AnsiString. Read it.

But if it could find (the correct) type "Exception" describing the 
class, then it could evaluate:
      Exception( instance_addr_from_reg ).FMessage

And thus get the dwarf info for FMessage which includes the size-of-char.

2)
However a user can define his own class name "Exception" that has 
nothing in common with the above.
So if LazDebuggerFp could relay on finding this class in unit system or 
sysutils => then that would make it more fool proof.

If the class is not found, the current behaviour can be used as fallback.

3)
Looking up "char" is possible. But there can be a "char" definition in 
each unit.... (On Windows there almost certainly is)

>
>> NOTE:
>> On Linux, such symbols are referenced cross units. But on Windows 
>> they are repeated in every unit using them. That again makes it 
>> harder to know if it is the correct type (since the user may have its 
>> own type by the name).
>
> ?
If I use TFooList from unit Foo, then a dwarf definition of TFooList is 
expected in unit Foo.

IIRC, on linux that is the case.
And on linux any other unit has a reference to that unit.

On Windows: There may be, if the type is actually used in unit Foo.
On Windows, every other unit in which it is used, will repeat the entire 
dwarf.

This may depend on the linker..... ??

>>>
>>> But for debugging this is not enough: I think the IDE debugger needs 
>>> another mechanism to determine using whether a program uses a 
>>> unicode RTL or not. The way this happens in code is {$IF 
>>> SIZEOF(CHAR)=2}
>>
>> I don't think the debugger should have or need such a "global" flag.
>>
>> Currently if it looks at a string (and there is dwarf debug info), 
>> then the dwarf contains the size-of-char: 1 or 2.
>> And the debugger does utf8 or utf16 (the latter may be buggy).
>
> I have meanwhile established that it has some problems: I have the 
> impression that PChar is hardcoded in some places to mean UTF8.
Example? I did not find that " below"

Anyway, what you might have seen, is not that the name "PChar" maps to 
utf8, but that the dwarf description of:
       pointer > basic-dwarf-type(char with size =1)
Is interpreted as such.

For such a type it is not known what the encoding is. (Eventually the 
user should be able to chose in the watch-properties).
But in LCL apps, the above is most likely a PAnsiChar to a (sub-)string 
that is utf8 encoded.

>>
>> Joost even added code to read the hidden encoding field of strings.
>>
>> Though this relies on being able to detect
>>    PChar  <>  String  <>  array of char
>> which is a PITA.
>>
>
....
> But for such low-level things as a debugger, the use of "char" and 
> "pchar" is IMO absolutely forbidden.

For the debugger the name of the type is irrelevant. It looks up the 
dwarf definition.

If the user's code contains a type name FooChar, then the debugger has 
no influence on that.
Equally if the user enters a watch: ^char(xy)^

>>> 2/ Where can I find the code that extracts the message from an 
>>> exception
>>> object ? So I can have a shot at trying to fix the display.
>>
>> unit FpDebugDebugger;
>> procedure TFpDebugDebugger.HandleSoftwareException
>>
>> // Get address of object
>>
>>   if not 
>> FDbgController.DefaultContext.ReadUnsignedInt(FDbgController.CurrentProcess.CallParamDefaultLocation(0),
>>     SizeVal(SizeOf(AnExceptionObjectLocation)), 
>> AnExceptionObjectLocation)
>>   then
>>
>> ....
>>     ExceptionMessage := 
>> ReadAnsiString(AnExceptionObjectLocation+DBGPTRSIZE[FDbgController.CurrentProcess.Mode]);
>>
>>
>> That currently does not yet have the read of the "encoding" field.
>
> From what I know, a unicode string does not have an encoding field. So 
> this
> code will need some changes.

This code IIRC only acts, if it finds:
     basic-dwarf-type(char with size =1)

So that should be ok.

It should be in the
   unit FpDbgDwarfFreePascal;