[Lazarus] String vs WideString

Wed Aug 16 19:29:17 CEST 2017

On 15.08.2017 10:34, Tony Whyman via Lazarus wrote:
> On 14/08/17 17:47, Sven Barth via Lazarus wrote:
>> The main problem of such a dynamic type would be the inability to do
>> fast indexing as the compiler would need to insert runtime checks for
>> the size of a character. I had already thought the same, but then had
>> to discard the idea due to this.
> 
> Is this really a big problem? It is not as if it would be necessary to
> do a table lookup everytime you index a string as the indexing method
> could be an attribute of the string and updated with the character
> encoding attribute. Is it really that complicated for the compiler to
> generate code that jumps to an indexing method depending upon a data
> attribute?

In a tight loop where one accesss the string character by character
(take Pos() for example) this will lead to a significant slowdown as the
compiler (without optimizations) will have to insert a call to the
lookup function for each access. While I generally don't consider
performance degradation as a backwards compatibility issue I do in this
case, due to the significant decrease in performance.

Take this evaluation example:

=== code begin ===

program tperf;

{$mode objfpc}{$H+}

uses
  SysUtils;

function lookup(const aStr: String; aIndex: SizeInt): Char;
begin
  Result := aStr[aIndex];
end;

var
  str: String;
  starttime, endtime: TDateTime;
  i, j: LongInt;
begin
  SetLength(str, 10000);

  starttime := Now;
  for i := 0 to 10000 do
    for j := 1 to Length(str) do
      if str[j] <> '' then ;
  endtime := Now;

  Writeln('Direct: ', FormatDateTime('hh:nn:ss.zzz', endtime - starttime));

  starttime := Now;
  for i := 0 to 10000 do
    for j := 1 to Length(str) do
      if lookup(str, j) <> '' then ;
  endtime := Now;

  Writeln('Lookup: ', FormatDateTime('hh:nn:ss.zzz', endtime - starttime));
end.

=== code end ===

=== output begin ===

Direct: 00:00:01.766
Lookup: 00:00:02.061

=== output end ===

While this example is of course artificial it nevertheless shows the
slow down.

> Is your problem really more about the result type as, depending on the
> character width, the result could be an AnsiChar or WideChar or a UTF8
> character for which I don't believe there is a defined char type (other
> than an arguable  mis-use of UCS4Char)?

That is indeed also a problem. I might not have had that one in mind
with my mail above, but I did back then when I had brainstormed this.

> I can accept that a clear up of this area would also have to extend to
> the char types as well - but I would also argue that that is well
> overdue. On a quick count, I found 7 different char types in the system
> unit.

And most important of all: any solution that is developed *MUST* be
backwards compatible, so that means that in the least that type aliases
would remain anyway.

Regards,
Sven