[Lazarus] Losing data when saving Database fileds with "Size" defined and UTF8 chars

Hans-Peter Diettrich DrDiettrich1 at aol.com
Tue Jul 16 17:41:20 CEST 2013


Graeme Geldenhuys schrieb:
> On 2013-07-15 17:43, Hans-Peter Diettrich wrote:
>> Another workaround: use the appropriate codepage for storing strings in 
>> the database, so that all characters are single bytes.
> 
> You should know by now that not all characters can be represented in a
> single byte.

I know that, the question is whether the user and DB understand that, too.

> Also Unicode was developed to overcome the many code-page
> issues and standardise text storage. So I think it would be silly using
> anything other than one of the Unicode encodings (I would opt for UTF-8
> or UTF-16) in this day and age.

Depends on the requested DB/SQL operations. Sizing, searching and 
sorting of strings is fastest with SBCS of a specific encoding, Unicode 
requires much more code and computation power. Even then I wonder how 
strings of different languages will be sorted together, most probably a 
"raw" sort (by codepoints) is the only solution.

> The missing link is fixing SqlDB or Zeos etc, as it seems that the
> database servers themselves already support UTF-8 and UTF-16 text
> fields. [at least Firebird does]
> 
> My solution (work-around) for now is to not specify a charset for
> Firebird and always store text as UTF-8. My fields are defined as
> follows [in bytes size]: <desired size in characters> * 1.5
> By business objects are coded to notify the user interface to only allow
> <desired size in characters> text input. 6 years on, and I haven't had a
> single client complain that text was truncated [maybe a little bit of
> luck has something to do with it too]. I guess I must also add that our
> products are mainly geared towards English, Afrikaans and Portuguese. So
> large quantities of multi-byte characters are at a minimum.

You see that Unicode introduces new problems. Even in UTF-32 the element 
count does not always equal the character count. Your calculation 
obviously is based on latin characters, while Unicode supports many more 
character sets or codepages. In your case I'd prefer an SBCS supporting 
all your languages, what's certainly feasable, even if it may not be an 
registered ISO/ANSI codepage.

DoDi





More information about the Lazarus mailing list