[Lazarus] Losing data when saving Database fileds with "Size" defined and UTF8 chars
Hans-Peter Diettrich
DrDiettrich1 at aol.com
Tue Jul 16 17:41:20 CEST 2013
Graeme Geldenhuys schrieb:
> On 2013-07-15 17:43, Hans-Peter Diettrich wrote:
>> Another workaround: use the appropriate codepage for storing strings in
>> the database, so that all characters are single bytes.
>
> You should know by now that not all characters can be represented in a
> single byte.
I know that, the question is whether the user and DB understand that, too.
> Also Unicode was developed to overcome the many code-page
> issues and standardise text storage. So I think it would be silly using
> anything other than one of the Unicode encodings (I would opt for UTF-8
> or UTF-16) in this day and age.
Depends on the requested DB/SQL operations. Sizing, searching and
sorting of strings is fastest with SBCS of a specific encoding, Unicode
requires much more code and computation power. Even then I wonder how
strings of different languages will be sorted together, most probably a
"raw" sort (by codepoints) is the only solution.
> The missing link is fixing SqlDB or Zeos etc, as it seems that the
> database servers themselves already support UTF-8 and UTF-16 text
> fields. [at least Firebird does]
>
> My solution (work-around) for now is to not specify a charset for
> Firebird and always store text as UTF-8. My fields are defined as
> follows [in bytes size]: <desired size in characters> * 1.5
> By business objects are coded to notify the user interface to only allow
> <desired size in characters> text input. 6 years on, and I haven't had a
> single client complain that text was truncated [maybe a little bit of
> luck has something to do with it too]. I guess I must also add that our
> products are mainly geared towards English, Afrikaans and Portuguese. So
> large quantities of multi-byte characters are at a minimum.
You see that Unicode introduces new problems. Even in UTF-32 the element
count does not always equal the character count. Your calculation
obviously is based on latin characters, while Unicode supports many more
character sets or codepages. In your case I'd prefer an SBCS supporting
all your languages, what's certainly feasable, even if it may not be an
registered ISO/ANSI codepage.
DoDi
More information about the Lazarus
mailing list