[Lazarus] Splitting a sentence into words??!
Reinier Olislagers
reinierolislagers at gmail.com
Wed Jun 27 16:26:34 CEST 2012
Splitting a string with a sentence into words shouldn't be hard, should it?
I've adapted the UTF8 examples on the wiki article to this: [1]
I'm assuming this must have been done countless times before.
1) Would somebody know a good solution which I can use? E.g. in the
syntaxhighlighter etc, but I'm a bit lazy/afraid to start looking ;)
2) There are probably things I've missed or done incorrectly below. For
my edifications, hints and tips more than welcome.
Thanks,
Reinier
[1]
uses
...
LazUTF8
...
procedure Sentence2Words(const Sentence: UTF8String; Words: TStringList);
// Splits words on spaces and lower ASCII characters including #10,#13
into strings
// It will collapse/ignore multiple spaces
// Expects UTF8 sentences; generates UTF8 words
var
p: PChar;
CharacterIndex: integer;
CharLen: integer;
FirstCharacter: integer;
FoundSpace: boolean;
begin
Words.Clear;
// Indicate we start on character 1:
CharacterIndex:=1;
FirstCharacter:=1;
FoundSpace:=false;
p:=PChar(Sentence);
repeat
CharLen := UTF8CharacterLength(p);
//todo: find out other UTF8 word delimiters!
case Ord(P[0]) of
0..32:
// Skip double spaces...
if not(FoundSpace) then
begin
// Add what we have up to now
FoundSpace:=true;
Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex-FirstCharacter));
FirstCharacter:=CharacterIndex+1; //skip this space character
end;
else FoundSpace:=false;
end;
inc(p,CharLen); //Take # bytes into account
CharacterIndex:=CharacterIndex+1;
until (CharLen=0) or (p^ = #0);
// Last word if it didn't end with a space:
if not(FoundSpace) then
Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex));
end;
More information about the Lazarus
mailing list