[Lazarus] Splitting a sentence into words??!

Reinier Olislagers reinierolislagers at gmail.com
Wed Jun 27 16:26:34 CEST 2012


Splitting a string with a sentence into words shouldn't be hard, should it?

I've adapted the UTF8 examples on the wiki article to this: [1]

I'm assuming this must have been done countless times before.
1) Would somebody know a good solution which I can use? E.g. in the
syntaxhighlighter etc, but I'm a bit lazy/afraid to start looking ;)
2) There are probably things I've missed or done incorrectly below. For
my edifications, hints and tips more than welcome.

Thanks,
Reinier

[1]
uses
...
LazUTF8
...
procedure Sentence2Words(const Sentence: UTF8String; Words: TStringList);
// Splits words on spaces and lower ASCII characters including #10,#13
into strings
// It will collapse/ignore multiple spaces
// Expects UTF8 sentences; generates UTF8 words
var
  p: PChar;
  CharacterIndex: integer;
  CharLen: integer;
  FirstCharacter: integer;
  FoundSpace: boolean;
begin
  Words.Clear;
  // Indicate we start on character 1:
  CharacterIndex:=1;
  FirstCharacter:=1;

  FoundSpace:=false;
  p:=PChar(Sentence);
  repeat
    CharLen := UTF8CharacterLength(p);
    //todo: find out other UTF8 word delimiters!
    case Ord(P[0]) of
      0..32:
        // Skip double spaces...
        if not(FoundSpace) then
        begin
          // Add what we have up to now
          FoundSpace:=true;

Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex-FirstCharacter));
          FirstCharacter:=CharacterIndex+1; //skip this space character
        end;
      else FoundSpace:=false;
    end;
    inc(p,CharLen); //Take # bytes into account
    CharacterIndex:=CharacterIndex+1;
  until (CharLen=0) or (p^ = #0);
  // Last word if it didn't end with a space:
  if not(FoundSpace) then
    Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex));
end;




More information about the Lazarus mailing list