[Lazarus] Splitting a sentence into words??!

Lukasz Sokol el.es.cr at gmail.com
Wed Jun 27 16:51:43 CEST 2012


On 27/06/2012 15:26, Reinier Olislagers wrote:
> Splitting a string with a sentence into words shouldn't be hard, should it?
> 
> I've adapted the UTF8 examples on the wiki article to this: [1]
> 
> I'm assuming this must have been done countless times before.
> 1) Would somebody know a good solution which I can use? E.g. in the
> syntaxhighlighter etc, but I'm a bit lazy/afraid to start looking ;)
> 2) There are probably things I've missed or done incorrectly below. For
> my edifications, hints and tips more than welcome.
> 
> Thanks,
> Reinier
> 
> [1]
> uses
> ...
> LazUTF8
> ...
> procedure Sentence2Words(const Sentence: UTF8String; Words: TStringList);
> // Splits words on spaces and lower ASCII characters including #10,#13
> into strings
> // It will collapse/ignore multiple spaces
> // Expects UTF8 sentences; generates UTF8 words
> var
>   p: PChar;
>   CharacterIndex: integer;
>   CharLen: integer;
>   FirstCharacter: integer;
>   FoundSpace: boolean;
> begin
>   Words.Clear;
>   // Indicate we start on character 1:
>   CharacterIndex:=1;
>   FirstCharacter:=1;
> 
>   FoundSpace:=false;
>   p:=PChar(Sentence);
>   repeat
>     CharLen := UTF8CharacterLength(p);
>     //todo: find out other UTF8 word delimiters!
>     case Ord(P[0]) of
>       0..32:
>         // Skip double spaces...
>         if not(FoundSpace) then
>         begin
>           // Add what we have up to now
>           FoundSpace:=true;
> 
> Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex-FirstCharacter));
>           FirstCharacter:=CharacterIndex+1; //skip this space character
>         end;
>       else FoundSpace:=false;
>     end;
>     inc(p,CharLen); //Take # bytes into account
>     CharacterIndex:=CharacterIndex+1;
>   until (CharLen=0) or (p^ = #0);
>   // Last word if it didn't end with a space:
>   if not(FoundSpace) then
>     Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex));
> end;
> 

TStringList.Delimiter (with careful use of .StrictDelimiter) is your friend :)

L.






More information about the Lazarus mailing list