[Lazarus] Splitting a sentence into words??!
Lukasz Sokol
el.es.cr at gmail.com
Wed Jun 27 16:51:43 CEST 2012
On 27/06/2012 15:26, Reinier Olislagers wrote:
> Splitting a string with a sentence into words shouldn't be hard, should it?
>
> I've adapted the UTF8 examples on the wiki article to this: [1]
>
> I'm assuming this must have been done countless times before.
> 1) Would somebody know a good solution which I can use? E.g. in the
> syntaxhighlighter etc, but I'm a bit lazy/afraid to start looking ;)
> 2) There are probably things I've missed or done incorrectly below. For
> my edifications, hints and tips more than welcome.
>
> Thanks,
> Reinier
>
> [1]
> uses
> ...
> LazUTF8
> ...
> procedure Sentence2Words(const Sentence: UTF8String; Words: TStringList);
> // Splits words on spaces and lower ASCII characters including #10,#13
> into strings
> // It will collapse/ignore multiple spaces
> // Expects UTF8 sentences; generates UTF8 words
> var
> p: PChar;
> CharacterIndex: integer;
> CharLen: integer;
> FirstCharacter: integer;
> FoundSpace: boolean;
> begin
> Words.Clear;
> // Indicate we start on character 1:
> CharacterIndex:=1;
> FirstCharacter:=1;
>
> FoundSpace:=false;
> p:=PChar(Sentence);
> repeat
> CharLen := UTF8CharacterLength(p);
> //todo: find out other UTF8 word delimiters!
> case Ord(P[0]) of
> 0..32:
> // Skip double spaces...
> if not(FoundSpace) then
> begin
> // Add what we have up to now
> FoundSpace:=true;
>
> Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex-FirstCharacter));
> FirstCharacter:=CharacterIndex+1; //skip this space character
> end;
> else FoundSpace:=false;
> end;
> inc(p,CharLen); //Take # bytes into account
> CharacterIndex:=CharacterIndex+1;
> until (CharLen=0) or (p^ = #0);
> // Last word if it didn't end with a space:
> if not(FoundSpace) then
> Words.Add(UTF8Copy(Sentence,FirstCharacter,CharacterIndex));
> end;
>
TStringList.Delimiter (with careful use of .StrictDelimiter) is your friend :)
L.
More information about the Lazarus
mailing list