Bug in token display for complex scripts

#5
by santhosh - opened

The display of tokens based on color has a bug for complex scripts(Indic scripts, arabic etc).
For example,

I analysed the text "സാങ്കേതികമായി നന്ദി വിജയിച്ചെങ്കിലും സാമ്പത്തിക കാരണങ്ങളാൽ പദ്ധതി നിർത്തിവെച്ചു." using Gemma model. The content is in Malayalam - an Indic lanuage

image.png

As you can observe, the number of tokens rendered does not match the actual number of tokens. Let me provide a simplest version to understand this better

The input text is "സേ"

image.png

The two tokens are "സ", "േ". But together they form a ligature "സേ". Because of this, probably due to the bug in the color highlighting tool, only one color is used and that shows both the tokens in green color.

Owner

Thanks for the report! This should be fixed by this commit. PR should be merged soon.

image.png

image.png

Owner

Sorry for the delay - I forgot to respond with the update! It's fixed now :)
image.png

Xenova changed discussion status to closed

Sign up or log in to comment