latest tokenization_rwkv5.py have some bugs about utf-8 decode failed

#3
by Cloud-Strife - opened

I meet a error when using rwkv with wikitext dataset with latest version.
The error happened in tokenization_rwkv5.py with latest version, previous version not have issue.
I deeply debuggered the issue and found the failed postion:
eg. In wikitext there is a sentence as below
Du Fu is best known for his lǜshi
lǜshi will be b' l\xc7\x9cshi' [32, 108, 199, 156, 115, 104, 105] bytes.

When running to tokenization_rwkv5.py below code:
def tokenize(self, text):
...
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = bytes(chars[start:end])
if substr in self.vocab:
cur_substr = substr
break
sub_tokens.append(cur_substr.decode())

Firstly it will check \xc7 in vocab, so it filled cur_substr '\xc7' and start to do cur_substr.decode() .
But \xc7 is not supported in utf-8, so it directly return error.

ǜ this is supported in utf-8 as '\xc7\x9c' , but vocab return '\xc7' when it found it.

Can we add some try/expections or any other way to fix the utf-8 decode issue ?

Thanks,

Sign up or log in to comment