Model doesn't seem to tokenize new lines in chat template?

#84
by bartowski - opened

Noticed when using transformers to extract a chat template that it prints out without any newlines, replacing them instead with spaces.. Any idea why?

Only seems to apply when using the built in tags, so it seems to tokenize <|system|>\n as just '<|system|>'

If i typo it and make it <|systemA|>\n, it tokenizes as '<|systemA|>\n' properly..

bartowski changed discussion title from Model doesn't seem to tokenize new lines in system prompt? to Model doesn't seem to tokenize new lines in chat template?

Actually I just noticed why, it's because <|user|>, <|end|>, <|system|>, and <|assistant|> all have rstrip = true in tokenizer_config.json.. if I take that out, it properly puts the new lines. Which is correct? The chat template seems to imply there should be new lines, as do the examples on your card

Microsoft org

I agree that rstrip=true causes many odd issues with tokenization, and directly conflicts with the chat_template/examples. Would love to see one or the other changed for consistency!

is there any way it can be escalated @hanori ?

I have similar issue

When I feed a text block that contains new lines into the Phi-3 tokeniser, the new lines are removed after decoding. Here is an example of the text I am working with:
Input Text to the tokenizer:

<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|> 

after tokenizer.decode I got this:

<|system|>
You are a helpful assistant.<|end|><|user|>
How to explain Internet for a medieval knight?<|end|><|assistant|> 

Can you help me with this issue and is it affecting the performance of the model if I proceed with this ?

can you help us with this please @hanori @gugarosa

Sign up or log in to comment