add_special_tokens=True doesn't add eos token at the end of the sequence

#4
by Andriy - opened

full_text = "This is a test sequence"
full_text_encoded = self.tokenizer.tokenize("This is a test sequence", add_special_tokens=True, return_tensors="pt")
print(full_text_encoded)

['This', 'Ġis', 'Ġa', 'Ġtest', 'Ġsequence']

Is eos token supposed to be added manually?

Qwen org

for Qwen2Tokenizer, add_special_tokens indeed does nothing, as there is not a consistent way to implement this.

In pretraining, the sequece is "...<|endoftext|>...<|endoftext|>...." and eos is set to <|endoftext|>.
In finetuning, the sequence is in the chatml template "<|im_start|>...\n...<|im_end|>\n<|im_start|>...\n...<|im_end|>\n..." and eos is set to <|im_end|>, as many frameworks rely on that to stop generation correctly.

please use the correct format depending on your usecase.

Sign up or log in to comment