@Jaward on Hugging Face: "nanoGPT with Sigmoid Self-Attention I couldn’t resist had to give it a try:)…"

Post

1386

nanoGPT with Sigmoid Self-Attention
I couldn’t resist had to give it a try:)

Some observations on M2:
SSA was ~5-10% faster in training with similar final loss values, slightly less coherent text generation, marginally higher perplexity, and lower memory usage compared to softmax.

Code: https://github.com/Jaykef/ai-algorithms/blob/main/sigmoid_attn.ipynb

Join the conversation