Understanding and Coding the Self-Attention Mechanism of LLMs from Scratch

(sebastianraschka.com)

1 points | by onurkanbkrc 12 days ago

1 comments

storystarling 12 days ago

The point about attention not being a bottleneck is true for training compute, but for inference the memory bandwidth required for the KV cache is the main constraint. That seems to be what actually drives the unit economics and API pricing for serving these models.