GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

(twitter.com)

28 points | by laxmena 2 hours ago

3 comments

cadamsdotcom 1 hour ago

Transformers scale poorly vs. context window size and parameter count.
Which means really impressive when those N’s are small!
I’m but a pundit in this area so don’t know much. But one wonders if there’s a future in burning larger models to FPGAs - whether big enough FPGAs exist (or can be built), and whether locating specialized compute right with the memory it needs can speed things up.
Likely would need a lot of algorithm parallelism work that’d translate back to CPUs/GPUs.

[-]
- T-A 49 minutes ago
  
  Related: https://www.spheron.network/blog/etched-ai-sohu-vs-nvidia-tr...
genxy 1 hour ago

The context window is 16 characters. Talking about tokens per second is meaningless.

[-]
- dominotw 1 hour ago
  
  its not meaningless. there could be usecases like spell correction.
  
  [-]
  - genxy 37 minutes ago
    
    It is only interesting as an academic exercise in EDA design. Just like microGPT. For something with an n^2 complexity and advertising perf is clickbait.
amelius 2 hours ago

See also:
https://rits.shanghai.nyu.edu/ai/karpathys-microgpt-on-fpga-...
TL;DR: The CPU implementation was 71x faster than the FPGA.
Note: model has only 4192 parameters.

[-]
- hedgehog 1 hour ago
  
  That post is uninteresting both because they miss the point, and it's not clear a human was even involved to perceive a point to miss. Sure, with an unlimited transistor budget, power budget, and a design clocked at 4GHz fabbed on 5nm one of the best CPU design teams in the world can make a thing that is straight line faster than a one-person project running at 80MHz on a 20 year old 65nm FPGA. Any other answer would be extremely surprising.
  Now, there are a bunch of interesting things about this project. Seeing the example of a tiny transformer running on FPGA is informative, and that it was apparently a pretty quick project for one person + robot assistance. Probably some transferable lessons for anyone else doing robo-FPGA development.
  https://github.com/fguzman82/gateGPT/tree/main/
- cyanydeez 1 hour ago
  
  yeah, then theres prompt loading too.
  but anyone who can fit QWEN-3.6 35B with a sustained ~30 token/s and ~100k context with cache could print money as a hardware vendor.
  
  [-]
  - wmf 1 hour ago
    
    That just sounds like a 3090.