Reduce GVisor Cold Starts with GPU Snapshotting

(cerebrium.ai)

42 points | by jono_irwin 4 hours ago

7 comments

keynha 1 hour ago

The number that jumped out is 9GB restoring in 2.25s from S3 but 9s from local NVMe. I'd have bet on local, so the inversion is surprising.
mountainriver 3 hours ago

How does this compare to the CRIU work? Or does it use that under the hood?

[-]
- za_mike157 3 hours ago
  
  No we don't use it. CRIU is used for normal checkpoint/restore of Linux processes. Since we run GVisor for container isolation we use their checkpoint/restore support for the sandboxed process state.
  Both approaches still need NVIDIA’s cuda-checkpoint for the GPU side, because CUDA/GPU memory and driver state are not something a normal process checkpointing tool can handle on its own.
eperot 2 hours ago

Wrong headline order, right? Should read "Reduce GPU Cold Starts with gVisor Snapshotting".

[-]
- za_mike157 2 hours ago
  
  haha you are right that the title is a bit strange - should just be "Reduce GPU cold starts with snapshotting"
  I can't read good ;)
gpgn_ 4 hours ago

Interesting work. How does NVIDIA Dynamo Snapshot relate?

[-]
- za_mike157 3 hours ago
  
  There are a lot of similarities.
  They run their snapshot agent as a Kubernetes DaemonSet, whereas our implementation runs as part of the Cerebrium container runtime path. Under the hood, both approaches rely on cuda-checkpoint, since cuda-checkpoint is currently the main primitive NVIDIA exposes for interacting with GPU memory during checkpoint/restore.
  One difference is how KV cache handling is exposed. NVIDIA’s approach appears to automatically handle KV cache allocation/deallocation, whereas today we expose that choice to users (vLLM and SGLang expose primitives to to his). In some cases, users may want to discard the KV cache to reduce checkpoint size and restore time; in others, preserving it may be useful.
  Their DaemonSet approach is also nice because it can be more portable across Kubernetes environments and clouds. Our approach is more deeply integrated into the node/runtime layer, which gives us tighter control over the serverless startup path, but also means it depends on custom node VM images, which not every provider supports equally.
  The optimizations they mention around parallel memfd restore and Linux native AIO for anonymous memory could also be applied to our architecture if we find them stable and beneficial. That said, our current results are already pretty close. For example, they report restoring Qwen3-8B in 4.7s with those changes, while we currently restore it in 6.49s.
  The biggest thing we are excited for is multi-GPU restore, which is not supported yet. That would unlock a much broader set of workloads.
Imustaskforhelp 2 hours ago

Are there any open source solutions or is cerebrium open-sourcing the technology behind it. Although I liked the read-up (side note: you might have to tone down the animations, as others have said, it was a bit dizzy) but overall it was a nice read but I still wish for more technical details as Cold starts are something that just is something that I am interested in.
So are there any more resources that perhaps the team could point out or other resources or if there are any idea of open-sourcing it ever for more internal deeper dives as I would love to know more about it!

[-]
- harleyjs 1 hour ago
  
  This sort of technology is available on GKE
  https://docs.cloud.google.com/kubernetes-engine/docs/concept...
  
  [-]
  - za_mike157 1 hour ago
    
    Interesting! I didn't see they released this. Do you know what their benchmarks are? I know for cloud run they are pretty slow
- eperot 2 hours ago
  
  gVisor is open-source, and `cuda-checkpoint` is freely available.
  gVisor's `runsc checkpoint` subcommand supports a `--save-restore-exec-argv` which lets you specify a program to execute before gVisor starts taking the process snapshot.
  You can fill in the blanks from there.
  
  [-]
  - za_mike157 2 hours ago
    
    Us and the team from Modal have been upstreaming things to the GVisor repo (https://github.com/google/gvisor/pulls) in order to make it compatible with cuda-checkpoint and other parts of our system. While we are both contributing fixes and performance improvements we are unfortunately leaving some secret sauce on the side but hopefully it should get most folks to a successful implementation as is
htrp 4 hours ago

Isn't this exactly what modal does?

[-]
- za_mike157 4 hours ago
  
  Hey! Yes you are correct! We have both been upstreaming changes to the main GVisor repo. However, in order to work within our own infrastructure we had to make various changes that we explain throughout the article (Open TCP connections, multiprocessing, unix sockets etc).
  Also in our benchmarks we seem to perform better than Modal by ~20% in 4/6 workloads we tested and have a lower spread of results meaning you get more consistent results. However the same fundamentals still apply -> how can you move storage into memory as quickly as possible
nixosbestos 4 hours ago

Started scrolling, immediately closed the page. Something is deeply wrong with a person who chooses to implement this shit on a webpage. Unusable garbage, I'm sorry, literally making me motion sick somehow.