No we don't use it. CRIU is used for normal checkpoint/restore of Linux processes. Since we run GVisor for container isolation we use their checkpoint/restore support for the sandboxed process state.
Both approaches still need NVIDIA’s cuda-checkpoint for the GPU side, because CUDA/GPU memory and driver state are not something a normal process checkpointing tool can handle on its own.
They run their snapshot agent as a Kubernetes DaemonSet, whereas our implementation runs as part of the Cerebrium container runtime path. Under the hood, both approaches rely on cuda-checkpoint, since cuda-checkpoint is currently the main primitive NVIDIA exposes for interacting with GPU memory during checkpoint/restore.
One difference is how KV cache handling is exposed. NVIDIA’s approach appears to automatically handle KV cache allocation/deallocation, whereas today we expose that choice to users (vLLM and SGLang expose primitives to to his). In some cases, users may want to discard the KV cache to reduce checkpoint size and restore time; in others, preserving it may be useful.
Their DaemonSet approach is also nice because it can be more portable across Kubernetes environments and clouds. Our approach is more deeply integrated into the node/runtime layer, which gives us tighter control over the serverless startup path, but also means it depends on custom node VM images, which not every provider supports equally.
The optimizations they mention around parallel memfd restore and Linux native AIO for anonymous memory could also be applied to our architecture if we find them stable and beneficial. That said, our current results are already pretty close. For example, they report restoring Qwen3-8B in 4.7s with those changes, while we currently restore it in 6.49s.
The biggest thing we are excited for is multi-GPU restore, which is not supported yet. That would unlock a much broader set of workloads.
Are there any open source solutions or is cerebrium open-sourcing the technology behind it. Although I liked the read-up (side note: you might have to tone down the animations, as others have said, it was a bit dizzy) but overall it was a nice read but I still wish for more technical details as Cold starts are something that just is something that I am interested in.
So are there any more resources that perhaps the team could point out or other resources or if there are any idea of open-sourcing it ever for more internal deeper dives as I would love to know more about it!
gVisor is open-source, and `cuda-checkpoint` is freely available.
gVisor's `runsc checkpoint` subcommand supports a `--save-restore-exec-argv` which lets you specify a program to execute before gVisor starts taking the process snapshot.
Us and the team from Modal have been upstreaming things to the GVisor repo (https://github.com/google/gvisor/pulls) in order to make it compatible with cuda-checkpoint and other parts of our system. While we are both contributing fixes and performance improvements we are unfortunately leaving some secret sauce on the side but hopefully it should get most folks to a successful implementation as is
Hey! Yes you are correct! We have both been upstreaming changes to the main GVisor repo. However, in order to work within our own infrastructure we had to make various changes that we explain throughout the article (Open TCP connections, multiprocessing, unix sockets etc).
Also in our benchmarks we seem to perform better than Modal by ~20% in 4/6 workloads we tested and have a lower spread of results meaning you get more consistent results. However the same fundamentals still apply -> how can you move storage into memory as quickly as possible
Started scrolling, immediately closed the page. Something is deeply wrong with a person who chooses to implement this shit on a webpage. Unusable garbage, I'm sorry, literally making me motion sick somehow.
Both approaches still need NVIDIA’s cuda-checkpoint for the GPU side, because CUDA/GPU memory and driver state are not something a normal process checkpointing tool can handle on its own.
I can't read good ;)
They run their snapshot agent as a Kubernetes DaemonSet, whereas our implementation runs as part of the Cerebrium container runtime path. Under the hood, both approaches rely on cuda-checkpoint, since cuda-checkpoint is currently the main primitive NVIDIA exposes for interacting with GPU memory during checkpoint/restore.
One difference is how KV cache handling is exposed. NVIDIA’s approach appears to automatically handle KV cache allocation/deallocation, whereas today we expose that choice to users (vLLM and SGLang expose primitives to to his). In some cases, users may want to discard the KV cache to reduce checkpoint size and restore time; in others, preserving it may be useful.
Their DaemonSet approach is also nice because it can be more portable across Kubernetes environments and clouds. Our approach is more deeply integrated into the node/runtime layer, which gives us tighter control over the serverless startup path, but also means it depends on custom node VM images, which not every provider supports equally.
The optimizations they mention around parallel memfd restore and Linux native AIO for anonymous memory could also be applied to our architecture if we find them stable and beneficial. That said, our current results are already pretty close. For example, they report restoring Qwen3-8B in 4.7s with those changes, while we currently restore it in 6.49s.
The biggest thing we are excited for is multi-GPU restore, which is not supported yet. That would unlock a much broader set of workloads.
So are there any more resources that perhaps the team could point out or other resources or if there are any idea of open-sourcing it ever for more internal deeper dives as I would love to know more about it!
https://docs.cloud.google.com/kubernetes-engine/docs/concept...
gVisor's `runsc checkpoint` subcommand supports a `--save-restore-exec-argv` which lets you specify a program to execute before gVisor starts taking the process snapshot.
You can fill in the blanks from there.
Also in our benchmarks we seem to perform better than Modal by ~20% in 4/6 workloads we tested and have a lower spread of results meaning you get more consistent results. However the same fundamentals still apply -> how can you move storage into memory as quickly as possible