Documentation Index
Fetch the complete documentation index at: https://cerebrium-chore-remove-provider-region-from-examples.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Container vs Storage Volume for Model Loading
Two main options exist for storing model weights:-
Inside the Container: Packaging model weights directly in your container image
- Pros:
- Faster initial startup as weights are already in the container
- No need to download or transfer weights from external storage
- Cons:
- Much larger container size, leading to longer deployment times
- Less flexibility to update model weights without rebuilding container
- Pros:
-
Storage Volume: Storing weights in a persistent storage volume
- Pros:
- Smaller container sizes and faster deployments
- Easy to update model weights without rebuilding container
- Cons:
- Initial cold start includes time to load weights from storage
- Requires managing separate storage infrastructure
- Pros:
Increasing core counts can parallelize downloads, improving pull-through times
for large images. This benefit becomes particularly notable when handling
large files from the storage layer, as multiple cores process different parts
simultaneously, reducing overall download time.
Loading Models from Storage Volume Faster
One of the biggest factors in model startup time is loading the model from storage into GPU memory. For example, in larger models of 20B+ parameters, it can take over 40 seconds to load using a normal Hugging Face load, even with 2GB/s transfer speeds from persistent storage. The underlying hardware is optimized for fast model loading, but several additional techniques can further reduce cold-start times.Tensorizer (recommended)
Tensorizer is a library that loads models from storage into GPU memory in a single step. Initially built for S3, it also works with Cerebrium’s persistent storage (nearly 2GB/s read speed). For large models (20B+ parameters), loading time decreases by 30–50%, with even greater improvements for larger models. See the GitHub page for details on the underlying methods. The following section covers using Tensorizer to load a model from storage directly into GPU memory in a single step.Installation
Add the following to your[cerebrium.dependencies.pip] in your cerebrium.toml file to install Tensorizer in your deployment: