Overview
This guide covers deploying vLLM on SaladCloud Community GPU’s consumer GPUs (like RTX 4090) to serve Large Language Models efficiently. vLLM is a high-throughput, open-source inference engine for LLMs. It’s widely adopted in production because it provides:- Continuous batching for maximizing GPU utilization.
- PagedAttention for optimized memory management.
- Streaming outputs and OpenAI-compatible.
- Quantization support (FP8, AWQ, GPTQ) to reduce memory usage.
Example Models for SaladCloud Community
You can deploy any Hugging Face model that vLLM supports. Popular examples:- Llama 3.1 8B Instruct — 8B parameter model tuned for chat.
- Qwen2.5 7B Instruct — General-purpose 7B parameter model.
- Mistral 7B Instruct v0.3 — Compact model optimized for efficiency.
- Gemma 2 9B — Open model for reasoning and instruction following.
- DeepSeek R1 Distill Llama 8B — Distilled 8B model
Models are not bundled with the container. They are downloaded at runtime, which can take several minutes depending on
size. For production, consider preloading models into the image.
Configuration Options
When deploying, you can set:Model— Hugging Face model ID to load (required).Hugging Face Token— Optional, required only for private or gated models.Max Model Length— Model context length (default: 4096).
Example Request
Submit chat completion requests to the/v1/chat/completions endpoint:
How To Use This Recipe
Authentication
If authentication is enabled, requests must include your SaladCloud API key in theSalad-Api-Key header. See
Sending Requests.
Replica Count
We recommend at least 2 replicas for development and 3–5 replicas for production.Logging
Logs are available in the SaladCloud Portal. You can also connect an external logging provider such as Axiom.Deploy & Wait
When you deploy, SaladCloud will provision nodes, pull the container image, and download the model weights. Large models may take 5-10 minutes or more to fully load. Once replicas show a green checkmark in the Ready column, the service is live.Advanced Settings
All settings are pre-configured, but you can override them via environment variables for performance tuning:DTYPE— Compute precision (auto,float16,bfloat16,float32).MAX_NUM_BATCH_TOKENS— Max tokens processed per batch.MAX_NUM_SEQS— Max concurrent sequences per batch.GPU_MEM_UTIL— Fraction of GPU VRAM vLLM can use (default0.92).QUANTIZATION— Quantization mode (awq,gptq,fp8, etc.).KV_CACHE_DTYPE— Precision for the key/value cache (auto,fp8,fp16,bf16).DOWNLOAD_DIR— Directory for caching downloaded models.TOKENIZER— Custom tokenizer repo or path.TRUST_REMOTE_CODE— Enable if the model requires custom code from Hugging Face.