Overview
This recipe runs llama.cpp using the officialserver-cuda image on SaladCloud
GPUs. It exposes OpenAI-compatible endpoints for inference and includes the built-in llama.cpp web UI at your deployment
URL.
Ensure your use is permissible under the license for the model you deploy.
Model Source
You can set model options in two places:- During deployment in the recipe form.
- After deployment by editing container group environment variables.
| Form Label (Portal) | Value / Example | Environment Variable |
|---|---|---|
Model Source | Hugging Face Repo (GGUF) or Direct Model URL | N/A (selection only) |
Hugging Face GGUF Repo | ggml-org/gemma-3-1b-it-GGUF | LLAMA_ARG_HF_REPO |
Hugging Face File (Optional) | gemma-3-1b-it-Q4_K_M.gguf | LLAMA_ARG_HF_FILE |
Model URL | Direct .gguf URL | LLAMA_ARG_MODEL_URL |
Hugging Face Token | Hugging Face access token | HF_TOKEN |
- Hugging Face Repo (GGUF): set
LLAMA_ARG_HF_REPO, and optionallyLLAMA_ARG_HF_FILEto select a specific quantization file. - Direct Model URL: set
LLAMA_ARG_MODEL_URLto a direct.gguffile URL.
HF_TOKEN.
Example Models
Hugging Face repo examples (LLAMA_ARG_HF_REPO):
ggml-org/gemma-3-1b-it-GGUFbartowski/Qwen2.5-7B-Instruct-GGUFunsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
LLAMA_ARG_MODEL_URL):
https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.ggufhttps://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf
Runtime Controls
You can also set runtime controls in the deployment form, then adjust later as environment variables.| Parameter | Form Label (Portal) | Environment Variable | Default | Notes |
|---|---|---|---|---|
| GPU Layers | GPU Layers | LLAMA_ARG_N_GPU_LAYERS | auto | Use auto, all, or an integer. |
| Context Size | Context Size | LLAMA_ARG_CTX_SIZE | 4096 | Larger context uses more VRAM. |
| Parallel Slots | Parallel Slots | LLAMA_ARG_N_PARALLEL | 1 | More parallel slots increase concurrent VRAM usage. |
| Model Alias | Model Alias | LLAMA_ARG_ALIAS | llama-cpp | Model name returned by /v1/models and API requests. |
| Host | (advanced config) | LLAMA_ARG_HOST | :: | Default bind address. |
| Port | (advanced config) | LLAMA_ARG_PORT | 8080 | Internal server port. |
LLAMA_ARG_* variables in Advanced Configuration.
API Endpoints
GET /health- readiness probe and health checkGET /v1/models- list available model aliasesPOST /v1/chat/completions- OpenAI-compatible chat completionsPOST /v1/completions- OpenAI-compatible completionsPOST /v1/embeddings- embeddings endpoint (model-dependent)
Example Request
Omit the
Salad-Api-Key header if authentication is disabled.How To Use This Recipe
Step-by-Step Deployment
- Open the SaladCloud Portal.
- Create an organization if you do not have one yet, or open an existing organization and project.
- In your project, click Deploy Container Group.
- Select the llama.cpp recipe.
- Fill in the required fields:
- Enter a Container Group Name.
- In Model Source, choose:
- Hugging Face Repo (GGUF) to load from Hugging Face.
- Direct Model URL to load from a direct
.gguflink.
- If you choose Hugging Face Repo (GGUF), fill Hugging Face GGUF Repo, and optionally Hugging Face File for a specific file in that repo.
- If you choose Direct Model URL, fill Model URL.
- Fill in optional runtime/model fields as needed:
- Model Alias controls the model name in API requests (default:
llama-cpp). - Tune performance with GPU Layers, Context Size, and Parallel Slots based on your GPU and traffic needs.
- Add Hugging Face Token only if your Hugging Face model is private or gated.
- Model Alias controls the model name in API requests (default:
- Choose whether to require authentication with Require Container Gateway Authentication:
- Enabled: requests must include a
Salad-Api-Keyheader. - Disabled: public unauthenticated access.
- Enabled: requests must include a
- Deploy and wait for readiness checks to pass.
Authentication
Container Gateway authentication is enabled by default. When enabled, include your SaladCloud API key in theSalad-Api-Key header. See Sending Requests for details.
Replica Count
The recipe defaults to 3 replicas. Keep at least 3 for testing and consider 5+ for production to absorb interruptions from individual nodes.Deploy And Wait
The model is downloaded at startup. It can take several minutes per replica depending on model size and node network conditions. Traffic starts routing only after the readiness probe (GET /health) passes.