llama.cpp Recipe - SaladCloud

Last Updated: February 17, 2026

Overview

This recipe runs llama.cpp using the official server-cuda image on SaladCloud GPUs. It exposes OpenAI-compatible endpoints for inference and includes the built-in llama.cpp web UI at your deployment URL.

Ensure your use is permissible under the license for the model you deploy.

Model Source

You can set model options in two places:

During deployment in the recipe form.
After deployment by editing container group environment variables.

Portal form labels for model selection:

Form Label (Portal)	Value / Example	Environment Variable
`Model Source`	`Hugging Face Repo (GGUF)` or `Direct Model URL`	N/A (selection only)
`Hugging Face GGUF Repo`	`ggml-org/gemma-3-1b-it-GGUF`	`LLAMA_ARG_HF_REPO`
`Hugging Face File (Optional)`	`gemma-3-1b-it-Q4_K_M.gguf`	`LLAMA_ARG_HF_FILE`
`Model URL`	Direct `.gguf` URL	`LLAMA_ARG_MODEL_URL`
`Hugging Face Token`	Hugging Face access token	`HF_TOKEN`

Model source behavior:

Hugging Face Repo (GGUF): set LLAMA_ARG_HF_REPO, and optionally LLAMA_ARG_HF_FILE to select a specific quantization file.
Direct Model URL: set LLAMA_ARG_MODEL_URL to a direct .gguf file URL.

If the model is private or gated on Hugging Face, set HF_TOKEN.

Example Models

Hugging Face repo examples (LLAMA_ARG_HF_REPO):

ggml-org/gemma-3-1b-it-GGUF
bartowski/Qwen2.5-7B-Instruct-GGUF
unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL

Direct URL examples (LLAMA_ARG_MODEL_URL):

https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Runtime Controls

You can also set runtime controls in the deployment form, then adjust later as environment variables.

Parameter	Form Label (Portal)	Environment Variable	Default	Notes
GPU Layers	`GPU Layers`	`LLAMA_ARG_N_GPU_LAYERS`	`auto`	Use `auto`, `all`, or an integer.
Context Size	`Context Size`	`LLAMA_ARG_CTX_SIZE`	`4096`	Larger context uses more VRAM.
Parallel Slots	`Parallel Slots`	`LLAMA_ARG_N_PARALLEL`	`1`	More parallel slots increase concurrent VRAM usage.
Model Alias	`Model Alias`	`LLAMA_ARG_ALIAS`	`llama-cpp`	Model name returned by `/v1/models` and API requests.
Host	(advanced config)	`LLAMA_ARG_HOST`	`::`	Default bind address.
Port	(advanced config)	`LLAMA_ARG_PORT`	`8080`	Internal server port.

After deployment, update these environment variables in the SaladCloud Portal under your container group settings. For advanced tuning, add additional supported LLAMA_ARG_* variables in Advanced Configuration.

API Endpoints

GET /health - readiness probe and health check
GET /v1/models - list available model aliases
POST /v1/chat/completions - OpenAI-compatible chat completions
POST /v1/completions - OpenAI-compatible completions
POST /v1/embeddings - embeddings endpoint (model-dependent)

Example Request

Omit the Salad-Api-Key header if authentication is disabled.

curl https://<your-dns>.salad.cloud/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'Salad-Api-Key: <api-key>' \
  -d '{
    "model": "llama-cpp",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain how quantization helps GGUF models."}
    ],
    "max_tokens": 128,
    "temperature": 0
  }'

How To Use This Recipe

Step-by-Step Deployment

Open the SaladCloud Portal.
Create an organization if you do not have one yet, or open an existing organization and project.
In your project, click Deploy Container Group.
Select the llama.cpp recipe.
Fill in the required fields:
- Enter a Container Group Name.
- In Model Source, choose:
  - Hugging Face Repo (GGUF) to load from Hugging Face.
  - Direct Model URL to load from a direct .gguf link.
- If you choose Hugging Face Repo (GGUF), fill Hugging Face GGUF Repo, and optionally Hugging Face File for a specific file in that repo.
- If you choose Direct Model URL, fill Model URL.
Fill in optional runtime/model fields as needed:
- Model Alias controls the model name in API requests (default: llama-cpp).
- Tune performance with GPU Layers, Context Size, and Parallel Slots based on your GPU and traffic needs.
- Add Hugging Face Token only if your Hugging Face model is private or gated.
Choose whether to require authentication with Require Container Gateway Authentication:
- Enabled: requests must include a Salad-Api-Key header.
- Disabled: public unauthenticated access.
Deploy and wait for readiness checks to pass.

Authentication

Container Gateway authentication is enabled by default. When enabled, include your SaladCloud API key in the Salad-Api-Key header. See Sending Requests for details.

Replica Count

The recipe defaults to 3 replicas. Keep at least 3 for testing and consider 5+ for production to absorb interruptions from individual nodes.

Deploy And Wait

The model is downloaded at startup. It can take several minutes per replica depending on model size and node network conditions. Traffic starts routing only after the readiness probe (GET /health) passes.

Documentation Index

​Overview

​Model Source

​Example Models

​Runtime Controls

​API Endpoints

​Example Request

​How To Use This Recipe

​Step-by-Step Deployment

​Authentication

​Replica Count

​Deploy And Wait

​Source Code

Overview

Model Source

Example Models

Runtime Controls

API Endpoints

Example Request

How To Use This Recipe

Step-by-Step Deployment

Authentication

Replica Count

Deploy And Wait

Source Code