Skip to main content
Last Updated: February 17, 2026
Deploy from the SaladCloud Portal.

Overview

This recipe runs llama.cpp using the official server-cuda image on SaladCloud GPUs. It exposes OpenAI-compatible endpoints for inference and includes the built-in llama.cpp web UI at your deployment URL.
Ensure your use is permissible under the license for the model you deploy.

Model Source

You can set model options in two places:
  • During deployment in the recipe form.
  • After deployment by editing container group environment variables.
Portal form labels for model selection:
Form Label (Portal)Value / ExampleEnvironment Variable
Model SourceHugging Face Repo (GGUF) or Direct Model URLN/A (selection only)
Hugging Face GGUF Repoggml-org/gemma-3-1b-it-GGUFLLAMA_ARG_HF_REPO
Hugging Face File (Optional)gemma-3-1b-it-Q4_K_M.ggufLLAMA_ARG_HF_FILE
Model URLDirect .gguf URLLLAMA_ARG_MODEL_URL
Hugging Face TokenHugging Face access tokenHF_TOKEN
Model source behavior:
  • Hugging Face Repo (GGUF): set LLAMA_ARG_HF_REPO, and optionally LLAMA_ARG_HF_FILE to select a specific quantization file.
  • Direct Model URL: set LLAMA_ARG_MODEL_URL to a direct .gguf file URL.
If the model is private or gated on Hugging Face, set HF_TOKEN.

Example Models

Hugging Face repo examples (LLAMA_ARG_HF_REPO):
  • ggml-org/gemma-3-1b-it-GGUF
  • bartowski/Qwen2.5-7B-Instruct-GGUF
  • unsloth/Qwen3-Coder-Next-GGUF:UD-Q4_K_XL
Direct URL examples (LLAMA_ARG_MODEL_URL):
  • https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
  • https://huggingface.co/bartowski/Qwen2.5-7B-Instruct-GGUF/resolve/main/Qwen2.5-7B-Instruct-Q4_K_M.gguf

Runtime Controls

You can also set runtime controls in the deployment form, then adjust later as environment variables.
ParameterForm Label (Portal)Environment VariableDefaultNotes
GPU LayersGPU LayersLLAMA_ARG_N_GPU_LAYERSautoUse auto, all, or an integer.
Context SizeContext SizeLLAMA_ARG_CTX_SIZE4096Larger context uses more VRAM.
Parallel SlotsParallel SlotsLLAMA_ARG_N_PARALLEL1More parallel slots increase concurrent VRAM usage.
Model AliasModel AliasLLAMA_ARG_ALIASllama-cppModel name returned by /v1/models and API requests.
Host(advanced config)LLAMA_ARG_HOST::Default bind address.
Port(advanced config)LLAMA_ARG_PORT8080Internal server port.
After deployment, update these environment variables in the SaladCloud Portal under your container group settings. For advanced tuning, add additional supported LLAMA_ARG_* variables in Advanced Configuration.

API Endpoints

  • GET /health - readiness probe and health check
  • GET /v1/models - list available model aliases
  • POST /v1/chat/completions - OpenAI-compatible chat completions
  • POST /v1/completions - OpenAI-compatible completions
  • POST /v1/embeddings - embeddings endpoint (model-dependent)

Example Request

Omit the Salad-Api-Key header if authentication is disabled.
curl https://<your-dns>.salad.cloud/v1/chat/completions \
  -X POST \
  -H 'Content-Type: application/json' \
  -H 'Salad-Api-Key: <api-key>' \
  -d '{
    "model": "llama-cpp",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain how quantization helps GGUF models."}
    ],
    "max_tokens": 128,
    "temperature": 0
  }'

How To Use This Recipe

Step-by-Step Deployment

  1. Open the SaladCloud Portal.
  2. Create an organization if you do not have one yet, or open an existing organization and project.
  3. In your project, click Deploy Container Group.
  4. Select the llama.cpp recipe.
  5. Fill in the required fields:
    • Enter a Container Group Name.
    • In Model Source, choose:
      • Hugging Face Repo (GGUF) to load from Hugging Face.
      • Direct Model URL to load from a direct .gguf link.
    • If you choose Hugging Face Repo (GGUF), fill Hugging Face GGUF Repo, and optionally Hugging Face File for a specific file in that repo.
    • If you choose Direct Model URL, fill Model URL.
  6. Fill in optional runtime/model fields as needed:
    • Model Alias controls the model name in API requests (default: llama-cpp).
    • Tune performance with GPU Layers, Context Size, and Parallel Slots based on your GPU and traffic needs.
    • Add Hugging Face Token only if your Hugging Face model is private or gated.
  7. Choose whether to require authentication with Require Container Gateway Authentication:
    • Enabled: requests must include a Salad-Api-Key header.
    • Disabled: public unauthenticated access.
  8. Deploy and wait for readiness checks to pass.

Authentication

Container Gateway authentication is enabled by default. When enabled, include your SaladCloud API key in the Salad-Api-Key header. See Sending Requests for details.

Replica Count

The recipe defaults to 3 replicas. Keep at least 3 for testing and consider 5+ for production to absorb interruptions from individual nodes.

Deploy And Wait

The model is downloaded at startup. It can take several minutes per replica depending on model size and node network conditions. Traffic starts routing only after the readiness probe (GET /health) passes.

Source Code