Overview
Unsloth provides an optimized LoRA fine-tuning stack for LLM models. Unsloth is an open-source framework that makes LLM fine-tuning up to 30× faster while using 60% less memory. It achieves this through custom kernel optimizations in Triton, Flash Attention, and manual autograd, while maintaining or even improving accuracy. This recipe packages Unsloth with Kelpie so you can queue fine-tuning jobs on SaladCloud, automatically sync checkpoints to S3-compatible storage, and scale workers up or down with the Kelpie autoscaler. Each worker runs/opt/unsloth-cli.py, a wrapper around Unsloth’s FastLanguageModel APIs. You control the training
run entirely through Kelpie job arguments—model choice, dataset, LoRA knobs, checkpoint cadence, and save strategy.
Prerequisites
- S3-compatible storage (AWS S3, Cloudflare R2, etc.) to persist checkpoints and final models. Provide Access Key ID, Secret Access Key, Region, and (for R2) Endpoint URL when you deploy the recipe.
- Training dataset accessible from the container. By default the script downloads a Hugging Face dataset; you may
also bring your own data via
sync.beforeor setUNSLOTH_USE_MODELSCOPE=trueto read from ModelScope. - (Optional) Hugging Face Hub token if you want to push artifacts with
--push_modelor--push_gguf.
Worker Storage Layout
/opt/checkpoints— incremental checkpoints (used for resume)./opt/outputs— final merged model or GGUF export.
Always align your Kelpiesyncrules with these paths:
sync.before→ download any previous checkpoints into/opt/checkpoints/if resuming.sync.during→ regularly upload/opt/checkpoints/for safekeeping.sync.after→ upload/opt/outputs/when training completes.
Get Your Container Group ID
.id from the response; it is required when you enqueue jobs.
Kelpie Job Arguments (/opt/unsloth-cli.py)
Model Options
--model_name(string, defaultunsloth/llama-3-8b) — base checkpoint to fine-tune.--max_seq_length(int, default2048) — context window.--dtype(string, defaultNone) — force dtype; auto-detected when omitted.--load_in_4bit(flag) — enable 4-bit loading to save VRAM.--dataset(string, defaultyahma/alpaca-cleaned) — Hugging Face or local dataset identifier.
LoRA Options
--r(int, default16) — LoRA rank.--lora_alpha(int, default16) — LoRA alpha.--lora_dropout(float, default0.0).--bias(string, defaultnone).--use_gradient_checkpointing(string, defaultunsloth).--random_state(int, default3407).--use_rslora(flag) — enable rank-stabilized LoRA.--loftq_config(string, optional) — LoftQ configuration.
Training Options
--per_device_train_batch_size(int, default2).--gradient_accumulation_steps(int, default4).--warmup_steps(int, default5).--max_steps(int, default400).--learning_rate(float, default2e-4).--optim(string, defaultadamw_8bit).--weight_decay(float, default0.01).--lr_scheduler_type(string, defaultlinear).--seed(int, default3407).--logging_steps(int, default1).--report_to(string, defaulttensorboard; setnoneto disable integrations).
Checkpoint & Resume
--save_strategy(no|steps|epoch, defaultsteps).--save_steps(int, default500).--save_total_limit(int, optional) — retain the most recent N checkpoints.--resume(flag) — auto-resume from the newestcheckpoint-*in--output_dir.--resume_from_checkpoint(string) — explicitly pick a checkpoint directory.
Resume logic searches for
checkpoint-* folders inside --output_dir. Make sure your sync.before step pulls those
directories down before the job starts.Saving & Publishing
--output_dir(string, default/opt/checkpoints) — where training checkpoints land.--save_model(flag) — write the final model after training.--save_method(merged_16bit|merged_4bit|lora, defaultmerged_16bit).--save_gguf(flag) — additionally export GGUF quantizations.--save_path(string, default/opt/outputs).--quantization(one or many, defaultq8_0) — GGUF quantization presets.--push_model/--push_gguf(flags) — push to Hugging Face Hub; pair with--hub_pathand--hub_token.
Submit a Fine-Tuning Job
Monitor a Job
Troubleshooting
- Training restarts from step 0 — Ensure
sync.beforepulls checkpoints into/opt/checkpoints/and that you pass--resume. - No checkpoints uploaded — Confirm the
sync.duringrule points at/opt/checkpoints/and that the bucket prefix is correct. - OOM errors — Lower
--max_seq_length, enable--load_in_4bit, or reduce batch size/accumulation steps. - Push to Hub fails — Provide
--hub_tokenwith repo write access or disable the--push_*flags.
Need automatic scaling? Kelpie supports queue-aware autoscaling, including scale-to-zero when the queue is empty. Invite the Kelpie service account to your organization, then create a scaling rule with the Create Scaling Rule endpoint.