Fine-Tune LLMs with Axolotl

Axolotl is an open-source fine-tuning toolkit. You configure a training job in YAML — model, dataset, method — and Axolotl runs it, no custom training code required. It supports 60+ model architectures and multiple training methods, including LoRA (which trains a small set of adapter parameters instead of the full model, significantly reducing GPU memory) and QLoRA (which adds 4-bit quantization on top of LoRA to reduce memory even further). This guide fine-tunes Qwen2.5-3B with LoRA on a Vast.ai GPU. We chose this model because it is ungated (no HuggingFace account needed), small enough to train on a single 24GB GPU, and widely used for fine-tuning. The same workflow applies to any Axolotl-supported model. By the end, you will have a working fine-tuned model.

Prerequisites

A Vast.ai account with credits
The Vast.ai CLI installed locally:
```
pip install vastai
vastai set api-key YOUR_API_KEY
```
You can find your API key at cloud.vast.ai/cli.
An SSH key added to your Vast.ai account (see SSH setup guide)

Hardware Requirements

GPU VRAM: 24 GB minimum (training peaks at ~14 GB with LoRA and gradient checkpointing)
Disk: 100 GB (model weights ~6 GB, plus dataset cache and checkpoints)
CUDA: 12.4+

The 3B model with LoRA uses only ~14 GB of VRAM, so it fits comfortably on GPUs like the RTX 3090, RTX 4090, A5000, or A100. The remaining headroom means you can increase the batch size or sequence length if needed.

Find and Rent a GPU

Search for a GPU instance with at least 24 GB VRAM and CUDA 12.4+:

vastai search offers \
  "gpu_ram >= 24 num_gpus = 1 cuda_vers >= 12.4 disk_space >= 100 reliability > 0.98" \
  --order "dph_base" --limit 10

Create an instance using the Axolotl template, which includes Axolotl, PyTorch, Flash Attention, and all core dependencies. You can find the template hash by searching for “Axolotl” on the Vast.ai templates page and copying the hash from the template details. Replace <OFFER_ID> with an ID from the search results:

vastai create instance <OFFER_ID> \
  --template_hash 43e16621b7e24ec58a340f33a6afd3ef \
  --disk 100 \
  --ssh --direct

You can also skip the CLI and create the instance directly from the Axolotl template page in the web UI. The command returns a contract ID (e.g., new_contract: 33402620). Use this <CONTRACT_ID> for all subsequent commands.

The Axolotl Docker image is large (~15 GB). On slower connections, the image pull can take 30+ minutes. To filter for faster instances, add inet_down >= 5000 to your search query.

Wait for the instance to reach running status. Look for Status: running in the output:

vastai show instance <CONTRACT_ID>

Once running, get the SSH connection command:

vastai ssh-url <CONTRACT_ID>

This returns a URL like ssh://root@<SSH_HOST>:<SSH_PORT>. Use the host and port for all subsequent SSH and SCP commands.

Configure Training

Axolotl uses a single YAML file to configure the entire training job. Save the following as config.yml on your local machine:

base_model: Qwen/Qwen2.5-3B

# Use the model's built-in chat template for formatting conversations
chat_template: tokenizer_default
datasets:
  - path: mlabonne/FineTome-100k
    type: chat_template
    split: train[:10%]  # 10% = ~10K examples, keeps training fast
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value
val_set_size: 0.05
output_dir: ./outputs/qwen25-3b-lora

sequence_len: 2048
sample_packing: true  # Packs multiple examples into each sequence to avoid wasted padding

# LoRA: train small adapter layers instead of the full model
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true  # Apply LoRA to all linear layers

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto  # Use 16-bit precision to halve memory vs 32-bit
tf32: true

gradient_checkpointing: true  # Saves ~30% VRAM at the cost of ~20% slower training
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1  # Gradually increase learning rate for first 10% of training
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0

Copy it to your instance:

scp -P <SSH_PORT> config.yml root@<SSH_HOST>:/workspace/config.yml

You can also create the file directly on the instance using nano or vim if you prefer. The following table explains the key settings:

Setting	Purpose
`base_model`	The pre-trained model to start from (downloaded automatically from HuggingFace)
`adapter: lora`	Trains small adapter layers alongside the frozen base model instead of updating all parameters, reducing VRAM from ~24 GB to ~14 GB
`lora_r: 16`	Controls LoRA capacity — higher rank means more trainable parameters but more VRAM
`lora_alpha: 32`	Scaling factor for LoRA updates, typically set to 2x the rank
`datasets`	FineTome-100k — 100K instruction-response pairs covering coding, writing, and reasoning. We use 10% to keep training fast
`sample_packing`	Combines multiple short training examples into a single sequence to maximize GPU utilization
`gradient_checkpointing`	Recomputes activations during the backward pass instead of storing them, trading ~20% speed for ~30% less memory
`micro_batch_size: 2`	Number of sequences processed per step. Combined with `gradient_accumulation_steps: 4`, each optimization step uses 8 sequences

To train on your own dataset, replace the datasets section. Axolotl supports Alpaca format (instruction/input/output fields), conversation format (OpenAI-style messages), and many others. See the Axolotl dataset docs for all supported formats.

Run Training

SSH into your instance and launch the training run:

ssh -p <SSH_PORT> root@<SSH_HOST>
cd /workspace
WANDB_MODE=disabled axolotl train config.yml

Weights & Biases (W&B) is an experiment tracking platform. Setting WANDB_MODE=disabled skips it so you are not prompted for a login. To enable tracking, set wandb_project in your config and run wandb login first.

Axolotl downloads the model weights, preprocesses the dataset, and begins training. You should see output confirming LoRA is active:

trainable params: 29,933,568 || all params: 3,115,872,256 || trainable%: 0.9607

This means only ~30M parameters are being trained instead of the full 3B. Training progress is logged every step. The key metrics are loss (how wrong the model’s predictions are — lower is better), grad_norm (magnitude of parameter updates), and epoch (progress through the dataset, where 1.0 = one full pass):

{'loss': '0.82', 'grad_norm': '0.21', 'learning_rate': '0.0',      'epoch': '0.003'}
{'loss': '0.67', 'grad_norm': '0.05', 'learning_rate': '0.000186', 'epoch': '0.254'}
...
{'loss': '0.60', 'grad_norm': '0.05', 'learning_rate': '2.67e-08', 'epoch': '0.994'}

When training completes, you will see:

Training completed! Saving trained model to ./outputs/qwen25-3b-lora

The LoRA adapter is saved to ./outputs/qwen25-3b-lora/. The adapter is approximately 80 MB, compared to the 6 GB base model.

Test the Fine-Tuned Model

Verify the fine-tuned model by running inference. Save the following as test_inference.py on your local machine:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model (uses the HuggingFace cache from training — no re-download)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-3B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen25-3b-lora")

# Load the LoRA adapter on top of the base model
model = PeftModel.from_pretrained(model, "./outputs/qwen25-3b-lora")

# Generate a response
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs, max_new_tokens=256,
        do_sample=True, temperature=0.7, top_p=0.9
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Copy it to the instance and run it:

scp -P <SSH_PORT> test_inference.py root@<SSH_HOST>:/workspace/test_inference.py
ssh -p <SSH_PORT> root@<SSH_HOST> "cd /workspace && python test_inference.py"

You should see output similar to the following:

def is_prime(n: int) -> bool:
    """Check if a number is prime."""
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    ...

Download Your Model

Before destroying the instance, download the LoRA adapter to your local machine:

scp -P <SSH_PORT> -r root@<SSH_HOST>:/workspace/outputs/qwen25-3b-lora ./qwen25-3b-lora

This downloads the ~80 MB adapter. To use it later, you also need the base model (Qwen/Qwen2.5-3B), which can be re-downloaded from HuggingFace.

Cleanup

Destroy the instance to stop billing:

vastai destroy instance <CONTRACT_ID>

Next Steps

Train longer: Increase num_epochs to 3–4 or use the full 100K dataset (split: train) for better results
Try QLoRA: Add load_in_4bit: true and change adapter: qlora to reduce VRAM further — useful for larger models like Qwen2.5-72B
Merge the adapter: Run axolotl merge-lora config.yml to combine the LoRA weights into the base model for faster inference without the PEFT library
Use your own data: Replace the dataset with your own JSONL file in Alpaca or conversation format
Scale to multi-GPU: Add a deepspeed or fsdp config section for distributed training across multiple GPUs — see the multi-node training guide

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

Prerequisites

Hardware Requirements

Find and Rent a GPU

Configure Training

Run Training

Test the Fine-Tuned Model

Download Your Model

Cleanup

Next Steps

Additional Resources

AI/ML Frameworks

Serving Infrastructure

AI Agents

MCP

Text Generation

Image Generation

Video Generation

Audio Generation

Transcription

OCR

Embeddings

NER

Virtual Computing

Graphics Rendering

GPU Programming

Distributed Computing

Development Tools

Specific GPUs

​Prerequisites

​Hardware Requirements

​Find and Rent a GPU

​Configure Training

​Run Training

​Test the Fine-Tuned Model

​Download Your Model

​Cleanup

​Next Steps

​Additional Resources

Prerequisites

Hardware Requirements

Find and Rent a GPU

Configure Training

Run Training

Test the Fine-Tuned Model

Download Your Model

Cleanup

Next Steps

Additional Resources