Axolotl is an open-source fine-tuning toolkit. You configure a training job in YAML — model, dataset, method — and Axolotl runs it, no custom training code required. It supports 60+ model architectures and multiple training methods, including LoRA (which trains a small set of adapter parameters instead of the full model, significantly reducing GPU memory) and QLoRA (which adds 4-bit quantization on top of LoRA to reduce memory even further).
This guide fine-tunes Qwen2.5-3B with LoRA on a Vast.ai GPU. We chose this model because it is ungated (no HuggingFace account needed), small enough to train on a single 24GB GPU, and widely used for fine-tuning. The same workflow applies to any Axolotl-supported model. By the end, you will have a working fine-tuned model.
Prerequisites
Hardware Requirements
- GPU VRAM: 24 GB minimum (training peaks at ~14 GB with LoRA and gradient checkpointing)
- Disk: 100 GB (model weights ~6 GB, plus dataset cache and checkpoints)
- CUDA: 12.4+
The 3B model with LoRA uses only ~14 GB of VRAM, so it fits comfortably on GPUs like the RTX 3090, RTX 4090, A5000, or A100. The remaining headroom means you can increase the batch size or sequence length if needed.
Find and Rent a GPU
Search for a GPU instance with at least 24 GB VRAM and CUDA 12.4+:
vastai search offers \
"gpu_ram >= 24 num_gpus = 1 cuda_vers >= 12.4 disk_space >= 100 reliability > 0.98" \
--order "dph_base" --limit 10
Create an instance using the Axolotl template, which includes Axolotl, PyTorch, Flash Attention, and all core dependencies. You can find the template hash by searching for “Axolotl” on the Vast.ai templates page and copying the hash from the template details. Replace <OFFER_ID> with an ID from the search results:
vastai create instance <OFFER_ID> \
--template_hash 43e16621b7e24ec58a340f33a6afd3ef \
--disk 100 \
--ssh --direct
You can also skip the CLI and create the instance directly from the Axolotl template page in the web UI.
The command returns a contract ID (e.g., new_contract: 33402620). Use this <CONTRACT_ID> for all subsequent commands.
The Axolotl Docker image is large (~15 GB). On slower connections, the image pull can take 30+ minutes. To filter for faster instances, add inet_down >= 5000 to your search query.
Wait for the instance to reach running status. Look for Status: running in the output:
vastai show instance <CONTRACT_ID>
Once running, get the SSH connection command:
vastai ssh-url <CONTRACT_ID>
This returns a URL like ssh://root@<SSH_HOST>:<SSH_PORT>. Use the host and port for all subsequent SSH and SCP commands.
Axolotl uses a single YAML file to configure the entire training job. Save the following as config.yml on your local machine:
base_model: Qwen/Qwen2.5-3B
# Use the model's built-in chat template for formatting conversations
chat_template: tokenizer_default
datasets:
- path: mlabonne/FineTome-100k
type: chat_template
split: train[:10%] # 10% = ~10K examples, keeps training fast
field_messages: conversations
message_property_mappings:
role: from
content: value
val_set_size: 0.05
output_dir: ./outputs/qwen25-3b-lora
sequence_len: 2048
sample_packing: true # Packs multiple examples into each sequence to avoid wasted padding
# LoRA: train small adapter layers instead of the full model
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true # Apply LoRA to all linear layers
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002
bf16: auto # Use 16-bit precision to halve memory vs 32-bit
tf32: true
gradient_checkpointing: true # Saves ~30% VRAM at the cost of ~20% slower training
gradient_checkpointing_kwargs:
use_reentrant: false
logging_steps: 1
flash_attention: true
warmup_ratio: 0.1 # Gradually increase learning rate for first 10% of training
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
Copy it to your instance:
scp -P <SSH_PORT> config.yml root@<SSH_HOST>:/workspace/config.yml
You can also create the file directly on the instance using nano or vim if you prefer.
The following table explains the key settings:
| Setting | Purpose |
|---|
base_model | The pre-trained model to start from (downloaded automatically from HuggingFace) |
adapter: lora | Trains small adapter layers alongside the frozen base model instead of updating all parameters, reducing VRAM from ~24 GB to ~14 GB |
lora_r: 16 | Controls LoRA capacity — higher rank means more trainable parameters but more VRAM |
lora_alpha: 32 | Scaling factor for LoRA updates, typically set to 2x the rank |
datasets | FineTome-100k — 100K instruction-response pairs covering coding, writing, and reasoning. We use 10% to keep training fast |
sample_packing | Combines multiple short training examples into a single sequence to maximize GPU utilization |
gradient_checkpointing | Recomputes activations during the backward pass instead of storing them, trading ~20% speed for ~30% less memory |
micro_batch_size: 2 | Number of sequences processed per step. Combined with gradient_accumulation_steps: 4, each optimization step uses 8 sequences |
To train on your own dataset, replace the datasets section. Axolotl supports Alpaca format (instruction/input/output fields), conversation format (OpenAI-style messages), and many others. See the Axolotl dataset docs for all supported formats.
Run Training
SSH into your instance and launch the training run:
ssh -p <SSH_PORT> root@<SSH_HOST>
cd /workspace
WANDB_MODE=disabled axolotl train config.yml
Weights & Biases (W&B) is an experiment tracking platform. Setting WANDB_MODE=disabled skips it so you are not prompted for a login. To enable tracking, set wandb_project in your config and run wandb login first.
Axolotl downloads the model weights, preprocesses the dataset, and begins training. You should see output confirming LoRA is active:
trainable params: 29,933,568 || all params: 3,115,872,256 || trainable%: 0.9607
This means only ~30M parameters are being trained instead of the full 3B.
Training progress is logged every step. The key metrics are loss (how wrong the model’s predictions are — lower is better), grad_norm (magnitude of parameter updates), and epoch (progress through the dataset, where 1.0 = one full pass):
{'loss': '0.82', 'grad_norm': '0.21', 'learning_rate': '0.0', 'epoch': '0.003'}
{'loss': '0.67', 'grad_norm': '0.05', 'learning_rate': '0.000186', 'epoch': '0.254'}
...
{'loss': '0.60', 'grad_norm': '0.05', 'learning_rate': '2.67e-08', 'epoch': '0.994'}
When training completes, you will see:
Training completed! Saving trained model to ./outputs/qwen25-3b-lora
The LoRA adapter is saved to ./outputs/qwen25-3b-lora/. The adapter is approximately 80 MB, compared to the 6 GB base model.
Test the Fine-Tuned Model
Verify the fine-tuned model by running inference. Save the following as test_inference.py on your local machine:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load the base model (uses the HuggingFace cache from training — no re-download)
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-3B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./outputs/qwen25-3b-lora")
# Load the LoRA adapter on top of the base model
model = PeftModel.from_pretrained(model, "./outputs/qwen25-3b-lora")
# Generate a response
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs, max_new_tokens=256,
do_sample=True, temperature=0.7, top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Copy it to the instance and run it:
scp -P <SSH_PORT> test_inference.py root@<SSH_HOST>:/workspace/test_inference.py
ssh -p <SSH_PORT> root@<SSH_HOST> "cd /workspace && python test_inference.py"
You should see output similar to the following:
def is_prime(n: int) -> bool:
"""Check if a number is prime."""
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
...
Download Your Model
Before destroying the instance, download the LoRA adapter to your local machine:
scp -P <SSH_PORT> -r root@<SSH_HOST>:/workspace/outputs/qwen25-3b-lora ./qwen25-3b-lora
This downloads the ~80 MB adapter. To use it later, you also need the base model (Qwen/Qwen2.5-3B), which can be re-downloaded from HuggingFace.
Cleanup
Destroy the instance to stop billing:
vastai destroy instance <CONTRACT_ID>
Next Steps
- Train longer: Increase
num_epochs to 3–4 or use the full 100K dataset (split: train) for better results
- Try QLoRA: Add
load_in_4bit: true and change adapter: qlora to reduce VRAM further — useful for larger models like Qwen2.5-72B
- Merge the adapter: Run
axolotl merge-lora config.yml to combine the LoRA weights into the base model for faster inference without the PEFT library
- Use your own data: Replace the dataset with your own JSONL file in Alpaca or conversation format
- Scale to multi-GPU: Add a
deepspeed or fsdp config section for distributed training across multiple GPUs — see the multi-node training guide
Additional Resources