Why fine-tuning still matters in a world of great base models
Prompting has gotten remarkably good. For many tasks, a well-crafted system prompt and a few examples will get you 80-90% of what you need. But the last 10-20% often matters enormously in production — and that's where fine-tuning earns its cost.
Fine-tuning is worth it when you need: consistent output formatting that prompting can't reliably achieve, domain-specific knowledge not well-represented in the base model, latency requirements that make large context windows impractical, or cost constraints that favor a smaller specialized model over a frontier API.
Dataset preparation: where most fine-tuning projects fail
The single biggest predictor of fine-tuning success is dataset quality — not model selection, not hyperparameter tuning, not training compute.
What a good fine-tuning dataset looks like
- Minimum 500 examples for meaningful behavioral change. For complex tasks, 2,000-5,000 is more realistic.
- Consistent formatting across all examples. Mixed formats confuse the model more than bad examples.
- Representative of your actual inputs. Test on the distribution you'll see in production, not the distribution that was easy to collect.
- Cleaned and deduplicated. Run deduplication before training — duplicates cause overfitting.
The synthetic data question
Generating training data with a larger model is increasingly common and often effective. The critical caveat: synthetic data captures the style and format of the teacher model, not just the content. This can be what you want (formatting consistency) or a problem (hallucinated facts).
LoRA configuration
For most fine-tuning tasks, LoRA (Low-Rank Adaptation) is the right choice. Full fine-tuning is prohibitively expensive and typically underperforms LoRA on small datasets.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank — start here, tune if needed
lora_alpha=32, # typically 2x rank
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
Rank selection: r=16 is a solid starting point. Higher rank (r=32, r=64) captures more complex adaptations but risks overfitting on small datasets. Lower rank (r=8) is faster and often sufficient for formatting tasks.
Training configuration
training_args = TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size = 16
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.1,
fp16=True,
logging_steps=10,
evaluation_strategy="steps",
eval_steps=100,
save_steps=100,
load_best_model_at_end=True,
)
Watch your validation loss curve. If it diverges from training loss early, you're overfitting — reduce epochs or increase regularization.
Evaluation: beyond perplexity
Perplexity on a held-out set tells you the model learned something, but not whether it learned the right thing. Build task-specific evaluation before training, not after.
For instruction-following tasks: measure exact format compliance, not just semantic correctness. A model that generates correct answers in the wrong format will fail your downstream parsing.
The deployment pitfalls
Merged vs. adapter weights: Merging LoRA weights into the base model speeds up inference but makes it harder to roll back. Keep adapter weights separate until you're confident in production.
Context length: Fine-tuned models often degrade on inputs longer than training examples. Test explicitly at the P95 input length from your production distribution.
Quantization compatibility: Not all LoRA configurations quantize cleanly. Test GPTQ or AWQ quantization early if you need it — discovering incompatibility after training is expensive.