Exercise 4: Evaluation
In this exercise, you will evaluate your fine-tuned LLM using perplexity and qualitative assessment. You'll compare experiments in MLflow.
Learning Objectives
By the end of this exercise, you will be able to: - Calculate perplexity for language models - Perform qualitative evaluation of model outputs - Compare experiments in MLflow - Understand LLM evaluation challenges
Prerequisites
Before starting this exercise, ensure you have: 1. Completed Exercise 3: LoRA Tuning 2. A fine-tuned LoRA adapter 3. MLflow tracking set up
Step 1: Calculate Perplexity
Perplexity measures how well the model predicts a sequence:
import math
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model and adapter
base_model = AutoModelForCausalLM.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
model = PeftModel.from_pretrained(base_model, "./lora_adapter")
tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
def calculate_perplexity(text, model, tokenizer):
"""Calculate perplexity for a given text."""
encodings = tokenizer(text, return_tensors="pt")
input_ids = encodings.input_ids
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
loss = outputs.loss
perplexity = math.exp(loss)
return perplexity
# Test on validation set
example_text = "MLOps is the practice of deploying and maintaining ML models in production."
ppl = calculate_perplexity(example_text, model, tokenizer)
print(f"Perplexity: {ppl:.2f}")
Step 2: Qualitative Evaluation
Compare base model vs fine-tuned model outputs:
def generate_response(prompt, model, tokenizer):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
test_prompt = "Explain what MLOps is in simple terms."
# Generate with base model
base_response = generate_response(test_prompt, base_model, tokenizer)
fine_tuned_response = generate_response(test_prompt, model, tokenizer)
print("Base model:")
print(base_response)
print("\nFine-tuned model:")
print(fine_tuned_response)
Step 3: Compare in MLflow
import mlflow
# List experiments
from mlflow.tracking import MlflowClient
client = MlflowClient()
experiments = client.search_experiments()
for exp in experiments:
print(f"{exp.name}: {exp.experiment_id}")
# Compare runs
runs = client.search_runs(experiment_ids=["<experiment_id>"])
for run in runs:
print(f"Run {run.info.run_id}: Loss = {run.data.metrics.get('train_loss')}")
Summary
You learned how to: - Calculate perplexity for evaluation - Perform qualitative comparisons - Use MLflow for experiment comparison