Exercise 2: Data Preparation
In this exercise, you will prepare the dataset for instruction tuning your LLM. You'll load a dataset, format it for instruction tuning, tokenize it, and create train/validation splits.
Learning Objectives
By the end of this exercise, you will be able to: - Load and explore instruction tuning datasets - Format data using instruction templates - Tokenize data for LLM training - Create train/validation splits - Understand data preprocessing best practices for LLMs
Prerequisites
Before starting this exercise, ensure you have: 1. Completed Exercise 1: Setup & Exploration 2. A working environment with transformers, datasets, and other required libraries installed
Step 1: Import Required Libraries
Let's start by importing the necessary libraries:
import os
import json
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
import numpy as np
Step 2: Load the Dataset
For instruction tuning, we'll use a subset of the Dolly dataset, which contains instruction-response pairs:
# Load Dolly dataset (or any instruction tuning dataset)
raw_datasets = load_dataset("databricks/databricks-dolly-15k", split="train")
# Explore the dataset
print(f"Dataset size: {len(raw_datasets)}")
print(f"Columns: {raw_datasets.column_names}")
print(f"\nExample:")
print(raw_datasets[0])
Step 3: Explore the Data
Let's understand the structure of our dataset:
# Check the different categories of instructions
categories = [example["category"] for example in raw_datasets]
unique_categories = set(categories)
print(f"Unique categories: {unique_categories}")
# Look at examples from different categories
for category in list(unique_categories)[:3]:
print(f"\n--- {category} ---")
example = next(e for e in raw_datasets if e["category"] == category)
print(f"Instruction: {example['instruction'][:100]}...")
print(f"Response: {example['response'][:100]}...")
Step 4: Format Data for Instruction Tuning
Now we'll format the data using an instruction template:
# Define instruction template
def format_instruction(example):
"""Format instruction and response for training."""
return f"""### Instruction:
{example['instruction']}
### Input:
{example.get('input', '')}
### Response:
{example['response']}"""
# Apply formatting to create text field
formatted_dataset = raw_datasets.map(
lambda example: {"text": format_instruction(example)},
remove_columns=raw_datasets.column_names
)
print("Formatted example:")
print(formatted_dataset[0]["text"])
Step 5: Load Tokenizer and Tokenize Data
Now let's tokenize our data:
# Load tokenizer
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Set padding token
tokenizer.pad_token = tokenizer.eos_token
# Tokenize function
def tokenize_function(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length",
return_tensors=None
)
# Tokenize dataset
tokenized_dataset = formatted_dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"]
)
# Check token lengths
token_lengths = [len(x) for x in tokenized_dataset["input_ids"]]
print(f"Average tokens: {np.mean(token_lengths):.1f}")
print(f"Max tokens: {max(token_lengths)}")
print(f"Min tokens: {min(token_lengths)}")
Step 6: Create Train/Validation Split
Now let's split our data:
# Split into train and validation
train_val_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
# Create DatasetDict
dataset_dict = DatasetDict({
"train": train_val_split["train"],
"validation": train_val_split["test"]
})
print(f"Training samples: {len(dataset_dict['train'])}")
print(f"Validation samples: {len(dataset_dict['validation'])}")
Step 7: Save the Processed Dataset
Finally, let's save our processed dataset:
# Save to disk
output_dir = "./data/processed"
dataset_dict.save_to_disk(output_dir)
print(f"Dataset saved to {output_dir}")
# Verify by loading back
loaded_dataset = DatasetDict.load_from_disk(output_dir)
print(f"Loaded dataset: {loaded_dataset}")
Summary
In this exercise, you learned how to: 1. Load instruction tuning datasets 2. Explore and understand data structure 3. Format data using instruction templates 4. Tokenize data for LLM training 5. Create train/validation splits 6. Save processed datasets