Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 12 additions & 20 deletions units/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,37 +20,29 @@
- local: unit1/6
title: Submit your final project!

- title: "2. Model Evaluation"
- title: "2. Preference Alignment"
sections:
- local: unit2/1
title: Introduction to Model Evaluation
title: Introduction to Preference Alignment
- local: unit2/2
title: vLLM Inference with Hugging Face Models
title: Direct Preference Optimization (DPO)
- local: unit2/3
title: Automatic Benchmarks
title: Advanced DPO Techniques
- local: unit2/4
title: Custom Domain Evaluation
- local: unit2/5
title: Submit your evaluation results!
title: DPO Hands-on Implementation

- title: "3. Preference Alignment"
- title: "3. Vision Language Models"
sections:
- local: unit3/1
title: Introduction to Preference Alignment
title: Introduction to Vision Language Models
- local: unit3/2
title: Direct Preference Optimization (DPO)
title: Using Pretrained VLMs
- local: unit3/3
title: Advanced DPO Techniques
title: Fine-Tuning VLMs
- local: unit3/4
title: DPO Hands-on Implementation
title: Hands-On Fine-Tuning VLMs

- title: "4. Vision Language Models"
- title: "4. Model Evaluation"
sections:
- local: unit4/1
title: Introduction to Vision Language Models
- local: unit4/2
title: Using Pretrained VLMs
- local: unit4/3
title: Fine-Tuning VLMs
- local: unit4/4
title: Hands-On Fine-Tuning VLMs
title: Coming soon!
132 changes: 23 additions & 109 deletions units/en/unit2/1.md
Original file line number Diff line number Diff line change
@@ -1,133 +1,47 @@
# Introduction to Model Evaluation
# Introduction to Preference Alignment with SmolLM3

Evaluation is a critical step in developing and deploying language models. It helps us understand how well our models perform across different capabilities and identify areas for improvement. This unit focuses on benchmark evaluation approaches to comprehensively assess your smol model.
Welcome to Unit 3 of the smollest course on fine-tuning! This module will guide you through preference alignment using **SmolLM3**, building on the instruction tuning foundation from Unit 1. You'll learn how to align language models with human preferences using Direct Preference Optimization (DPO) to create more helpful, harmless, and honest AI assistants.

<Tip>

We are already using evaluation to submit models to the course leaderboard. In this unit we will explore evaluation in more detail and use what we learn to evaluate our models and submit them to the leaderboard.
By the end of this unit you will be aligning an LLM with human preferences using Direct Preference Optimization (DPO). This course is smol but fast! If you're looking for a smoother gradient, check out the [The LLM Course](https://huggingface.co/learn/llm-course/chapter1/1).

After completing this unit (and the assignment), don’t forget to test your knowledge with the [quiz](https://huggingface.co/spaces/smol-course/unit_2_quiz)!
After completing this unit (and the assignment), don’t forget to test your knowledge with the [quiz](https://huggingface.co/spaces/smol-course/unit_3_quiz)!

</Tip>

## Why Evaluation Matters
## What is Preference Alignment?

When we train or fine-tune a language model, we need systematic ways to measure its quality and performance. Evaluation helps us:
While supervised fine-tuning (SFT) teaches models to follow instructions and engage in conversations, preference alignment takes this further by training models to generate responses that match human preferences. It's the process of making AI systems more aligned with what humans actually want, rather than just following instructions literally. In simple terms, it makes language models better for applications in the real world.

- Compare models objectively so that we can understand how different models or training approaches perform. Use standardized metrics to compare different models or training approaches
- Identify strengths and weaknesses so that we can understand where our model excels and where it needs improvement
- Track progress so that we can monitor improvements across training iterations
- Ensure deployment readiness so that we can verify that our model meets performance requirements before production use
- Detect regressions so that we can catch performance degradation when making changes

## Tool of choice: LightEval

We'll use [`lighteval`](https://github.com/huggingface/lighteval), a powerful evaluation library developed by Hugging Face that integrates seamlessly with the Hugging Face ecosystem. LightEval provides:

- Access to standard benchmarks like MMLU, TruthfulQA, BBH, and GSM8K
- Flexible framework for creating custom evaluation tasks
- Efficient batch processing and parallelization with [vLLM](https://docs.vllm.ai/en/latest/) backend
- Integration with the Hugging Face model Hub
- Reproducible evaluation pipelines

## Installation and Setup

To get started with LightEval and [vLLM](https://docs.vllm.ai/en/latest/), install the required packages:

```bash
# Install LightEval with vLLM support
pip install lighteval[vllm]

# Or install separately
pip install lighteval
pip install vllm
```

[vLLM](https://docs.vllm.ai/en/latest/) provides significant speed improvements for evaluation by:
- Optimized attention mechanisms
- Efficient memory management
- Automatic batching and parallelization
- Support for tensor and data parallelism across multiple GPUs
Preference alignment addresses several key challenges in AI development. Models trained with preference alignment demonstrate improved behavior across multiple areas. They generate fewer harmful, biased, or inappropriate responses, and their outputs become more useful and relevant to actual human needs. Such models provide more truthful answers while reducing hallucinations, and their responses better reflect human values and ethics. Overall, preference-aligned models exhibit enhanced coherence, relevance, and response quality.

<Tip>


There are certainly great alternatives to LightEval, which users might prefer to use, but for the purposes of this course we will stick with LightEval. Mainly because it offers a reproducible and complete set of evaluation tasks and metrics for all major benchmarks.

Some alternatives that we might explore later in the course are:

- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- [OpenBench](https://github.com/groq/openbench)
For a deeper dive into alignment techniques, check out the [Direct Preference Optimization paper](https://huggingface.co/papers/2305.18290) which is the original paper that introduced DPO.

</Tip>

For a deeper dive into evaluation concepts and best practices, check out the [Evaluation Guidebook](https://github.com/huggingface/evaluation-guidebook).

## Types of Evaluation

To start, we will explore the two main types of evaluation: automatic benchmarks and domain-specific evaluation.

### 1. Automatic Benchmarks

Standard benchmarks provide a common ground for model comparison. They test various capabilities:

- General Knowledge: For example, MMLU tests knowledge across 57 subjects
- Truthfulness: For example, TruthfulQA evaluates tendency to reproduce misconceptions
- Reasoning: For example, BBH and GSM8K test logical thinking and mathematical problem-solving
- Language Understanding: For example, WinoGrande tests common sense reasoning

While these benchmarks are valuable for baseline comparisons, they have limitations:
- May not reflect real-world performance where models are deployed.
- Can be gamed through overfitting and leaking data.
- Don't capture domain-specific or user-specific requirements.

### 2. Domain-Specific Evaluation

Custom evaluations tailored to your use case provide more relevant insights:

- Test actual tasks your model will perform in a real-world setting.
- Use real examples from your domain.
- Implement metrics that matter for your application.
- Evaluate edge cases and failure modes.

### 3. Multi-Layered Evaluation Strategy

A comprehensive approach combines multiple evaluation methods:

1. Automated metrics for quick feedback during development
2. Human evaluation for nuanced quality assessment
3. Domain expert review for specialized applications
4. A/B testing in controlled production environments

During training, you will use automatic benchmarks to evaluate your model's performance and make modeling or parameter decisions based on the results. However, during deployment, you will need to use domain-specific evaluation to ensure that your model is performing as expected.

## Understanding Evaluation Metrics

Common metrics you'll encounter:
## Direct Preference Optimization (DPO)

- Accuracy: Percentage of correct predictions
- F1 Score: Harmonic mean of precision and recall
- Perplexity: How well the model predicts text
- BLEU/ROUGE: Text generation quality metrics
- Custom metrics: Domain-specific measurements
DPO revolutionizes preference alignment by eliminating the need for separate reward models and complex reinforcement learning. In this unit, we'll explore this leading technique for aligning language models with human preferences.

Each metric has strengths and limitations. Choose metrics that align with your application's requirements.
The DPO alignment pipeline is much simpler than the Reinforcement Learning from Human Feedback (RLHF) alignment pipeline. The process involves two main stages:

## Best Practices
1. Adapt the base model to follow instructions through supervised fine-tuning.
2. Directly optimize the model using preference data through Direct Preference Optimization.

1. Start with relevant benchmarks: Establish baselines using standard benchmarks related to your domain
2. Develop custom evaluations early: Don't wait until the end to create domain-specific tests
3. Version control everything: Track evaluation datasets, code, and results
4. Document your methodology: Record assumptions, limitations, and design decisions
5. Iterate based on findings: Use evaluation results to guide model improvements
This streamlined approach allows training on preference data without a separate reward model or complex reinforcement learning, while achieving comparable or better results. Don't worry if this is your first time seeing RLHF, we'll review it in more detail later in the course and see how it compares to DPO.

## What's Next
For exercises in this unit, we will use [SmolLM3](https://hf.co/blog/smollm3) for preference alignment once again. We could use either the instruction tuned model or the result of the unit 1 exercise.

In the following sections, we'll dive deeper into:
## What You'll Build

- Running automatic benchmarks with LightEval
- Creating custom evaluation tasks and metrics
- Building comprehensive evaluation pipelines
- Hands-on exercises to practice these concepts
Throughout this unit, you'll develop practical skills in preference alignment through hands-on implementation. You'll learn to train SmolLM3 using DPO on preference datasets.
- You'll master DPO hyperparameter configuration and tuning techniques.
- You'll compare DPO results with baseline instruction-tuned models.
- You'll evaluate model safety and alignment quality using standard benchmarks.
- You'll submit your aligned model to the course leaderboard.
- Finally, you'll explore how to deploy aligned models for practical applications.

Let's start by exploring how to use standard benchmarks effectively!
Ready to make your models more aligned with human preferences using DPO? Let's begin!
Loading
Loading