This repository contains implementations of key inference optimization techniques for Large Language Models (LLMs).
A from-scratch implementation of Key-Value (KV) Caching for a GPT-style transformer.
- Features:
- Custom
unoptimized_gptvscached_gptcomparison. - Implements the caching mechanism in the attention layer.
- Demonstrates significant speedup in autoregressive generation.
- Custom
- Base Model: GPT-2 XL.
Implementation of Speculative Sampling to accelerate inference without degrading model quality.
- Algorithm: Based on Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind).
- Logic: Uses a smaller "draft" model to propose tokens and a larger "target" model to verify them in parallel.
- Key Files:
specdec.py: Core logic forsd_sample(speculative sampling) andar_sample(autoregressive sampling).benchmark.py: Script to compare speed and match rate between standard and speculative decoding.
-
Install Dependencies:
pip install -r requirements.txt
-
Run KV Cache Demo:
python kv_cache_implementation.py
Note: This script compares the generation time of standard vs cached implementation.
-
Run Speculative Decoding Benchmark:
cd speculative_decoding python benchmark.py