Skip to content

yandex-research/context-intensive-kv-offloading

Repository files navigation

KV Cache Offloading for Context-Intensive Tasks

Supplementary code for "KV Cache Offloading for Context-Intensive Tasks"  

The Text2JSON dataset gathered and used in the experiments is available as ./text_to_json.jsonl.gz. The nearest update will move the dataset to HF and provide detailed evaluation config for ShadowKV and improved variants on Text2JSON and other benchmarks integrated into OpenCompass.

Evaluation

Transformers==4.52.4

To run the experiments, you first need to build a docker container with our environment:

docker build -t eval_image .

Run the image with:

docker run -it -d --gpus all --volume /path/to_your_downloaded_models:/mnt/LLM eval_image

And then exec into container via:

docker exec -it container_id bash

Note: if you wish to forego mounting downloaded models and download them on the go, remove the volume flag and unset HF_HOME variable inside the container.

Inside the image, run opencompass run_cfg.py with appropriate CUDA_VISIBLE_DEVICES. We ran our experiments on A100-80G GPUs.

To run the validation with YAKV on MultiNeedle, Text2JSON and LongProc datasets and Qwen3-30B-A3B-Instruct-2507, Qwen3-4B-Instruct-2507, Llama-3.1-8B-Instruct and Llama-3.2-3B-Instruct run

CUDA_VISIBLE_DEVICES=0 opencompass -w outputs run_cfg.py

from /workspace/. If you wish to use some subset of models or dataset, modify the run_cfg.py accordingly. You can change sparse budget, chunk size, HIGGS grid, etc by changing corresponding values from the same run_cfg.py

Transformers==5.6.1

You can also use a newer transformers version. The setup is the same, just use Dockerfile_v5 instead of Dockerfile:

docker build -t eval_image -f Dockerfile_v5 .

Then do docker run and docker exec just like commands above.

Newer version allows you to use:

  • newer models such as Qwen3.5
  • newer benchmarks such as LongBench_v2
  • faster MoE kerenls (relevant in case of Qwen3-30B-A3B-Instruct-2507). But be cautious, faster kernels takes more memory so you are more likely to encounter OOM.

If you encounter OOM with Qwen3-30-A3B on multineedle, you can use keys_on_device=False, which is gonna make validation slower, but less memory demanding.

Baselines

You can evaluate different baselines, like as ArkVale and LRQK too.

To do this, follow the installation steps below.

ArkVale

Git repo that will be cloned is a fork of official repo. The fork is different in 3 things:

  • It supports newer version of transformers (so you must install transformers==5.6.1)
  • It supports a dynamic token budget: the budget is a percentage of the prompt length, not a fixed number of tokens.
  • It does not offload generated tokens to have a more fair comparison with ShadowKV and YAKV (upper bound of ArkVale is being evaluated)
  • It supports additional group_size (from GQA) values, so we could do the evaluation on models like Llama-3.2-3B with group_size=3

Setup

mkdir baselines
git clone -b GQA_3_fix https://github.com/AndreyBocharnikov/ArkVale.git baselines/ArkVale
cd baselines/ArkVale
git submodule update --init --recursive --depth 1
cd source
TORCH_CUDA_ARCH_LIST="8.0" uv pip install --no-build-isolation -e .
cd ..

Change TORCH_CUDA_ARCH_LIST according to the GPU you are using, TORCH_CUDA_ARCH_LIST="8.0" is for A100.

Launch

Just like with YAKV, you can controll what models and datasets are used from run_config.py file from ArkVale folder. To start the evaluation run this from ArkVale folder:

python opencompass_run.py run_config.py -w outputs

LRQK

In case of LRQK you can use both transformers==4.52.4 and transformers==5.6.1. LRQK is a more memory demanding baseline, so on A100, while using Qwen3-30B-A3B you will encounter OOM even on text2json (66k tokens max). To handle Qwen3-30B-A3B evaluation in memory constrained regime (80Gb of GPU RAM), we recommend using transformers==4.52.4 and optionally cache_on_device=True (relevant in case of multineedle -- 128k tokens on 80Gb GPU).

For more memory demanding setup (Qwen3-30B-A3B on 80Gb GPU) use qwen_30B_long_context branch, otherwise you can use transformers_v5.

Just like with ArkVale, you are going to clone a fork of the official repo, this fork supports dynamic token budget and fixes randomness for reproducibility.

Setup

mkdir -p baselines
git clone -b branch_name https://github.com/AndreyBocharnikov/LRQK.git baselines/LRQK
cd baselines/LRQK
cd cpp_kernel
make
cd ..

Launch

To start the evaluation run this from LRQK folder:

python ./opencompass_run.py run_config.py

GPU Inference

We have separate detailed instructions for compiling and running GPU inference experiments in ./inference, see README.md inside.

Code Ownership

Our evaluation code uses significant parts of OpenCompass and implements our benchmarks and offloading methods as components. We also include ShadowKV implementation using the original ByteDance-Seed/ShadowKV project. In turn, our GPU inference code modifies the mini-sglang framework - a minimal version of SGLang. We do not own the OpenCompass codebase or the mini-sglang framework and we are immensely grateful to their authors for their implementations. Our own code implements YAKV and additional evaluation benchmarks within these frameworks. The specific versions that we use are under Apache-2.0 License for OpenCompass, MIT License for mini-sglang, and Apache-2.0 License again for ShadowKV.

About

[Work in Progress] Supplementary code for "KV Cache Offloading for Context-Intensive Tasks"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors