Code Assistant Usage Guide

This document describes how to install and use CleanTest-Agent skills with various AI coding assistants.

Supported Coding Assistants

Assistant	Status	Installation Method
CodeBuddy	Tested	Copy to `~/.codebuddy/skills/`
Claude Code	Compatible	Copy to `~/.claude/skills/`
Cursor	Compatible	Copy to project `.cursor/skills/`
Other (SKILL.md compatible)	Should work	Copy to the assistant's skill directory

Installation

Step 1: Install Python Dependencies

cd cleantest-agent
pip install -r requirements.txt

Step 2: Install Skills

Global Install (all projects):

For CodeBuddy:

make install
# This copies all 4 skills to ~/.codebuddy/skills/

For Claude Code:

mkdir -p ~/.claude/skills
for s in cleantest-pipeline cleantest-syntax-filter \
         cleantest-relevance-filter cleantest-coverage-filter; do
  cp -R skills/$s ~/.claude/skills/$s
done

Project-Level Install (current project only):

mkdir -p .codebuddy/skills
cp -R skills/* .codebuddy/skills/

Step 3: Restart Your Coding Assistant

The assistant scans for new skills on startup. After installation, restart your IDE or coding assistant session.

How to Use the Skills

Method 1: Natural Language Trigger (Recommended)

Simply describe what you want to do in natural language. The assistant will automatically detect which skill to invoke based on trigger phrases defined in each SKILL.md.

What You Want	What to Say	Skill Invoked
Clean a full dataset	"Help me clean this unit test training dataset"	`cleantest-pipeline`
Check syntax noise	"Check this code for syntax noise"	`cleantest-syntax-filter`
Check test relevance	"Is this test relevant to its focal method?"	`cleantest-relevance-filter`
Predict coverage	"Predict the coverage of this test"	`cleantest-coverage-filter`
Run full pipeline	"Run the CleanTest pipeline on my CSV"	`cleantest-pipeline`

Chinese triggers also work:

"帮我清洗这份单元测试数据集"
"检测语法噪声"
"检查测试相关性"
"运行 CleanTest"

Method 2: CLI (Without Coding Assistant)

# Full pipeline (no LLM, skip coverage)
python -m cleantest_agent.pipeline \
  --input_csv path/to/data.csv \
  --output_dir ./output \
  --skip_coverage

# Full pipeline with LLM enhancement
export OPENAI_API_KEY="sk-..."
python -m cleantest_agent.pipeline \
  --input_csv path/to/data.csv \
  --output_dir ./output \
  --llm_enhance

# Individual filters
python skills/cleantest-syntax-filter/scripts/syntax_filter.py \
  --input_csv data.csv --output_csv filtered.csv

python skills/cleantest-relevance-filter/scripts/relevance_filter.py \
  --input_csv data.csv --output_csv filtered.csv

Method 3: Invoke Specific Skill by Name

In CodeBuddy, you can also explicitly invoke a skill:

"Use the cleantest-syntax-filter skill to check my dataset for noisy annotations."

The assistant will read the corresponding SKILL.md and follow its instructions.

Skill Details

cleantest-pipeline (Orchestrator)

File: skills/cleantest-pipeline/SKILL.md

Purpose: Orchestrates all 3 filters in sequence, generates a noise report.

Input: CSV with src_fm and target columns.

Output:

filtered_data.csv -- cleaned dataset
noise_report.json -- structured statistics
summary.md -- human-readable report

cleantest-syntax-filter (Filter 1)

File: skills/cleantest-syntax-filter/SKILL.md

Purpose: Detects 6 syntactic noise types (N1–N6) from the CleanTest paper, including unnecessary annotations matched against a 21,954-pattern dictionary.

Key Resources:

references/noise_modifier_fm.txt -- Dictionary of 21,954 annotation patterns
Uses Aho-Corasick automaton for fast matching

cleantest-relevance-filter (Filter 2)

File: skills/cleantest-relevance-filter/SKILL.md

Purpose: Checks if test cases are relevant to their focal methods.

Two-Stage Design:

AST name matching (fast, deterministic)
LLM semantic judgment (only for ~12.7% borderline cases)

cleantest-coverage-filter (Filter 3)

File: skills/cleantest-coverage-filter/SKILL.md

Purpose: Filters samples by branch coverage. Default label mode reads the condition_cover_rate column from the input CSV; optional model mode uses a fine-tuned Qwen2.5-Coder-0.5B regression model when no label is available.

Requires: Label mode runs anywhere with pandas. Model mode requires a trained Qwen2.5-Coder-0.5B checkpoint and a GPU; skippable via --skip_coverage.

Example Session (CodeBuddy)

User: "I have a CSV file at data/all_train.csv with unit test training
       data. Please clean it using CleanTest to remove noisy samples."

Assistant: [Invokes cleantest-pipeline skill]
"I will run the CleanTest-Agent pipeline on your dataset.
Loaded 593,953 unique samples (after deduplication on the `target` column).
Running Filter 1 (Syntax Noise)... removed 290,685 samples.
Running Filter 2 (Relevance)... removed 29,885 samples.
Filter 3 (Coverage) was skipped (no model provided).
Pipeline complete! 273,383 clean samples saved to output/filtered_data.csv.
See output/summary.md for the full noise report."

User: "What types of noise were found?"

Assistant: [Reads output/noise_report.json]
"Here is the breakdown:
- Unnecessary Annotations: 259,427 (43.68%)
- No Relevance: 29,885 (5.03%)
- Syntax Errors: 16,372 (2.76%)
- Ambiguous Data Type: 10,335 (1.74%)
- Empty Exception: 2,259 (0.38%)
- Non-English: 1,479 (0.25%)
- Missing Implementation: 813 (0.14%)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Assistant Usage Guide

Supported Coding Assistants

Installation

Step 1: Install Python Dependencies

Step 2: Install Skills

Global Install (all projects):

Project-Level Install (current project only):

Step 3: Restart Your Coding Assistant

How to Use the Skills

Method 1: Natural Language Trigger (Recommended)

Method 2: CLI (Without Coding Assistant)

Method 3: Invoke Specific Skill by Name

Skill Details

cleantest-pipeline (Orchestrator)

cleantest-syntax-filter (Filter 1)

cleantest-relevance-filter (Filter 2)

cleantest-coverage-filter (Filter 3)

Example Session (CodeBuddy)

FilesExpand file tree

code-assistant-guide.md

Latest commit

History

code-assistant-guide.md

File metadata and controls

Code Assistant Usage Guide

Supported Coding Assistants

Installation

Step 1: Install Python Dependencies

Step 2: Install Skills

Global Install (all projects):

Project-Level Install (current project only):

Step 3: Restart Your Coding Assistant

How to Use the Skills

Method 1: Natural Language Trigger (Recommended)

Method 2: CLI (Without Coding Assistant)

Method 3: Invoke Specific Skill by Name

Skill Details

cleantest-pipeline (Orchestrator)

cleantest-syntax-filter (Filter 1)

cleantest-relevance-filter (Filter 2)

cleantest-coverage-filter (Filter 3)

Example Session (CodeBuddy)