This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
baitUtils is a Python CLI toolkit for analyzing and visualizing bait sequences used in in-solution hybridization. The package provides comprehensive statistical analysis, plotting, mapping, and coverage evaluation capabilities for bait sequence quality assessment.
The project follows a modular, well-factored architecture with clear separation of concerns:
__main__.py: CLI dispatcher using argparse subcommands
stats→sequence_statistics.py+sequence_analysis.py- Sequence statistics calculationplot→statistical_plots.py+plotting_utils.py- Statistical visualization generationmap→sequence_mapping.py+mapping_utils.py- Sequence mapping against referencescheck→coverage_evaluation.py+coverage_checking.py- Coverage evaluation and gap analysisfill→gap_filling.py+gap_filling_algorithm.py+coverage_analysis.py- Multi-pass gap fillingevaluate→evaluate.py+ supporting modules - Comprehensive oligo set evaluationcompare→compare.py+ comparative analysis modules - Multi-set comparison
sequence_analysis.py: Core sequence analysis functions (GC content, Tm, MFE, entropy, complexity)coverage_analysis.py: Coverage calculation utilities, PSL parsing, BED operationsplotting_utils.py: Comprehensive plotting utilities with multiple chart typesmapping_utils.py: Sequence mapping utilities with pblat integrationcoverage_checking.py: Coverage checking and uncovered region analysisgap_filling_algorithm.py: Multi-pass greedy selection algorithms
quality_scorer.py: Quantitative quality assessment systembenchmark.py: Performance benchmarking against theoretical optimalreference_analyzer.py: Reference sequence complexity analysisinteractive_plots.py: Interactive Plotly visualizationsreport_generator.py: HTML report generation
comparative_analyzer.py: Multi-set comparison frameworkdifferential_analysis.py: Statistical testing for set comparisonscomparative_visualizations.py: Comparative plotting suitecomparative_report_generator.py: Comparative HTML reports
# Create conda environment with dependencies
conda create -n baitutils_env python=3.12 numpy pandas matplotlib-base seaborn scikit-learn biopython
conda activate baitutils_env
# Install optional external tools
conda install -c bioconda pblat bedtools
# Note: ViennaRNA installation is optional for MFE calculations
conda install -c bioconda viennarna
# Install package in development mode
pip install -e .# Run all tests
python -m unittest discover tests/
# Run specific test files
python -m unittest tests.test_evaluate
python -m unittest tests.test_comparative_analysis
# Run tests with verbose output
python -m unittest discover tests/ -v
# Run tests with coverage (if coverage.py installed)
coverage run -m unittest discover tests/
coverage report# Run via installed command (after pip install -e .)
baitUtils --help
baitUtils stats --help
baitUtils evaluate --help
# Run via Python module
python -m baitUtils --help
python -m baitUtils stats -i sequences.fasta -o results/# Run linting (if installed)
ruff check baitUtils/
flake8 baitUtils/
# Format code (if installed)
black baitUtils/Each command follows a consistent pattern:
add_arguments(parser)- defines CLI arguments for the subcommandmain(args)- entry point that receives parsed arguments- Processor classes handle the main logic (e.g.,
SequenceStatsProcessor) - Utility classes handle specific functionality
- Scientific Python stack: numpy, pandas, matplotlib, seaborn, scikit-learn
- Bioinformatics: biopython for sequence handling
- Interactive plotting: plotly for enhanced visualizations
- Optional external tools:
- ViennaRNA (for RNA folding/MFE calculations)
- pblat (for sequence mapping)
- bedtools (for genomic interval operations)
- Primary input: FASTA/FASTA.GZ files containing sequences
- Output formats: TSV statistics, various plot formats (PNG/PDF/SVG), filtered FASTA, HTML reports
- Directory structure: Results organized in structured output directories
Core sequence analysis functionality:
- GC content calculation using BioPython
- Melting temperature via BioPython's MeltingTemp module
- Minimum Free Energy (MFE) using ViennaRNA (optional)
- Shannon entropy and sequence complexity metrics
- Homopolymer run analysis
- Masked base counting
Main statistics calculation workflow:
SequenceStatsCalculator: Orchestrates analysis with parallel processing supportSequenceFilter: Configurable filtering based on multiple criteria- Supports both sequential and parallel processing modes
- Handles compressed FASTA files
Comprehensive visualization system:
- Multiple plot types: histograms, boxplots, scatterplots, violin plots, PCA
- Color coding support for categorical data
- Batch generation of pairwise combination plots
- Statistical plotting with seaborn integration
Sequence mapping workflow:
- pblat integration with configurable parameters
- PSL file parsing with filtering capabilities
- Results categorization into mapped/unmapped sequences
- FASTA output generation for different categories
Coverage evaluation system:
- PSL to BED conversion with filtering
- Coverage calculation using bedtools integration
- Uncovered region identification and analysis
- FASTA export of uncovered regions with N-splitting
Multi-pass optimization system:
- Greedy selection algorithm with multiple scoring criteria
- Coverage pattern analysis for difficult regions
- Spacing constraints for oligo selection
- Multi-pass iteration with convergence detection
The codebase uses unittest with comprehensive mocking for external dependencies:
- Core algorithms: Sequence analysis functions, statistics calculations
- File I/O operations: FASTA reading, results writing with mocking
- Command workflows: End-to-end command testing
- Edge cases: Error handling, malformed inputs, empty datasets
- Integration tests: Multi-module workflows
- File operations mocked to avoid filesystem dependencies
- External tool calls (pblat, bedtools) mocked in tests
- Optional dependencies handled gracefully with availability checks
The evaluate command provides comprehensive oligo set coverage analysis:
# Basic usage
baitUtils evaluate -i oligos.fasta -r reference.fasta -o coverage_report/
# Advanced analysis with custom parameters
baitUtils evaluate -i oligos.fasta -r reference.fasta -o results/ \
--min-identity 95 --target-coverage 10 --threads 4 \
--plot-format pdf --enable-html-reportevaluate.py: Main orchestrator integrating all analysis componentscoverage_stats.py: Coverage statistics computationquality_scorer.py: Quality assessment with A-F gradingreference_analyzer.py: Reference sequence complexity analysisgap_analysis.py: Gap characterization with sequence correlation
# Compare multiple oligo sets
baitUtils compare -r reference.fasta -o comparison_report/ \
--sets "Design1:oligos1.fasta" "Design2:oligos2.fasta" "Design3:oligos3.fasta"
# With statistical analysis
baitUtils compare -r reference.fasta -o results/ \
--sets "Set1:design1.fasta" "Set2:design2.fasta" \
--enable-statistical-analysis --significance-level 0.01- Multi-set analysis: Simultaneous evaluation of multiple designs
- Statistical testing: Rigorous comparison with significance testing
- Performance ranking: Composite scoring and recommendation system
- Interactive reporting: Comprehensive HTML reports with embedded analysis
- Optional dependencies handled with try/except imports
- Fallback methods when external tools unavailable
- Clear error messages for missing dependencies
- Continuation with reduced functionality when possible
- FASTA file format validation
- Parameter range checking
- File existence verification
- Memory usage considerations for large datasets
- Multi-threaded sequence analysis where beneficial
- Configurable process counts for compute-intensive operations
- Memory-efficient streaming for large files
- Intermediate file caching for complex workflows
- Optimized data structures for coverage calculations
- Efficient algorithms for gap filling optimization
The complete baitUtils workflow supports iterative oligo design:
- Initial Analysis (
evaluate): Assess current oligo set quality - Gap Identification (
check): Identify coverage gaps and problematic regions - Gap Filling (
fill): Generate improved oligo selections - Comparison (
compare): Evaluate multiple design iterations - Visualization (
plot): Generate publication-quality figures - Statistics (
stats): Detailed sequence property analysis
This modular architecture enables flexible workflows adapted to specific research needs while maintaining code quality and testability.