bbgdomains

Identifying cancer driver mutations is essential for understanding tumorigenesis and enabling precision oncology. Current driver discovery methods, like those integrated in intogen-pipeline, rely on accurate residue-level annotations to assess the functional impact of somatic mutations. However, existing domain catalogs such as Pfam are incomplete and not entirely compatible with reference transcript sets like MANE Select.

In this project, we developed bbgdomains, a comprehensive, transcript-aware catalog of protein regions based on InterPro annotations. Notwithstanding the comprehensivene annotation of protein regions, our approach intends to provide an easy-to-interpret catalogue by selecting the most suitable InterPro member databases and resolving overlapping entries with a minimum extent mutual overlap, thus generating consensus regions with traceable provenance. The resulting catalog resolves redundancy, enhances annotation completeness, and ensures compatibility with user-defined transcript references.

Content

This repo provides:

code: intended to generate a catalogue of residue-level annotations of protein domains and regions, by using the data compiled and curated by InterPro as starting point;
dataset: catalogue of protein regions generated with this code, alongside with metadata for interpretation.

Resource

We provide two downloadable resource tables:

Catalogue: data/release_2025/domain_catalogue.tsv
Descriptions: data/release_2025/long_short_domain_descriptions.tsv

Resource configuration

The resource was the result of making a few critical decisions:

Raw catalogue compiled and curated by InterPro
Transcripts: MANE Select from MANE.GRCh38.v1.4
Both reviewed and unreviewed Uniprot entries are allowed
Merge strategy:
- merge region pairs with at least 70% overlap, i.e. overlap represents at least 70% of both overlapping regions
- resolved via intersection of overlapping regions
Allowed InterPro types: Active site, Binding site, Conserved site, Domain, Family, Repeat
Allowed Match types: Conserved site, Disordered, Domain, Family, Repeat
Excluded member databases: PANTHER, PIRSF, PRINTS
- PANTHER and PIRSF are intended for classifying entire proteins based on its evolutionary origin and function. Their signatures tend to cover full-length protein sequences, which goes against the intended use for fine-grained discovery of signals of positive selection.
- PIRSF and PRINTS are legacy databases that have not been updated in a while

Slides

Including the following:

Summary of criteria used for region selection
Merge algorithm pseudo-code
Descriptive analyses and sanity checks

bbgdomains google-drive presentation

Conda environment

Create a conda environment with the required Python dependencies: $ conda env create -f environment.yml

Conda activate: $ conda activate bbgdomains-env

Download source data

Download the entire InterPro collection of functional annotation matches from: https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/match_complete.xml.gz

Sorted scripts/notebooks to execute

01_extract_human_matches.py

Requires the following download:

https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/match_complete.xml.gz

Example command line:

python 01_extract_human_matches.py /workspace/nobackup/scratch/fcalvet/interpro/2025-05-25.match_complete.xml.gz > output.xml
02_parse_xml2table.py

Example command line:

python 02_parse_xml2table.py --input_file output.xml --output_file output_table.tsv
03_connections_from_InterPro_to_transcripts.ipynb
04_row_selection.ipynb
05_selection_of_transcripts.ipynb
06_merging_algorithm.ipynb
07_format_domain_catalogue.ipynb
08_testing.ipynb

Additional downstream steps for running smRegions, a driver discovery method based on the statistical analysis of mutation enrichment in protein domains. Check out smRegions documentation and the publication where it was first described

Preparation_of_smregions_table.ipynb
input4smregions_creation.sh

Datasets to check:

/data/bbg/nobackup2/scratch/fcalvet/test_smregions/build_regions/version3
code_smregions.py
Analysis_of_smregion_results.ipynb

Additional script for comparison between Pfam IntOgen dataset and bbgdomains

Comparison_between_Pfam_IntOgen_and_BBGdomains.ipynb

Extended Summary

One of the central goals of cancer genomics is to identify the genomic events that lead to tumorigenesis and to understand the mechanisms underlying their tumorigenic effects. The characterization of these events, so-called drivers, is essential for the development of precision oncology.

To date several methods to identify genes harbouring genomic driver events, so-called driver discovery methods, have been proposed by the community. These methods feed on catalogs of somatic mutations sequenced in cohorts of tumors to conduct statistical analysis of signals of positive selection, whereby the pattern of observed mutations found at any given gene is compared with the expected under neutral evolution. In other words, these methods test whether the observed mutation patterns deviate from the expected in the absence of clonal selection.

Intogen intogen.org is a computational framework that identifies cancer driver genes from somatic point mutations by combining the readouts of several driver discovery methods. Among the methods included in Intogen, the smRegions method tests the enrichment of protein coding mutations perturbing protein domains, which can be broadly defined as protein subunits with distinct structural and evolutionary properties, generally implicated in a specific protein function. Thus, genes with a significant enrichment of mutations mapping to their protein domains are deemed candidate driver genes.

Therefore, a crucially relevant aspect for smRegions and for driver mutation identification at large is the correct annotation of protein regions with critical functional significance. Through the structural and evolutionary analysis of protein sequences, the community has proposed several methods to systematically annotate protein domains, such as Pfam, which resort to profile Hidden Markov Models (profile HMMs) to encode and subsequently identify protein domains across the human proteome. These profile HMM models can be regarded as statistical representations of the defining motifs and admissible variability in the residue sequence so that it can be identified as a domain.

To date, the smRegions implementation in Intogen (as of version v2024) relies on the Pfam catalogue. Although a sensible choice, this catalogue is not comprehensive, as other domain catalogs have produced domain annotations not covered by Pfam. Another caveat of the current Pfam catalogue in use in Intogen v2024 is that its annotations are not necessarily compatible with the transcripts of interest, currently the MANE Select catalogue.

In this project we aimed at creating a new catalogue of protein domains that is comprehensive, i.e. aware of as many sources as possible, resolves the redundancies and inconsistencies across catalogs, provides domain coordinates that are compatible with any given reference transcript and is easy to maintain with a view towards future releases of the respective source domain catalogs.

To address these goals, we decided to use InterPro as a functional region annotation resource. InterPro is a database developed and maintained by the European Bioinformatics Institute (EMBL-EBI) which includes protein sequence annotations from multiple sources, providing a broad and rich coverage of protein features, including protein families, domains and functional sites. Several member databases are incorporated including CATH-Gene3D, CDD, Pfam, PROSITE and SMART.

An important feature of our domain annotation pipeline, which we refer to as bbgdomains, is the fact that the region annotations are compatible with the transcripts corresponding to each gene that the user provides. Given the reference transcript, the pipeline retrieves a compatible Uniprot identifier, then retrieves the InterPro annotation matching this identifier.

Despite the fact that InterPro is a rich resource for protein domains, the resulting protein domain catalogue contains overlapping regions, which the current curation efforts of InterPro maintainers have not yet fully resolved into a collection of unique entities. Although InterPro selects what they deem a "representative" annotation to render a non-redundant view, the designated representatives are not always useful in practice or miss relevant domain signature types with regards to somatic evolution and cancer driver discovery. For example, in the case of TP53, a well-known cancer driver gene, there is a lack of "representative" domains. Consequently, any systematic filter based on InterPro’s definition of representative is bound to be flawed. To resolve these issues, we implemented a domain merging algorithm that 1) identifies and merges, by means of intersection, overlapping domains with high enough overlap as representing the same biological entity; 2) produces a unique identifier from which the raw components of the resulting merged domains can be traced back to their source. The result is a consensus catalogue that remains compatible with InterPro structure but is better suited for downstream applications, such as procedures entailing multiple testing, smRegions in particular. The consensus catalogue of domains renders the interpretation simpler, while keeping track of the source domains before merging, thereby making biological insights clearer. When used in the context of the driver discovery method smRegions, it provides complete, state-of-the-art protein domain coverage, while reducing multiple testing bias and computational burden due to redundancies between sources, thereby providing a more comprehensive foundation for identifying cancer driver events and enhanced resolution in somatic clonal evolution analyses.

Technical definitions

06_merging_algorithm.ipynb

In order to build an unambiguous catalogue, the fundamental question to answer is "which domains represent the same biological/functional entity?". If the regions retrieved from InterPro are disjoint, they will appear as disjoint entries in our catalogue. However, we need to establish a criterion to assert whether overlapping regions might be different instances with the same underlying functional implications.

To address this, we implemented a merging algorithm intended to resolve overlapping functional elements. In its essence, the algorithm iteratively conducts intersections between pairs of regions with high mutual overlap, with a threshold that can be defined by the user (defaults to 70%).

For example, if two regions with respective lengths $L_1$ and $L_2$ have an intersection with length $L$ and such that $L/L_1$ and $L/L_2$ is higher than a given threshold (by default 0.7), then they will be removed from the catalogue and their intersection put instead. This is conducted iteratively and giving precedence to large assemblies of mutually intersecting regions, until there is no region left that satisfies the high mutual overlap condition.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data/release_2025		data/release_2025
images		images
scripts		scripts
HOWTO_RUN_SMREGIONS		HOWTO_RUN_SMREGIONS
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

bbgdomains

Content

Resource

Resource configuration

Slides

Conda environment

Download source data

Sorted scripts/notebooks to execute

Extended Summary

Technical definitions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

bbgdomains

Content

Resource

Resource configuration

Slides

Conda environment

Download source data

Sorted scripts/notebooks to execute

Extended Summary

Technical definitions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages