Identifying cancer driver mutations is essential for understanding tumorigenesis and enabling precision oncology. Current driver discovery methods, like those integrated in intogen-pipeline, rely on accurate residue-level annotations to assess the functional impact of somatic mutations. However, existing domain catalogs such as Pfam are incomplete and not entirely compatible with reference transcript sets like MANE Select.
In this project, we developed bbgdomains, a comprehensive, transcript-aware catalog of protein regions based on InterPro annotations. Notwithstanding the comprehensivene annotation of protein regions, our approach intends to provide an easy-to-interpret catalogue by selecting the most suitable InterPro member databases and resolving overlapping entries with a minimum extent mutual overlap, thus generating consensus regions with traceable provenance. The resulting catalog resolves redundancy, enhances annotation completeness, and ensures compatibility with user-defined transcript references.
This repo provides:
-
code: intended to generate a catalogue of residue-level annotations of protein domains and regions, by using the data compiled and curated by InterPro as starting point;
-
dataset: catalogue of protein regions generated with this code, alongside with metadata for interpretation.
We provide two downloadable resource tables:
-
Catalogue:
data/release_2025/domain_catalogue.tsv -
Descriptions:
data/release_2025/long_short_domain_descriptions.tsv
The resource was the result of making a few critical decisions:
-
Raw catalogue compiled and curated by InterPro
-
Transcripts: MANE Select from MANE.GRCh38.v1.4
-
Both reviewed and unreviewed Uniprot entries are allowed
-
Merge strategy:
-
merge region pairs with at least 70% overlap, i.e. overlap represents at least 70% of both overlapping regions
-
resolved via intersection of overlapping regions
-
-
Allowed InterPro types: Active site, Binding site, Conserved site, Domain, Family, Repeat
-
Allowed Match types: Conserved site, Disordered, Domain, Family, Repeat
-
Excluded member databases: PANTHER, PIRSF, PRINTS
-
PANTHER and PIRSF are intended for classifying entire proteins based on its evolutionary origin and function. Their signatures tend to cover full-length protein sequences, which goes against the intended use for fine-grained discovery of signals of positive selection.
-
PIRSF and PRINTS are legacy databases that have not been updated in a while
-
Including the following:
- Summary of criteria used for region selection
- Merge algorithm pseudo-code
- Descriptive analyses and sanity checks
bbgdomains google-drive presentation
Create a conda environment with the required Python dependencies:
$ conda env create -f environment.yml
Conda activate:
$ conda activate bbgdomains-env
Download the entire InterPro collection of functional annotation matches from:
https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/match_complete.xml.gz
-
01_extract_human_matches.pyRequires the following download:
https://ftp.ebi.ac.uk/pub/databases/interpro/current_release/match_complete.xml.gz
Example command line:
python 01_extract_human_matches.py /workspace/nobackup/scratch/fcalvet/interpro/2025-05-25.match_complete.xml.gz > output.xml -
02_parse_xml2table.pyExample command line:
python 02_parse_xml2table.py --input_file output.xml --output_file output_table.tsv -
03_connections_from_InterPro_to_transcripts.ipynb -
04_row_selection.ipynb -
05_selection_of_transcripts.ipynb -
06_merging_algorithm.ipynb -
07_format_domain_catalogue.ipynb -
08_testing.ipynb
Additional downstream steps for running smRegions, a driver discovery method based on the statistical analysis of mutation enrichment in protein domains. Check out smRegions documentation and the publication where it was first described
-
Preparation_of_smregions_table.ipynb -
input4smregions_creation.shDatasets to check:
/data/bbg/nobackup2/scratch/fcalvet/test_smregions/build_regions/version3 -
code_smregions.py -
Analysis_of_smregion_results.ipynb
Additional script for comparison between Pfam IntOgen dataset and bbgdomains
Comparison_between_Pfam_IntOgen_and_BBGdomains.ipynb
One of the central goals of cancer genomics is to identify the genomic events that lead to tumorigenesis and to understand the mechanisms underlying their tumorigenic effects. The characterization of these events, so-called drivers, is essential for the development of precision oncology.
To date several methods to identify genes harbouring genomic driver events, so-called driver discovery methods, have been proposed by the community. These methods feed on catalogs of somatic mutations sequenced in cohorts of tumors to conduct statistical analysis of signals of positive selection, whereby the pattern of observed mutations found at any given gene is compared with the expected under neutral evolution. In other words, these methods test whether the observed mutation patterns deviate from the expected in the absence of clonal selection.
Intogen intogen.org is a computational framework that identifies cancer driver genes from somatic point mutations by combining the readouts of several driver discovery methods. Among the methods included in Intogen, the smRegions method tests the enrichment of protein coding mutations perturbing protein domains, which can be broadly defined as protein subunits with distinct structural and evolutionary properties, generally implicated in a specific protein function. Thus, genes with a significant enrichment of mutations mapping to their protein domains are deemed candidate driver genes.
Therefore, a crucially relevant aspect for smRegions and for driver mutation identification at large is the correct annotation of protein regions with critical functional significance. Through the structural and evolutionary analysis of protein sequences, the community has proposed several methods to systematically annotate protein domains, such as Pfam, which resort to profile Hidden Markov Models (profile HMMs) to encode and subsequently identify protein domains across the human proteome. These profile HMM models can be regarded as statistical representations of the defining motifs and admissible variability in the residue sequence so that it can be identified as a domain.
To date, the smRegions implementation in Intogen (as of version v2024) relies on the Pfam catalogue. Although a sensible choice, this catalogue is not comprehensive, as other domain catalogs have produced domain annotations not covered by Pfam. Another caveat of the current Pfam catalogue in use in Intogen v2024 is that its annotations are not necessarily compatible with the transcripts of interest, currently the MANE Select catalogue.
In this project we aimed at creating a new catalogue of protein domains that is comprehensive, i.e. aware of as many sources as possible, resolves the redundancies and inconsistencies across catalogs, provides domain coordinates that are compatible with any given reference transcript and is easy to maintain with a view towards future releases of the respective source domain catalogs.
To address these goals, we decided to use InterPro as a functional region annotation resource. InterPro is a database developed and maintained by the European Bioinformatics Institute (EMBL-EBI) which includes protein sequence annotations from multiple sources, providing a broad and rich coverage of protein features, including protein families, domains and functional sites. Several member databases are incorporated including CATH-Gene3D, CDD, Pfam, PROSITE and SMART.
An important feature of our domain annotation pipeline, which we refer to as bbgdomains, is the fact that the region annotations are compatible with the transcripts corresponding to each gene that the user provides. Given the reference transcript, the pipeline retrieves a compatible Uniprot identifier, then retrieves the InterPro annotation matching this identifier.
Despite the fact that InterPro is a rich resource for protein domains, the resulting protein domain catalogue contains overlapping regions, which the current curation efforts of InterPro maintainers have not yet fully resolved into a collection of unique entities. Although InterPro selects what they deem a "representative" annotation to render a non-redundant view, the designated representatives are not always useful in practice or miss relevant domain signature types with regards to somatic evolution and cancer driver discovery. For example, in the case of TP53, a well-known cancer driver gene, there is a lack of "representative" domains. Consequently, any systematic filter based on InterPro’s definition of representative is bound to be flawed. To resolve these issues, we implemented a domain merging algorithm that 1) identifies and merges, by means of intersection, overlapping domains with high enough overlap as representing the same biological entity; 2) produces a unique identifier from which the raw components of the resulting merged domains can be traced back to their source. The result is a consensus catalogue that remains compatible with InterPro structure but is better suited for downstream applications, such as procedures entailing multiple testing, smRegions in particular. The consensus catalogue of domains renders the interpretation simpler, while keeping track of the source domains before merging, thereby making biological insights clearer. When used in the context of the driver discovery method smRegions, it provides complete, state-of-the-art protein domain coverage, while reducing multiple testing bias and computational burden due to redundancies between sources, thereby providing a more comprehensive foundation for identifying cancer driver events and enhanced resolution in somatic clonal evolution analyses.
06_merging_algorithm.ipynb
In order to build an unambiguous catalogue, the fundamental question to answer is "which domains represent the same biological/functional entity?". If the regions retrieved from InterPro are disjoint, they will appear as disjoint entries in our catalogue. However, we need to establish a criterion to assert whether overlapping regions might be different instances with the same underlying functional implications.
To address this, we implemented a merging algorithm intended to resolve overlapping functional elements. In its essence, the algorithm iteratively conducts intersections between pairs of regions with high mutual overlap, with a threshold that can be defined by the user (defaults to 70%).
For example, if two regions with respective lengths