The goal of this capstone project is to improve small molecule binding prediction by applying machine learning (ML) methods to navigate the immense chemical space and identify promising drug candidates. Traditional drug discovery relies on individually synthesizing and testing small molecules against protein targets — a slow process, especially given the estimated 10^60 drug-like compounds compared to only ~2,000 FDA-approved novel molecules. This project leverages the Big Encoded Library for Chemical Assessment (BELKA), which includes data on ~133 million small molecules screened against three protein targets using DNA-encoded chemical library (DEL) technology. We aim to build predictive models that estimate the binding affinity of unseen compounds to protein targets, using the provided training data and exploring additional modeling strategies beyond purely empirical binding data. By combining multiple approaches, we seek to enhance predicti
- id — A unique identifier for each molecule–protein target pair.
- buildingblock1_smiles — SMILES string representing the first chemical building block.
- buildingblock2_smiles — SMILES string for the second building block.
- buildingblock3_smiles — SMILES string for the third building block.
- molecule_smiles — SMILES string of the complete molecule, combining the three building blocks and the triazine core.
- protein_name — Name of the target protein (e.g., BRD4, HSA, sEH).
- binds — The binary label indicating whether the molecule binds to the protein target.
Each SMILES molecule is represented three times to assess its binding interactions with three different protein targets: HSA, BRD4, and sEH.
The distribution of SMILES molecules that bind to each protein (BRD4, HSA, sEH) is as follows:

It offers insights into both the unique and shared positive binding interactions across the different protein targets.
The BELKA dataset (Data Source) is available here → BELKA Dataset
The workflow starts by loading raw Parquet files into Google Cloud Storage, followed by preprocessing steps such as deduplication, molecular encoding (using Morgan fingerprints, ECFP), and reshaping the data from long to wide format. Exploratory Data Analysis (EDA) is conducted to generate molecular descriptors, build a correlation matrix, and prepare the feature set for model training.
Various machine learning models—including Logistic Regression, Random Forest, CatBoost, and XGBoost—are trained, with the best-performing model saved to an artifact registry. This model is then deployed using FastAPI and integrated with Streamlit to enable user-friendly interaction.

