This repository contains code and data for the first edition of Machine Learning for Drug Discovery (Manning Publications). The companion material within this repository covers introductory topics at the intersection of machine learning, deep learning, and drug discovery applied to real world scenarios in each chapter. The code and notebooks are released under the Apache 2.0 license.
For readability, the chapter notebooks only contain runnable code blocks and section titles. They omit the rest of the material in the book, i.e., text paragraphs, figures (unless generated as part of one of the code blocks), equations, and pseudocode. If you want to be able to follow what's going on, I recommend reading the notebooks side-by-side with your copy of the book!
Encounter any issues? Please let me know -- I can't fix a problem if I am not aware of its existence!
- Chapter 1: The Drug Discovery Process
- Chapter 2: Ligand-based Screening: Filtering & Similarity Searching
- Chapter 3: Ligand-based Screening: Machine Learning
- Chapter 4: Solubility Deep Dive with Linear Models
- Chapter 5: Classification: Cytochrome P450 Inhibition
- Chapter 6: Case Study: Small Molecule Binding to an RNA Target
- Chapter 7: Unsupervised Learning: Repurposing Drugs, Curating Compounds, & Screening Fragments
- Chapter 8: Introduction to Deep Learning
- Chapter 9: Structure-based Drug Design with Active Learning
- Chapter 10: Generative Models for De Novo Design
- Chapter 11: Graph Neural Networks for Drug Target Affinity Prediction
- Chapter 12: Transformer Architectures for Protein Structure Prediction
- Chapter 13: Multimodal AI Systems for End-to-End Drug Discovery Pipelines
- Appendix A: Glossary
- Appendix B: Chemical Data Repositories
- Appendix C: Knowledge Distillation: Shrinking Models for Efficient, Hierarchical Molecular Generation
- Appendix D: Technical Deep Dive into Protein Structure Prediction
Open any notebook in Colab and run the installation cells at the top!
Each notebook includes two Colab installation options:
- Quick Install: Fast pip-based setup (3-10 minutes) with only the packages needed for that chapter
- Full Install: Complete conda environment (15-20 minutes) with all packages for all chapters
Prerequisites: Python 3.12+ and git
We provide tiered installation options so you can install only what you need:
Core Environment (Chapters 1-4) — Basic ML & QSAR
git clone https://github.com/nrflynn2/ml-drug-discovery.git
cd ml-drug-discovery
pip install -r requirements-core.txtIncludes: numpy, pandas, matplotlib, seaborn, rdkit, scikit-learn
Advanced Environment (Chapters 5-8) — Gradient Boosting & Deep Learning
pip install -r requirements-advanced.txtAdds: torch, xgboost, lightgbm, catboost, shap, umap, statsmodels
Full Environment (Chapters 9-11) — Molecular Docking & GNNs
conda env create -f ml4dd2025.yml
conda activate ml4dd2025Adds: openmm, vina, pdbfixer, torch-geometric, mdtraj, prolif, meeko
Note: Chapters 9-11 require conda due to specialized packages (molecular dynamics, docking) that don't install reliably via pip.
Quick Reference:
- Chapter 1-4: Use
requirements-core.txt - Chapter 5-8, Appendix C: Use
requirements-advanced.txt - Chapter 9-11: Use
ml4dd2025.yml(conda required) - Chaper 12: Follow instructions and use the notebooks within
CH12_FLYNN_ML4DD - All chapters: Use
ml4dd2025.ymlfor complete setup
For detailed installation instructions and troubleshooting, see INSTALL.md.
Feel free to contribute, raise issues, or propose enhancements to make this repository a comprehensive resource for everyone venturing into machine learning, drug discovery, and related applications.
If you wish to cite the book, you may use the following:
@book{flynn2025mldd,
title={Machine Learning for Drug Discovery},
author={Flynn, N.},
isbn={9781633437661},
url={https://www.manning.com/books/machine-learning-for-drug-discovery},
year={2025},
publisher={Manning Publications}
}