A comprehensive bioinformatics pipeline for extracting heteroatoms from protein structures and finding molecularly similar compounds using fingerprint-based similarity analysis.
© 2025 Standard Seed Corporation. This is an open-source project developed and released by Standard Seed Corporation under the MIT License. All rights reserved.
TrackMyPDB is a user-friendly Streamlit web application that combines two powerful components:
- Heteroatom Extraction Tool: Systematically extracts all heteroatoms from PDB structures associated with UniProt proteins
- Molecular Similarity Analyzer: Finds ligands most similar to a target molecule using Morgan fingerprints and Tanimoto similarity
- Python 3.7+
- Internet connection for API calls
- Windows OS (optimized for Windows environment)
-
Clone the repository:
git clone <repository-url> cd TrackMyPDB
-
Install dependencies:
pip install -r requirements.txt
-
Launch the application:
streamlit run streamlit_app.py
-
Open your browser to
http://localhost:8501
- Navigate to the web interface
- Choose analysis type:
- 🔍 Heteroatom Extraction
- 🧪 Similarity Analysis
- 📊 Complete Pipeline
- Input your data:
- UniProt IDs (e.g., Q9UNQ0, P37231, P06276)
- Target SMILES structure
- Run analysis and download CSV results
- Input: UniProt protein identifiers
- Process: Fetches PDB structures, extracts heteroatoms, retrieves SMILES
- Output: Comprehensive CSV with chemical information
- APIs: RCSB PDB, PubChem integration
- Features: Progress tracking, error handling, result caching
- Input: Target SMILES structure
- Process: Morgan fingerprint computation, Tanimoto similarity calculation
- Output: Ranked similarity results with interactive visualizations
- Features: Configurable parameters, real-time analysis, comprehensive reports
- Workflow: End-to-end processing from UniProt IDs to similarity results
- Integration: Automatic heteroatom extraction followed by similarity analysis
- Output: Both heteroatom database and similarity results
TrackMyPDB/
├── streamlit_app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── backend/
│ ├── __init__.py # Package initialization
│ ├── heteroatom_extractor.py # Heteroatom extraction logic
│ └── similarity_analyzer.py # Similarity analysis logic
└── README.md # This file
- Streamlit: Web application framework
- RDKit: Cheminformatics and molecular similarity
- Pandas: Data manipulation and analysis
- Plotly: Interactive visualizations
- Requests: API communications
- NumPy: Numerical computations
- PDBe REST API: PDB structure mappings
- RCSB PDB API: Chemical component data
- PubChem API: Backup molecular data
- Morgan Fingerprints: Circular molecular fingerprints (radius=2, 2048 bits)
- Tanimoto Similarity: Industry-standard similarity metric (0-1 scale)
- Interactive Visualizations: Distribution plots, similarity rankings, statistical analysis
- Modern UI: Clean, minimalist design inspired by Apple Design principles
- Responsive Layout: Optimized for different screen sizes
- Interactive Elements: Smooth animations and hover effects
- Intuitive Navigation: Clear section organization and progress indicators
- Real-time Progress: Progress bars and status updates
- Error Handling: Graceful error messages and troubleshooting
- Data Export: CSV download functionality with timestamps
- Result Caching: Session state management for efficiency
- Heteroatoms: ~1000-5000 heteroatoms per 10 UniProt proteins
- SMILES Success: ~60-80% success rate for SMILES retrieval
- Similar Ligands: ~50-200 similar compounds per target (similarity > 0.2)
- Processing Time: 30-60 minutes for complete pipeline
heteroatom_results_YYYYMMDD_HHMMSS.csv: Complete heteroatom extraction resultssimilarity_results_YYYYMMDD_HHMMSS.csv: Molecular similarity analysis results
- UniProt IDs: Multiple input formats (comma-separated, line-separated)
- Result Caching: Previous results loading and management
- API Settings: Automatic retry logic and rate limiting
- Fingerprint Parameters:
- Morgan radius: 1, 2, 3 (default: 2)
- Fingerprint bits: 1024, 2048, 4096 (default: 2048)
- Analysis Parameters:
- Top N results: 10-100 (default: 50)
- Minimum similarity: 0.0-1.0 (default: 0.2)
# Install dependencies
pip install -r requirements.txt
# For RDKit installation issues on Windows
conda install -c conda-forge rdkit- Verify SMILES syntax using online validators
- Check for special characters or formatting issues
- Example valid SMILES:
CCO(ethanol),CC(=O)O(acetic acid)
- Reduce number of UniProt IDs for testing
- Use higher minimum similarity threshold
- Check internet connection stability
- Wait a few minutes and retry
- Check if external APIs (RCSB, PubChem) are accessible
- Reduce batch size for large datasets
- Lead Optimization: Find similar compounds to known drugs
- Scaffold Hopping: Identify alternative molecular frameworks
- Target Analysis: Understand ligand binding preferences
- Cofactor Analysis: Study enzyme cofactor preferences
- Binding Site Analysis: Characterize pocket properties
- Cross-reactivity Prediction: Assess off-target binding
- Structural Biology: Build custom screening libraries
- Comparative Analysis: Study protein-ligand interactions
- Database Construction: Create specialized molecular databases
- Follow PEP 8 style guidelines
- Add comprehensive error handling
- Include progress indicators for long operations
- Document all functions and classes
- Test with various input formats
This project is licensed under the MIT License - see the LICENSE file for details.
Open Source Project - Free to use, modify, and distribute under the MIT License terms.
Please respect API terms of service and rate limits when using this application.
If you use TrackMyPDB in your research or project, please cite it as follows:
Sharif, S., Gamage, A., Kotawalagedara, K., Sha, S., & Bodun, D. (2025).
TrackMyPDB: A comprehensive bioinformatics pipeline for extracting heteroatoms from protein
structures and finding molecularly similar compounds using fingerprint-based similarity analysis
(Version 2.0) [Computer software]. Standard Seed Corporation.
https://trackmypdbsscai.streamlit.app/
@software{trackmypdb2025,
author = {Sharif, Suliman and Gamage, Anu and Kotawalagedara, Kalana and
Sha, Sakeer and Bodun, Damilola},
title = {TrackMyPDB: A Comprehensive Bioinformatics Pipeline for Heteroatom Extraction
and Molecular Similarity Analysis},
year = {2025},
version = {2.0},
organization = {Standard Seed Corporation},
url = {https://trackmypdbsscai.streamlit.app/},
note = {Open-source software for protein structure analysis and molecular similarity}
}Sharif, S., Gamage, A., Kotawalagedara, K., Sha, S., & Bodun, D. (2025).
TrackMyPDB v2.0 - Protein Structure Heteroatom Extraction & Molecular Similarity Analysis.
Standard Seed Corporation. Available at: https://trackmypdbsscai.streamlit.app/
If you have obtained a DOI for your work that uses TrackMyPDB, please consider citing both the software and your own publication.
Note: Please also acknowledge the underlying databases and tools used by TrackMyPDB:
- RCSB Protein Data Bank (RCSB PDB)
- Protein Data Bank in Europe (PDBe)
- PubChem Database
- RDKit Cheminformatics Toolkit
- RCSB PDB: Protein structure data
- PDBe: Structure mapping services
- PubChem: Chemical information database
- RDKit: Cheminformatics toolkit
- Streamlit: Web application framework
- Project Supervisor/Senior Engineer: Sul sharif
- Lead Engineer: Anu Gamage
- Associate Engineers: Kalana Kotawalagedara, Sakeer Sha, Damilola Bodun
For issues or questions:
- Check the troubleshooting section
- Verify input data format
- Test with provided examples
- Review browser console for errors
- Contact the developers through LinkedIn
Happy molecular hunting! 🧬🔍