An AI-powered local-first system for extracting insights from PDF documents using Retrieval-Augmented Generation (RAG). This app loads documents, processes them into chunks, generates embeddings using sentence-transformers, and performs semantic search via a vector store (e.g., FAISS). A FastAPI backend can be used for interactive queries and integration.
- 📂 Load and parse PDF or text-based documents
- 🧼 Preprocess and chunk documents for optimal embedding
- 🔎 Semantic search using vector similarity (e.g., FAISS)
- 🧠 Sentence-transformer-based embedding generation
- 🔄 Retrieval-Augmented Generation engine (RAG)
- 🚀 FastAPI backend for RESTful document insight queries
- 🛠️ Modular, clean, and extensible codebase
pdfinsight/
├── app/
│ ├── main.py # Application entry point (e.g., FastAPI setup)
│ │
│ ├── loaders/
│ │ └── document_loader.py # Load PDF/text files into memory
│ │
│ ├── processors/
│ │ └── document_processor.py # Clean and split text into chunks
│ │
│ ├── embeddings/
│ │ └── embedding_service.py # Generate vector embeddings for text chunks
│ │
│ ├── vectorstores/
│ │ └── vector_store.py # Store and query embedding vectors using FAISS or similar
│ │
│ └── engines/
│ └── rag_engine.py # Perform RAG (retrieve + generate)
│
├── requirements.txt # Python dependencies
├── README.md # Project documentation (this file)
└── .gitignore # Git ignored files
- Python 3.9+
- PdfPlumber (for PDF parsing)
- langchain
- sentence-transformers
- Chroma (for vector search)
- FastAPI + Uvicorn (for API layer)
This project is licensed under the MIT License.
See CONTRIBUTING.md for guidelines.
For questions, suggestions, or feedback, open an issue or contact @TEJAS-SAI-PRASHAD-K on GitHub.