An AI-powered local-first system for extracting insights from PDF documents using Retrieval-Augmented Generation (RAG). This app loads documents, processes them into chunks, generates embeddings using sentence-transformers, and performs semantic search via a vector store (e.g., FAISS). A FastAPI backend can be used for interactive queries and integration.
- π Load and parse PDF or text-based documents
- π§Ό Preprocess and chunk documents for optimal embedding
- π Semantic search using vector similarity (e.g., FAISS)
- π§ Sentence-transformer-based embedding generation
- π Retrieval-Augmented Generation engine (RAG)
- π FastAPI backend for RESTful document insight queries
- π οΈ Modular, clean, and extensible codebase
pdfinsight/
βββ app/
β βββ main.py # Application entry point (e.g., FastAPI setup)
β β
β βββ loaders/
β β βββ document_loader.py # Load PDF/text files into memory
β β
β βββ processors/
β β βββ document_processor.py # Clean and split text into chunks
β β
β βββ embeddings/
β β βββ embedding_service.py # Generate vector embeddings for text chunks
β β
β βββ vectorstores/
β β βββ vector_store.py # Store and query embedding vectors using FAISS or similar
β β
β βββ engines/
β βββ rag_engine.py # Perform RAG (retrieve + generate)
β
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation (this file)
βββ .gitignore # Git ignored files
- Python 3.9+
- PdfPlumber (for PDF parsing)
- langchain
- sentence-transformers
- Chroma (for vector search)
- FastAPI + Uvicorn (for API layer)
This project is licensed under the MIT License.
See CONTRIBUTING.md for guidelines.
For questions, suggestions, or feedback, open an issue or contact @TEJAS-SAI-PRASHAD-K on GitHub.