NeuraFind addresses the growing privacy concerns associated with cloud-based document search engines by providing a strictly local, offline-first semantic search architecture. By leveraging quantized transformer models via the ONNX Runtime, NeuraFind computes high-dimensional vector embeddings of local documents (PDF, DOCX, XLSX, PPTX) entirely on consumer-grade hardware. This enables users to perform complex conceptual queries and retrieve contextually relevant information without exposing sensitive data to external networks.
- Semantic Vector Space Modeling: Utilizes state-of-the-art multilingual sentence transformers (e.g.,
paraphrase-multilingual-MiniLM-L12-v2) to map textual data into a dense vector space, enabling meaning-based retrieval rather than relying solely on lexical overlap. - Hybrid Retrieval Engine: Implements a multi-tiered search methodology:
- Exact Matching: Traditional boolean retrieval for precise term isolation.
- Fuzzy Matching: Levenshtein distance-based algorithms to account for typographical variances.
- Semantic Search: Cosine similarity computations against the local SQLite vector store to retrieve conceptually related documents.
- Absolute Data Sovereignty: The system architecture is designed to operate in air-gapped environments. Document parsing, tokenization, embedding generation, and indexing are confined entirely to the host machine.
- Asynchronous Interface Design: Developed utilizing
PySide6(Qt), the graphical interface maintains high responsiveness during computationally intensive background tasks (indexing and tensor operations) via robust multithreading.
The application is structured into decoupled modules to ensure maintainability and high performance:
graph TD
A[Local Directory Scanner] --> B(Document Parsers)
B -->|Text Extraction| C[Text Chunking & Preprocessing]
C --> D[ONNX Inference Engine]
D -->|Dense Vectors| E[(SQLite Vector Store)]
F[User Query] --> G[Query Tokenization]
G --> H[Query Embedding Generation]
H --> I{Hybrid Search Dispatcher}
I --> E
I -->|Ranked Results| J[Graphical User Interface]
- Python 3.11 or higher
- Windows 10 / 11
- Clone the repository:
git clone https://github.com/Hussein-Furaty/NeuraFind.git cd NeuraFind - Environment Configuration:
python -m venv .venv .venv\Scripts\activate pip install -r requirements.txt
- Application Execution:
set PYTHONPATH=. python src\neurafind\app.py
To generate a standalone Windows installer utilizing PyInstaller and the Inno Setup compiler:
-
Install Inno Setup 6: Download the compiler from the Official JRSoftware Website. Tip: Install it in the default directory (
C:\Program Files (x86)\Inno Setup 6) so the automated build script can locate theISCC.execompiler automatically. -
Execute the Build Pipeline: Run the batch script which automates both PyInstaller packaging and Inno Setup compilation:
scripts\build_all.bat
Upon successful execution, the final installer (
NeuraFind_Setup_v1.0.0.exe) will be generated in thedist/directory, and intermediate portable files will be automatically cleaned.
NeuraFind is built upon robust open-source foundations. The development of this application heavily relied on the following libraries and frameworks:
- ONNX Runtime: Provides the cross-platform, high-performance machine learning inference engine used for local embedding generation.
- PySide6 (Qt for Python): The core framework driving the application's graphical user interface and multithreading architecture.
- Hugging Face Transformers: Utilized for text tokenization algorithms (
XLMRobertaTokenizerFast). - Xenova Models: Acknowledgment for the optimized, quantized ONNX port of the
paraphrase-multilingual-MiniLM-L12-v2model, which makes local inference feasible on standard CPUs. - PyMuPDF: Enables highly efficient text extraction algorithms for PDF parsing.
- RapidFuzz: Implements the optimized string matching metrics used in the fuzzy search module.
Hussein Al-Furati
Cybersecurity Student, Software Developer, and AI Researcher.
Email: hussein.a.habeeb.sec@gmail.com
GitHub: @Hussein-Furaty
This project is released under the MIT License. See the LICENSE file for complete details. The accompanying End User License Agreement (EULA) within the installer details specific terms concerning local data privacy and liability.
