Skip to content

Hussein-Furaty/NeuraFind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeuraFind Logo

NeuraFind: Offline Semantic Document Intelligence

A Local-First Information Retrieval and Vector Search Engine for Windows Environments


License: MIT Status: Active Version: 1.0.0 Python 3.11+ Platform: Windows Architecture: x64 Framework: PySide6 AI: ONNX Runtime Database: SQLite3




Abstract

NeuraFind addresses the growing privacy concerns associated with cloud-based document search engines by providing a strictly local, offline-first semantic search architecture. By leveraging quantized transformer models via the ONNX Runtime, NeuraFind computes high-dimensional vector embeddings of local documents (PDF, DOCX, XLSX, PPTX) entirely on consumer-grade hardware. This enables users to perform complex conceptual queries and retrieve contextually relevant information without exposing sensitive data to external networks.

Core Architectural Features

  • Semantic Vector Space Modeling: Utilizes state-of-the-art multilingual sentence transformers (e.g., paraphrase-multilingual-MiniLM-L12-v2) to map textual data into a dense vector space, enabling meaning-based retrieval rather than relying solely on lexical overlap.
  • Hybrid Retrieval Engine: Implements a multi-tiered search methodology:
    • Exact Matching: Traditional boolean retrieval for precise term isolation.
    • Fuzzy Matching: Levenshtein distance-based algorithms to account for typographical variances.
    • Semantic Search: Cosine similarity computations against the local SQLite vector store to retrieve conceptually related documents.
  • Absolute Data Sovereignty: The system architecture is designed to operate in air-gapped environments. Document parsing, tokenization, embedding generation, and indexing are confined entirely to the host machine.
  • Asynchronous Interface Design: Developed utilizing PySide6 (Qt), the graphical interface maintains high responsiveness during computationally intensive background tasks (indexing and tensor operations) via robust multithreading.

Data Flow & Architecture

The application is structured into decoupled modules to ensure maintainability and high performance:

graph TD
    A[Local Directory Scanner] --> B(Document Parsers)
    B -->|Text Extraction| C[Text Chunking & Preprocessing]
    C --> D[ONNX Inference Engine]
    D -->|Dense Vectors| E[(SQLite Vector Store)]
    
    F[User Query] --> G[Query Tokenization]
    G --> H[Query Embedding Generation]
    H --> I{Hybrid Search Dispatcher}
    I --> E
    I -->|Ranked Results| J[Graphical User Interface]
Loading

System Requirements and Installation

Prerequisites

  • Python 3.11 or higher
  • Windows 10 / 11

Developer Setup

  1. Clone the repository:
    git clone https://github.com/Hussein-Furaty/NeuraFind.git
    cd NeuraFind
  2. Environment Configuration:
    python -m venv .venv
    .venv\Scripts\activate
    pip install -r requirements.txt
  3. Application Execution:
    set PYTHONPATH=.
    python src\neurafind\app.py

Production Build (Executable)

To generate a standalone Windows installer utilizing PyInstaller and the Inno Setup compiler:

  1. Install Inno Setup 6: Download the compiler from the Official JRSoftware Website. Tip: Install it in the default directory (C:\Program Files (x86)\Inno Setup 6) so the automated build script can locate the ISCC.exe compiler automatically.

  2. Execute the Build Pipeline: Run the batch script which automates both PyInstaller packaging and Inno Setup compilation:

    scripts\build_all.bat

    Upon successful execution, the final installer (NeuraFind_Setup_v1.0.0.exe) will be generated in the dist/ directory, and intermediate portable files will be automatically cleaned.

Technical Acknowledgements & Dependencies

NeuraFind is built upon robust open-source foundations. The development of this application heavily relied on the following libraries and frameworks:

  • ONNX Runtime: Provides the cross-platform, high-performance machine learning inference engine used for local embedding generation.
  • PySide6 (Qt for Python): The core framework driving the application's graphical user interface and multithreading architecture.
  • Hugging Face Transformers: Utilized for text tokenization algorithms (XLMRobertaTokenizerFast).
  • Xenova Models: Acknowledgment for the optimized, quantized ONNX port of the paraphrase-multilingual-MiniLM-L12-v2 model, which makes local inference feasible on standard CPUs.
  • PyMuPDF: Enables highly efficient text extraction algorithms for PDF parsing.
  • RapidFuzz: Implements the optimized string matching metrics used in the fuzzy search module.

Author

Hussein Al-Furati
Cybersecurity Student, Software Developer, and AI Researcher.
Email: hussein.a.habeeb.sec@gmail.com
GitHub: @Hussein-Furaty

License

This project is released under the MIT License. See the LICENSE file for complete details. The accompanying End User License Agreement (EULA) within the installer details specific terms concerning local data privacy and liability.