NeuraFind: Offline Semantic Document Intelligence

A Local-First Information Retrieval and Vector Search Engine for Windows Environments

📥Download Latest Windows Installer

Abstract

NeuraFind addresses the growing privacy concerns associated with cloud-based document search engines by providing a strictly local, offline-first semantic search architecture. By leveraging quantized transformer models via the ONNX Runtime, NeuraFind computes high-dimensional vector embeddings of local documents (PDF, DOCX, XLSX, PPTX) entirely on consumer-grade hardware. This enables users to perform complex conceptual queries and retrieve contextually relevant information without exposing sensitive data to external networks.

Core Architectural Features

Semantic Vector Space Modeling: Utilizes state-of-the-art multilingual sentence transformers (e.g., paraphrase-multilingual-MiniLM-L12-v2) to map textual data into a dense vector space, enabling meaning-based retrieval rather than relying solely on lexical overlap.
Hybrid Retrieval Engine: Implements a multi-tiered search methodology:
- Exact Matching: Traditional boolean retrieval for precise term isolation.
- Fuzzy Matching: Levenshtein distance-based algorithms to account for typographical variances.
- Semantic Search: Cosine similarity computations against the local SQLite vector store to retrieve conceptually related documents.
Absolute Data Sovereignty: The system architecture is designed to operate in air-gapped environments. Document parsing, tokenization, embedding generation, and indexing are confined entirely to the host machine.
Asynchronous Interface Design: Developed utilizing PySide6 (Qt), the graphical interface maintains high responsiveness during computationally intensive background tasks (indexing and tensor operations) via robust multithreading.

Data Flow & Architecture

The application is structured into decoupled modules to ensure maintainability and high performance:

graph TD
    A[Local Directory Scanner] --> B(Document Parsers)
    B -->|Text Extraction| C[Text Chunking & Preprocessing]
    C --> D[ONNX Inference Engine]
    D -->|Dense Vectors| E[(SQLite Vector Store)]
    
    F[User Query] --> G[Query Tokenization]
    G --> H[Query Embedding Generation]
    H --> I{Hybrid Search Dispatcher}
    I --> E
    I -->|Ranked Results| J[Graphical User Interface]

System Requirements and Installation

Prerequisites

Python 3.11 or higher
Windows 10 / 11

Developer Setup

Clone the repository:

git clone https://github.com/Hussein-Furaty/NeuraFind.git
cd NeuraFind

Environment Configuration:

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

Application Execution:

set PYTHONPATH=.
python src\neurafind\app.py

Production Build (Executable)

To generate a standalone Windows installer utilizing PyInstaller and the Inno Setup compiler:

Install Inno Setup 6: Download the compiler from the Official JRSoftware Website. Tip: Install it in the default directory (C:\Program Files (x86)\Inno Setup 6) so the automated build script can locate the ISCC.exe compiler automatically.
Execute the Build Pipeline: Run the batch script which automates both PyInstaller packaging and Inno Setup compilation:
```
scripts\build_all.bat
```
Upon successful execution, the final installer (NeuraFind_Setup_v1.0.0.exe) will be generated in the dist/ directory, and intermediate portable files will be automatically cleaned.

Technical Acknowledgements & Dependencies

NeuraFind is built upon robust open-source foundations. The development of this application heavily relied on the following libraries and frameworks:

ONNX Runtime: Provides the cross-platform, high-performance machine learning inference engine used for local embedding generation.
PySide6 (Qt for Python): The core framework driving the application's graphical user interface and multithreading architecture.
Hugging Face Transformers: Utilized for text tokenization algorithms (XLMRobertaTokenizerFast).
Xenova Models: Acknowledgment for the optimized, quantized ONNX port of the paraphrase-multilingual-MiniLM-L12-v2 model, which makes local inference feasible on standard CPUs.
PyMuPDF: Enables highly efficient text extraction algorithms for PDF parsing.
RapidFuzz: Implements the optimized string matching metrics used in the fuzzy search module.

Author

Hussein Al-Furati
Cybersecurity Student, Software Developer, and AI Researcher.
Email: hussein.a.habeeb.sec@gmail.com
GitHub: @Hussein-Furaty

License

This project is released under the MIT License. See the LICENSE file for complete details. The accompanying End User License Agreement (EULA) within the installer details specific terms concerning local data privacy and liability.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
docs		docs
scripts		scripts
src/neurafind		src/neurafind
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
NeuraFind.spec		NeuraFind.spec
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeuraFind: Offline Semantic Document Intelligence

📥Download Latest Windows Installer

Abstract

Core Architectural Features

Data Flow & Architecture

System Requirements and Installation

Prerequisites

Developer Setup

Production Build (Executable)

Technical Acknowledgements & Dependencies

Author

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NeuraFind: Offline Semantic Document Intelligence

📥Download Latest Windows Installer

Abstract

Core Architectural Features

Data Flow & Architecture

System Requirements and Installation

Prerequisites

Developer Setup

Production Build (Executable)

Technical Acknowledgements & Dependencies

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages