Skip to content

Commit 6d35dc3

Browse files
author
svalench
committed
feat(sprint4): smart indexing, event streaming, production hardening
Adds opt-in SmartIndexer that builds structured per-section documents, an event hub for the LangGraph search pipeline, and an NDJSON/SSE streaming endpoint. All features are off by default and gated behind flags so the public API stays backward compatible. Smart indexing * SmartIndexer pipeline: inspect_model -> collect_fields -> enrich_document -> embed_batch -> persist * DocumentTemplate / FieldSection dataclasses; per-model templates via SMART_INDEXING.TEMPLATES, with a deterministic heuristic fallback (default_template_for) when no template is registered * Always appends legacy text as safety net so smart indexing never produces less content than the classic indexer * New get_indexer() factory in indexer.py routes to SmartIndexer when enabled and back to Indexer otherwise. Signals, app-level index(), and the build_search_index command all use the factory now. Event hub & streaming endpoint * events.EventHub with subscribe/publish, thread-safe and tolerant of subscriber exceptions; get_default_hub() singleton helper * Both LangGraph and the in-tree fallback runner emit lifecycle events: query_received, query_expanded, vector_search_completed, rerank_completed, completed * Searcher now accepts an event_hub kwarg and forwards it into the graph factory only when supported (TypeError-tolerant) * StreamingSearchAPIView: NDJSON (default) or SSE, runs search in a worker thread, drains events through queue.Queue, terminal 'end' event for reliable client-side stream termination, sets X-Accel-Buffering: no for proxies * Endpoint returns 404 when STREAMING.ENABLED is false Hardening * Settings additions are dataclass-frozen and validated (STREAMING.FORMAT must be ndjson|sse, SMART_INDEXING.TEMPLATES must be a dict, etc.) * No new hard dependencies; LangGraph and LLM SDKs remain optional Tests: +22 tests (10 smart indexing, 12 events/streaming). All 58 tests pass.
1 parent e4ac1d9 commit 6d35dc3

14 files changed

Lines changed: 1357 additions & 25 deletions

README.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,97 @@ Built-in memory backends:
308308
Bring your own by subclassing `BaseMemoryBackend` and pointing
309309
`MEMORY_BACKEND` at the dotted path.
310310

311+
## Smart indexing (optional)
312+
313+
The classic indexer joins selected fields with whitespace. That works, but the
314+
embedding model loses the *role* of each value: a category name and a body
315+
paragraph become indistinguishable tokens. The optional `SmartIndexer` builds
316+
structured documents with labelled sections so the embedder sees something
317+
closer to:
318+
319+
```
320+
Title: Pixel 8
321+
Description:
322+
Camera-first Android phone with Tensor G3.
323+
Category: Phones
324+
```
325+
326+
Enable it from settings — your existing index, signals, and management command
327+
keep working because the resolver and `get_indexer()` factory pick the new
328+
implementation transparently:
329+
330+
```python
331+
GRAPH_SEARCH = {
332+
# ... your existing config ...
333+
"SMART_INDEXING": {
334+
"ENABLED": True,
335+
# Optional per-model templates; the indexer falls back to a heuristic
336+
# template based on your MODELS config when one is missing.
337+
"TEMPLATES": {
338+
"shop.Product": {
339+
"title_field": "name",
340+
"sections": [
341+
{"label": "Description", "field": "description", "multiline": True},
342+
{"label": "Category", "field": "category__name"},
343+
],
344+
}
345+
},
346+
},
347+
}
348+
```
349+
350+
The original deterministic text is always appended as a safety net so smart
351+
indexing never produces *less* searchable content than the legacy pipeline.
352+
Disable the flag to fall back instantly — no reindex required to switch back.
353+
354+
## Streaming search endpoint (optional)
355+
356+
Long-running pipelines (query expansion, vector search, reranking) can stream
357+
lifecycle events to the client so users see progress instead of staring at a
358+
spinner. Two transports are supported:
359+
360+
- `ndjson` (default): one JSON object per line, ideal for `fetch` +
361+
`ReadableStream` and CLI tools like `jq`.
362+
- `sse`: Server-Sent Events for `EventSource` clients.
363+
364+
Enable from settings:
365+
366+
```python
367+
GRAPH_SEARCH = {
368+
# ... your existing config ...
369+
"STREAMING": {
370+
"ENABLED": True,
371+
"FORMAT": "ndjson", # or "sse"
372+
"INCLUDE_INTERNAL_EVENTS": True,
373+
},
374+
}
375+
```
376+
377+
The endpoint is registered at `/<API_URL_PREFIX>/stream/` (default
378+
`/api/search/stream/`) and returns HTTP 404 when disabled, so it is safe to
379+
leave the URL config untouched.
380+
381+
Quick test:
382+
383+
```bash
384+
curl -N "http://localhost:8000/api/search/stream/?q=phone"
385+
```
386+
387+
Example event sequence (NDJSON):
388+
389+
```json
390+
{"type": "query_received", "query": "phone"}
391+
{"type": "vector_search_completed", "candidate_count": 12}
392+
{"type": "completed", "total": 5}
393+
{"type": "results", "results": [...], "total": 5}
394+
{"type": "end"}
395+
```
396+
397+
Under the hood the view subscribes a `queue.Queue` to a per-request
398+
`EventHub`, runs the search in a worker thread, and yields events as soon as
399+
the nodes publish them. The hub also powers structured logging and any
400+
custom subscribers you register from your own apps.
401+
311402
## Comparison
312403

313404
| Feature | django-graph-search | Haystack | django-elasticsearch-dsl |

src/django_graph_search/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from __future__ import annotations
22

33
from .apps import DjangoGraphSearchConfig
4-
from .indexer import Indexer
4+
from .indexer import Indexer, get_indexer
55
from .searcher import Searcher
66
from .settings import get_settings
77

@@ -17,7 +17,7 @@ def index(instance) -> None:
1717
model_cfg = next((cfg for cfg in config.models if cfg.model == instance._meta.label), None)
1818
if model_cfg is None:
1919
return
20-
Indexer(config=config).index_instance(instance, model_cfg)
20+
get_indexer(config=config).index_instance(instance, model_cfg)
2121

2222

2323
def get_similar(instance, limit: int | None = None):
@@ -27,6 +27,7 @@ def get_similar(instance, limit: int | None = None):
2727
__all__ = [
2828
"DjangoGraphSearchConfig",
2929
"Indexer",
30+
"get_indexer",
3031
"Searcher",
3132
"search",
3233
"index",

src/django_graph_search/events.py

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
"""Lightweight event hook system used by the LangGraph pipelines.
2+
3+
The hub is intentionally tiny: it stores a list of subscribers and broadcasts
4+
events via plain function calls. It does not try to be a full pub/sub system —
5+
the intent is to give callers (HTTP endpoints, structured logging, future
6+
streaming integrations) a single place to plug in.
7+
8+
Events are dictionaries with a ``type`` key and optional payload. Examples:
9+
10+
* ``{\"type\": \"query_received\", \"query\": \"phone\"}``
11+
* ``{\"type\": \"vector_search_started\", \"queries\": [\"phone\", \"phones\"]}``
12+
* ``{\"type\": \"completed\", \"total\": 12}``
13+
"""
14+
from __future__ import annotations
15+
16+
import logging
17+
import threading
18+
from typing import Any, Callable, Dict, List, Optional
19+
20+
log = logging.getLogger(__name__)
21+
22+
EventCallback = Callable[[Dict[str, Any]], None]
23+
24+
25+
class EventHub:
26+
"""Per-instance hub. Use :func:`get_default_hub` for the global one."""
27+
28+
def __init__(self) -> None:
29+
self._subscribers: List[EventCallback] = []
30+
self._lock = threading.Lock()
31+
32+
def subscribe(self, callback: EventCallback) -> Callable[[], None]:
33+
"""Register ``callback`` and return a function that removes it."""
34+
with self._lock:
35+
self._subscribers.append(callback)
36+
37+
def _unsubscribe() -> None:
38+
with self._lock:
39+
try:
40+
self._subscribers.remove(callback)
41+
except ValueError: # pragma: no cover - already removed.
42+
pass
43+
44+
return _unsubscribe
45+
46+
def publish(self, event: Dict[str, Any]) -> None:
47+
"""Broadcast ``event`` to subscribers. Errors are logged, never raised."""
48+
with self._lock:
49+
subs = list(self._subscribers)
50+
for cb in subs:
51+
try:
52+
cb(event)
53+
except Exception as exc: # noqa: BLE001
54+
log.warning("Event subscriber raised: %s", exc)
55+
56+
57+
_default_hub: Optional[EventHub] = None
58+
_default_lock = threading.Lock()
59+
60+
61+
def get_default_hub() -> EventHub:
62+
global _default_hub
63+
if _default_hub is None:
64+
with _default_lock:
65+
if _default_hub is None:
66+
_default_hub = EventHub()
67+
return _default_hub
68+
69+
70+
def reset_default_hub() -> None: # pragma: no cover - testing helper
71+
global _default_hub
72+
_default_hub = EventHub()
73+
74+
75+
__all__ = ["EventHub", "EventCallback", "get_default_hub", "reset_default_hub"]

src/django_graph_search/indexer.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,32 @@ def make_doc_id(model_label: str, pk: object) -> str:
1919
return f"{model_label}:{pk}"
2020

2121

22+
def get_indexer(
23+
config: Optional[GraphSearchConfig] = None,
24+
**kwargs,
25+
):
26+
"""Return the configured indexer instance.
27+
28+
By default returns :class:`Indexer`. When
29+
``GRAPH_SEARCH["SMART_INDEXING"]["ENABLED"]`` is ``True`` the dotted-path
30+
in ``SMART_INDEXING.INDEXER`` is loaded instead — typically
31+
:class:`~django_graph_search.langgraph_indexer.SmartIndexer`.
32+
33+
The factory keeps the public surface stable: callers do not need to know
34+
which implementation they are using because both expose the same methods
35+
(``index_queryset``, ``index_instance``, ``delete_instance``,
36+
``rebuild_all``).
37+
"""
38+
from .settings import get_settings
39+
from django.utils.module_loading import import_string
40+
41+
cfg = config or get_settings()
42+
if not cfg.smart_indexing.enabled:
43+
return Indexer(config=cfg, **kwargs)
44+
indexer_cls = import_string(cfg.smart_indexing.indexer)
45+
return indexer_cls(config=cfg, **kwargs)
46+
47+
2248
class Indexer(ComponentMixin):
2349
def __init__(
2450
self,

src/django_graph_search/langgraph_agent.py

Lines changed: 63 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
import logging
2424
from typing import Any, Callable, Dict, Iterable, List, Optional, Sequence, TypedDict
2525

26+
from .events import EventHub
2627
from .llm.base import BaseLLMBackend, RerankCandidate
2728
from .settings import GraphSearchConfig
2829

@@ -232,34 +233,65 @@ def postprocess_results_node(state: SearchState) -> SearchState:
232233
# ---------------------------------------------------------------------------
233234

234235

235-
def build_search_graph(config: GraphSearchConfig, *, embedding_backend, vector_store, llm: BaseLLMBackend):
236+
def build_search_graph(
237+
config: GraphSearchConfig,
238+
*,
239+
embedding_backend,
240+
vector_store,
241+
llm: BaseLLMBackend,
242+
event_hub: Optional[EventHub] = None,
243+
):
236244
"""Build and compile the search graph.
237245
238246
When the ``langgraph`` package is available we return a compiled
239247
LangGraph ``StateGraph``. Otherwise we return :class:`_FallbackGraph` so
240248
the rest of the code stays identical.
249+
250+
Pass ``event_hub`` to receive lifecycle events (``query_received``,
251+
``query_expanded``, ``vector_search_completed``, ``rerank_completed``,
252+
``completed``) — the same hub powers the streaming HTTP endpoint.
241253
"""
242254
try:
243255
from langgraph.graph import END, StateGraph # type: ignore
244256
except Exception: # pragma: no cover - exercised when langgraph absent.
245-
return _FallbackGraph(config=config, embedding_backend=embedding_backend,
246-
vector_store=vector_store, llm=llm)
257+
return _FallbackGraph(
258+
config=config,
259+
embedding_backend=embedding_backend,
260+
vector_store=vector_store,
261+
llm=llm,
262+
event_hub=event_hub,
263+
)
264+
265+
def _wrap(name: str, fn):
266+
if event_hub is None:
267+
return fn
268+
269+
def _wrapped(s):
270+
event_hub.publish({"type": f"{name}_started", "query": s.get("normalized_query") or s.get("query")})
271+
out = fn(s)
272+
event_hub.publish({
273+
"type": f"{name}_completed",
274+
"candidate_count": len(out.get("merged_results") or out.get("raw_results") or []),
275+
})
276+
return out
277+
278+
return _wrapped
247279

248280
graph: Any = StateGraph(dict)
249-
graph.add_node("analyze_query", lambda s: analyze_query_node(s, config=config))
281+
graph.add_node("analyze_query", _wrap("analyze_query", lambda s: analyze_query_node(s, config=config)))
250282
graph.add_node(
251283
"expand_query",
252-
lambda s: expand_query_node(s, config=config, llm=llm),
284+
_wrap("expand_query", lambda s: expand_query_node(s, config=config, llm=llm)),
253285
)
254286
graph.add_node(
255287
"vector_search",
256-
lambda s: vector_search_node(s, embedding_backend=embedding_backend, vector_store=vector_store),
288+
_wrap("vector_search", lambda s: vector_search_node(s, embedding_backend=embedding_backend, vector_store=vector_store)),
257289
)
258290
graph.add_node(
259291
"rerank_results",
260-
lambda s: rerank_results_node(s, config=config, llm=llm),
292+
_wrap("rerank_results", lambda s: rerank_results_node(s, config=config, llm=llm)),
261293
)
262-
graph.add_node("postprocess_results", lambda s: postprocess_results_node(s))
294+
graph.add_node("postprocess_results", _wrap("postprocess_results", lambda s: postprocess_results_node(s)))
263295

264296
graph.set_entry_point("analyze_query")
265297
graph.add_conditional_edges(
@@ -295,24 +327,47 @@ def __init__(
295327
embedding_backend,
296328
vector_store,
297329
llm: BaseLLMBackend,
330+
event_hub: Optional[EventHub] = None,
298331
) -> None:
299332
self.config = config
300333
self.embedding_backend = embedding_backend
301334
self.vector_store = vector_store
302335
self.llm = llm
336+
self.event_hub = event_hub
337+
338+
def _emit(self, event: Dict[str, Any]) -> None:
339+
if self.event_hub is not None:
340+
self.event_hub.publish(event)
303341

304342
def invoke(self, state: SearchState) -> SearchState:
343+
self._emit({"type": "query_received", "query": state.get("query") or ""})
305344
state = analyze_query_node(state, config=self.config)
306345
if self.config.langgraph.query_expansion:
307346
state = expand_query_node(state, config=self.config, llm=self.llm)
347+
self._emit({
348+
"type": "query_expanded",
349+
"queries": list(state.get("expanded_queries") or []),
350+
})
308351
state = vector_search_node(
309352
state,
310353
embedding_backend=self.embedding_backend,
311354
vector_store=self.vector_store,
312355
)
356+
self._emit({
357+
"type": "vector_search_completed",
358+
"candidate_count": len(state.get("merged_results") or []),
359+
})
313360
if self.config.langgraph.reranking:
314361
state = rerank_results_node(state, config=self.config, llm=self.llm)
362+
self._emit({
363+
"type": "rerank_completed",
364+
"candidate_count": len(state.get("reranked_results") or []),
365+
})
315366
state = postprocess_results_node(state)
367+
self._emit({
368+
"type": "completed",
369+
"total": len(state.get("final_results") or []),
370+
})
316371
return state
317372

318373

0 commit comments

Comments
 (0)