Replies: 3 comments
-
|
maybe try reduce chunk overlap size and reduce the number of source documents the answer references |
Beta Was this translation helpful? Give feedback.
-
|
You can enable the GPU functionality. |
Beta Was this translation helpful? Give feedback.
-
|
Query speed in RAG systems has three distinct bottlenecks and they each need different fixes: Embedding latency — if you're using a local embedding model (sentence-transformers, etc.), warm it up at startup and keep it in memory. Cold-loading the embedding model per query adds 2-5 seconds. If query volume is high, consider a dedicated embedding service (separate process) so it's always warm. ANN search latency — FAISS on CPU is fine for <100K chunks. Above that, approximate search (IVF/HNSW) becomes important. Also: cache the most frequent query embeddings. If 20% of queries are "what is X?" variants, those embeddings are stable and can be pre-computed. LLM generation latency — this is usually the biggest factor, but also the hardest to compress. Options:
Caching — cache both retrieval results (same query → same chunks) and generation results (same query + same chunks → same answer) with a short TTL. For documentation QA, many queries repeat. We built a multi-tier latency model for KinthAI's agent queries: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale Which step is your bottleneck — retrieval, reranking, or generation? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We are experimenting with this internally at work, for some of our proprietary codebases and documentation, and we're really excited about the privacy focused nature of it. What can we (as non A.I. devs) do, to make the queries faster, they take about 1-2 minute(s) for us, not quite what we're looking for though, thanks!
Beta Was this translation helpful? Give feedback.
All reactions