Ideas on making the querying by the model, faster? #223

frroossst · 2023-05-16T21:22:43Z

frroossst
May 16, 2023

We are experimenting with this internally at work, for some of our proprietary codebases and documentation, and we're really excited about the privacy focused nature of it. What can we (as non A.I. devs) do, to make the queries faster, they take about 1-2 minute(s) for us, not quite what we're looking for though, thanks!

thekit · 2023-05-17T23:25:17Z

thekit
May 17, 2023

maybe try reduce chunk overlap size and reduce the number of source documents the answer references
#251 (comment)

0 replies

maozdemir · 2023-05-22T10:16:08Z

maozdemir
May 22, 2023

You can enable the GPU functionality.
The notebook below is for Linux and Google Colab, though nothing much is different in Windows:

https://github.com/maozdemir/privateGPT-colab

0 replies

kinthaiofficial · 2026-04-29T00:37:05Z

kinthaiofficial
Apr 29, 2026

Query speed in RAG systems has three distinct bottlenecks and they each need different fixes:

Embedding latency — if you're using a local embedding model (sentence-transformers, etc.), warm it up at startup and keep it in memory. Cold-loading the embedding model per query adds 2-5 seconds. If query volume is high, consider a dedicated embedding service (separate process) so it's always warm.

ANN search latency — FAISS on CPU is fine for <100K chunks. Above that, approximate search (IVF/HNSW) becomes important. Also: cache the most frequent query embeddings. If 20% of queries are "what is X?" variants, those embeddings are stable and can be pre-computed.

LLM generation latency — this is usually the biggest factor, but also the hardest to compress. Options:

Context pruning — don't pass all K retrieved chunks; score them by relevance and pass the top 3-4. Less context = faster generation.
Model quantization — 4-bit GGUF models (via llama.cpp) are 2-4x faster than float16 with minimal quality loss for RAG synthesis tasks.
Streaming — start rendering the response as it generates rather than waiting for the full response. User-perceived latency drops significantly even if total time is the same.

Caching — cache both retrieval results (same query → same chunks) and generation results (same query + same chunks → same answer) with a short TTL. For documentation QA, many queries repeat.

We built a multi-tier latency model for KinthAI's agent queries: https://blog.kinthai.ai/openclaw-multi-tenancy-why-vm-per-user-doesnt-scale

Which step is your bottleneck — retrieval, reranking, or generation?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ideas on making the querying by the model, faster? #223

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Ideas on making the querying by the model, faster? #223

Uh oh!

frroossst May 16, 2023

Replies: 3 comments

Uh oh!

thekit May 17, 2023

Uh oh!

maozdemir May 22, 2023

Uh oh!

kinthaiofficial Apr 29, 2026

frroossst
May 16, 2023

thekit
May 17, 2023

maozdemir
May 22, 2023

kinthaiofficial
Apr 29, 2026