Optimal RAM & does GPU make any difference? #219

jjsarf · 2023-05-15T20:50:13Z

jjsarf
May 15, 2023

Does anyone know what RAM would be best to run privateGPT? Also does GPU play any role? If so, what config setting could we use to optimize performance. It takes minutes to get a response irrespective what gen CPU I run this under. Also, is there a smaller model we can use that would be faster?

ji8sw · 2023-05-15T22:27:46Z

ji8sw
May 15, 2023

I think the Ram is based on the size of your model, there is a number given when you start privateGPT which is like 10GB. With this configuration it is not able to access resources of the GPU, which is very unfortunate because the GPU would be much faster. There are smaller models (Im not sure whats compatible with privateGPT) but the smaller the model the "dumber". I'm not sure where to find models but if someone knows do tell

0 replies

jjsarf · 2023-05-15T22:33:24Z

jjsarf
May 15, 2023
Author

Can you be more specific? I did not see anywhere where we can allocate the memory... Thank you

0 replies

ji8sw · 2023-05-15T22:40:15Z

ji8sw
May 15, 2023

When you start privateGPT one of the first messages that pops up is a memory message, I'm not home so I don't know the exact message

0 replies

thekit · 2023-05-17T23:31:02Z

thekit
May 17, 2023

I am running with 16gb of ram using vicuna #233

0 replies

kinthaiofficial · 2026-04-29T01:51:01Z

kinthaiofficial
Apr 29, 2026

GPU vs. CPU for local LLMs makes a significant difference — here's the practical breakdown:

RAM requirements (GPU VRAM + system RAM):

7B models: ~5-8GB (Q4 quant), ~14GB (FP16)
13B models: ~8-10GB (Q4), ~26GB (FP16)
34B models: ~20-22GB (Q4), needs 2x GPU for FP16

GPU vs CPU speed (rough numbers for 7B Q4):

CPU only (16 cores): ~2-8 tokens/sec
GPU (RTX 3080 10GB): ~30-50 tokens/sec
GPU (RTX 4090 24GB): ~80-120 tokens/sec

For RAG workloads (like privateGPT), the bottleneck isn't just generation — it's also embedding speed. GPU helps both:

# llama.cpp / llama-cpp-python with GPU offload
llm = Llama(
    model_path="./models/mistral-7b.Q4_K_M.gguf",
    n_gpu_layers=35,    # offload 35 layers to GPU (all layers for 7B)
    n_ctx=4096,
    n_batch=512,
)

# Embedding model on GPU too
embeddings = LlamaCppEmbeddings(
    model_path="./models/nomic-embed-text.gguf",
    n_gpu_layers=99,    # all layers
)

Practical recommendation:

n_gpu_layers=-1 or 99 to offload everything if it fits in VRAM
If model is too big for VRAM: partial offload (e.g., n_gpu_layers=20) is much better than CPU-only
Mixed CPU+GPU is slower than pure GPU but much faster than pure CPU

What GPU do you have available? That determines which models you can run comfortably.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimal RAM & does GPU make any difference? #219

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Optimal RAM & does GPU make any difference? #219

Uh oh!

Uh oh!

jjsarf May 15, 2023

Replies: 5 comments

Uh oh!

ji8sw May 15, 2023

Uh oh!

jjsarf May 15, 2023 Author

Uh oh!

ji8sw May 15, 2023

Uh oh!

thekit May 17, 2023

Uh oh!

kinthaiofficial Apr 29, 2026

jjsarf
May 15, 2023

ji8sw
May 15, 2023

jjsarf
May 15, 2023
Author

ji8sw
May 15, 2023

thekit
May 17, 2023

kinthaiofficial
Apr 29, 2026