Replies: 5 comments
-
|
I think the Ram is based on the size of your model, there is a number given when you start privateGPT which is like 10GB. With this configuration it is not able to access resources of the GPU, which is very unfortunate because the GPU would be much faster. There are smaller models (Im not sure whats compatible with privateGPT) but the smaller the model the "dumber". I'm not sure where to find models but if someone knows do tell |
Beta Was this translation helpful? Give feedback.
-
|
Can you be more specific? I did not see anywhere where we can allocate the memory... Thank you |
Beta Was this translation helpful? Give feedback.
-
|
When you start privateGPT one of the first messages that pops up is a memory message, I'm not home so I don't know the exact message |
Beta Was this translation helpful? Give feedback.
-
|
I am running with 16gb of ram using vicuna #233 |
Beta Was this translation helpful? Give feedback.
-
|
GPU vs. CPU for local LLMs makes a significant difference — here's the practical breakdown: RAM requirements (GPU VRAM + system RAM):
GPU vs CPU speed (rough numbers for 7B Q4):
For RAG workloads (like privateGPT), the bottleneck isn't just generation — it's also embedding speed. GPU helps both: # llama.cpp / llama-cpp-python with GPU offload
llm = Llama(
model_path="./models/mistral-7b.Q4_K_M.gguf",
n_gpu_layers=35, # offload 35 layers to GPU (all layers for 7B)
n_ctx=4096,
n_batch=512,
)
# Embedding model on GPU too
embeddings = LlamaCppEmbeddings(
model_path="./models/nomic-embed-text.gguf",
n_gpu_layers=99, # all layers
)Practical recommendation:
What GPU do you have available? That determines which models you can run comfortably. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Does anyone know what RAM would be best to run privateGPT? Also does GPU play any role? If so, what config setting could we use to optimize performance. It takes minutes to get a response irrespective what gen CPU I run this under. Also, is there a smaller model we can use that would be faster?
Beta Was this translation helpful? Give feedback.
All reactions