Introducing GGUF Parser
GGUF, a highly efficient file format, is designed for storing models for inference with GGML and executors based on GGML. It's a binary format that ensures fast loading and saving of models. Models, typically developed using PyTorch or another framework, undergo a conversion process to be compatible with GGUF for use in GGML.
GGUF Parser provides some functions to parse the GGUF file in Go for the following purposes:
- Read metadata from the GGUF file without downloading the whole model remotely.
- Estimate the model resource requirements.
Estimating the RAM/VRAM required for running models is a crucial step when deploying them. This estimation can determine the appropriate model size and select the suitable quantization method.
For more details about GGUF Parser, visit:
GitHub repo: https://github.com/gpustack/gguf-parser-go
Getting Started with GGUF Parser
Download the gguf-parser
binary from the Releases on https://github.com/gpustack/gguf-parser-go. Move the GGUF Parser binary to /usr/local/bin
and grant it execution permissions (The following commands are for MacOS):
xxxxxxxxxx
mv ~/Downloads/gguf-parser-darwin-arm64 /usr/local/bin/gguf-parser
chmod +x /usr/local/bin/gguf-parser
Run the following command to view the runtime parameters for GGUF Parser (MacOS requires allowing gguf-parser
to run in Privacy & Security settings
):
xxxxxxxxxx
gguf-parser -h
Common parameters:
-path
: Specifies the local model file path to load.-hf-repo
: Specifies the Hugging Face model repository to load.-hf-file
: Used with-hf-repo
, specifies the GGUF model file name within the corresponding Hugging Face repository.-gpu-layers
: Specifies how many layers of the model are offloaded to the GPU. The more layers you offload, the faster the inference speed. The number of layers in the model can be determined from theLAYERS
section of theARCHITECTURE
part in the output after execution.-ctx-size
: Specifies the model's context size, constrained by the model. After execution, you can view the upper limit in theMax Context Len
section of theARCHITECTURE
part.-url
: Specifies the URL path of the remote model file to load.
Run GGUF Parser and pay attention to the results in the ESTIMATE section.
xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf
Based on the estimated results, under Apple's Unified Memory Architecture (UMA), the model will use:
- 84.16 MiB of RAM and 1.12 GiB of VRAM, totaling 1.21 GiB of memory.
Under a non-UMA architecture, the model will use:
- 234.16 MiB of RAM and 6.41 GiB of VRAM.
By default, all layers are offloaded to the GPU for acceleration, which maximizes GPU usage but may put pressure on the GPU. You can also use the -gpu-layers
parameter to specify how many layers to offload to the GPU. For example, if 20
layers of the model are offloaded to the GPU in a non-UMA architecture, the model will use:
- 754.16 MiB of RAM and 3.97 GiB of VRAM.
xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf -gpu-layers 20
Changing the model's context size affects memory usage. For example, decreasing the -ctx-size
parameter from the default 8192
to 2048
will reduce memory usage by 878.48 MiB
(1.21 GiB - 360.16 MiB
) under Apple's UMA.
xxxxxxxxxx
gguf-parser --hf-repo rubra-ai/Meta-Llama-3-8B-Instruct-GGUF -hf-file rubra-meta-llama-3-8b-instruct.Q4_K_M.gguf -ctx-size 2048
If you want to learn more about how it works, you can join our Community to talk to our team.
GPUStack and GGUF Parser
GPUStack estimates LLM resource requirements using GGUF Parser. It is an open-source GPU cluster manager for running large language models (LLMs) that automatically schedules the model to run on machines with appropriate resources.
GPUStack allows you to create a unified cluster from any brand of GPUs in Apple Macs, Windows PCs, and Linux servers. Administrators can deploy LLMs from popular repositories such as Hugging Face. Developers can then access LLMs just as easily as accessing public LLM services from vendors like OpenAI or Microsoft Azure.
If you are interested in GPUStack, visit the following links to see more information:
Introducing GPUStack: https://gpustack.ai/introducing-gpustack
User guide: https://docs.gpustack.ai
About Us
GPUStack and GGUF Parser is brought to you by Seal, Inc., a team dedicated to enabling AI access for all. Our mission is to enable enterprises to use AI to conduct their business, and GPUStack is a significant step towards achieving that goal.
Quickly build your own LLMaaS platform with GPUStack! Start experiencing the ease of creating GPU clusters locally, running and using LLMs, and integrating them into your applications.