Convert and Upload Your GGUF Model to Hugging Face – Step-by-Step Guide

Convert and Upload Your GGUF Model to Hugging Face - Step-by-Step Guide

llama.cpp is the underlying implementation for Ollama, LMStudio, and many other popular projects, and is one of the inference backends in GPUStack. It offers the GGUF model format - a file format designed for optimized inference, allowing models to load and run quickly.

llama.cpp also supports quantized models, which reduces storage and computational demands while preserving high model accuracy. This makes it possible to deploy LLMs efficiently on desktops, embedded devices, and resource-limited environments, enhancing inference speed.

Today, we bring a tutorial on how to convert and quantize GGUF models and upload them to HuggingFace.

 

Signing up and configuring your Hugging Face Account

  • Signing up Hugging Face

First, go to https://huggingface.co/join and sign up for a Hugging Face account.

 

  • Configuring SSH Key

Add your SSH public key to Hugging Face, to get your SSH key, run the command below. (If it doesn't exist, generate SSH key with ssh-keygen -t rsa -b 4096)

On Hugging Face, go to "Account > Settings > SSH and GPG Keys" to add your SSH key. This provides authentication for uploading model later.

 

Preparing llama.cpp

Create and activate a Conda environment (If Conda is not installed, refer to the installation guide:https://docs.anaconda.com/miniconda/):

Clone the latest llama.cpp release and build it as follows:

image-20241106120151035

After building, run the following command to confirm:

image-20241106120420296

 

Downloading original model

Download the original model you want to convert to GGUF model and quantize.

Download model from Hugging Face using huggingface-cli. First, ensure it is installed:

Here we download meta-llama/Llama-3.2-3B-Instruct . This is a Gated model, so we need to request access on Hugging Face before downloading:

image-20241106141024313

On Hugging Face, go to "Account > Access Tokens" to generate an access token with Read permissions:

image-20241106131544469

Downloading meta-llama/Llama-3.2-3B-Instruct and using --local-dir parameter to specify the placed directory, and --token to specify the access token created previously.

Convert to GGUF and quantize

Create a script to convert model to GGUF format and quantize the model:

Fill in the following content and modify the directory paths for llama.cpp and huggingface.co to the actual paths in your current environment, ensuring they are absolute paths. Change the gpustack in the d variable to your HuggingFace username:

Start converting the model to a FP16 GGUF model, and quantize the model using the following methods: Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K, and Q2_K.

After the script is executed, confirm the successful conversion to the FP16 GGUF model and the quantized GGUF model:

image-20241106154934731

The model is stored in the directory under the corresponding username:

image-20241106171759142

 

Uploading the Model to HuggingFace

Go to HuggingFace, click on your profile and select New Model to create a model repository with the same name, formatted as original-model-name-GGUF.

image-20241106171458202

 

Update the README for model:

For maintainability, after the metadata, record the original model and the commit information for the llama.cpp. Ensure to modify according to the actual information:

image-20241106173636107

Installing Git LFS for managing large files:

Add the remote repository:

Add files, confirm files to be committed with git ls-files, and use git lfs ls-files to verify all .gguf files are managed by Git LFS:

image-20241106181806616

Enable large file (larger than 5GB) uploads on HuggingFace. Log into Hugging Face via CLI, entering the token created in the Downloading original model section above:

Enable large file uploads for the current directory:

image-20241106190029534

Upload the model to Hugging Face:

After uploading, verify the uploaded model files on Hugging Face.

If the upload is unsuccessful, try using huggingface-cli to upload. Make sure to use an access token with Write permissions.

 

Conclusion

In this tutorial, we introduced how to use llama.cpp to convert and quantize GGUF models and upload them to HuggingFace.

The flexibility and efficiency of llama.cpp make it an ideal choice for model inference in resource-limited scenarios, with widespread applications. GGUF is the required model format for running models in llama.cpp, and we hope this tutorial is helpful for you to manage GGUF models.

Related Articles