Convert and Upload Your GGUF Model to Hugging Face – Step-by-Step Guide

Convert and Upload Your GGUF Model to Hugging Face - Step-by-Step Guide

llama.cpp is the underlying implementation for Ollama, LMStudio, and many other popular projects, and is one of the inference backends in GPUStack. It offers the GGUF model format - a file format designed for optimized inference, allowing models to load and run quickly.

llama.cpp also supports quantized models, which reduces storage and computational demands while preserving high model accuracy. This makes it possible to deploy LLMs efficiently on desktops, embedded devices, and resource-limited environments, enhancing inference speed.

Today, we bring a tutorial on how to convert and quantize GGUF models and upload them to HuggingFace.

Signing up and configuring your Hugging Face Account

Signing up Hugging Face

First, go to https://huggingface.co/join and sign up for a Hugging Face account.

Configuring SSH Key

Add your SSH public key to Hugging Face, to get your SSH key, run the command below. (If it doesn't exist, generate SSH key with ssh-keygen -t rsa -b 4096)


xxxxxxxxxx
cat ~/.ssh/id_rsa.pub

On Hugging Face, go to "Account > Settings > SSH and GPG Keys" to add your SSH key. This provides authentication for uploading model later.

Preparing llama.cpp

Create and activate a Conda environment (If Conda is not installed, refer to the installation guide:https://docs.anaconda.com/miniconda/):


xxxxxxxxxx
conda create -n llama-cpp python=3.12 -y
conda activate llama-cpp
which python
pip -V

Clone the latest llama.cpp release and build it as follows:


xxxxxxxxxx
cd ~
git clone -b b4034 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
brew install cmake
make

After building, run the following command to confirm:


xxxxxxxxxx
./llama-quantize --help

Downloading original model

Download the original model you want to convert to GGUF model and quantize.

Download model from Hugging Face using huggingface-cli. First, ensure it is installed:


xxxxxxxxxx
pip install -U huggingface_hub

Here we download meta-llama/Llama-3.2-3B-Instruct . This is a Gated model, so we need to request access on Hugging Face before downloading:

On Hugging Face, go to "Account > Access Tokens" to generate an access token with Read permissions:

Downloading meta-llama/Llama-3.2-3B-Instruct and using --local-dir parameter to specify the placed directory, and --token to specify the access token created previously.


xxxxxxxxxx
mkdir ~/huggingface.co
cd ~/huggingface.co/
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct --token hf_abcdefghijklmnopqrstuvwxyz

Convert to GGUF and quantize

Create a script to convert model to GGUF format and quantize the model:


xxxxxxxxxx
cd ~/huggingface.co/
vim quantize.sh

Fill in the following content and modify the directory paths for llama.cpp and huggingface.co to the actual paths in your current environment, ensuring they are absolute paths. Change the gpustack in the d variable to your HuggingFace username:


xxxxxxxxxx
#!/usr/bin/env bash

llama_cpp="/Users/gpustack/llama.cpp"
b="/Users/gpustack/huggingface.co"

export PATH="$PATH:${llama_cpp}"

s="$1"
n="$(echo "${s}" | cut -d'/' -f2)"
d="gpustack/${n}-GGUF"

# prepare

mkdir -p ${b}/${d} 1>/dev/null 2>&1
pushd ${b}/${d} 1>/dev/null 2>&1
git init . 1>/dev/null 2>&1

if [[ ! -f .gitattributes ]]; then
    cp -f ${b}/${s}/.gitattributes . 1>/dev/null 2>&1 || true
    echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes
fi
if [[ ! -d assets ]]; then
    cp -rf ${b}/${s}/assets . 1>/dev/null 2>&1 || true
fi
if [[ ! -d images ]]; then
    cp -rf ${b}/${s}/images . 1>/dev/null 2>&1 || true
fi
if [[ ! -d imgs ]]; then
    cp -rf ${b}/${s}/imgs . 1>/dev/null 2>&1 || true
fi
if [[ ! -f README.md ]]; then
    cp -f ${b}/${s}/README.md . 1>/dev/null 2>&1 || true
fi

set -e

pushd ${llama_cpp} 1>/dev/null 2>&1

# convert

[[ -f venv/bin/activate ]] && source venv/bin/activate
echo "#### convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf"
python3 convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf

# quantize

qs=(
  "Q8_0"
  "Q6_K"
  "Q5_K_M"
  "Q5_0"
  "Q4_K_M"
  "Q4_0"
  "Q3_K"
  "Q2_K"
)
for q in "${qs[@]}"; do
    echo "#### llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}"
    llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}
    ls -lth ${b}/${d}
    sleep 3
done

popd 1>/dev/null 2>&1

set +e

Start converting the model to a FP16 GGUF model, and quantize the model using the following methods: Q8_0, Q6_K, Q5_K_M, Q5_0, Q4_K_M, Q4_0, Q3_K, and Q2_K.


xxxxxxxxxx
bash quantize.sh Llama-3.2-3B-Instruct

After the script is executed, confirm the successful conversion to the FP16 GGUF model and the quantized GGUF model:

The model is stored in the directory under the corresponding username:


xxxxxxxxxx
ll gpustack/Llama-3.2-3B-Instruct-GGUF/

Uploading the Model to HuggingFace

Go to HuggingFace, click on your profile and select New Model to create a model repository with the same name, formatted as original-model-name-GGUF.

Update the README for model:


xxxxxxxxxx
cd ~/huggingface.co/gpustack/Llama-3.2-3B-Instruct-GGUF
vim README.md

For maintainability, after the metadata, record the original model and the commit information for the llama.cpp. Ensure to modify according to the actual information:


xxxxxxxxxx
# Llama-3.2-3B-Instruct-GGUF

**Model creator**: [meta-llama](https://huggingface.co/meta-llama)<br/>
**Original model**: [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)<br/>
**GGUF quantization**: based on llama.cpp release [b8deef0e](https://github.com/ggerganov/llama.cpp/commit/b8deef0ec0af5febac1d2cfd9119ff330ed0b762)

---

Installing Git LFS for managing large files:


xxxxxxxxxx
brew install git-lfs

Add the remote repository:


xxxxxxxxxx
git remote add origin git@hf.co:gpustack/Llama-3.2-3B-Instruct-GGUF

Add files, confirm files to be committed with git ls-files, and use git lfs ls-files to verify all .gguf files are managed by Git LFS:


xxxxxxxxxx
git add .
git ls-files
git lfs ls-files

Enable large file (larger than 5GB) uploads on HuggingFace. Log into Hugging Face via CLI, entering the token created in the Downloading original model section above:


xxxxxxxxxx
huggingface-cli login

Enable large file uploads for the current directory:


xxxxxxxxxx
huggingface-cli lfs-enable-largefiles .

Upload the model to Hugging Face:


xxxxxxxxxx
git commit -m "feat: first commit" --signoff
git push origin main -f

After uploading, verify the uploaded model files on Hugging Face.

If the upload is unsuccessful, try using huggingface-cli to upload. Make sure to use an access token with Write permissions.

Conclusion

In this tutorial, we introduced how to use llama.cpp to convert and quantize GGUF models and upload them to HuggingFace.

The flexibility and efficiency of llama.cpp make it an ideal choice for model inference in resource-limited scenarios, with widespread applications. GGUF is the required model format for running models in llama.cpp, and we hope this tutorial is helpful for you to manage GGUF models.

Try GPUStack

Convert and Upload Your GGUF Model to Hugging Face – Step-by-Step Guide

Signing up and configuring your Hugging Face Account

Preparing llama.cpp

Downloading original model

Convert to GGUF and quantize

Uploading the Model to HuggingFace

Conclusion

Related Articles

Resources

Company

Try GPUStack

Convert and Upload Your GGUF Model to Hugging Face – Step-by-Step Guide

Signing up and configuring your Hugging Face Account

Preparing llama.cpp

Downloading original model

Convert to GGUF and quantize

Uploading the Model to HuggingFace

Conclusion

Related Articles

Resources

Company

Get our newsletter