llama.cpp is the underlying implementation for Ollama, LMStudio, and many other popular projects, and is one of the inference backends in GPUStack. It offers the GGUF model format - a file format designed for optimized inference, allowing models to load and run quickly.
llama.cpp also supports quantized models, which reduces storage and computational demands while preserving high model accuracy. This makes it possible to deploy LLMs efficiently on desktops, embedded devices, and resource-limited environments, enhancing inference speed.
Today, we bring a tutorial on how to convert and quantize GGUF models and upload them to HuggingFace.
Signing up and configuring your Hugging Face Account
- Signing up Hugging Face
First, go to https://huggingface.co/join
and sign up for a Hugging Face account.
- Configuring SSH Key
Add your SSH public key to Hugging Face, to get your SSH key, run the command below. (If it doesn't exist, generate SSH key with ssh-keygen -t rsa -b 4096
)
xxxxxxxxxx
cat ~/.ssh/id_rsa.pub
On Hugging Face, go to "Account
> Settings
> SSH and GPG Keys
" to add your SSH key. This provides authentication for uploading model later.
Preparing llama.cpp
Create and activate a Conda environment (If Conda is not installed, refer to the installation guide:https://docs.anaconda.com/miniconda/
):
xxxxxxxxxx
conda create -n llama-cpp python=3.12 -y
conda activate llama-cpp
which python
pip -V
Clone the latest llama.cpp release and build it as follows:
xxxxxxxxxx
cd ~
git clone -b b4034 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp/
pip install -r requirements.txt
brew install cmake
make
After building, run the following command to confirm:
xxxxxxxxxx
./llama-quantize --help
Downloading original model
Download the original model you want to convert to GGUF model and quantize.
Download model from Hugging Face using huggingface-cli
. First, ensure it is installed:
xxxxxxxxxx
pip install -U huggingface_hub
Here we download meta-llama/Llama-3.2-3B-Instruct
. This is a Gated model
, so we need to request access on Hugging Face before downloading:
On Hugging Face, go to "Account
> Access Tokens
" to generate an access token with Read
permissions:
Downloading meta-llama/Llama-3.2-3B-Instruct
and using --local-dir
parameter to specify the placed directory, and --token
to specify the access token created previously.
xxxxxxxxxx
mkdir ~/huggingface.co
cd ~/huggingface.co/
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct --local-dir Llama-3.2-3B-Instruct --token hf_abcdefghijklmnopqrstuvwxyz
Convert to GGUF and quantize
Create a script to convert model to GGUF format and quantize the model:
xxxxxxxxxx
cd ~/huggingface.co/
vim quantize.sh
Fill in the following content and modify the directory paths for llama.cpp
and huggingface.co
to the actual paths in your current environment, ensuring they are absolute paths. Change the gpustack
in the d
variable to your HuggingFace username:
xxxxxxxxxx
#!/usr/bin/env bash
llama_cpp="/Users/gpustack/llama.cpp"
b="/Users/gpustack/huggingface.co"
export PATH="$PATH:${llama_cpp}"
s="$1"
n="$(echo "${s}" | cut -d'/' -f2)"
d="gpustack/${n}-GGUF"
# prepare
mkdir -p ${b}/${d} 1>/dev/null 2>&1
pushd ${b}/${d} 1>/dev/null 2>&1
git init . 1>/dev/null 2>&1
if [[ ! -f .gitattributes ]]; then
cp -f ${b}/${s}/.gitattributes . 1>/dev/null 2>&1 || true
echo "*.gguf filter=lfs diff=lfs merge=lfs -text" >> .gitattributes
fi
if [[ ! -d assets ]]; then
cp -rf ${b}/${s}/assets . 1>/dev/null 2>&1 || true
fi
if [[ ! -d images ]]; then
cp -rf ${b}/${s}/images . 1>/dev/null 2>&1 || true
fi
if [[ ! -d imgs ]]; then
cp -rf ${b}/${s}/imgs . 1>/dev/null 2>&1 || true
fi
if [[ ! -f README.md ]]; then
cp -f ${b}/${s}/README.md . 1>/dev/null 2>&1 || true
fi
set -e
pushd ${llama_cpp} 1>/dev/null 2>&1
# convert
[[ -f venv/bin/activate ]] && source venv/bin/activate
echo "#### convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf"
python3 convert_hf_to_gguf.py ${b}/${s} --outfile ${b}/${d}/${n}-FP16.gguf
# quantize
qs=(
"Q8_0"
"Q6_K"
"Q5_K_M"
"Q5_0"
"Q4_K_M"
"Q4_0"
"Q3_K"
"Q2_K"
)
for q in "${qs[@]}"; do
echo "#### llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}"
llama-quantize ${b}/${d}/${n}-FP16.gguf ${b}/${d}/${n}-${q}.gguf ${q}
ls -lth ${b}/${d}
sleep 3
done
popd 1>/dev/null 2>&1
set +e
Start converting the model to a FP16
GGUF model, and quantize the model using the following methods: Q8_0
, Q6_K
, Q5_K_M
, Q5_0
, Q4_K_M
, Q4_0
, Q3_K
, and Q2_K
.
xxxxxxxxxx
bash quantize.sh Llama-3.2-3B-Instruct
After the script is executed, confirm the successful conversion to the FP16
GGUF model and the quantized GGUF model:
The model is stored in the directory under the corresponding username:
xxxxxxxxxx
ll gpustack/Llama-3.2-3B-Instruct-GGUF/
Uploading the Model to HuggingFace
Go to HuggingFace, click on your profile and select New Model
to create a model repository with the same name, formatted as original-model-name-GGUF
.
Update the README for model:
xxxxxxxxxx
cd ~/huggingface.co/gpustack/Llama-3.2-3B-Instruct-GGUF
vim README.md
For maintainability, after the metadata, record the original model and the commit information for the llama.cpp
. Ensure to modify according to the actual information:
xxxxxxxxxx
# Llama-3.2-3B-Instruct-GGUF
**Model creator**: [meta-llama](https://huggingface.co/meta-llama)<br/>
**Original model**: [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)<br/>
**GGUF quantization**: based on llama.cpp release [b8deef0e](https://github.com/ggerganov/llama.cpp/commit/b8deef0ec0af5febac1d2cfd9119ff330ed0b762)
---
Installing Git LFS for managing large files:
xxxxxxxxxx
brew install git-lfs
Add the remote repository:
xxxxxxxxxx
git remote add origin git@hf.co:gpustack/Llama-3.2-3B-Instruct-GGUF
Add files, confirm files to be committed with git ls-files
, and use git lfs ls-files
to verify all .gguf
files are managed by Git LFS:
xxxxxxxxxx
git add .
git ls-files
git lfs ls-files
Enable large file (larger than 5GB) uploads on HuggingFace. Log into Hugging Face via CLI, entering the token created in the Downloading original model
section above:
xxxxxxxxxx
huggingface-cli login
Enable large file uploads for the current directory:
xxxxxxxxxx
huggingface-cli lfs-enable-largefiles .
Upload the model to Hugging Face:
xxxxxxxxxx
git commit -m "feat: first commit" --signoff
git push origin main -f
After uploading, verify the uploaded model files on Hugging Face.
If the upload is unsuccessful, try using
huggingface-cli
to upload. Make sure to use an access token with Write permissions.
Conclusion
In this tutorial, we introduced how to use llama.cpp to convert and quantize GGUF models and upload them to HuggingFace.
The flexibility and efficiency of llama.cpp
make it an ideal choice for model inference in resource-limited scenarios, with widespread applications. GGUF is the required model format for running models in llama.cpp, and we hope this tutorial is helpful for you to manage GGUF models.