ibm-granite/gguf
Jupyter Notebook
Captured source
source ↗ibm-granite/gguf
Description: CI/CD for IBM model GGUF conversions, quantizations and packagings for partner delivery
Language: Jupyter Notebook
License: Apache-2.0
Stars: 4
Forks: 0
Open issues: 1
Created: 2025-10-02T20:06:28Z
Pushed: 2025-10-17T16:08:16Z
Default branch: main
Fork: no
Archived: no
README:
gguf
This repository provides an automated CI/CD process to convert, test and deploy IBM Granite models, in safetensor format, from the ibm-granite organization to versioned IBM GGUF collections in Hugging Face Hub under the `ibm-research` organization. This includes:
Topic index
- [Target IBM models for format conversion](#target-ibm-models-for-format-conversion)
- [Supported IBM Granite models (GGUF)](#supported-ibm-granite-models-gguf)
- [Language](#language)
- [Guardian](#guardian)
- [Vision](#vision)
- [Embedding](#embedding-dense)
- [GGUF Conversion & Quantization](#gguf-conversion--quantization)
- [GGUF Verification Testing](#gguf-verification-testing)
- [References](#references)
- [Releasing GGUF model conversions & quantizations](#releasing-gguf-model-conversions--quantizations)
---
Target IBM models for format conversion
Format conversions (i.e., GGUF) and quantizations will only be provided for canonically hosted model repositories hosted in an official IBM Huggingface organization.
Currently, this includes the following organizations:
- https://huggingface.co/ibm-granite
- https://huggingface.co/ibm-research
Additionally, only a select set of IBM models from these orgs. will be converted based upon the following general criteria:
- The IBM GGUF model needs to be referenced by an AI provider service as a "supported" model.
- *For example, a local AI provider service such as Ollama or a hosted service such as Replicate.*
- The GGUF model is referenced by a public blog, tutorial, demo, or other public use case.
- Specifically, if the model is referenced in an IBM Granite Snack Cookbook
Select quantization will only be made available when:
- Small form-factor is justified:
- *e.g., Reduced model size intended running locally on small form-factor devices such as watches and mobile devices.*
- Performance provides significant benefit without compromising on accuracy (or enabling hallucination).
Supported IBM Granite models (GGUF)
Specifically, the following Granite model repositories are currently supported in GGUF format (by collection) with listed:
###### Language
Typically, this model category includes "instruct" models.
| Source Repo. ID | HF (llama.cpp) Architecture | Target Repo. ID | | --- | --- | --- | | ibm-granite/granite-3.2-2b-instruct | GraniteForCausalLM (gpt2) | ibm-research | | ibm-granite/granite-3.2-8b-instruct | GraniteForCausalLM (gpt2) | ibm-research |
- Supported quantizations:
fp16,Q2_K,Q3_K_L,Q3_K_M,Q3_K_S,Q4_0,Q4_1,Q4_K_M,Q4_K_S,Q5_0,Q5_1,Q5_K_M,Q5_K_S,Q6_K,Q8_0
###### Guardian
| Source Repo. ID | HF (llama.cpp) Architecture | Target HF Org. | | --- | --- | --- | | ibm-granite/granite-guardian-3.2-3b-a800m | GraniteMoeForCausalLM (granitemoe) | ibm-research | | ibm-granite/granite-guardian-3.2-5b | GraniteMoeForCausalLM (granitemoe) | ibm-research |
- Supported quantizations:
fp16,Q4_K_M,Q5_K_M,Q6_K,Q8_0
###### Vision
| HF (llama.cpp) Architecture | Source Repo. ID | Target HF Org. | | --- | --- | --- | | ibm-granite/granite-vision-3.2-2b | GraniteForCausalLM (granite), LlavaNextForConditionalGeneration | ibm-research |
- Supported quantizations:
fp16,Q4_K_M,Q5_K_M,Q8_0
###### Embedding (dense)
| Source Repo. ID | HF (llama.cpp) Architecture | Target HF Org. | | --- | --- | --- | | ibm-granite/granite-embedding-30m-english | Roberta (roberta-bpe) | ibm-research | | ibm-granite/granite-embedding-125m-english | Roberta (roberta-bpe) | ibm-research | | ibm-granite/granite-embedding-107m-multilingual | Roberta (roberta-bpe) | ibm-research | | ibm-granite/granite-embedding-278m-multilingual | Roberta (roberta-bpe) | ibm-research |
- Supported quantizations:
fp16,Q8_0
Note: Sparse model architecture (i.e., HF RobertaMaskedLM) is not currently supported; therefore, there is no conversion for ibm-granite/granite-embedding-30m-sparse.
###### RAG LoRA support**
- LoRA support is currently in plan (no date).
---
GGUF Conversion & Quantization
The GGUF format is defined in the GGUF specification. The specification describes the structure of the file, how it is encoded, and what information is included.
Currently, the primary means to convert from HF SafeTensors format to GGUF will be the canonical llama.cpp tool convert-hf-to-gguf.py.
for example:
python llama.cpp/convert-hf-to-gguf.py ./ --outfile output_file.gguf --outtype q8_0
Alternatives
##### Ollama CLI (future)
- https://github.com/ollama/ollama/blob/main/docs/import.md#quantizing-a-model
$ ollama create --quantize q4_K_M mymodel transferring model data quantizing F16 model to Q4_K_M creating new layer sha256:735e246cc1abfd06e9cdcf95504d6789a6cd1ad7577108a70d9902fef503c1bd creating new layer sha256:0853f0ad24e5865173bbf9ffcc7b0f5d56b66fd690ab1009867e45e7d2c4db0f writing manifest success
Note: The Ollama CLI tool only supports a subset of quantizations:
- (rounding):
q4_0,q4_1,q5_0,q5_1,q8_0 - k-means:
q3_K_S,q3_K_M,q3_K_L,q4_K_S,q4_K_M,q5_K_S,q5_K_M,q6_K
##### Hugging Face endorsed tool "ggml-org/gguf-my-repo"
- https://huggingface.co/spaces/ggml-org/gguf-my-repo
Note:
- Similar to Ollama CLI, the web UI supports only a subset of quantizations.
---
GGUF Verification Testing
As a baseline, each converted model MUST successfully be run in the following providers:
##### llama.cpp testing
llama.cpp - As the core implementation of the GGUF format which is either a direct dependency or utilized as forked code in most all downstream GGUF providers, testing is essential. Specifically, testing to verify the model can be hosted using the llama-server service.
- *See the specific…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Low stars, routine repo.