WritingLightning AILightning AIpublished Nov 15, 2023seen 5d

8-bit Quantization with Lightning Fabric

Open original ↗

Captured source

source ↗
published Nov 15, 2023seen 5dcaptured 3dhttp 200method plain

8-bit Quantization with Lightning Fabric - Lightning AI Lightning AI Studios: Never set up a local environment again →

Takeaways Readers will learn the basics of Lightning Fabric’s plugin for 8-bit quantization.

Introduction The aim of 8-bit quantization is to reduce the memory usage of the model parameters by using lower precision types than full (float32) or half (bfloat16) precision. Meaning – 8-bit quantization compresses models that have billions of parameters like Llama 2 or SDXL and makes them require less memory. Thankfully, Lightning Fabric makes quantization as easy as setting a mode flag in a plugin! 8-bit Quantization 8-bit quantization is discussed in the popular paper 8-bit Optimizers via Block-wise Quantization and was introduced in FP8 Formats for Deep Learning . As stated in the original paper, 8-bit quantization was the natural progression after 16-bit precision. Although it was the natural progression, the implementation was not as simple as moving from FP32 to FP16 – as those two floating point types share the same representation scheme and 8-bit does not. 8-bit quantization requires a new representation scheme, and this new scheme allows for fewer numbers to be represented than FP16 or FP32. This means model performance may be affected when using quantization, so it is good to be aware of this trade-off. Additionally, model performance should be evaluated in its quantized form if the weights will be used on an edge device that requires quantization. Lightning Fabric can use 8-bit quantization by setting the mode flag to int8 for inference. from lightning.fabric import Fabric from lightning.fabric.plugins import BitsandbytesPrecision

available 8-bit quantization modes

("int8")

mode = "int8" plugin = BitsandbytesPrecision(mode=mode) fabric = Fabric(plugins=plugin)

model = CustomModule() # your PyTorch model model = fabric.setup_module(model) # quantizes the layers Expand Copy Conclusion Quantization is a must for most production systems given that edge devices and consumer grade hardware typically require models of a much smaller memory footprint than more powerful hardware such as NVIDIA’s A100 80GB. Learning about this technique will enable a better understanding of deployment of LLMs like a Llama 2 and SDXL, and requirements for edge devices in robotics, vehicles, and other systems. Still have questions? We have an amazing community and team of core engineers ready to answer your questions. So, join us on Discourse or Discord . See you there! Resources and References Quantization in Lightning Fabric Introduction to Quantization Introduction to Quantization and API Summary Quantization in Practice Post Training Quantization FP8 Formats for Deep Learning 8-bit Optimizers via Block-wise Quantization GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers Automatic Mixed Precision for Deep Learning

More from the Blog

Lightning AI Joins AI Alliance To Advance Open, Safe, Responsible AI Read More

4-Bit Quantization with Lightning Fabric Read More

Quickstart to Lightning Fabric Read More

Additional captured pages

FP8 FORMATS FOR DEEP LEARNING Paulius Micikevicius, Dusan Stosic, Patrick Judd, John Kamalu, Stuart Oberman, Mohammad Shoeybi, Michael Siu, Hao Wu NVIDIA {pauliusm, dstosic, pjudd, jkamalu, soberman, mshoeybi, msiu, skyw}@nvidia.com Neil Burgess, Sangwon Ha, Richard…