WritingGoogle (DeepMind / Gemini)Google (DeepMind / Gemini)published Sep 12, 2025seen 6d

VaultGemma: The world's most capable differentially private LLM

Open original ↗

Captured source

source ↗

VaultGemma: The world's most capable differentially private LLM

Skip to main content

VaultGemma: The world's most capable differentially private LLM

September 12, 2025 Amer Sinha, Software Engineer, and Ryan McKenna, Research Scientist, Google Research

We introduce VaultGemma, the most capable model trained from scratch with differential privacy.

Quick links

Paper

Hugging Face

Kaggle

Technical report

Share

Copy link

×

As AI becomes more integrated into our lives, building it with privacy at its core is a critical frontier for the field. Differential privacy (DP) offers a mathematically sound solution by adding calibrated noise to prevent memorization. However, applying DP to LLMs introduces trade-offs. Understanding these trade-offs is crucial. Applying DP noise alters traditional scaling laws — rules describing performance dynamics — by reducing training stability (the model's ability to learn consistently without experiencing catastrophic events like loss spikes or divergence) and significantly increasing batch size (a collection of training examples sent to the model simultaneously for processing) and computation costs. Our new research, “ Scaling Laws for Differentially Private Language Models ”, conducted in partnership with Google DeepMind, establishes laws that accurately model these intricacies, providing a complete picture of the compute-privacy-utility trade-offs. Guided by this research, we’re excited to introduce VaultGemma, the largest (1B-parameters), open model trained from scratch with differential privacy. We are releasing the weights on Hugging Face and Kaggle , alongside a technical report , to advance the development of the next generation of private AI.

Understanding the scaling laws With a carefully thought-out experimental methodology, we aimed to quantify the benefit of increasing model sizes, batch sizes, and iterations in the context of DP training. Our work required making some simplifying assumptions to overcome the exponential number of combinations one might consider trying. We assumed that how well the model learns depends mostly on the "noise-batch ratio” which compares the amount of random noise we add for privacy to the size of the data groups (batches) we use for training. This assumption works because the privacy noise we add is much greater than any natural randomness that comes from sampling the data. To establish a DP scaling law, we conducted a comprehensive set of experiments to evaluate performance across a variety of model sizes and noise-batch ratios. The resulting empirical data, together with known deterministic relationships between other variables, allows us to answer a variety of interesting scaling-laws–style queries, such as, “For a given compute budget, privacy budget, and data budget, what is the optimal training configuration to achieve the lowest possible training loss?”

The structure of our DP scaling laws. We establish that predicted loss can be accurately modeled using primarily the model size, iterations and the noise-batch ratio, simplifying the complex interactions between the compute, privacy, and data budgets.

Key findings: A powerful synergy Before diving into the full scaling laws, it’s useful to understand the dynamics and synergies between the compute budget, privacy budget, and data budget from a privacy accounting perspective — i.e., understand how these factors influence the noise-batch ratio for a fixed model size and number of iterations. This analysis is significantly cheaper to do as it does not require any model training, yet it yields a number of useful insights. For instance, increasing the privacy budget in isolation leads to diminishing returns, unless coupled with a corresponding increase in either the compute budget ( FLOPs ) or data budget (tokens).

Marginal benefit of increasing the privacy budget (epsilon) and the compute budget (batch size) in terms of their effect on the noise-batch ratio.

To explore this synergy further, the visualization below shows how the optimal training configuration changes based on different constraints. As the privacy and compute budgets change, notice how the recommendation shifts between investing in a larger model versus training with larger batch sizes or more iterations.

play silent looping video pause silent looping video

unmute video mute video

Predicted training loss for different settings of data/privacy/compute budget, and a further detailed breakdown by the number of iterations, batch size, and model size. The plots show both the minimum achievable loss for different budget settings, along with the optimal hyper-parameter configurations.

This data provides a wealth of useful insights for practitioners. While all the insights are reported in the paper, a key finding is that one should train a much smaller model with a much larger batch size than would be used without DP. This general insight should be unsurprising to a DP expert given the importance of large batch sizes. While this general insight holds across many settings, the optimal training configurations do change with the privacy and data budgets. Understanding the exact trade-off is crucial to ensure that both the compute and privacy budgets are used judiciously in real training scenarios. The above visualizations also reveal that there is often wiggle room in the training configurations — i.e., a range of model sizes might provide very similar utility if paired with the correct number of iterations and/or batch size.

Applying the scaling laws to build VaultGemma The Gemma models are designed with responsibility and safety at their core. This makes them a natural foundation for developing a production-quality, DP-trained model like VaultGemma. Algorithmic advancements: Training at scale The scaling laws we derived above represent an important first step towards training a useful Gemma model with DP. We used the scaling laws to determine both how much compute we needed to train a compute-optimal 1B parameter Gemma 2-based model with DP, and how to allocate that compute among batch size, iterations, and sequence length to achieve the best utility. One prominent gap between the research underlying the scaling laws and the actual training of VaultGemma was our handling of Poisson sampling , which is a central component of DP-SGD . We initially used a straightforward method of loading data in uniform batches but then switched to Poisson sampling to get the best privacy…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable model release from top lab