What does this repo signal mean?

Meta AI (Llama) published meta-llama/llama-models (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo meta-llama/llama-models · language Python · Meta's flagship Llama model repo with high stars.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Meta AI (Llama) Repo: meta-llama/llama-models

Captured source

source ↗

GitHub/github.com/meta-llama/llama-models

meta-llama/llama-models repository metadata

Source ↗

published Jun 27, 2024seen Jun 5captured Jun 11http 200method plain

meta-llama/llama-models

Description: Utilities intended for use with Llama models.

Language: Python

License: NOASSERTION

Stars: 7625

Forks: 1386

Open issues: 204

Created: 2024-06-27T22:14:09Z

Pushed: 2026-02-11T16:38:31Z

Default branch: main

Fork: no

Archived: no

README:

🤗 Models on Hugging Face&nbsp | Blog&nbsp | Website&nbsp | Get Started&nbsp | Llama Cookbook&nbsp

---

Llama Models

Llama is an accessible, open large language model (LLM) designed for developers, researchers, and businesses to build, experiment, and responsibly scale their generative AI ideas. Part of a foundational system, it serves as a bedrock for innovation in the global community. A few key aspects: 1. Open access: Easy accessibility to cutting-edge large language models, fostering collaboration and advancements among developers, researchers, and organizations 2. Broad ecosystem: Llama models have been downloaded hundreds of millions of times, there are thousands of community projects built on Llama and platform support is broad from cloud providers to startups - the world is building with Llama! 3. Trust & safety: Llama models are part of a comprehensive approach to trust and safety, releasing models and tools that are designed to enable community collaboration and encourage the standardization of the development and usage of trust and safety tools for generative AI

Our mission is to empower individuals and industry through this opportunity while fostering an environment of discovery and ethical AI advancements. The model weights are licensed for researchers and commercial entities, upholding the principles of openness.

Llama Models

Download

To download the model weights and tokenizer:

1. Visit the Meta Llama website. 2. Read and accept the license. 3. Once your request is approved you will receive a signed URL via email. 4. Install the Llama Models CLI: pip install llama-models. (<-- Start Here if you have received an email already.) 5. Run llama-model list to show the latest available models and determine the model ID you wish to download. NOTE: If you want older versions of models, run llama-model list --show-all to show all the available Llama models.

6. Run: llama-model download --source meta --model-id CHOSEN_MODEL_ID 7. Pass the URL provided when prompted to start the download.

Remember that the links expire after 24 hours and a certain amount of downloads. You can always re-request a link if you start seeing errors such as 403: Forbidden.

CLI Commands Reference

Once installed, the llama-model CLI provides the following commands:

llama-model list # List available models
llama-model list --show-all # List all models (including older versions)
llama-model describe -m MODEL_ID # Show detailed information about a model
llama-model download # Download models from Meta or Hugging Face
llama-model verify-download # Verify integrity of downloaded models
llama-model remove -m MODEL_ID # Remove a downloaded model
llama-model prompt-format -m MODEL_ID # Show the prompt format for a model

For detailed help on any command, run llama-model COMMAND --help.

Running the models

In order to run the models, you will need to install dependencies after checking out the repository.

# Run this within a suitable Python environment (uv, conda, or virtualenv)
pip install .[torch]

Example scripts are available in models/{ llama3, llama4 }/scripts/ sub-directory. Note that the Llama4 series of models require at least 4 GPUs to run inference at full (bf16) precision.

#!/bin/bash

NGPUS=4
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-4-Scout-17B-16E-Instruct
PYTHONPATH=$(git rev-parse --show-toplevel) \
torchrun --nproc_per_node=$NGPUS \
-m models.llama4.scripts.chat_completion $CHECKPOINT_DIR \
--world_size $NGPUS

The above script should be used with an Instruct (Chat) model. For a Base model, update the CHECKPOINT_DIR path and use the script models.llama4.scripts.completion.

Running inference with FP8 and Int4 Quantization

You can reduce the memory footprint of the models at the cost of minimal loss in accuracy by running inference with FP8 or Int4 quantization. Use the --quantization-mode flag to specify the quantization mode. There are two modes:

fp8_mixed: Mixed precision inference with FP8 for some weights and bfloat16 for activations.
int4_mixed: Mixed precision inference with Int4 for some weights and bfloat16 for activations.

Using FP8, running Llama-4-Scout-17B-16E-Instruct requires 2 GPUs with 80GB of memory. Using Int4, you need a single GPU with 80GB of memory.

MODE=fp8_mixed # or int4_mixed
if [ $MODE == "fp8_mixed" ]; then
NGPUS=2
else
NGPUS=1
fi
CHECKPOINT_DIR=~/.llama/checkpoints/Llama-4-Scout-17B-16E-Instruct
PYTHONPATH=$(git rev-parse --show-toplevel) \
torchrun --nproc_per_node=$NGPUS \
-m...

Excerpt shown — open the source for the full document.

Notability

notability 10.0/10

Meta's flagship Llama model repo with high stars.