ForkNous ResearchNous Researchpublished Mar 10, 2024seen 5d

NousResearch/nous-llama.cpp

forked from ggml-org/llama.cpp

Open original ↗

Captured source

source ↗
published Mar 10, 2024seen 5dcaptured 14hhttp 200method plain

NousResearch/nous-llama.cpp

Description: LLM inference in C/C++ - nous research

License: MIT

Stars: 8

Forks: 1

Open issues: 0

Created: 2024-03-10T06:36:11Z

Pushed: 2024-03-15T19:59:13Z

Default branch: master

Fork: yes

Parent repository: ggml-org/llama.cpp

Archived: no

README:

llama.cpp

!llama

Roadmap / Project status / Manifesto / ggml

Inference of Meta's LLaMA model (and others) in pure C/C++

> [!IMPORTANT] > Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962 > > Vote for which quantization type provides better responses, all other parameters being the same.

Recent API changes

  • [2024 Mar 8] llama_kv_cache_seq_rm() returns a bool instead of void, and new llama_n_max_seq() returns the upper limit of acceptable seq_id in batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328
  • [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
  • [2024 Mar 3] struct llama_context_params https://github.com/ggerganov/llama.cpp/pull/5849

Hot topics

  • Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328

----

Table of Contents

Description

Usage

Get the Code

Build

BLAS Build

Prepare and Quantize

Run the quantized model

Memory/Disk Requirements

Quantization

Interactive mode

Constrained output with grammars

Instruct mode

Obtaining and using the Facebook LLaMA 2 model

Seminal papers and background on the models

Perplexity (measuring model quality)

Android

Docker

Contributing

Coding guidelines

Docs

Description

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

  • Plain C/C++ implementation without any dependencies
  • Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
  • AVX, AVX2 and AVX512 support for x86 architectures
  • 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
  • Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
  • Vulkan, SYCL, and (partial) OpenCL backend support
  • CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

Since its inception, the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the ggml library.

Supported platforms:

  • [X] Mac OS
  • [X] Linux
  • [X] Windows (via CMake)
  • [X] Docker
  • [X] FreeBSD

Supported models:

Typically finetunes of the base models below are supported as well.

Multimodal models:

HTTP server

[llama.cpp web server](./examples/server) is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.

Bindings:

Excerpt shown — open the source for the full document.