NousResearch/nous-llama.cpp
forked from ggml-org/llama.cpp
Captured source
source ↗NousResearch/nous-llama.cpp
Description: LLM inference in C/C++ - nous research
License: MIT
Stars: 8
Forks: 1
Open issues: 0
Created: 2024-03-10T06:36:11Z
Pushed: 2024-03-15T19:59:13Z
Default branch: master
Fork: yes
Parent repository: ggml-org/llama.cpp
Archived: no
README:
llama.cpp
Roadmap / Project status / Manifesto / ggml
Inference of Meta's LLaMA model (and others) in pure C/C++
> [!IMPORTANT] > Quantization blind testing: https://github.com/ggerganov/llama.cpp/discussions/5962 > > Vote for which quantization type provides better responses, all other parameters being the same.
Recent API changes
- [2024 Mar 8]
llama_kv_cache_seq_rm()returns aboolinstead ofvoid, and newllama_n_max_seq()returns the upper limit of acceptableseq_idin batches (relevant when dealing with multiple sequences) https://github.com/ggerganov/llama.cpp/pull/5328 - [2024 Mar 4] Embeddings API updated https://github.com/ggerganov/llama.cpp/pull/5796
- [2024 Mar 3]
struct llama_context_paramshttps://github.com/ggerganov/llama.cpp/pull/5849
Hot topics
- Initial Mamba support has been added: https://github.com/ggerganov/llama.cpp/pull/5328
----
Table of Contents
Description
Usage
Get the Code
Build
BLAS Build
Prepare and Quantize
Run the quantized model
Memory/Disk Requirements
Quantization
Interactive mode
Constrained output with grammars
Instruct mode
Obtaining and using the Facebook LLaMA 2 model
Seminal papers and background on the models
Perplexity (measuring model quality)
Android
Docker
Contributing
Coding guidelines
Docs
Description
The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
- Plain C/C++ implementation without any dependencies
- Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
- AVX, AVX2 and AVX512 support for x86 architectures
- 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
- Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)
- Vulkan, SYCL, and (partial) OpenCL backend support
- CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
Since its inception, the project has improved significantly thanks to many contributions. It is the main playground for developing new features for the ggml library.
Supported platforms:
- [X] Mac OS
- [X] Linux
- [X] Windows (via CMake)
- [X] Docker
- [X] FreeBSD
Supported models:
Typically finetunes of the base models below are supported as well.
- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [X] Mistral 7B
- [x] Mixtral MoE
- [X] Falcon
- [X] Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- [X] Vigogne (French)
- [X] Koala
- [X] Baichuan 1 & 2 + derivations
- [X] Aquila 1 & 2
- [X] Starcoder models
- [X] Refact
- [X] Persimmon 8B
- [X] MPT
- [X] Bloom
- [x] Yi models
- [X] StableLM models
- [x] Deepseek models
- [x] Qwen models
- [x] PLaMo-13B
- [x] Phi models
- [x] GPT-2
- [x] Orion 14B
- [x] InternLM2
- [x] CodeShell
- [x] Gemma
- [x] Mamba
Multimodal models:
- [x] LLaVA 1.5 models, LLaVA 1.6 models
- [x] BakLLaVA
- [x] Obsidian
- [x] ShareGPT4V
- [x] MobileVLM 1.7B/3B models
- [x] Yi-VL
HTTP server
[llama.cpp web server](./examples/server) is a lightweight OpenAI API compatible HTTP server that can be used to serve local models and easily connect them to existing clients.
Bindings:
- Python: abetlen/llama-cpp-python
- Go: go-skynet/go-llama.cpp
- Node.js: withcatai/node-llama-cpp
- JS/TS (llama.cpp server client): lgrammel/modelfusion
- JavaScript/Wasm (works in browser): tangledgroup/llama-cpp-wasm
- Ruby: yoshoku/llama_cpp.rb
- Rust (nicer API): mdrokz/rust-llama.cpp
- Rust (more direct bindings):…
Excerpt shown — open the source for the full document.