RepoArcee AIArcee AIpublished Mar 29, 2024seen 5d

arcee-ai/fastmlx

Python

Open original ↗

Captured source

source ↗
published Mar 29, 2024seen 5dcaptured 9hhttp 200method plain

arcee-ai/fastmlx

Description: FastMLX is a high performance production ready API to host MLX models.

Language: Python

License: NOASSERTION

Stars: 358

Forks: 43

Open issues: 24

Created: 2024-03-29T18:33:55Z

Pushed: 2025-03-18T02:01:50Z

Default branch: main

Fork: no

Archived: no

README:

FastMLX

![image](https://pyup.io/repos/github/Blaizzy/fastmlx)

FastMLX is a high performance production ready API to host MLX models, including Vision Language Models (VLMs) and Language Models (LMs).

  • Free software: Apache Software License 2.0
  • Documentation: https://Blaizzy.github.io/fastmlx

Features

  • OpenAI-compatible API: Easily integrate with existing applications that use OpenAI's API.
  • Dynamic Model Loading: Load MLX models on-the-fly or use pre-loaded models for better performance.
  • Support for Multiple Model Types: Compatible with various MLX model architectures.
  • Image Processing Capabilities: Handle both text and image inputs for versatile model interactions.
  • Efficient Resource Management: Optimized for high-performance and scalability.
  • Error Handling: Robust error management for production environments.
  • Customizable: Easily extendable to accommodate specific use cases and model types.

Usage

1. Installation

pip install fastmlx

2. Running the Server

Start the FastMLX server:

fastmlx

or

uvicorn fastmlx:app --reload --workers 0

> [!WARNING] > The --reload flag should not be used in production. It is only intended for development purposes.

Running with Multiple Workers (Parallel Processing)

For improved performance and parallel processing capabilities, you can specify either the absolute number of worker processes or the fraction of CPU cores to use. This is particularly useful for handling multiple requests simultaneously.

You can also set the FASTMLX_NUM_WORKERS environment variable to specify the number of workers or the fraction of CPU cores to use. workers defaults to 2 if not passed explicitly or set via the environment variable.

In order of precedence (highest to lowest), the number of workers is determined by the following:

  • Explicitly passed as a command-line argument
  • --workers 4 will set the number of workers to 4
  • --workers 0.5 will set the number of workers to half the number of CPU cores available (minimum of 1)
  • Set via the FASTMLX_NUM_WORKERS environment variable
  • Default value of 2

To use all available CPU cores, set the value to 1.0.

Example:

fastmlx --workers 4

or

uvicorn fastmlx:app --workers 4

> [!NOTE] > - --reload flag is not compatible with multiple workers > - The number of workers should typically not exceed the number of CPU cores available on your machine for optimal performance.

Considerations for Multi-Worker Setup

1. Stateless Application: Ensure your FastMLX application is stateless, as each worker process operates independently. 2. Database Connections: If your app uses a database, make sure your connection pooling is configured to handle multiple workers. 3. Resource Usage: Monitor your system's resource usage to find the optimal number of workers for your specific hardware and application needs. Additionally, you can remove any unused models using the delete model endpoint. 4. Load Balancing: When running with multiple workers, incoming requests are automatically load-balanced across the worker processes.

By leveraging multiple workers, you can significantly improve the throughput and responsiveness of your FastMLX application, especially under high load conditions.

3. Making API Calls

Use the API similar to OpenAI's chat completions:

Vision Language Model

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/nanoLLaVA-1.5-4bit",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
"messages": [{"role": "user", "content": "What are these"}],
"max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

With streaming:

import requests
import json

def process_sse_stream(url, headers, data):
response = requests.post(url, headers=headers, json=data, stream=True)

if response.status_code != 200:
print(f"Error: Received status code {response.status_code}")
print(response.text)
return

full_content = ""

try:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
event_data = line[6:] # Remove 'data: ' prefix
if event_data == '[DONE]':
print("\nStream finished. ✅")
break
try:
chunk_data = json.loads(event_data)
content = chunk_data['choices'][0]['delta']['content']
full_content += content
print(content, end='', flush=True)
except json.JSONDecodeError:
print(f"\nFailed to decode JSON: {event_data}")
except KeyError:
print(f"\nUnexpected data structure: {chunk_data}")

except KeyboardInterrupt:
print("\nStream interrupted by user.")
except requests.exceptions.RequestException as e:
print(f"\nAn error occurred: {e}")

if __name__ == "__main__":
url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/nanoLLaVA-1.5-4bit",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
"messages": [{"role": "user", "content": "What are these?"}],
"max_tokens": 500,
"stream": True
}
process_sse_stream(url, headers, data)

Language Model

import requests
import json

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}
data = {
"model": "mlx-community/gemma-2-9b-it-4bit",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 100
}

response = requests.post(url, headers=headers, data=json.dumps(data))
print(response.json())

With streaming:

import requests
import json

def process_sse_stream(url, headers, data):
response = requests.post(url, headers=headers, json=data, stream=True)

if response.status_code != 200:
print(f"Error: Received status code {response.status_code}")
print(response.text)
return

full_content = ""

try:
for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
event_data = line[6:] # Remove 'data: ' prefix
if…

Excerpt shown — open the source for the full document.