RepoDatabricks (DBRX)Databricks (DBRX)published Mar 23, 2023seen 5d

databricks/lilac

Python

Open original ↗

Captured source

source ↗
published Mar 23, 2023seen 5dcaptured 12hhttp 200method plain

databricks/lilac

Description: Curate better data for LLMs

Language: Python

License: Apache-2.0

Stars: 1071

Forks: 105

Open issues: 88

Created: 2023-03-23T21:19:10Z

Pushed: 2024-03-19T12:41:30Z

Default branch: main

Fork: no

Archived: yes

README: Lilac

Better data, better AI

🔗 Try the Lilac web demo!

Lilac is a tool for exploration, curation and quality control of datasets for training, fine-tuning and monitoring LLMs.

Lilac is used by companies like Cohere and Databricks to visualize, quantify and improve the quality of pre-training and fine-tuning data.

Lilac runs on-device using open-source LLMs with a UI and Python API.

🆒 New

dataset-level computations. Sign up to join the pilot.

  • Cluster & title millions of documents with the power of LLMs.

Explore and search over 36,000 clusters of 4.3M documents in OpenOrca

Why use Lilac?

  • Explore your data interactively with LLM-powered search, filter, clustering and annotation.
  • Curate AI data, applying best practices like removing duplicates, PII and obscure content to

reduce dataset size and lower training cost and time.

  • Inspect and collaborate with your team on a single, centralized dataset to improve data quality.
  • Understand how data changes over time.

Lilac can offload expensive computations to Lilac Garden, our hosted platform for blazing fast dataset-level computations.

> See our 3min walkthrough video

🔥 Getting started

💻 Install

pip install lilac[all]

If you prefer no local installation, you can duplicate our Spaces demo by following documentation here.

For more detailed instructions, see our installation guide.

🌐 Start a webserver

Start a Lilac webserver with our lilac CLI:

lilac start ~/my_project

Or start the Lilac webserver from Python:

import lilac as ll

ll.start_server(project_dir='~/my_project')

This will open start a webserver at http://localhost:5432/ where you can now load datasets and explore them.

Lilac Garden

Lilac Garden is our hosted platform for running dataset-level computations. We utilize powerful GPUs to accelerate expensive signals like Clustering, Embedding, and PII. Sign up to join the pilot.

  • Cluster and title a million data points in 20 mins
  • Embed your dataset at half a billion tokens per min
  • Run your own signal

📊 Load data

Datasets can be loaded directly from HuggingFace, Parquet, CSV, JSON, LangSmith from LangChain, SQLite, LLamaHub, Pandas, Parquet, and more. More documentation here.

import lilac as ll

ll.set_project_dir('~/my_project')
dataset = ll.from_huggingface('imdb')

If you prefer, you can load datasets directly from the UI without writing any Python:

🔎 Explore

> [!NOTE] > 🔗 Explore OpenOrca and > its clusters > before installing!

Once we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data. More documentation here.

✨ Clustering

Cluster any text column to get automated dataset insights:

dataset = ll.get_dataset('local', 'imdb')
dataset.cluster('text') # add `use_garden=True` to offload to Lilac Garden

> [!TIP] > Clustering on device can be slow or impractical, especially on machines without a powerful GPU or > large memory. Offloading the compute to Lilac Garden, our hosted data processing platform, can speedup clustering by more than 100x.

⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)

Annotating data with signals will produce another column in your data.

dataset = ll.get_dataset('local', 'imdb')
dataset.compute_signal(ll.LangDetectionSignal(), 'text') # Detect language of each doc.

# [PII] Find emails, phone numbers, ip addresses, and secrets.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.
dataset.compute_signal(ll.PIISignal(), 'text')

# [Near Duplicates] Computes clusters based on minhash LSH.
dataset.compute_signal(ll.NearDuplicateSignal(), 'text')

# Print the resulting manifest, with the new field added.
print(dataset.manifest())

We can also compute signals from the UI:

🔎 Search

Semantic and conceptual search requires computing an embedding first:

dataset.compute_embedding('gte-small', path='text')

Semantic search

In the UI, we can search by semantic similarity or by classic keyword search to find chunks of documents similar to a query:

We can run the same search in Python:

rows = dataset.select_rows(
columns=['text', 'label'],
searches=[
ll.SemanticSearch(
path='text',
embedding='gte-small')
],
limit=1)

print(list(rows))

Conceptual search

Conceptual search is a much more controllable and powerful version of semantic search, where "concepts" can be taught to Lilac by providing positive and negative examples of that concept.

Lilac provides a set of built-in concepts, but you can create your own for very specif

We can create a concept in Python with a few examples, and search by it:

concept_db = ll.DiskConceptDB()
db.create(namespace='local', name='spam')
# Add examples of spam and not-spam.
db.edit('local', 'spam', ll.concepts.ConceptUpdate(
insert=[…

Excerpt shown — open the source for the full document.