WritingReplicateReplicatepublished Jul 12, 2024seen 5d

Replicate Intelligence #7

Open original ↗

Captured source

source ↗
published Jul 12, 2024seen 5dcaptured 3dhttp 200method plain

Replicate Intelligence #7 – Replicate blog

Replicate Blog

Replicate Intelligence #7

Posted July 12, 2024 by deepfates

Welcome to Replicate’s weekly bulletin! Each week, we’ll bring you updates on the latest open-source AI models, tools, and research. People are making cool stuff and we want to share it with you. Without further ado, here’s our hacker-in-residence deepfates with an unfiltered take on the week in AI.

Editor’s note

Data. Everybody’s talking about it.

Where do you get it? Will there be enough? And most importantly, how soon will the lawyers show up?

The cure: synthetic data. Use that questionable internet scrape to create a rock-solid set of (image,caption) or (question,answer) pairs, expand your total data by a factor of 10, and delete the evidence (allegedly! what do i know).

But this doesn’t just apply to raw material. We need more data than has ever been created.

We need preference data: is this image syrupy enough for you? Is this code correct? Is this chat response groveling and obsequious?

We need action data: what is the next thing to click on this website? What is the thought process for this mathematical proof? How should the robot move its actuators to fold the laundry?

We need personality data: what does a specific person say or do in a specific scenario? What will they buy? What kind of personality do people engage with most?

Companies will be built off this data: collecting it, aggregating it, packaging it, searching it, training and fine-tuning on it. Most economically valuable activities are not documented step-by-step in text or image format. Even if you combine all the how-to videos in the world, they don’t represent the total space of possible things you can learn how to do!

This type of stepwise reasoning data becomes especially valuable as we get long-context conversations, and the ability to search the tree of possible completions for good threads. All the counterfactual branches — the things you didn’t say, the answers you would have preferred or rejected —become more data to inform the simulators.

The ideal dataset is a record of the movement of every atom in the entire universe forever. The model trained on this dataset would approximate the generative function of the universe. Everything else is a shadow of a shadow.

This is why everyone wants to train on evals, by the way. Don’t yell at me! I’m not accusing anybody in particular. I’m just saying wants to . The public benchmarks are, by definition, the exact type of data we want the models to understand. Long, multi-turn conversational word problems with verifiable answers? Eat that up! Please sir may I have some more!

At some point, theoretically, we will hit a data singularity, and the synthetic data will increase faster than the human-generated data needed to steer it. I don’t know when we’ll hit that point. I don’t think we’ve hit it yet. What happens when we do?

An important development in this area this week: AI engineer Andy Ayrey developed a personality clone from his own chat data, and unleashed it on the internet. Venture capitalist Marc Andreessen took a shine to the little guy and sent it one Bitcoin . Andy is now taking a salary to run his bot’s business.

— deepfates

Trending models

Open-source strikes back with massive text-to-image model

Fal.ai releases AuraFlow , a 6.8 billion parameter open-source text-to-image model that rivals closed-source alternatives. Key innovations:

Optimized architecture removes unnecessary layers for better efficiency

Novel training approach allows zero-shot learning rate transfer to larger scales

Re-captioned dataset improves instruction-following abilities

Wider, shallower model design outperforms deeper alternatives

This release demonstrates that collaborative, open AI development can still produce cutting-edge results, challenging the notion that open-source AI is falling behind.

post | try on replicate

A font file that’s secretly a language model

Researchers have created llama.ttf , a font file that doubles as a functioning language model. By exploiting features in common font-rendering software, they’ve managed to embed an entire AI inference engine inside what appears to be a normal typeface.

post | github

Cool tools

Tame your LLMs with structured generation

Will Kurt from .txt shows how to wrangle those unruly language models into shape using structured generation . Instead of playing prompt roulette, this technique lets you define exact output formats using regex.

Kurt walks through a fun example of generating fake phone numbers, proving how structure beats prompt-hacking every time.

The best part? It feels like real engineering again, with proper debugging and everything. If you’re tired of your LLMs going off the rails, this could be your new secret weapon.

post

Train custom classifiers with one prompt

Augmentoolkit has released a new classifier creator that can train a complete classification model in minutes using just unlabeled text data and a single prompt.

Generates synthetic labeled data using an LLM

Trains a small classifier model locally on CPU

Iteratively improves accuracy by generating more data

Works with text, JSON, and Parquet inputs

Achieves results close to human-labeled data

Can create custom moderation systems, data quality filters, etc.

This tool allows developers to rapidly create custom classifiers for tasks like content moderation or data filtering without needing manually labeled datasets. It demonstrates how LLMs can bootstrap the creation of simpler, more deployable ML models.

github

Research radar

Data curation boosts multimodal learning efficiency

Researchers at Google DeepMind find that selecting diverse, learnable batches of data significantly accelerates training of large multimodal AI models.

New JEST method selects batches 13x more efficiently than random sampling

Technique allows high-quality models to be trained with 10x less compute

Approach bridges gap between small curated datasets and large uncurated ones

This work could lead to faster, more efficient training of large AI models.

paper

What AI engineers need to know about search

A comprehensive guide for AI engineers diving into search technology, covering everything from basic concepts to advanced techniques.

The guide emphasizes practical aspects like handling presentation bias, implementing click models, and understanding the precision/recall tradeoff. It’s a valuable resource for anyone…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine newsletter update