Replicate Intelligence #7
Captured source
source ↗Replicate Intelligence #7 – Replicate blog
Replicate Blog
Replicate Intelligence #7
Posted July 12, 2024 by deepfates
Welcome to Replicate’s weekly bulletin! Each week, we’ll bring you updates on the latest open-source AI models, tools, and research. People are making cool stuff and we want to share it with you. Without further ado, here’s our hacker-in-residence deepfates with an unfiltered take on the week in AI.
Editor’s note
Data. Everybody’s talking about it.
Where do you get it? Will there be enough? And most importantly, how soon will the lawyers show up?
The cure: synthetic data. Use that questionable internet scrape to create a rock-solid set of (image,caption) or (question,answer) pairs, expand your total data by a factor of 10, and delete the evidence (allegedly! what do i know).
But this doesn’t just apply to raw material. We need more data than has ever been created.
We need preference data: is this image syrupy enough for you? Is this code correct? Is this chat response groveling and obsequious?
We need action data: what is the next thing to click on this website? What is the thought process for this mathematical proof? How should the robot move its actuators to fold the laundry?
We need personality data: what does a specific person say or do in a specific scenario? What will they buy? What kind of personality do people engage with most?
Companies will be built off this data: collecting it, aggregating it, packaging it, searching it, training and fine-tuning on it. Most economically valuable activities are not documented step-by-step in text or image format. Even if you combine all the how-to videos in the world, they don’t represent the total space of possible things you can learn how to do!
This type of stepwise reasoning data becomes especially valuable as we get long-context conversations, and the ability to search the tree of possible completions for good threads. All the counterfactual branches — the things you didn’t say, the answers you would have preferred or rejected —become more data to inform the simulators.
The ideal dataset is a record of the movement of every atom in the entire universe forever. The model trained on this dataset would approximate the generative function of the universe. Everything else is a shadow of a shadow.
This is why everyone wants to train on evals, by the way. Don’t yell at me! I’m not accusing anybody in particular. I’m just saying wants to . The public benchmarks are, by definition, the exact type of data we want the models to understand. Long, multi-turn conversational word problems with verifiable answers? Eat that up! Please sir may I have some more!
At some point, theoretically, we will hit a data singularity, and the synthetic data will increase faster than the human-generated data needed to steer it. I don’t know when we’ll hit that point. I don’t think we’ve hit it yet. What happens when we do?
An important development in this area this week: AI engineer Andy Ayrey developed a personality clone from his own chat data, and unleashed it on the internet. Venture capitalist Marc Andreessen took a shine to the little guy and sent it one Bitcoin . Andy is now taking a salary to run his bot’s business.
— deepfates
Trending models
Open-source strikes back with massive text-to-image model
Fal.ai releases AuraFlow , a 6.8 billion parameter open-source text-to-image model that rivals closed-source alternatives. Key innovations:
Optimized architecture removes unnecessary layers for better efficiency
Novel training approach allows zero-shot learning rate transfer to larger scales
Re-captioned dataset improves instruction-following abilities
Wider, shallower model design outperforms deeper alternatives
This release demonstrates that collaborative, open AI development can still produce cutting-edge results, challenging the notion that open-source AI is falling behind.
post | try on replicate
A font file that’s secretly a language model
Researchers have created llama.ttf , a font file that doubles as a functioning language model. By exploiting features in common font-rendering software, they’ve managed to embed an entire AI inference engine inside what appears to be a normal typeface.
post | github
Cool tools
Tame your LLMs with structured generation
Will Kurt from .txt shows how to wrangle those unruly language models into shape using structured generation . Instead of playing prompt roulette, this technique lets you define exact output formats using regex.
Kurt walks through a fun example of generating fake phone numbers, proving how structure beats prompt-hacking every time.
The best part? It feels like real engineering again, with proper debugging and everything. If you’re tired of your LLMs going off the rails, this could be your new secret weapon.
post
Train custom classifiers with one prompt
Augmentoolkit has released a new classifier creator that can train a complete classification model in minutes using just unlabeled text data and a single prompt.
Generates synthetic labeled data using an LLM
Trains a small classifier model locally on CPU
Iteratively improves accuracy by generating more data
Works with text, JSON, and Parquet inputs
Achieves results close to human-labeled data
Can create custom moderation systems, data quality filters, etc.
This tool allows developers to rapidly create custom classifiers for tasks like content moderation or data filtering without needing manually labeled datasets. It demonstrates how LLMs can bootstrap the creation of simpler, more deployable ML models.
github
Research radar
Data curation boosts multimodal learning efficiency
Researchers at Google DeepMind find that selecting diverse, learnable batches of data significantly accelerates training of large multimodal AI models.
New JEST method selects batches 13x more efficiently than random sampling
Technique allows high-quality models to be trained with 10x less compute
Approach bridges gap between small curated datasets and large uncurated ones
This work could lead to faster, more efficient training of large AI models.
paper
What AI engineers need to know about search
A comprehensive guide for AI engineers diving into search technology, covering everything from basic concepts to advanced techniques.
The guide emphasizes practical aspects like handling presentation bias, implementing click models, and understanding the precision/recall tradeoff. It’s a valuable resource for anyone…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine newsletter update