amazon-science/text_generation_diffusion_llm_topic
Python
Captured source
source ↗amazon-science/text_generation_diffusion_llm_topic
Description: Topic Embedding, Text Generation and Modeling using diffusion
Language: Python
License: Apache-2.0
Stars: 15
Forks: 5
Open issues: 3
Created: 2024-01-30T22:44:58Z
Pushed: 2026-06-10T20:30:25Z
Default branch: main
Fork: no
Archived: no
README:
DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM (Accepted by EMNLP 2023 as Findings)
This repository is the official implementation of DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM.
DeTiME can generate embeddings, do diffusion and
Installation
To install requirements:
pip install -r requirements.txt
Training and Evaluation
To train and evaluate the model, run this command:
Step 1: If the data is in the huggingface. specify --data_source as the repository of hugging face If the data is a csv file specify where the data is and specify --data_source csv Step 2: Define number of topics. if the number is 10 use --numb_embeddings 10 Step 3: Define the metric you want to evaluate, currently it supports diversity, c_v, c_uci, etc
Then you just have to run
python3 main.py --data_source xwjzds/ag_news --metric diversity --topk 20
It will output the diversity metric using data in xwjzds/ag_news
Embedding Explain
Diffusion Explain
After getting the embedding using the encoders of DeTiME, the diffusion can be leveraged to denoise the embeddings. The denoised embeddings can be passed to the decoders of the DeTiME to generate text.
The training of diffusor involved two steps.
Step 1: generate embedding of datasets using the encoders of the DeTiME. The code below shows how to generate embeddings
outputs = [] text_ls = dataset['summary'] batch_size = 2 batch_ls = [text_ls[ind: ind + batch_size]for ind in range(0, len(text_ls), batch_size)] print(dataset) for text in tqdm(batch_ls): # inputs = tokenizer(text, return_tensors="pt").input_ids # attention = tokenizer(text, return_tensors="pt").attention_mask # add instruction # text = ['repeat: ' + t for t in text] inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length = args.max_length) # get the inputs and attention inputs_id = inputs.input_ids.to(models.device) attention = inputs.attention_mask.to(models.device) output = models.model.encoder(inputs_id, attention).last_hidden_state #batch size * seq length * embedding size, output = models.encoder(output) outputs.append(output.detach().cpu()) gc.collect()
Step 2: train a diffusor using the embeddings. To train a diffusor, the users can leverage python diffuser_training.py --embedding_input './example/embed_vectors_base_7_1000_prefix.pt' --model_name 'UNet_Conv' --output_dir './example'. Here. embedding_input is the embedding file location, model_name is the diffusor model name to train, output_dir is the location where the trained diffusor saved.
To generate the text using the deniosed embedding, three steps are involved.
Step 1: generate embedding of datasets using the encoders of the DeTiME.
Step 2: denoise the embeddings using the generated embeddings.
from diffusion.diffusion_generate import generate_diffused_embed, generate_text # generate from the noise vector sampling_turn = 2 timesteps = 1000 x_noise = torch.randn((num_images, 4, latent_dim // 4), device=device) x_track_ls_ls_noise, x_0_track_ls_ls_noise = generate_diffused_embed(x_noise, model, timesteps, device, batch_size=2, num_generated_sample=2, return_all_time_embed=True)
Step 3: generate text from the denoised embeddings.
Interactive Code
Example of using dataset from OCTIS
from octis.dataset.dataset import Dataset
import sys
sys.path.insert(0, '../src/topicmodeling')
from model import TopicModel
from datasets import load_dataset
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
dataset = Dataset()
dataset.fetch_dataset("20NewsGroup") #It can support 20NewsGroup, BBC_News, DBLP, DBPedia_IT
tm = TopicModel(numb_embeddings = 10)
texts = [' '.join(i) for i in dataset.get_corpus()]
model_output = tm.train_model(texts)
metric = TopicDiversity(topk=10)
topic_diversity_score = metric.score(model_output) # Compute score of diversity
cmetric = Coherence(texts = tm.tp.lemmas, measure='c_npmi')
coherence = cmetric.score(model_output) # Compute score of coherenceExample of using datasets from huggingface
import sys
sys.path.insert(0, '../src/topicmodeling')
from model import TopicModel
from datasets import load_dataset
from octis.evaluation_metrics.diversity_metrics import TopicDiversity
from octis.evaluation_metrics.coherence_metrics import Coherence
df = load_dataset('xwjzds/ag_news')
tm = TopicModel(numb_embeddings = 10)
model_output = tm.train_model(df['train']['text'])
metric = TopicDiversity(topk=10)
topic_diversity_score = metric.score(model_output) # Compute score of diversity
cmetric = Coherence(texts = tm.tp.lemmas, measure='c_npmi')
coherence = cmetric.score(model_output) # Compute score of coherenceArugument Explain
Arguments Explained:
--numb_embeddings: Number of embeddings (default is 10).
--epochs: Number of epochs for training (default is 20).
--batch_size: Batch size for training (default is 256).
--gpu_num: GPU number to use (default is 1).
--learning_rate: Learning rate (default is 0.002).
--weight_decay: Weight decay (default is 1.2e-6).
--penalty: Penalty term (default is 1).
--beta: Beta value (default is 1).
--temp: Temperature (default is 10).
--data_source: Data source type (default is 'huggingface'). Can be 'huggingface', 'csv', or 'txt'.
--data_path: Path to the data file for 'csv' or 'txt' (default is '').
--metrics: List of metrics to report (default is ['diversity', 'c_v', 'c_npmi', 'c_uci', 'u_mass']).
--topk: Top k words to report for diversity (default is 10).
Results
Our model achieves the following performance on Ag News:
| Model name | Diversity | C_v | C_npmi | | ------------------ |---------------- | -------------- | -------------- | | vONT | 0.865 | 0.618 | 0.115 | | DeTiME | 0.93 | 0.645 | 0.113 |
we use existed embeddings in this code relase instead of using spherical embeddings. Training a spherical embeddings takes time. We noticed that this reported performance is better than the…
Excerpt shown — open the source for the full document.