RepoTogether AITogether AIpublished Jun 2, 2024seen 5d

togethercomputer/Dragonfly

Python

Open original ↗

Captured source

source ↗
published Jun 2, 2024seen 5dcaptured 10hhttp 200method plain

togethercomputer/Dragonfly

Language: Python

License: NOASSERTION

Stars: 81

Forks: 12

Open issues: 2

Created: 2024-06-02T04:21:48Z

Pushed: 2024-10-17T21:54:13Z

Default branch: main

Fork: no

Archived: no

README:

🔥 News

📖 Introduction

![Dragonfly framework](assets/model_overview.png)

Recent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains.

![Example Generations](assets/examples.png)

📖 Table of Contents

  • [📖 Table of Contents](#-table-of-contents)
  • [💿 Installation](#-installation)
  • [🏁 Checkpoint](#-checkpoint)
  • [🧠 Inference](#-inference)
  • [📊 Dataset](#-dataset)
  • [🏋️‍♂️ Training](#️️-training)
  • [Stage 1](#stage-1)
  • [Stage 2](#stage-2)
  • [🏆 Credits](#-credits)
  • [📚 BibTeX](#-bibtex)
  • [🪪 License](#-license)

💿 Installation

Create a conda environment and install necessary packages

conda env create -f environment.yml
conda activate dragonfly_env

Install flash attention

pip install flash-attn --no-build-isolation

As a final step, please run the following command.

pip install --upgrade -e .

🏁 Checkpoint

*Note: These models are released under [Llama 3.1 Community License Agreement](LICENSE)*

We release two huggingface model checkpoints: `togethercomputer/Llama-3.1-8B-Dragonfly-v2` and `togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2`. Please follow the script [test_dragonfly.py](test_dragonfly.py) for more details. We provide a brief description on how to use them below.

🧠 Inference

If you have successfully completed the [Installation](#installation) process, then you should be able to follow the steps below.

We provide two test examples inside [assets](assets).

Question: What is so funny about this image?

![Monalisa Dog](assets/monalisa_dog.jpg)

Load necessary packages

import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed

Instantiate the tokenizer, processor, and model.

device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)

Now, lets load the image and process them.

image = Image.open("./assets/monalisa_dog.jpg")
image = image.convert("RGB")
images = [image]
# images = [None] # if you do not want to pass any images

text_prompt = "user\n\nWhat is so funny about this image?assistant\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)

Finally, let us generate the responses from the model

temperature = 0

with torch.inference_mode():
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode(""), do_sample=temperature > 0, temperature=temperature, use_cache=True)

generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)

An example response.

The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci. The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with the body of the Mona Lisa, the artist has created a…

Excerpt shown — open the source for the full document.