RepoReplicateReplicatepublished Mar 13, 2024seen 5d

replicate/cog-vila

Python

Open original ↗

Captured source

source ↗
published Mar 13, 2024seen 5dcaptured 13hhttp 200method plain

replicate/cog-vila

Description: Cog wrapper for VILA

Language: Python

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-03-13T13:55:09Z

Pushed: 2024-03-13T14:05:04Z

Default branch: main

Fork: no

Archived: no

README:

VILA

Cog wrapper for VILA, a visual language model (VLM) pretrained with interleaved image-text data. See the paper, official repo and Replicate demos for details.

How to use the API

You need to have Cog and Docker installed to run this model locally. To build the docker container with cog and run a prediction:

cog predict -i image=@sample_images/1.jpg -i prompt="Can you describe this image?"

To start a server and send requests to your locally or remotely deployed API:

cog run -p 5000 python -m cog.server.http

To use VILA, provide an image and a text prompt. The response is generated by decoding the model's output using beam search with the specified parameters. The input arguments to the API are as follows:

  • image: The image to discuss.
  • prompt: The query to generate a response for.
  • top_p: When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens.
  • temperature: When decoding text, higher values make the model more creative.
  • num_beams: Number of beams to use when decoding text; higher values are slower but more accurate.
  • max_tokens: Maximum number of tokens to generate.

References

@misc{lin2023vila,
title={VILA: On Pre-training for Visual Language Models},
author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
year={2023},
eprint={2312.07533},
archivePrefix={arXiv},
primaryClass={cs.CV}
}