🌋 LLaVA: Large Language and Vision Assistant

MMMs not LLMs

Oct 23, 2023

LlaVA is a multi-modal model

Say multi-modal model five times fast.

It’s basically and open source version of OpenAI’s “Chat with images” feature that one might be able to use in applications.

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Upload an image and ask it different things…

This item is a small, light blue, furry Prada purse.

Not bad! Let’s do another. I generated this next image with DALL-E 3 for a previous blog post :)

The image features a man with a beard sitting at a desk, working on a laptop computer. He appears to be focused on his task, possibly typing or browsing the internet. The laptop is placed on the desk, with the man sitting in front of it. In addition to the laptop, there are several other items on the desk, including a keyboard, a mouse, and a cell phone. A cup can be seen nearby, possibly containing a beverage for the man while he works. There are also two books on the desk, one closer to the laptop and the other further away. A potted plant is placed on the desk, adding a touch of greenery to the workspace.

These models are the worst they will ever be… they will only get better.

I’m going to continue to play around with them and you should too.

Please forward this newsletter to a friend.

Daily Build AI

Discussion about this post