LlaVA is a multi-modal model
Say multi-modal model five times fast.
Itβs basically and open source version of OpenAIβs βChat with imagesβ feature that one might be able to use in applications.
LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.
Upload an image and ask it different thingsβ¦
This item is a small, light blue, furry Prada purse.
Not bad! Letβs do another. I generated this next image with DALL-E 3 for a previous blog post :)
The image features a man with a beard sitting at a desk, working on a laptop computer. He appears to be focused on his task, possibly typing or browsing the internet. The laptop is placed on the desk, with the man sitting in front of it. In addition to the laptop, there are several other items on the desk, including a keyboard, a mouse, and a cell phone. A cup can be seen nearby, possibly containing a beverage for the man while he works. There are also two books on the desk, one closer to the laptop and the other further away. A potted plant is placed on the desk, adding a touch of greenery to the workspace.
These models are the worst they will ever be⦠they will only get better.
Iβm going to continue to play around with them and you should too.
Please forward this newsletter to a friend.