Mistral Launches Its First Multimodal AI Model: Pixtral 12B

Key hiGHLIGHTS

Mistral's new Pixtral 12B model can process both text and images.

It builds on the company's earlier Nemo 12B text model but now includes a vision adapter for image tasks.

The model can be downloaded from GitHub and Hugging Face.

Mistral Launches Its First Multimodal AI Model: Pixtral 12B

Mistral, a French AI startup, has introduced Pixtral 12B, its first AI model that handles both text and images. This multimodal model allows users to perform tasks like image captioning, object counting, and image classification, making it versatile for various applications.

What Is Pixtral 12B?

Pixtral 12B is an upgraded version of Mistral’s Nemo 12B text model. The new feature that sets it apart is its vision adapter, which has 400 million parameters. This addition allows the model to process images alongside text, similar to other advanced models like OpenAI's GPT-4 or Anthropic’s Claude.

With this model, users can input images either by providing a URL or encoding them in base64 format. The model breaks down images into 16x16 pixel patches, which helps it process high-resolution images effectively. It also uses a technique called 2D RoPE (Rotary Position Embeddings), which helps the model understand spatial relationships within images.

Performance and Capabilities

Pixtral 12B has 12 billion parameters, which determine how well the model can handle complex tasks. While it's a smaller model compared to giants like GPT-3, which has 175 billion parameters, it’s still a significant step forward for Mistral in the multimodal AI space.

Availability and Licensing

Pixtral 12B is currently available for download on GitHub and Hugging Face. The exact licensing for the model hasn’t been confirmed, but it’s possible it will follow the same Apache 2.0 license as some of Mistral's earlier models.

Right now, the model is free for research and academic purposes, but if you want to use it for commercial projects, you’ll need to purchase a license. Mistral has also announced that the model will soon be available for testing on their chatbot and API platforms, Le Chat and Le Platform.

Conclusion

Mistral’s Pixtral 12B represents a promising step into the world of multimodal AI, offering both text and image processing capabilities. While it may not yet compete with larger models like GPT-3, it provides a valuable tool for researchers and developers looking for a free-to-use model for academic work.