Multimodal Support

llms.py provides comprehensive support for multiple input modalities beyond just text, allowing you to process images, audio files, and documents with capable AI models.

Overview

llms.py supports three main types of multimodal inputs:

Images 🖼️ - Process and analyze images with vision-capable models
Audio 🎤 - Transcribe and analyze audio files
Files 📎 - Process and analyze documents, especially PDFs

Each modality has its own set of features, supported models, and use cases.

Default Templates

llms.py comes with default chat templates for each modality in llms.json:

{
  "defaults": {
    "text": { ... },
    "image": {
      "model": "gemini-2.5-flash",
      "messages": [...]
    },
    "audio": {
      "model": "gpt-4o-audio-preview",
      "messages": [...]
    },
    "file": {
      "model": "gpt-5",
      "messages": [...]
    }
  }
}

These templates are used when you provide the respective file type without a custom template.

Quick Start

Images

llms --image ./screenshot.png "What's in this image?"

Audio

llms --audio ./recording.mp3 "Transcribe this audio"

Files

llms --file ./document.pdf "Summarize this document"

Multimodal Support

Overview

Default Templates

Quick Start

Images

Audio

Files

Learn More

Image Support

Audio Support

File Support

Next Steps

Web UI

CLI Reference

Configuration

On this page