Multimodal
Multimodal Support
Work with images, audio, and documents in llms.py
llms.py provides comprehensive support for multiple input modalities beyond just text, allowing you to process images, audio files, and documents with capable AI models.
Overview
llms.py supports three main types of multimodal inputs:
- Images 🖼️ - Process and analyze images with vision-capable models
- Audio 🎤 - Transcribe and analyze audio files
- Files 📎 - Process and analyze documents, especially PDFs
Each modality has its own set of features, supported models, and use cases.
Default Templates
llms.py comes with default chat templates for each modality in llms.json:
{
"defaults": {
"text": { ... },
"image": {
"model": "gemini-2.5-flash",
"messages": [...]
},
"audio": {
"model": "gpt-4o-audio-preview",
"messages": [...]
},
"file": {
"model": "gpt-5",
"messages": [...]
}
}
}These templates are used when you provide the respective file type without a custom template.
Quick Start
Images
llms --image ./screenshot.png "What's in this image?"Audio
llms --audio ./recording.mp3 "Transcribe this audio"Files
llms --file ./document.pdf "Summarize this document"