Images, Audio & Video
Theory
Multimodal models share one brain across formats. That is why you can drop in a screenshot of an error and ask what is going wrong, or paste a recipe photo and ask for a shopping list.
The underlying move is the same as text: the model turns whatever you give it into a representation, then predicts what comes next — pixels, words, or sound.
"Make an image for our blog post about productivity."
The model invents subject, style, framing, palette. Result: a generic stock photo of someone smiling at a laptop.
"Photo-real, top-down view of a wooden desk: open notebook, brass pen, black coffee, soft window light from the left. 16:9. No text on the page. No brand logos."
The model has a target and knows what to avoid. Your edits become small instead of total rewrites.
Two risks that show up fast:
- Believable fakery — a generated photo can imply an event happened or a person endorsed something.
- Rights and likeness — assume nothing is fair to publish until you check.
Use generated media for drafts and clearly labeled creative work. For anything that looks like evidence, verify it the way a careful editor would.