Images, Audio & Video

Act 3 · ~5 min

Theory

Multimodal models share one brain across formats. That is why you can drop in a screenshot of an error and ask what is going wrong, or paste a recipe photo and ask for a shopping list.

The underlying move is the same as text: the model turns whatever you give it into a representation, then predicts what comes next — pixels, words, or sound.

Vague brief

"Make an image for our blog post about productivity."

The model invents subject, style, framing, palette. Result: a generic stock photo of someone smiling at a laptop.

Specific brief

"Photo-real, top-down view of a wooden desk: open notebook, brass pen, black coffee, soft window light from the left. 16:9. No text on the page. No brand logos."

The model has a target and knows what to avoid. Your edits become small instead of total rewrites.

Two risks that show up fast:

Believable fakery — a generated photo can imply an event happened or a person endorsed something.
Rights and likeness — assume nothing is fair to publish until you check.

Use generated media for drafts and clearly labeled creative work. For anything that looks like evidence, verify it the way a careful editor would.

Application

Images, Audio & Video

Theory