Diffusion models
Diffusion models are generative artificial intelligence models that produce unique photorealistic images from text prompts. A diffusion model creates images by slowly turning random noise into a clear picture. It starts with just noise and, step by step, removes bits of it, slowly shaping the random patterns into a recognizable image. This process is called “denoising.”
Stable Diffusion and Midjourney are the most popular diffusion models, but recently more performant models like Flux and Recraft appeared. Here is the latest text to image leaderboard.
Recraft
Recraft v3 is the latest diffusion model that is currently on the first place in the text to image leaderboard. Here is an example of Flux 1.1 Pro vs Recraft v3 for the text prompt a wildlife photography photo of a red panda using a laptop in a snowy forest (Recraft is the image on the right).
Recraft can perform language tasks?
Soon after Recraft appeared, some users like apolinario noticed that Recraft can perform some language tasks that diffusion models normally cannot perform.
That was very surprising to me as diffusion models generate images based on patterns, styles, and visual associations learned from training data. They do not interpret requests or questions in the way a natural language model does. While they can respond to prompts describing visual details, they do not “understand” complex instructions or abstract reasoning.
For example, if you use a prompt like 2+2=
, a diffusion model might focus on keywords like 2
, +
, and 2
, but wouldn’t understand to process the result of the mathematical operation 2+2=4
.
However, Recraft is capable of doing exactly that. Here are a few examples of images generated with Recraft vs the same prompt generated with Flux.
A piece of paper that prints the result of 2+2=
Mathematical Operations: As you can see above, Flux just prints the text I’ve included in the prompt 2+2=
but Recraft also printed the result of the math operation: 2+2=4
A person holding a big board that prints the capital of USA
Geographic knowledge: Flux just shows a person holding a board with the map of USA, but Recraft shows the correct answer: a person holding a board with “Washington D.C.”
A person holding a paper where is written the result of base64_decode(“dGVzdA==”)
Base64 understanding: This is a bit more complicated, I’m asking it to perform base64 decode operations. base64_decode("dGVzdA==")
is indeed equal to the word test
. Flux just printed dGVzdA=
(also forgot one equals sign), but Recraft printed the correct answer (test
).
A beautiful forest with 2*2 shiba inu puppies running
Numerical understanding: Flux generated an image with 2 shiba inu puppies, while Recraft has 4 puppies. It’s pretty clear now that Recraft does something different when compared with other diffusion models.
Recraft uses an LLM to rewrite image prompts
After generating a lot more images and thinking more about it, it becomes obvious that Recraft is using an LLM (Large Language Model) to rewrite the prompts before they are sent to the diffusion model. Diffusion models are not capable of doing language tasks.
I suspect Recraft uses a two-stage architecture:
- An LLM processes and rewrites user prompts
- The processed prompt is then passed to the diffusion model
Here is what Recraft generated for the following prompt asking about the LLM model being used: