Qwen-Image image generation model: complex text display and precise image editing
Ctrl+Alt+Future3 Sep 2025

Qwen-Image image generation model: complex text display and precise image editing

Qwen-Image is a basic image generation model developed by Alibaba's Qwen team. It has two outstanding capabilities: complex text rendering and precise image editing.


Qwen-Image can render text, even long paragraphs, in images with very high quality. It is particularly good at handling English and Chinese, where it is exceptionally accurate. It preserves the typographic details, layout, and contextual harmony of texts.

Precise image editing: The model allows for style transfer, adding or removing objects, refining details, editing text within images, and even manipulating human poses. This capability makes almost professional-level editing accessible to everyday users.


This is a 20 billion-parameter MMDiT (Multimodal Diffusion Transformer) model. Open source under the Apache 2.0 license.


Availability: Natively supported in ComfyUI, but also available via Hugging Face and ModelScope, and can be tried as a demo on Qwen Chat

Performance: Independently evaluated, it shows outstanding results in both image generation and image editing, and is currently one of the best open source models on the market.


The MMDiT (Multimodal Diffusion Transformer) is the central, fundamental element or "backbone" of the Qwen-Image image generation model. (This approach has also proven effective in other models, such as the FLUX and Seedream series.)


Now let's see what this means exactly:

Imagine that the model works like a sculptor who starts from random noise (like a grainy TV broadcast). The essence of the diffusion model is to gradually remove this noise step by step until a clean and recognizable image is created. This is not done directly with the pixels, but with a compressed, abstract form of the images, which we call the (image) latent space. Qwen-Image uses a special tool, the VAE (Variational AutoEncoder), to transform the original images into such encoded, latent representations.


During the diffusion process, MMDiT learns the complex relationships between noisy image codes and clean, desired image codes. It practically learns the "recipe" of how to transform the noise into some specific visual content.


Qwen-Image uses a model called Qwen2.5-VL to extract interpretable "instructions" for MMDiT from text inputs. Thus, the model generates exactly the image we have described.


Qwen-Image has multimodal capabilities. Not only can it generate images from text (Text-to-Image), but it can also edit images based on text instructions (Text-Image-to-Image). It can also perform certain image interpretation tasks, such as object recognition or depth information estimation. This is because MMDiT is designed to process and interpret text and image information simultaneously.


LinksQwen-Image blog: https://qwenlm.github.io/blog/qwen-image/Qwen-Image Technical Report: https://arxiv.org/pdf/2508.02324GitHub: https://github.com/QwenLM/Qwen-ImageHugging Face: https://huggingface.co/Qwen/Qwen-ImageQwen Chat: https://chat.qwen.ai/Hugging Face Demo: https://huggingface.co/spaces/Qwen/Qwen-ImageKépgenerátor Aréna: https://github.com/mp3pintyo/Leaderboard-Image

Episoder(15)

Qwen3-Next: Free large language model from Alibaba that could revolutionize training costs?

Qwen3-Next: Free large language model from Alibaba that could revolutionize training costs?

Qwen3-Next is a new large-scale language model (LLM) from Alibaba that has 80 billion parameters but only activates 3 billion during inference through a hybrid attention mechanism and rare Mixture-of-...

15 Sep 202546min

HunyuanImage 2.1 is an open source model that can generate high resolution (2K) images

HunyuanImage 2.1 is an open source model that can generate high resolution (2K) images

HunyuanImage 2.1 is an open source text-to-image diffusion model capable of generating ultra-high resolution (2K) images. It stands out with its dual text encoder, two-stage architecture including a r...

12 Sep 202533min

Google Stitch: user interface (UI) design using artificial intelligence

Google Stitch: user interface (UI) design using artificial intelligence

Google Stitch is an AI-powered tool designed for app developers to generate user interfaces (UI) for mobile and web applications. It can turn ideas into UIs. By default, it uses Google DeepMind’s late...

12 Sep 202533min

Kimi K2 0905 is the latest update to Moonshot AI's large-scale Mixture-of-Experts language model

Kimi K2 0905 is the latest update to Moonshot AI's large-scale Mixture-of-Experts language model

Kimi K2 0905 is the latest update to Moonshot AI’s large-scale Mixture-of-Experts (MoE) language model, which is well-suited for complex agent-like tasks. With its advanced coding and reasoning capabi...

7 Sep 202529min

Tencent HunyuanWorld-Voyager: Generating 3D-consistent video from a single photo

Tencent HunyuanWorld-Voyager: Generating 3D-consistent video from a single photo

Tencent has unveiled its AI-powered tool called HunyuanWorld-Voyager, which can transform a single image into a directional, 3D-consistent video—providing the thrill of exploration without the need fo...

7 Sep 202546min

GLM-4.5: The Next Generation of Artificial Intelligence That Thinks and Acts

GLM-4.5: The Next Generation of Artificial Intelligence That Thinks and Acts

Z.ai introduces its latest flagship models, the GLM-4.5 and GLM-4.5-Air, which take the capabilities of intelligent assistants to a new level. These models uniquely combine deep analytics, master-leve...

7 Sep 202535min

Gemini 2.5 Flash Image: Advanced AI Generation and Editing

Gemini 2.5 Flash Image: Advanced AI Generation and Editing

Gemini 2.5 Flash Image, also known as Nano Banana, is an advanced, multimodal image creation and editing model that can interpret both text and image commands, allowing users to create, edit, and iter...

4 Sep 202549min

Populært innen Teknologi

romkapsel
rss-avskiltet
teknisk-sett
tomprat-med-gunnar-tjomlid
energi-og-klima
lydartikler-fra-aftenposten
rss-impressions-2
shifter
nasjonal-sikkerhetsmyndighet-nsm
fornybaren
elektropodden
hans-petter-og-co
smart-forklart
pedagogisk-intelligens
rss-alt-vi-kan
rss-fish-ships
teknologi-og-mennesker
rss-for-alarmen-gar
rss-ki-praten
rss-alt-som-gar-pa-strom