{*}
Add news
March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010 November 2010 December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024 December 2024 January 2025 February 2025 March 2025 April 2025 May 2025 June 2025 July 2025 August 2025 September 2025 October 2025 November 2025 December 2025 January 2026 February 2026 March 2026 April 2026
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27
28
29
30
News Every Day |

Z.ai's open source GLM-Image beats Google's Nano Banana Pro at complex text rendering, but not aesthetics

The two big stories of AI in 2026 so far have been the incredible rise in usage and praise for Anthropic's Claude Code and a similar huge boost in user adoption for Google's Gemini 3 AI model family released late last year — the latter of which includes Nano Banana Pro (also known as Gemini 3 Pro Image), a powerful, fast, and flexible image generation model that renders complex, text-heavy infographics quickly and accurately, making it an excellent fit for enterprise use (think: collateral, trainings, onboarding, stationary, etc).

But of course, both of those are proprietary offerings. And yet, open source rivals have not been far behind.

This week, we got a new open source alternative to Nano Banana Pro in the category of precise, text-heavy image generators: GLM-Image, a new 16-billion parameter open-source model from recently public Chinese startup Z.ai.

By abandoning the industry-standard "pure diffusion" architecture that powers most leading image generator models in favor of a hybrid auto-regressive (AR) + diffusion design, GLM-Image has achieved what was previously thought to be the domain of closed, proprietary models: state-of-the-art performance in generating text-heavy, information-dense visuals like infographics, slides, and technical diagrams.

It even beats Google's Nano Banana Pro on the shared by z.ai — though in practice, my own quick usage found it to be far less accurate at instruction following and text rendering (and other users seem to agree).

But for enterprises seeking cost-effective and customizable, friendly-licensed alternatives to proprietary AI models, z.ai's GLM-Image may be "good enough" or then some to take over the job of a primary image generator, depending on their specific use cases, needs and requirements.

The Benchmark: Toppling the Proprietary Giant

The most compelling argument for GLM-Image is not its aesthetics, but its precision. In the CVTG-2k (Complex Visual Text Generation) benchmark, which evaluates a model's ability to render accurate text across multiple regions of an image, GLM-Image scored a Word Accuracy average of 0.9116.

To put that number in perspective, Nano Banana 2.0 aka Pro—often cited as the benchmark for enterprise reliability—scored 0.7788. This isn't a marginal gain; it is a generational leap in semantic control.

While Nano Banana Pro retains a slight edge in single-stream English long-text generation (0.9808 vs. GLM-Image's 0.9524), it falters significantly when the complexity increases.

As the number of text regions grows, Nano Banana's accuracy remains in the 70s, whereas GLM-Image maintains >90% accuracy even with multiple distinct text elements.

For enterprise use cases—where a marketing slide needs a title, three bullet points, and a caption simultaneously—this reliability is the difference between a production-ready asset and a hallucination.

Unfortunately, my own usage of a demo inference of GLM-Image on Hugging Face proved to be less reliable than the benchmarks might suggest.

My prompt to generate an "infographic labeling all the major constellations visible from the U.S. Northern Hemisphere right now on Jan 14 2026 and putting faded images of their namesakes behind the star connection line diagrams" did not result in what I asked for, instead fulfilling maybe 20% or less of the specified content.

But Google's Nano Banana Pro handled it like a champ, as you'll see below:

Of course, a large portion of this is no doubt due to the fact that Nano Banana Pro is integrated with Google search, so it can look up information on the web in response to my prompt, whereas GLM-Image is not, and therefore, likely requires far more specific instructions about the actual text and other content the image should contain.

But still, once you're used to being able to type some simple instructions and get a fully researched and well populated image via the latter, it's hard to imagine deploying a sub-par alternative unless you have very specific requirements around cost, data residency and security — or the customizability needs of your organization are so great.

Furthermore, Nano Banana Pro still edged out GLM-Image in terms of pure aesthetics — using the OneIG benchmark, Nano Banana 2.0 is at 0.578 vs. GLM-Image at 0.528 — and indeed, as the top header artwork of this article indicates, GLM-Image does not always render as crisp, finely detailed and pleasing an image as Google's generator.

The Architectural Shift: Why "Hybrid" Matters

Why does GLM-Image succeed where pure diffusion models fail? The answer lies in Z.ai’s decision to treat image generation as a reasoning problem first and a painting problem second.

Standard latent diffusion models (like Stable Diffusion or Flux) attempt to handle global composition and fine-grained texture simultaneously.

This often leads to "semantic drift," where the model forgets specific instructions (like "place the text in the top left") as it focuses on making the pixels look realistic.

GLM-Image decouples these objectives into two specialized "brains" totaling 16 billion parameters:

  1. The Auto-Regressive Generator (The "Architect"): Initialized from Z.ai’s GLM-4-9B language model, this 9-billion parameter module processes the prompt logically. It doesn't generate pixels; instead, it outputs "visual tokens"—specifically semantic-VQ tokens. These tokens act as a compressed blueprint of the image, locking in the layout, text placement, and object relationships before a single pixel is drawn. This leverages the reasoning power of an LLM, allowing the model to "understand" complex instructions (e.g., "A four-panel tutorial") in a way diffusion noise predictors cannot.

  2. The Diffusion Decoder (The "Painter"): Once the layout is locked by the AR module, a 7-billion parameter Diffusion Transformer (DiT) decoder takes over. Based on the CogView4 architecture, this module fills in the high-frequency details—texture, lighting, and style.

By separating the "what" (AR) from the "how" (Diffusion), GLM-Image solves the "dense knowledge" problem. The AR module ensures the text is spelled correctly and placed accurately, while the Diffusion module ensures the final result looks photorealistic.

Training the Hybrid: A Multi-Stage Evolution

The secret sauce of GLM-Image’s performance isn't just the architecture; it is a highly specific, multi-stage training curriculum that forces the model to learn structure before detail.

The training process began by freezing the text word embedding layer of the original GLM-4 model while training a new "vision word embedding" layer and a specialized vision LM head.

This allowed the model to project visual tokens into the same semantic space as text, effectively teaching the LLM to "speak" in images. Crucially, Z.ai implemented MRoPE (Multidimensional Rotary Positional Embedding) to handle the complex interleaving of text and images required for mixed-modal generation.

The model was then subjected to a progressive resolution strategy:

  • Stage 1 (256px): The model trained on low-resolution, 256-token sequences using a simple raster scan order.

  • Stage 2 (512px - 1024px): As resolution increased to a mixed stage (512px to 1024px), the team observed a drop in controllability. To fix this, they abandoned simple scanning for a progressive generation strategy.

In this advanced stage, the model first generates approximately 256 "layout tokens" from a down-sampled version of the target image.

These tokens act as a structural anchor. By increasing the training weight on these preliminary tokens, the team forced the model to prioritize the global layout—where things are—before generating the high-resolution details. This is why GLM-Image excels at posters and diagrams: it "sketches" the layout first, ensuring the composition is mathematically sound before rendering the pixels.

Licensing Analysis: A Permissive, If Slightly Ambiguous, Win for Enterprise

For enterprise CTOs and legal teams, the licensing structure of GLM-Image is a significant competitive advantage over proprietary APIs, though it comes with a minor caveat regarding documentation.

The Ambiguity: There is a slight discrepancy in the release materials. The model’s Hugging Face repository explicitly tags the weights with the MIT License.

However, the accompanying GitHub repository and documentation reference the Apache License 2.0.

Why This Is Still Good News: Despite the mismatch, both licenses are the "gold standard" for enterprise-friendly open source.

  • Commercial Viability: Both MIT and Apache 2.0 allow for unrestricted commercial use, modification, and distribution. Unlike the "open rail" licenses common in other image models (which often restrict specific use cases) or "research-only" licenses (like early LLaMA releases), GLM-Image is effectively "open for business" immediately.

  • The Apache Advantage (If Applicable): If the code falls under Apache 2.0, this is particularly beneficial for large organizations. Apache 2.0 includes an explicit patent grant clause, meaning that by contributing to or using the software, contributors grant a patent license to users. This reduces the risk of future patent litigation—a major concern for enterprises building products on top of open-source codebases.

  • No "Infection": Neither license is "copyleft" (like GPL). You can integrate GLM-Image into a proprietary workflow or product without being forced to open-source your own intellectual property.

For developers, the recommendation is simple: Treat the weights as MIT (per the repository hosting them) and the inference code as Apache 2.0. Both paths clear the runway for internal hosting, fine-tuning on sensitive data, and building commercial products without a vendor lock-in contract.

The "Why Now" for Enterprise Operations

For the enterprise decision maker, GLM-Image arrives at a critical inflection point. Companies are moving beyond using generative AI for abstract blog headers and into functional territory: multilingual localization of ads, automated UI mockup generation, and dynamic educational materials.

In these workflows, a 5% error rate in text rendering is a blocker. If a model generates a beautiful slide but misspells the product name, the asset is useless. The benchmarks suggest GLM-Image is the first open-source model to cross the threshold of reliability for these complex tasks.

Furthermore, the permissive licensing fundamentally changes the economics of deployment. While Nano Banana Pro locks enterprises into a per-call API cost structure or restrictive cloud contracts, GLM-Image can be self-hosted, fine-tuned on proprietary brand assets, and integrated into secure, air-gapped pipelines without data leakage concerns.

The Catch: Heavy Compute Requirements

The trade-off for this reasoning capability is compute intensity. The dual-model architecture is heavy. Generating a single 2048x2048 image requires approximately 252 seconds on an H100 GPU. This is significantly slower than highly optimized, smaller diffusion models.

However, for high-value assets—where the alternative is a human designer spending hours in Photoshop—this latency is acceptable.

Z.ai also offers a managed API at $0.015 per image, providing a bridge for teams who want to test the capabilities without investing in H100 clusters immediately.

GLM-Image is a signal that the open-source community is no longer just fast-following proprietary labs; in specific, high-value verticals like knowledge-dense generation, they are now setting the pace. For the enterprise, the message is clear: if your operational bottleneck is the reliability of complex visual content, the solution is no longer necessarily a closed Google product—it might be an open-source model you can run yourself.

Ria.city






Read also

Nasim Nunez drives in 4, Washington Nationals beat the Chicago White Sox 6-3 in 10 innings

Rampant Gilgeous-Alexander fuels Thunder, Magic and Knicks win

Bryce Harper, Phillies Snap 10-Game Skid With Extra-Inning Win Over Braves

News, articles, comments, with a minute-by-minute update, now on Today24.pro

Today24.pro — latest news 24/7. You can add your news instantly now — here




Sports today


Новости тенниса


Спорт в России и мире


All sports news today





Sports in Russia today


Новости России


Russian.city



Губернаторы России









Путин в России и мире







Персональные новости
Russian.city





Friends of Today24

Музыкальные новости

Персональные новости