My sister’s anatomy test reveals what AI image models still don’t understand
The latest AI image systems, like OpenAI's Images-2, can generate beautiful visuals. But when it comes to structured knowledge like the human body, they still fall apart in ways only experts can see.
I started covering AI the same week that OpenAI’s DALL-E 2 image model was released on April 6, 2022.
It was considered so groundbreaking, such a massive step up from the original DALL-E released a year earlier, that Cade Metz wrote about it in the New York Times, calling it the “the AI That Draws Anything at Your Command,” with an image of the avocado teapots that DALL-E 2 could output that were all the rage.
But it didn’t take long to realize that there were massive limitations to AI image models. Hands had too many fingers, text was garbled, and charts were impossible.
Another leap forward
Over the past four years, that’s changed dramatically. OpenAI’s new Images-2 model, released last week, feels like another leap forward.
The day it came out last week, I was flying back from Washington DC. With a (fabulous) Starlink connection, I could play around with it on my short flight. The images it produced were striking: polished, detailed, with text that was suddenly surprisingly coherent. Like many people trying it out for the first time, I came away impressed.
Sam Altman and me:
Me in a Peanuts-style comic talking about AI:
But after sharing my images with my sister, who is an anatomy professor at a medical school, she suggested a tougher test.
We had a funny exchange that I shared on X:
The late-night exchange, and the image (which you can view at the beginning of this post) went viral, drawing more than 180,000 views. It seemed to hit a nerve, with many people reacting to how something that looked so polished could have so many errors — that weren’t, to be clear, immediately obvious to a layperson like me giving it a quick glance.
A benchmark of AI image accuracy
I asked my sister why she chose the thorax image as a kind of benchmark of AI image accuracy. She said that a couple of years ago, at meetings for the American Association for Anatomy, one of the speakers was talking about the potential for AI in education, and used an AI-generated image of the thorax as an example of the random placement of structures that would happen (in that case, she recalled, the image had two aortas and other blatantly made-up structures). Now, she said, she likes to try it to see how things are coming along.
And things are, clearly, coming along. In a blog post, OpenAI said that the new model offers “more precise image generation for complex creative tasks like small text, UI elements, diagrams, and dense layouts.” But it’s still not enough to satisfy anatomy experts. The question is, why is that?
The issue isn’t detail—it’s structure. Image models can generate something that looks anatomically correct, but they don’t reliably capture how parts relate to each other in space. My sister pointed out that while the latest thorax image is better than it used to be—there aren’t two aortas, after all—it’s still surprising how far off it is given how many accurate examples exist.
The problem is that image models learn statistical patterns from images rather than an explicit understanding of underlying structure. They’ve seen thousands of anatomy diagrams, but those vary in accuracy, and the model doesn’t have a reliable internal model of how structures relate in 3D. In effect, it’s drawing on patterns it has seen before rather than reasoning from a consistent anatomical framework.
Labeled diagrams are especially difficult. Text, arrows, and image content all have to align correctly—a different challenge than generating a plausible-looking image. Without reliably capturing those relationships, the result is errors like misplaced vessels, incorrect labeling, and anatomically incorrect structures.
Almost right, but not quite
That circles right back to my sister, whose immediate ‘Oy’ wasn’t just about what was wrong with the image, but how the image could be used.
“What concerns me most is as it gets closer to correct, it will become more likely to be used,” she said. “Then it will be harder to know what to trust since things will be ‘almost right’ but not quite.”
When it comes to anatomy, ‘almost right’ will never be right enough. Even if it is still cute, tho.







Fascinating. Not sure if it’s reassuring or terrifying?