Sep 13, 2022
DALL-E2 is bad at “compositionality”, ie combining different pieces accurately. For example, here’s its response to “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”.
Most of the elements - cubes, spheres, redness, yellowness, etc - are there. It even does better than chance at getting the sphere on top of the cube. But it’s not able to track how all of the words relate to each other and where everything should be.
I ran into this problem in my stained glass window post. When I asked it for a stained glass window of a woman in a library with a raven on her shoulder with a key in its mouth, it gave me everything from “a library with a stained glass window in it” to “a half-human, half-raven abomination”.