DALLE1 was able to render text[0]. That DALLE2 isn't probably is a tradeoff introduced by unCLIP in exchange for diverse results. Now the google model is better yet and doesn't have to make that tradeoff.
[0] https://openai.com/blog/dall-e/#text-rendering