The Advent of AI: Text-to-Image Generation

Another type of AI that captures the time and minds of people around the world for months is text-to-image generation. You might have seen the website This Person Does Not Exist, which generates a new face on every reload. Text-to-image generation is even better, you can generate images just with a text input.

The video above is one month of work by an artist, created in Midjourney out of 10,000 images. If you’re curious about what artists created with these AI generators, have a look at these stunning videos:

Disturbed – Bad Man (Official Music Video) – Midjourney
How I Faked My Life Using Ai: Or (The Life and Death of Ryan Gosling Person) – Stable Diffusion & Dreambooth
BINARY DREAMS: How A.I. Sees the Universe – Midjourney
Voyage through Time - a Generative AI journey – Stable Diffusion
Shroudcity Yuu – DALL·E
Frank Hughes OMG V1 – DALL·E
Billingsley - Serenity – DALL·E
Venosic - Denial of Decree – Midjourney
Vlad In Tears - Running Up That Hill – Midjourney
I’ve made DALLE-2 neural network extend Michelangelo’s “Creation of Adam” – DALL·E
A snippet from my full-length animated philosophy video – Stable Diffusion
A quick demonstration of how I accomplished this animation – Stable Diffusion

My First AI Project

Two weeks ago, when a colleague left our company, a designer from our team had the idea to create a fantasy-based goodbye card on a Miro board because the leaving colleague is a huge fantasy fan.

The designer created a map of Essos from Game of Thrones and added the journey of the leaving colleague to his new company as a path on the map. People could sign the route with their goodbye wishes. He created fantasy cards for each member of our team.

I had started using text-to-image generation a month earlier and had spoken with him about AI. He asked me if I could do magic and make us look more fantasy. And thanks to text-to-image AI, I was able to transform us into wizards, warriors, elves, and dwarfs. The demand for more people signing the card grew, and I was creating lots of fantasy photos for co-workers from other teams.

My chat exploded with questions: How did you do this?, Can you teach me how to do this?, and Can I book a personal hour to learn how to do this?

Text-to-image AI is a large language model capable of generating images from text descriptions (prompts), using a neural network trained on a dataset of text-image pairs. Currently, there exist four ways for text-to-image generation that the public can use. You have to pay for two of the options, the others are open source, and you can run them yourself.

The two commercial generators are DALL·E 2 by OpenAI and Midjourney by an independent research lab.

DALL·E 2

You pay DALL·E 2 per image and resolution, the price is between $0.016 (256×256 pixels) and $0.020 (1024×1024 pixels). You can use your free $18 gift for the generation of images (or text).

It’s excellent at coherency, can create unique zoom animations, and is easy to use. On the downside, it’s expensive, very censored, and you have limited artistic control. You can see incredible art created with DALL·E 2 in the Reddit Subreddit DALL·E 2.

Midjourney

Midjourney has different membership plans, the cheapest is $10 per month and allows you to generate ~200 images. The Standard membership costs $30 per month and allows unlimited images in relax mode and 15 GPU hours for upscaling of images. More upgrades are available, for example, to have a private visibility option. Corporate Membership costs $600 per year for one person. With a new account, you get 20 image generations for free.

You create images on a Discord server using the Midjourney bot. Midjourney has extensive documentation and a massive community. Artists and designers love Midjourney because it produces dreamy and artful images. It’s in the sweet spot between creative control and ease of use, it’s very stylistic, and has fantastic developers working on it. On the downside, it’s less coherent. To see images generated with Midjourney, visit the official Reddit Midjourney Subreddit.

Stable Diffusion

The third generator is Stable Diffusion, developed by Stability AI Ltd. It has a permissive license that allows for commercial and non-commercial usage. You can run Stable Diffusion on your local computer, but you need to fulfill some minimum hardware requirements. It needs an NVIDIA GPU with at least 4 GB VRAM, and hard drive space of 10 GB. On a Mac, you’ll need a M1 chip or better to run it properly.

The company develops DreamStudio, a paid service using Stable Diffusion. For $10 you can create ~5,000 images with the app, 500 images are free with a new account.

It’s probably the generator with the highest quality, it’s very coherent, very fluid, and open source. On the downside, you require experience with AI generators, and it can be confusing to newcomers.

If you want to see what it’s capable of, look at Stable Diffusion: DALL-E 2 For Free, For Everyone!, and Stable Diffusion Is Getting Outrageously Good! by Two Minute Papers. The Subreddit StableDiffusion is a constant stream of mind-blowing things.

Another fantastic technique you can do with Stable Diffusion is Dreambooth. It allows people to train the AI with custom images (for example with photos of you or your pet) to generate personalized images. You can learn more about Dreambooth in the article Training Stable Diffusion with Dreambooth using 🧨 Diffusers. Another option is one of the commercial platforms. On Astria you can train a model for $5 with up to 20 photos. Dreambooth is another option with a yearly price tag of $29.99. You can browse the funny gallery with famous people.

OpenArt has an incredible free Stable Diffusion Prompt Book that is constantly updated. It’s one of the best resources I know for learning Stable Diffusion.

GoogleColab

Another way to run text-to-image generation is GoogleColab notebooks. The website is a way to prepare complicated code to generate AI in a user-friendly way, where beginners can click through the code steps. It’s fully controllable and allows for more specific use cases. On the downside, it’s not very user-friendly, slow if you don’t pay for faster generation, and requires a lot of technical knowledge.

You can find Disco Diffusion, VQGAN, and many others. With a bit of technical knowledge (Python) you can create wonderful things without limitations on your computer. This video gives a brief introduction to how to generate AI images with Disco Diffusion. But if you don’t have the hardware or want to invest the money for fast execution, it’s slow, I used the free version of Disco Diffusion, and it took 35 minutes to create one image.

You can run Dreambooth on GoogleColab. This tutorial video or this video explain the steps.

Playgrounds and Services

Plenty of services are available to generate images. Playground allows using Stable Diffusion (1000/day), or with the Pro Plan for $15 per month (2000/day). You can use the DALL·E add-on for $10 per month to create 800 DALL·E images per month.

Lexica is a Stable Diffusion search engine that allows the creation of 100 images without any payment. The best feature is that each artwork has its prompt (the text used to talk to the AI), its settings, dimensions, and seed (a random number to create noise for the image). With this information, it’s possible to create a similar image (but never the same).

DreamStudio is the application of the creators of Stable Diffusion and gifts users 500 images for free. 5,000 images cost $10 per month.

The image models compete constantly for the throne, and with each update, people discuss the differences between the generators. Nobody can tell you which of the generators is the best. It’s taste, depends on your prompting skills, and the style you want to achieve.

Text-to-Video, Text-to-3D, Text-to-Audio, and Brain-to-Image

Here are a few links to other interesting research that I don’t cover here in detail, lacking information. Google Video AI can create impressive videos from text with Imagen Video.

And another project of Google, DreamFusion is capable to generate 3D models from text.

Riffusion is a model that uses Stable Diffusion to create images of spectrograms that can be converted to audio. You can basically create music from text. Harmonai created with Diffusion Radio, a radio station that streams 24/7 AI-generated music.

And the paper Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding decoded visual stimuli from brain recording to create images.

AutoDraw is a fun tool to create images from painting. It will recognize what you tried to draw and suggest an illustration.