The Advent of AI: Introduction

Abstract

This is a four-part series of essays that investigates different aspects of AI. From political and social impact to AI research. I’ll dive deeper into the topics of speech-to-text, text generation, and text-to-image generation and share useful resources, videos, and links.

Humanity will remember 2022 as a year of tremendous progress in Artificial Intelligence (AI) advancement. This year, AI made a gigantic leap into consumer products. I use now four products every day that have AI built in.

My journey into AI started with the introduction of GitHub Copilot in the summer of 2021. Based on Codex, a language model, I used the beta product, trained by OpenAI to write computer code. It’s based on GPT-3 (Generative Pretrained Transformer 3), another language model by OpenAI that can be used for a variety of natural language processing tasks, such as language translation, summarization, and question-answering.

Humans had always a mixture of fear and admiration for artificial intelligence. This is expressed in countless books, movies, TV shows, and games. The Wikipedia article List of artificial intelligence films lists nearly 150 movies since 1927. Iconic movies as 2001: A Space Odyssey, Blade Runner, The Terminator, Ex Machina, and Her show various degrees of dystopian futures with AI.

This is one potential danger of AI. Accidentally or with malicious training, AI could endanger humanity. Recently, an aggressive AI threatened to kill humans in a test with AI interviews using GPT-3, LaMDA, and Synthesia avatars. This AI isn’t a danger yet because it’s not self-aware. But it proves that AI can be dangerous if it is not trained properly.

Other potential dangers are surveillance, manipulation, and control by authoritarian governments. AI could be used to manipulate people, their behaviors, and emotions, to spread misinformation, influence elections, and manipulate news (as seen in the computer game Deus Ex). It could be used to control the economy, for example by manipulating the stock market.

Combined with advanced robotics, AI could be used by police forces and the military to crush protests, kill people, and control the world.

These are dangers in a further future. The most immediate danger is that AI could replace humans in many jobs. This could lead to massive unemployment and huge social unrest. It would lead to a massive wealth gap between the rich and the poor. The number of jobs that could be replaced is huge, for example, copywriters, translators, authors, journalists, programmers, designers, lawyers, doctors, actors, influencers, and many more. Creative people feel threatened by what AI can do today. The boards are full of angry arguments about how AI is not creative, AI doesn’t create anything new, or just mixes existing things. Stack Overflow started to ban AI-generated answers from their platform. Angry developers filed the first lawsuits.¹ AI will likely be subject to regulation soon.

Developers discussed the future of our profession everywhere in the last few weeks.² But because humans are bad at predicting the future, I would take these discussions as nothing more than a guess.

I, personally, think that AI will help humans in the future, at least in the near-time future. It will help humans as a tool, helper, muse, or pair programmer. For now, the AI-generated output is not good enough to rely on. A lot of generated texts have wrong information, or images get generated with flaws and distortions. For many years, AI will enhance our jobs, not replace us.

AI Research

A wide variety of companies and laboratories are doing research in artificial intelligence, for example DeepMind (a subsidiary of Alphabet Inc.), OpenAI, Google Brain (a deep learning research team within Google), Meta AI Research, Microsoft Research AI, Baidu Research, and AWS AI. These organizations do research in machine learning, deep learning, natural language processing, computer vision, and robotics.

OpenAI

OpenAI is one of the few companies that is not driven by profit. It’s a non-profit research company founded in 2015 by Elon Musk, Sam Altman (CEO), Greg Brockman (CTO), John Schulman, and Ilya Sutskever (Chief Scientist).³ The company has raised $1.7 billion in funding.

Elon Musk is a Co-Founder, board member, and doing Executive Operations at OpenAI. He is also the CEO of Tesla, SpaceX, Neuralink, and 𝕏; co-founder of SolarCity and The Boring Company. His recent interview at TED, A future worth getting excited about, is a must-watch.

The company has been working on Generative Models for several years. In 2019, OpenAI released GPT-2, a large-scale unsupervised language model that generates coherent paragraphs of text. In 2020, OpenAI released CLIP, a new approach to image and text understanding. In 2021, OpenAI released DALL·E, a new approach to image generation. In 2022, OpenAI released Codex, a new approach to code generation.

Other interesting areas of research are Robotics, Music AI, Gaming AI, and speech recognition. Its robotics research released Roboschool, a physics simulator for robotics research. They trained a human-like robot hand to perform a variety of tasks. In 2019, it was able to solve Rubik’s Cube with a robotic hand.

Its music AI research released MuseNet and Jukebox, large-scale neural networks that generate music.

Its gaming AI research released Dota 2 AI, a neural network that plays Dota 2. The AI was tested at the International 2019, the biggest Dota 2 tournament in the world. It was able to beat the world’s best human players. OpenAI released Neural MMO, a massively multiplayer online game that can be played by AI agents. In 2022, the AI learned to play Minecraft with Video PreTraining (VPT).

In 2022, OpenAI released Whisper, a speech recognition system that can transcribe speech in real-time, as Open Source. It’s able to transcribe speech with poor sound quality, such as in a noisy environment or with a bad microphone, in more than 50 languages and translate it into another language.

They constantly work on making machine learning research more accessible to the public. Deep RL is an educational platform that teaches the basics of reinforcement learning. It’s a collection of interactive tutorials that teach you how to build and train reinforcement learning agents.

Speech-to-Text

In the following sections, I’ll show examples of AI-generated content, starting with speech-to-text. I’ll focus on the bigger topics and mention the less-researched topics.

When OpenAI released Whisper as Open Source, Andrej Karpathy announced on 𝕏 how he had downloaded and transcribed 322 episodes of the Lex Friedman Podcast thanks to Whisper and published them on his project Lexicap. I wanted to try it out myself because it doesn’t need expensive hardware for calculation or payment.

I’ll show in this section how to download a video, transcribe and translate it with Whisper. The tool requires a running Python environment with PyTorch and FFmpeg installed. It’s picky about the versions, I’ll walk you through what I did.

Install the Dependencies

I use Homebrew as my package manager and asdf as my version manager. I install the version manager (feel free to use another one), youtube-dl to download a video as audio, and the dependencies listed for Whisper.

brew install asdf
brew install youtube-dl
brew install ffmpeg

Next, I install the correct Python version.

asdf plugin add python
asdf install python 3.9.9
asdf global python 3.9.9

Now I install Whisper and its dependencies.

pip install torch torchvision
pip install git+https://github.com/openai/whisper.git

Download a Video

I decided to download a short video from Muji Global in Japanese. It’s short, the processing will be fast. First I find out the available video and audio formats, and then I pick m4a as an audio format for download.

youtube-dl -F https://www.youtube.com/watch\?v\=j86NOoAcq24
youtube-dl -f 140 https://www.youtube.com/watch\?v\=j86NOoAcq24

Generating the Transcript

I rename the downloaded file to input.m4a and transcribe it with Whisper. For that, it has to download a 1.42 GB model. Different models are available, for some languages or use cases a smaller model is enough. Start the command and grab a coffee.

whisper input.m4a --language Japanese --model medium --task translate

It took me about 7 minutes to transcribe the 1 minute of audio on my MacBook Pro 2016.

Whisper will generate three files (some with time code) and print the output to the console.

[00:00.000 --> 00:02.000]  How to make an order curtain
[00:03.000 --> 00:04.000]  Hello.
[00:04.000 --> 00:09.000]  I will show you how to make an order curtain.
[00:11.000 --> 00:14.000]  There are three necessary information.
[00:15.000 --> 00:17.000]  First, the width of the curtain rail.
[00:18.000 --> 00:20.000]  Second, the length from the runner.
[00:21.000 --> 00:23.000]  Third, how to attach the curtain rail.
[00:23.000 --> 00:29.000]  The width of the curtain rail is measured from the edge of the rail.
[00:30.000 --> 00:34.000]  The height is measured from the bottom of the runner.
[00:35.000 --> 00:41.000]  When measuring the height of a large window, it is easy to measure from the floor to the top.
[00:42.000 --> 00:46.000]  Check how the curtain rail is attached.
[00:46.000 --> 00:48.000]  How was it?
[00:48.000 --> 00:52.000]  If you can confirm so far, please select the curtain article.

The output is not too impressive, but this is an example. I tried it on long videos with poor audio quality, and it picked up even chatter in the background. If you would like to learn more, you can watch OpenAI’s Whisper Learned 680,000 Hours Of Speech! by Two Minute Papers describing the scientific paper and learn more about Whisper.

Matthew Butterick (2022): Maybe you don’t mind if GitHub Copilot used your open-source code without asking. But how will you feel if Copilot erases your open-source community?, https://githubcopilotinvestigation.com/. ↩
ThePrimagen (2022): ChatGPT - What does this mean for programmers?, https://www.youtube.com/watch?v=dQYXM4U831A. ↩
Crunchbase (2022): OpenAI, https://www.crunchbase.com/organization/openai. ↩