With Technology Review explains: Let our writers disguise the complex, messy world of technology to understand what comes next. You can read more from the series here. It was a big year for video generation. In the past nine months, Openaai Sora has made Google Deepmind started VEO 3, the video startup start-up start train, Gen-4. Everyone can produce video clips that (almost) impossible from actual filmed film materials or CGI animations. This year, Netflix also made a visual AI effect on his show The Eternaut. The first video was used to create mass market TVs. Sure, the clips you see in demo rolls are of cherry picking to present a company's models at the head of your game. But with the technology in the hands of more users than ever before, Oora and VEO 3 are available in the Chatgpt- und Gemini apps for paying subscribers -even if the most informal filmmaker can now switch off something remarkable. The disadvantage is that the creators compete with AI slop and fill social media feeds with fake news materials. The video also consumes a large amount of energy, many times more than text or image generation. Let's take a moment to talk about the technique that it brings you up and running. How do you generate a video? Let us assume that you are an occasional user. There are now a number of high-end tools with which Pro video manufacturers can insert video models into their workflows. However, most people will use this technology in an app or via a website. You know the drill: "Hey, Gemini, make me a video of a unicorn -ess spaghetti. Now let his horn lift off like a rocket." What you get back will be hit or missed, and you usually have to ask the model to take another pass or 10 before you get more or less what you wanted. So what's going on under the bonnet? Why is it hit or missed - and why does it take so much energy? The latest wave of video models is the so -called latent diffusion transformers. Yes, that's quite a sip. Let us unpack each part one after the other, starting with the diffusion. What is a diffusion model? Imagine taking a picture and adding a random splash of pixels. Take this pixel widespread picture and sprinkle it and then again and again. Do this enough and you have converted the initial picture into a random mess of pixels, like static on an old television. A diffusion model is a neuronal network that is trained to reverse this process and transform random static static images. During the training, millions of pictures are shown in various pixel levels. It learns how these pictures change every time when new pixels are thrown on them and how these changes undo. The upshot is that if you ask a diffusion model to create a picture, start with a random mess of pixels and gradually turn into a picture into a picture that is more or less similar to the images in his training kit. However, you do not want a picture - you want the picture you have given, usually with a text request. And so the diffusion model is paired with a second model - for example a large voice model (LLM), which is trained according to images with text descriptions that leads every step of the cleaning process and further develops the diffusion model to images that match the large language model well to the proportion. Apart from: This LLM does not pull it between text and images. Most models for text-to-image and text-to-video models are now trained in large data records that contain billions of text and images or pairs of text that are scraped out of the Internet (a practice that many creators are very unhappy about). This means that what you get from such models is a distillation of the world, as shown online, is distorted by prejudices (and pornography). It is easiest to imagine diffusion models that work with pictures. However, the technology can be used with many types of data, including audio and video. In order to generate film clips, a diffusion model must clean up sequences of images - the successive frames of a video - which are only one picture. What is a latent diffusion model? All of this requires a large amount of calculation (read: Energy). For this reason, most diffusion models used for video zeal use a technique that is referred to as latent diffusion. Instead of processing raw data - the millions of pixels in every video frame - the model works in a so -called latent space, in which the video frame (and the text request) are compressed into a mathematical code that only detects the essential features of the data and throws the rest. Something similar always happens when you stream a video on the Internet: A video is sent by a server in a compressed format from a server to achieve it faster, and when it arrives, your computer or television convert it back into a viewable video. The last step is to decompress what the latent diffusion process has developed. As soon as the compressed framework of random statics has been transformed into the compressed framework of a video that takes into account the LLM manual for the user's input request, the compressed video is converted into a little that you can watch. With latent diffusion, the diffusion process works more or less as it would be for a picture. The difference is that the pixelated video frames are now rather mathematical codes of this frame than the frames themselves. This makes a latent diffusion much more efficient than a typical diffusion model. (Nevertheless, video production still uses more energy than image or text generation. There is only a most striking amount of calculation.) What is a latent diffusion transformer? Still with me? The puzzle still has a piece - and so you can make sure that the diffusion process creates a sequence of frames that are consistent, maintain objects and lighting and continue from one frame to the next. Openaai did this with Sora by combining his diffusion model with another model called a transformer. This has now become standard in the generative video. Transformers are ideal in processing long data sequences such as words. This has done this into the special sauce in large voice models such as the GPT-5 of Openai and Gemini from Google Deepmind, which can create long sequences of words that make sense and will maintain consistency in many dozens of sentences. But videos do not consist of words. Instead, videos are cut into pieces that can be treated as if they were. The approach that Openai developed was to dice videos both in space and in time. "It is as if you have a stack of all video frames and cut small cubes out of it," says Tim Brooks, a senior researcher at Sora. A selection of videos created with VEO 3 and Midjourney. The clips were improved in post-production with Topaz, an AI video processing tool. Credit: Vaigueman that use transformers and diffusion models brings several advantages. Since they are designed in such a way that data sequences are processed, transformers also help the diffusion model to maintain consistency across the frames as soon as it generates them. This makes it possible to produce videos in which objects do not get into and out of existence, for example. And because the videos are seasoned, their size and orientation does not matter. This means that the latest wave of video models can be trained on a variety of sample videos, from short vertical clips with a phone to film films with a wide screen. The greater variety of training data made video video much better than two years ago. This also means that videoogenization models can now be asked to produce videos in different formats. What about the audio? A big progress with VEO 3 is that it creates a video with audio, from lip -synchronous dialogue to sound effects to background noises. This is a premiere for video models. As the CEO of Google Deepmind, Demis Hassabis, produced in this year's Google I/O from the silence from the silent era of videoogenization. "The challenge was to find a way to set video and audio data in such a way that the diffusion process works at the same time. The breakthrough of Google Deepmind was a new way of compressing audio and video in a single piece of data within the diffusion model. If VEO generates a video, its diffusion model creates an audio and video together in a locking process to ensure that The images are also used to generate different types of data. Combined diffusion models. The use of a diffusion model instead of a transformer to generate text could be much more efficient than existing LLMs in the near future!
ai·7 min read13.9.2025
How do AI models generate videos?
Source: Original