Sora

Sora is a generative machine learning model for video and images developed by OpenAI. It has three main components: (1) a video encoder which compresses videos both temporally and spatially into a latent representation; (2) a Diffusion Transformer that operates on sequences of video patches (which are analogues to text tokens) in the latent space, as well as conditioning information such as a text prompt; (3) a decoder, which upsamples the generated videos from the latent space into the pixel space. Component 1 is only used during training. When deployed, the model starts from a noisy sample in the latent space, along with the prompt, passes them throught the diffusion transformer to generate a video representation in the latent space, then through the decoder to obtain the high-resolution video.