Just when I thought that the most realistic graphics ever generated would be The Last of Us remastered for PS5, our beloved Sam Altman dropped a massive bombshell in the shape of a four-lettered word — Sora. So big was the impact that even film titan Tyler Perry paused his $800M studio expansion, citing Sora’s “shocking” capabilities.
Now coming to the main question…
What in the world is Sora?
Sora is the latest brainchild of Open AI that can generate “realistic” and “imaginative” short videos from text instructions. The overarching goal of this project is to evolve this model into world simulators. Believe me when I say, it is no less than an engineering marvel.
Sora takes inspiration from existing language models like DALL-E 3 and GPT-3.5, and is trained on insurmountable volumes of data from the internet. Just like how large language models and foundation models use small chunks of textual data or “tokens” to understand and process information, Sora uses small chunks of images and videos called “visual-patches”.
Before training the model, raw video data is first compressed into more digestible bits or “latents”, both in terms of data (temporal compression) and space (spatial compression). Sora then trains on these latents by extracting patches and can generate new videos based on the information. Together with its corresponding decoder model, Sora converts these compressed files back to its legible format i.e., its pixel space.
Is it better than what we already had?
Oh, yeah. 1000 times better.
Sora is not just a model, it is a diffusion model built with transformer architecture. This might just be the key to why the videos it generates are so hyper-realistic. Remember visual patches? Now, imagine there are literally trillions of such patches available for it to train and process. Using probablistic machine learning, Sora’s diffusion model picks out and predicts the best and most relevant “clean” patches given the user’s text prompts. The transformer model that fuels that natural language processor can operate in parallel with the diffusion model, making Sora incredibly efficient.
Transformers also have incredible scaling capacity in language modeling and video generation. As the computing power is increased and the training is refined, Sora can create higher-quality and more realistic videos. Sora builds off what OpenAI learned from DALL-E 3 and GPT to turn shorter user prompts into more detailed language captions, which are then used to generate videos as per the text instructions.
Remembering our Sensei, Mr. Spiderman.
The engineer in me was completely blown away, excited to know what the future holds, wondering how much further we can we actually go? And it was at this point that I was reminded of something. In my economic policy class at Ross, the professor always included a footnote in his presentation slides; particularly on those that included important policy decisions and strategies. The footnote read, “With great power, comes great responsibility – Spiderman”.
While technically the quote is delivered from Uncle Ben to Peter Parker, the sentiment resonated. He was reminding us that every decision we would make as future leaders would lead to certain outcomes and consequences, some beyond our expectations. It is our responsibility that we work to be cognizant of those repercussions and take proactive steps to control them.
This is extremely relevant in the case of Sora. This leap forward in AI video technology is arriving while our society grapples with two major threats — deepfake content and information bias.
We are seeing more and more examples of deep-fake media threatening the privacy, security and reputation of individuals and organizations, public and private — faked explicit pictures of Taylor Swift, deepfake Zoom calls duping employees out of millions of dollars. How do we draw the line between reality and deceptiveness? Who is responsible for drawing that line, and are they in control? The fear that Sora could be a convenient and advanced tool for malicious actors to play with and create more damaging content is legitimate.
Deepfake technology has far-reaching consequences in politics as well. Misleading and convincing propaganda, quickly created by opposing parties, can affect the nature of elections. In U.S. presidential elections last year, President Joe Biden’s voice was deepfaked by a political challenger and then used in robocalls to dissuade voters before the New Hampshire primary election. It has occurred in the past and it can happen again, with potentially more efficacy with advanced tools like Sora.
GenAI also notoriously suffers from information bias. When models are trained on countless historical data points from open sources like the Internet, it is nearly impossible to avoid bias from creeping in. This creates misinformation and biased results, like rarely showing women as doctors or other advanced professions; generating images of men with dark skin committing crimes and exclusively white male CEOs. If AI begins to dominate the media we consume, how will that bias be confirmed, challenged or corrected?
…….back to Tyler Perry
Only time will reveal the impact Sora will make on the film industry, and whether Tyler’s decision to pause studio expansion is wise. Sora is still far from being perfect, and has a number of flaws in recreating the physics of the Earth, many of them have been illustrated in the release document by OpenAI. Despite this, one thing is perfectly clear: risk to individual privacy and misinformation exists and will continue to persist if action is not taken now.
Both regulatory bodies and AI companies like Open AI, Meta, and Google must create mechanisms that prevent malicious actors from creating explicit and incorrect content. Sam Altman has said that Sora is not publicly available and is working with red teams of researchers, scientists and engineers to test the limits of modern safeguards preventing misuse. He has also said that they are currently working on a technology called RLHF (reinforcement learning from human feedback) to reduce bias in these systems. He is optimistic that the model will eventually be able to generate unbiased results and be free from misuse in the future. The real question is how much can we trust these claims and safeguards, and if yes, will this be enough?
* Sora: 空, 昊 (Japanese Kanji) そら(Japanese Hiragana) Pronounced SO-RA. Definition: Sky; Emptiness.
Further reading:
https://openai.com/research/video-generation-models-as-world-simulators#fn-26