2022 — The year superstar AI models stepped onto the stage

Note: This article originally appeared on Matt’s personal Medium page. He’s granted us permission to repost. Image generated by Stable Diffusion from text “A photo of spacemen riding horses”.

By Matthew Harris, Head of Data Science, DataKind

As we close out 2022, one can’t help but think this year has been a bit special for AI. Progress has been consistent since the rise of Neural Networks and AI is already everywhere in our day-to-day lives, but this year we started to see something a little bit different — the rise of superstar models that are really beginning to capture the imagination of the general public.

In a flurry of spacemen on horses, Midjourney, Stable diffusion and DALL∙E 2 generated stunning images from simple text descriptions. Galactica and GPT-3 wrote plausible-looking — if sometimes wholly inaccurate — articles and chat AIs such as ChatGPT, Sparrow and LaMDA wrote poetry, told jokes and explained scientific concepts a 6-year might understand. Developers signed on to use GitHub’s Copilot to magically help write code, and AlphaFold was lauded as transformative in the field of molecular biology.

What’s especially interesting is that these advances have been noticed by more than just the technical community. Over a million users signed up for ChatGPT in the first 5 days of its beta launch, many more have started interacting with these models using a range of apps and websites, some Tinder users are even using AI to help them find love. If this is a gold rush, then there are already those selling shovels with a secondary market developing for how best to prompt models for best results. All this has generated a surge in mainstream news articles and social media buzz, suggesting that we seem to be at an interesting juncture in the progress of AI.

So what’s going on, why did 2022 turn out to be such a bumper year?

Source OurWorldInData.org

Models have exceeded human-level accuracy in text and image recognition for the last few years now, in some cases saturating benchmarks altogether. We’ve also seen the rise of generative AIs which can generate text and photorealistic images. These advances have been fueled by a few key things:

In their 2017 seminal paper Attention is all you need, Google authors proposed “a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely”. These ‘Attention mechanisms’ enable models to learn deep context in a sequence of data, such as words in text. For example, in the phrase “I had a packet of cookies, but when I got home I discovered my roommate ate them all”, a transformer is able to learn that “them” refers to the cookies. This is a simple example, but given huge volumes of data transformers can learn extremely subtle and complex structure. Context like this is key to language understanding for humans (even if sometimes those humans are a bit disappointed their cookies are, in fact, gone).

Transformer models also support Transfer learning, where a model can be trained on data related to one task then applied to a different task. Fine tuning can take this further, refining the model with additional training on new data. For example, a model trained on internet text can be fine-tuned to generate computer code. Transfer learning and fine tuning have been applied to great effect in recent years with language models such as BERT and XLNet, which have been made available to the wider community on vibrant platforms like Hugging Face. You no longer need to have a few football fields of computers to have access to state-of-the-art model performance.

More recently we’ve seen transformer models such as Deep Mind’s Gato and Meta’s Data2Vec become multi-modal, moving beyond just language tasks and into image, video, audio and robotics.

For several years Generative Adversarial Networks (GANs) were the champions of image generation. More recently these have been superseded by Diffusion Models such as DALL∙E 2 and Stable Diffusion. Using thermodynamic diffusion techniques these models take captioned images and degrade them through a set of steps until they are pure noise, then train neural networks to reverse the process to create the image again. The end result is a model which can generate amazingly realistic images from any input text phrase. They can even be used to edit existing images using text directives.

An extremely important feature of both these architectures is that they enable calculations to be run in parallel, which has cost and training time implications as we’ll see below.

In order for models to capture subtle nuances and in some cases develop amazing emergent properties such as in-context learning where you can train the model on-the-fly, a lot of data is required. Stable Diffusion was trained on 5 billion labeled images and GPT-3 uses 499 billion tokens. The internet, digitization of books, news articles, and scientific papers have created a huge repository of data to satisfy this need. The majority taken from the internet as part of the fantastic Common Crawl initiative.

In some cases advances in self-supervision techniques have also contributed to the increase in available training data. These do away with the requirement for humans to label data, which is time consuming and expensive, limiting the amount of data available for training. This is not a new concept of course, Word2Vec hit the scene twenty years ago, but the unsupervised learning transformer architecture has really supercharged things. In fact, we may actually run out of data quite soon.

The general trend seems to be that model performance scales with the volume of data and computing resources (albeit with counterarguments about how exactly). State-of-the-art models require eye-watering computing resources, with billions of parameters trained on massive amounts of data. NVIDIA’s NeMo Megatron framework supports models with 530 billion parameters. For this ‘Large scale era’ of AI to even be possible, advances in parallel cloud computing have been key. It still costs many thousands of arms and legs to train a humongous model, but at least it’s possible for those with sufficient funds.

Trends in ML Models between 1952 and 2022, illustrating that we have recently entered a ‘Large scale Era’. Source Sevilla et al 2022

Convergence towards a set of foundation models. Source Rishi et al 2022

All this has led to convergence onto a relatively small set of foundation models. Trained on staggering volumes of data with billions of parameters, it is mainly commercial entities that have the resources to pay for the computing and engineering required to develop them. That said, it is this very scale that has enabled new levels of performance.

As with any new phase of AI there is a certain amount of hype and hysteria, but given the capabilities of these new models, there are some real problems that first need to be addressed.

Liar, liar (digital) pants on fire!

As luminary Andrew Ng notes Large language models like Galactica and ChatGPT can spout nonsense in a confident, authoritative tone. This overconfidence — which reflects the data they’re trained on — makes them more likely to mislead”. Unlike a human, there is so far no concept of uncertainty in model responses and being trained mostly on web content they can be dead wrong. Meta’s Galactica was taken offline to the public shortly after its release where it sometimes generated wildly inaccurate scientific research articles and it’s quite easy to get ChatGPT to say some pretty crazy stuff. The new and very real danger here is that these models can be so convincing when producing gibberish, with potential for extremely damaging outcomes.

As with any AI trained on web data from us humans, biases can emerge. For example, stable Diffusion has been shown to perpetuate racial stereotypes. ChatGPT does seem to be a bit better than its predecessors and appears to pass ad hoc tests such as the Nazi test, but even so it’s still in a formative stage and OpenAI clearly state that the current release is to refine the model and until this is complete it can produce incorrect, biased and offensive output.

The good news is that some organizations like OpenAI are investing in mitigating issues with helpfulness, correctness and harmlessness, with a range of techniques to include Reinforcement Learning Through Human Feedback (RLHF). But in these early days and with very high stakes, caution will be needed for some time.

One area of concern with advanced language models has been that students can potentially generate essays and articles with just a few lines of text, circumventing the need to actually learn anything themselves (other than to be expert Prompt engineers, securing high-paying jobs). OpenAI might be implementing digital watermarking to mitigate plagiarism and opinion varies amongst educators as to exactly how disruptive article generation might be, but it illustrates the potential for misuse. Models that help students cheat, hackers write convincing phishing emails and malware, fraudsters generate a realistic fake social media profile, can have a very real negative impact on society.

Related to the plagiarism question is the possible impact on creativity itself. If generative AI models are trained on the current state of the world and the world starts using them almost exclusively, will this stifle new ideas? This of course is a deep philosophical question and perhaps it’s a bit optimistic to think that humans hold exclusive rights on creativity, but we’ve seen many unanticipated effects of machine learning and AI in the past.

Somewhat ironically given OpenAI’s name, their models are considered proprietary and not available for download to the public. There is an argument for this being the best route to go so that they cannot be used dangerously until proven safe, but as models become more and more advanced it is disconcerting that they might remain ‘secret’ with little in the way of external regulation.

This is the bit where I join legions of (past) futurists sagely proclaiming a brave new world, only to be ignobly dashed upon the rocks of what-actually-ended-up-happening.

I am still waiting for my flying car.

That said, below are a few potential areas where I am hoping we will see some dazzling progress in 2023.

Image generated by Stable Diffusion from text “futurists sagely proclaiming a brave new world, only to be ignobly dashed upon the rocks”. Seems they have funny arms.

Large language models today are truly amazing, but the fact they can produce plausible gibberish and bias is the biggest roadblock to their widespread adoption, posing a significant risk to society. As mentioned above companies such as OpenAI are investing heavily in AI safety, others will need to do the same. I’m hoping we see progress towards nicer, less biased, and more helpful models in 2023.

There is a continual march towards models trained on more data with more parameters (with a recent emphasis on fewer parameters and more data). GPT-3.5 has been released, suggesting GPT-4 might be making an appearance this coming year and other vendors such as Google, Deep Mind and Amazon plus many others are actively working on new models.

We may see new emergent properties at increased scale, and since this process is poorly understood, increased safety concerns. Moves towards multimodality will continue, perhaps towards an even smaller set of multimodal transformer models capable of performing a wide range of tasks.

An example of in-context learning, an emergent property of Open AI’s GPT-3. I provided the model example answers in the voice of cookie monster (which was fun), the final answer was satisfyingly cookie-related without the need for any re-training.

There will be a place for custom on-premises machine learning and AI for some time yet, but as foundation models grow in size and complexity the cost of training, fine tuning and running them in-house becomes prohibitive for smaller organizations. It is therefore likely that we will see a growth in AI as a Service through APIs, as is already the case for OpenAI. This is not a new idea, most of the cloud vendors already have offerings like this, but the incentive for smaller organizations increases with exceptional human-level model performance.

Perhaps not something for 2023, but with multi-modal support and the possible emergence of AI General Intelligence we could see AI becoming a utility where we pay to buy some ‘brains’!

The world has become used to search engines for finding information where we enter some query, receive a set of results, then click through them to extract an answer. This relies on the user making decisions when parsing each result and isn’t always great if trying to find information scattered across multiple websites. Advanced chat AIs offer an easier approach because the response to any question is summarized into a single block of text. It’s similar to how we ask each other for information and doesn’t require the user to click through individual search results. If the AI challenges mentioned above can be surmounted, this would be a huge step forward for information retrieval.

OpenAI is out of the starting gates with this new approach, but I would expect Google to respond with something in 2023. Whatever happens, as with search where many learned how best to form queries for best results, it’s likely that people will learn how best to form model prompts to get the best results, ie “Prompt engineering”.

ChatGPT’s response when asked about whether Cookie monster inspired my roommate to steal cookies. I couldn’t agree more.

So, a very exciting 2023 ahead of us! I hope it brings joy to everybody reading this, whether you are a carbon based lifeform or not.

undefined

Matt Harris is Head of Data Science at DataKind, overseeing data product development and data science strategy. He also has an extensive DataKind track record as a volunteer data scientist and Data Ambassador. 

Join the DataKind movement.

Quick Links

Scroll to Top