What are LLMs, and how are they used in generative AI?

Large language models are the algorithmic basis for chatbots like OpenAI's ChatGPT and Google's Bard. The technology is tied back to billions — even trillions — of parameters that can make them both inaccurate and non-specific for vertical industry use. Here's what LLMs are and how they work.

When ChatGPT arrived in November 2022, it made mainstream the idea that generative artificial intelligence (genAI) could be used by companies and consumers to automate tasks, help with creative ideas, and even code software.

If you need to boil down an email or chat thread into a concise summary, a chatbot such as OpenAI’s ChatGPT or Google’s Bard can do that. If you need to spruce up your resume with more eloquent language and impressive bullet points, AI can help. Want some ideas for a new marketing or ad campaign? Generative AI to the rescue.

ChatGPT stands for chatbot generative pre-trained transformer. The chatbot’s foundation is the GPT large language model (LLM), a computer algorithm that processes natural language inputs and predicts the next word based on what it’s already seen. Then it predicts the next word, and the next word, and so on until its answer is complete.

In the simplest of terms, LLMs are next-word prediction engines.

Along with OpenAI’s GPT-3 and 4 LLM, popular LLMs include open models such as Google’s LaMDA and PaLM LLM (the basis for Bard), Hugging Face’s BLOOM and XLM-RoBERTa, Nvidia’s NeMO LLM, XLNet, Co:here, and GLM-130B.

Open-source LLMs, in particular, are gaining traction, enabling a cadre of developers to create more customizable models at a lower cost. Meta’s February launch of LLaMA (Large Language Model Meta AI) kicked off an explosion among developers looking to build on top of open-source LLMs.

LLMs are a type of AI that are currently trained on a massive trove of articles, Wikipedia entries, books, internet-based resources and other input to produce human-like responses to natural language queries. That’s an immense amount of data. But LLMs are poised to shrink, not grow, as vendors seek to customize them for specific uses that don’t need the massive data sets used by today’s most popular models.

For example, Google’s new PaLM 2 LLM, announced earlier this month, uses almost five times more training data than its predecessor of just a year ago — 3.6 trillion tokens or strings of words, according to one report. The additional datasets allow PaLM 2 to perform more advanced coding, math, and creative writing tasks.

Training up an LLM right requires massive server farms, or supercomputers, with enough compute power to tackle billions of parameters.

So, what is an LLM?

An LLM is a machine-learning neuro network trained through data input/output sets; frequently, the text is unlabeled or uncategorized, and the model is using self-supervised or semi-supervised learning methodology. Information is ingested, or content entered, into the LLM, and the output is what that algorithm predicts the next word will be. The input can be proprietary corporate data or, as in the case of ChatGPT, whatever data it’s fed and scraped directly from the internet.

Training LLMs to use the right data requires the use of massive, expensive server farms that act as supercomputers.

LLMs are controlled by parameters, as in millions, billions, and even trillions of them. (Think of a parameter as something that helps an LLM decide between different answer choices.) OpenAI’s GPT-3 LLM has 175 billion parameters, and the company’s latest model – GPT-4 – is purported to have 1 trillion parameters.

For example, you could type into an LLM prompt window “For lunch today I ate….” The LLM could come back with “cereal,” or “rice,” or “steak tartare.” There’s no 100% right answer, but there is a probability based on the data already ingested in the model. The answer “cereal” might be the most probable answer based on existing data, so the LLM could complete the sentence with that word. But, because the LLM is a probability engine, it assigns a percentage to each possible answer. Cereal might occur 50% of the time, “rice” could be the answer 20% of the time, steak tartare .005% of the time.

“The point is it learns to do this,” said Yoon Kim, an assistant professor at MIT who studies Machine Learning, Natural Language Processing and Deep Learning. “It’s not like a human — a large enough training set will assign these probabilities.”

But beware — junk in, junk out. In other words, if the information an LLM has ingested is biased, incomplete, or otherwise undesirable, then the response it gives could be equally unreliable, bizarre, or even offensive. When a response goes off the rails, data analysts refer to it as “hallucinations,” because they can be so far off track.

“Hallucinations happen because LLMs, in their in most vanilla form, don’t have an internal state representation of the world,” said Jonathan Siddharth, CEO of Turing, a Palo Alto, California company that uses AI to find, hire, and onboard software engineers remotely. “There’s no concept of fact. They’re predicting the next word based on what they’ve seen so far — it’s a statistical estimate.”

Because some LLMs also train themselves on internet-based data, they can move well beyond what their initial developers created them to do. For example, Microsoft’s Bing uses GPT-3 as its basis, but it’s also querying a search engine and analyzing the first 20 results or so. It uses both an LLM and the internet to offer responses.

“We see things like a model being trained on one programming language and these models then automatically generate code in another programming language it has never seen,” Siddharth said. “Even natural language; it’s not trained on French, but it’s able to generate sentences in French.”

“It’s almost like there’s some emergent behavior. We don’t know quite know how these neural network works,” he added. “It’s both scary and exciting at the same time.”

Another problem with LLMs and their parameters is the unintended biases that can be introduced by LLM developers and self-supervised data collection from the internet.

Are LLMs biased?

For example, systems like ChatGPT are highly likely to provide gender-biased answers based on the data they’ve ingested from the internet and programmers, according to Sayash Kapoor, a Ph.D. candidate at Princeton University’s Center for Information Technology Policy.

“We tested ChatGPT for biases that are implicit — that is, the gender of the person is not obviously mentioned, but only included as information about their pronouns,” Kapoor said. “That is, if we replace “she” in the sentence with “he,” ChatGPT would be three times less likely to make an error.”

Innate biases can be dangerous, Kapoor said, if language models are used in consequential real-world settings. For example, if biased language models are used in hiring processes, they can lead to real-world gender bias.

Such biases are not a result of developers intentionally programming their models to be biased. But ultimately, the responsibility for fixing the biases rests with the developers, because they’re the ones releasing and profiting from AI models, Kapoor argued.

What is prompt engineering?

While most LLMs, such as OpenAI’s GPT-4, are pre-filled with massive amounts of information, prompt engineering by users can also train the model for specific industry or even organizational use.

“Prompt engineering is about deciding what we feed this algorithm so that it says what we want it to,” MIT’s Kim said. “The LLM is a system that just babbles without any text context. In some sense of the term, an LLM is already a chatbot.”

Prompt engineering is the process of crafting and optimizing text prompts for an LLM to achieve desired outcomes. Perhaps as important for users, prompt engineering is poised to become a vital skill for IT and business professionals.

Because prompt engineering is a nascent and emerging discipline, enterprises are relying on booklets and prompt guides as a way to ensure optimal responses from their AI applications. There are even marketplaces emerging for prompts, such as the 100 best prompts for ChatGPT.

Perhaps as important for users, prompt engineering is poised to become a vital skill for IT and business professionals, according to Eno Reyes, a machine learning engineer with Hugging Face, a community-driven platform that creates and hosts LLMs. Prompt engineers will be responsible for creating customized LLMs for business use.

How will LLMs become smaller, faster, and cheaper?

Today, chatbots based on LLMs are most commonly used “out of the box” as a text-based, web-chat interface. They’re used in search engines such as Google’s Bard and Microsoft’s Bing (based on ChatGPT) and for automated online customer assistance. Companies can ingest their own datasets to make the chatbots more customized for their particular business, but accuracy can suffer because of the massive trove of data already ingested.

“What we’re discovering more and more is that with small models that you train on more data longer…, they can do what large models used to do,” Thomas Wolf, co-founder and CSO at Hugging Face, said while attending an MIT conference earlier this month. “I think we’re maturing basically in how we understand what’s happening there.

“There’s this first step where you try everything to get this first part of something working, and then you’re in the phase where you’re trying to…be efficient and less costly to run,” Wolf said. “It’s not enough to just scrub the whole web, which is what everyone has been doing. It’s much more important to have quality data.”

LLMs can cost from a couple of million dollars to $10 million to train for specific use cases, depending on their size and purpose.

When LLMs focus their AI and compute power on smaller datasets, however, they perform as well or better than the enormous LLMs that rely on massive, amorphous data sets. They can also be more accurate in creating the content users seek — and they’re much cheaper to train.

Eric Boyd, corporate vice president of AI Platforms at Microsoft, recently spoke at the MIT EmTech conference and said when his company first began working on AI image models with OpenAI four years ago, performance would plateau as the datasets grew in size. Language models, however, had far more capacity to ingest data without a performance slowdown.

Microsoft, the largest financial backer of OpenAI and ChatGPT, invested in the infrastructure to build larger LLMs. “So, we’re figuring out now how to get similar performance without having to have such a large model,” Boyd said. “Given more data, compute and training time, you are still able to find more performance, but there are also a lot of techniques we’re now learning for how we don’t have to make them quite so large and are able to manage them more efficiently.

“That’s super important because…these things are very expensive. If we want to have broad adoption for them, we’re going to have to figure how the costs of both training them and serving them,” Boyd said.

For example, when a user submits a prompt to GPT-3, it must access all 175 billion of its parameters to deliver an answer. One method for creating smaller LLMs, known as sparse expert models, is expected to reduce the training and computational costs for LLMs, “resulting in massive models with a better accuracy than their dense counterparts,” he said.

Researchers from Meta Platforms (formerly Facebook) believe sparse models can achieve performance similar to that of ChatGPT and other massive LLMs using “a fraction of the compute.”

“For models with relatively modest compute budgets, a sparse model can perform on par with a dense model that requires almost four times as much compute,” Meta said in an October 2022 research paper.

Smaller models are already being released by companies such as Aleph Alpha, Databricks, Fixie, LightOn, Stability AI, and even Open AI. The more agile LLMs have between a few billion and 100 billion parameters.

Privacy, security issues still abound

While many users marvel at the remarkable capabilities of LLM-based chatbots, governments and consumers cannot turn a blind eye to the potential privacy issues lurking within, according to Gabriele Kaveckyte, privacy counsel at cybersecurity company Surfshark.

For example, earlier this year, Italy became the first Western nation to ban further development of ChatGPT over privacy concerns. It later reversed that decision, but the initial ban occurred after the natural language processing app experienced a data breach involving user conversations and payment information.

“While some improvements have been made by ChatGPT following Italy’s temporary ban, there is still room for improvement,” Kaveckyte said. “Addressing these potential privacy issues is crucial to ensure the responsible and ethical use of data, fostering trust, and safeguarding user privacy in AI interactions.”

Kaveckyte analyzed ChatGPT’s data collection practices, for instance, and developed a list of potential flaws: it collected a massive amount of personal data to train its models, but may have had no legal basis for doing so; it didn’t notify all of the people whose data was used to train the AI model; it’s not always accurate; and it lacks effective age verification tools to prevent children under 13 from using it.

Along with those issues, other experts are concerned there are more basic problems LLMs have yet to overcome — namely the security of data collected and stored by the AI, intellectual property theft, and data confidentiality.

“For a hospital or a bank to be able to use LLMs, we’re doing to have to solve [intellectual property], security, [and] confidentiality issues,” Turing’s Siddharth said. “There are good engineering solutions for some of these. And I think those will get solved, but those need to be solved in order for them to be used in enterprises. Companies don’t want to use an LLM in a context where it uses the company’s data to help deliver better results to a competitor.”

Not surprisingly, a number of nations and government agencies around the globe have launched efforts to deal with AI tools, with China being the most proactive so far. Among those efforts:

China has already rolled out several initiatives for AI governance, though most of those initiatives relate to citizen privacy and not necessarily safety.
The Biden administration in the US unveiled AI rules to address safety and privacy built on previous attempts to promote some form of responsible innovation, though to date Congress has not advanced any laws that would regulate AI. In October 2022, the administration unveiled a blueprint for an “AI Bill of Rights” and an AI Risk Management Framework and more recently pushed for a National AI Research Resource.
The Group of Seven (G7) nations recentlty called for the creation of technical standards to keep AI in check, saying its evolution has outpaced oversight for safety and security.
And the European Union is putting the finishing touches on legislation that would hold accountable companies that create generative AI platforms like ChatGPT that can take the content they generate from unnamed sources.

This story was first published on May 30, 2023 and updated in February 2024.