Understanding attention in large language models

In this test case of the study, the transformer model stabilizes the relevant part of the image – the frog – over hundreds of training rounds (epochs). Initially, attention is directed randomly around the image. But with training, the model learns to ignore parts of the image that are not the frog. Image credit: Dawud Atai Tarzanagh and Xuechen Zhang, SOTA Lab, University of Michigan

Chatbot users often recommend treating a series of prompts like a conversation, but how does a Chatbot know what you’re referring back to? A new study reveals the mechanism that transformer models — like the ones driving modern chatbots — use to decide what to pay attention to.

“Let’s say you have a very long text, and you ask your chatbot to identify, group, and summarize the main topics. To do that, you have to be able to focus on exactly the right kinds of details,” said Samit Oymak, an assistant professor of electrical and computer engineering at the University of Michigan, who He supervised the study that will be presented at the Neural Information Processing Systems Conference on Wednesday, December 13.

“We have shown mathematically for the first time how transformers learn to do this,” he said.

Transformer architectures, first proposed in 2017, have revolutionized natural language processing because they are so good at consuming very long strings of text that GPT-4 can handle entire books. Converters break text into smaller chunks, called tokens, which are processed in parallel while annotating the context surrounding each word. The big language model GPT spent years digesting text from the Internet before it came on the scene with a knowledgeable chatbot so it could pass the bar exam.

The key to adapters is the attention mechanism: they decide what information is most important. What the Oymac team found is that part of the way transformers do this is very old-fashioned, they basically use support vector machines that were invented 30 years ago. SVM sets boundaries so that the data falls into one of two categories. For example, they are used to identify positive and negative sentiment in customer reviews. It turns out that adapters do something similar in deciding what to pay attention to and what to ignore.

Although it looks like you are talking to a person, ChatGPT is actually performing multi-dimensional calculations. Each text symbol becomes a string of numbers called a vector. The first time you input a prompt, ChatGPT uses its mathematical attention mechanism to attach weights to each vector, and thus each word and word group, to determine what to consider as it formulates its response. It’s a word prediction algorithm. It starts by predicting the first word that might initiate a good response, and then the next, until the response is complete.

Then when you enter the next prompt, it feels like a continuation of the conversation to you, but ChatGPT reads the entire conversation from the beginning, assigns new weights to each token, and then formulates a response based on this new evaluation. This is how you give the impression of being able to remember something that was previously said. This is also why, if you gave him the first hundred lines of Romeo and Juliet and asked him to explain the issue between the Montagu and the Capulets, he could summarize the most relevant interactions.

This was already known about how transfer neural networks work. However, transformer architectures are not designed with a clear limit on what to pay attention to and what not to pay attention to. This is the purpose of the SVM-like mechanism.

“We don’t understand what these black box models are doing, and they are going mainstream,” Oymak said. “This is one of the first studies to clearly demonstrate how an attention mechanism can find and retrieve a needle of useful information in a haystack of text.”

The team intends to use this knowledge to make large language models more efficient and easier to interpret, and they expect it will be useful to others working in aspects of artificial intelligence where interest is important, such as perception, image processing, and sound processing.

A second paper, delving deeper into the topic, will be presented at the Modern Machine Learning Mathematics Workshop at NeurIPS 2023: Transformers as Support Vector Machines.

more information:
Choosing a Maximum Margin Symbol in the Attention Mechanism (2023).

Understanding attention in large language models is crucial for improving their efficiency and accuracy in natural language processing tasks. Attention mechanisms play a crucial role in directing the model’s focus on relevant parts of the input text, helping the model to better understand and generate meaningful responses. As large language models continue to grow in complexity and capability, it is essential to delve deeper into how attention mechanisms operate within these models to harness their full potential. This understanding can lead to improvements in various language-related applications, such as machine translation, text summarization, and question answering systems. In this article, we will explore the concept of attention in large language models and its implications for natural language processing.

Understanding attention in large language models

Formulaire de contact