Understanding attention in large language models

the pictures

Chatbot users often recommend treating a series of prompts like a conversation, but how does a Chatbot know what you’re referring back to? A new study reveals the mechanism that transformer models — like the ones driving modern chatbots — use to decide what to pay attention to.

“Let’s say you have a very long text, and you ask your chatbot to identify, group, and summarize the main topics. To do that, you have to be able to focus on exactly the right kinds of details,” said Samit Oymak, an assistant professor of electrical and computer engineering at the University of Michigan, who He supervised the study that will be presented at the Neural Information Processing Systems Conference on Wednesday, December 13.

“We have shown mathematically for the first time how transformers learn to do this,” he said.

Transformer architectures, first proposed in 2017, have revolutionized natural language processing because they are so good at consuming very long strings of text that GPT-4 can handle entire books. Converters break text into smaller chunks, called tokens, which are processed in parallel while annotating the context surrounding each word. The big language model GPT spent years digesting text from the Internet before it came on the scene with a knowledgeable chatbot so it could pass the bar exam.

The key to adapters is the attention mechanism: they decide what information is most important. What the Oymac team found is that part of the way transformers do this is very old-fashioned, they basically use support vector machines that were invented 30 years ago. SVM sets boundaries so that the data falls into one of two categories. For example, they are used to identify positive and negative sentiment in customer reviews. It turns out that adapters do something similar in deciding what to pay attention to and what to ignore.

Although it looks like you are talking to a person, ChatGPT is actually performing multi-dimensional calculations. Each text symbol becomes a string of numbers called a vector. The first time you input a prompt, ChatGPT uses its mathematical attention mechanism to attach weights to each vector, and thus each word and word group, to determine what to consider as it formulates its response. It’s a word prediction algorithm. It starts by predicting the first word that might initiate a good response, and then the next, until the response is complete.

Then when you enter the next prompt, it feels like a continuation of the conversation to you, but ChatGPT reads the entire conversation from the beginning, assigns new weights to each token, and then formulates a response based on this new evaluation. This is how you give the impression of being able to remember something that was previously said. This is also why, if you gave him the first hundred lines of Romeo and Juliet and asked him to explain the issue between the Montagu and the Capulets, he could summarize the most relevant interactions.

This was already known about how transfer neural networks work. However, transformer architectures are not designed with a clear limit on what to pay attention to and what not to pay attention to. This is the purpose of the SVM-like mechanism.

“We don’t understand what these black box models are doing, and they are going mainstream,” Oymak said. “This is one of the first studies to clearly demonstrate how an attention mechanism can find and retrieve a needle of useful information in a haystack of text.”

The team intends to use this knowledge to make large language models more efficient and easier to interpret, and they expect it will be useful to others working in aspects of artificial intelligence where interest is important, such as perception, image processing, and sound processing.

The study was selected as a spotlight poster for NeurIPS 2023: Maximum Margin Code Selection in the Attention Mechanism.

A second paper, delving deeper into the topic, will be presented at the Modern Machine Learning Mathematics Workshop at NeurIPS 2023: Transformers as Support Vector Machines.

/General release. This material from the original organization/author(s) may be chronological in nature, and is edited for clarity, style and length. Mirage.News does not take corporate positions or parties, and all opinions, positions and conclusions expressed herein are solely those of the author(s).View in full here.

Understanding attention in large language models is a crucial aspect of natural language processing and machine learning. Attention mechanisms play a pivotal role in how these models process and generate responses to input data. By delving into the intricacies of attention in large language models, researchers can gain a deeper understanding of how these models operate and make improvements to their performance. This can have wide-ranging implications for various applications in areas such as translation, summarization, and conversational agents. In this paper, we will explore the importance of attention in large language models and its implications for the future of natural language processing.

Understanding attention in large language models

Formulaire de contact