Microsoft AI launches LLMLingua: a unique fast compression technology that compresses claims for accelerated inference for large language models (LLMs)

featured image
https://www.microsoft.com/en-us/research/blog/llmlingua-innovating-llm-efficiency-with-prompt-compression/

Large language models (LLMs), due to their strong generalization and logical power, have greatly contributed to the advancement of the artificial intelligence (AI) community. These models have shown to be remarkably capable and have demonstrated the capabilities of Natural Language Processing (NLP), Natural Language Generation (NLG), Computer Vision, etc. However, more recent developments, including In-Context Learning (ICL) and Chain Learning Claim of Thought (CoT) have led to the deployment of longer claims, sometimes more than tens of thousands of tokens. This presents problems for model inference in terms of cost effectiveness and computational efficiency.

To overcome these challenges, a team of researchers from Microsoft introduced LLMLingua, a unique technology for fast, rough-to-fine compression. LLMLingua has been developed with the primary goal of reducing expenses related to lengthy claims processing and speeding up model reasoning. To do this, LLMLingua uses some basic strategies, which are as follows.

  1. Budget Controller: A dynamic budget controller has been created to control how compression ratios are distributed between different parts of the original claims. This ensures that the semantic integrity of claims is maintained even at large compression ratios.
  1. Token-level iterative compression algorithm: An algorithm for token-level iterative compression is integrated into LLMLingua. This technology enables more complex compression by capturing the interconnection between compressed elements while preserving important information about the vector.
  1. Instruction-tuning-based approach: The team proposed an instruction-tuning-based approach to deal with the problem of distribution mismatch between language models. Language model distribution alignment improves the fit between the small language model used for fast compression and the intended LLM.

The team conducted analysis and experiments using four datasets from different conditions to verify the usefulness of LLMLingua. The datasets are GSM8K and BBH for inference, ShareGPT for conversation, and Arxiv-March23 for summarization. The results showed that the proposed approach achieves advanced performance in each of these conditions. The results also showed that LLMLingua allows significant compression of up to 20 times while sacrificing very little in terms of performance.

The small language model used in the experiments was LLaMA-7B, and the closed LLM was GPT-3.5-Turbo-0301. LLMLingua outperforms previous compression techniques by preserving reasoning, summarization, and discourse skills even at a maximum compression ratio of 20x, which depicts flexibility, economy, effectiveness, and recoverability.

The effectiveness of LLMLingua has been observed across a range of closed LLM courses and small language models. LLMLingua showed good performance results, nearly matching larger models when using GPT-2-small. It has also proven successful with strong LLM degree holders, outperforming the expected quick results.

LLMLingua’s retrievability is one noteworthy aspect as GPT-4 effectively retrieved important logical information from the full nine-step CoT prompt when it was used to recover compressed prompts, while preserving the meaning and similarity of the original prompts. This function ensures that important information can be retrieved and retained even after translation, which adds to the overall beauty of LLMLingua.

In conclusion, LLMLingua has provided a comprehensive solution to the difficulties posed by long prompts in LLM applications. The method demonstrates excellent performance and provides a useful way to improve the effectiveness and affordability of LLM-based applications.


Check the Paper, GitHub, and Blog. All credit for this research goes to the researchers in this project. Also don’t forget to join We have 33k+ ML SubReddit, 41k+ Facebook community, Discord channelAnd Email newsletterwhere we share the latest AI research news, cool AI projects, and more.

If you like our work, you’ll love our newsletter.

Tanya Malhotra is a final year undergraduate student from University of Petroleum and Energy Studies, Dehradun, studying B.Tech in Computer Science Engineering with specialization in Artificial Intelligence and Machine Learning.
She is passionate about data science and has good analytical and critical thinking, as well as a keen interest in acquiring new skills, leading groups and managing work in an organized manner.

🐝 [Free Webinar] Alexa, Upgrade My App: Integrating Voice AI into Your Strategy (December 15, 2023)

Microsoft AI has once again disrupted the tech industry with the launch of LLMLingua, a groundbreaking fast compression technology designed to accelerate inference for large language models (LLMs). This unique technology promises to revolutionize the way claims are compressed, allowing for faster and more efficient processing of vast amounts of data. With LLMLingua, Microsoft is set to pave the way for a new era of AI-driven applications and services that leverage the power of LLMs in a more effective and streamlined manner.

Previous Post Next Post

Formulaire de contact