【LLM活用】LangChainとLLMLinguaでプロンプト圧縮

近年、ChatGPTをはじめとする大規模言語モデル(LLM)が、私たちの生活に大きな変化をもたらしています。これらのモデルは、膨大な量のテキストデータを学習することで、人間と自然な対話をしたり、文章を生成したりすることができるようになりました。しかし、LLMの利用には、計算コストや処理時間といった課題がつきまといます。
特に、プロンプトと呼ばれる、LLMに与える指示が長くなればなるほど、処理時間が遅延したり、コストが増加したりする傾向があります。これは、LLMがより多くの計算資源を必要とするためです。
もし、プロンプトを短くできれば、LLMの処理速度を向上させ、コストを削減できるのではないか?
このような疑問から生まれたのが、プロンプト圧縮という技術です。そして、その中でも注目を集めているのが、LLMLinguaというフレームワークです。
今回は、LangChainのモジュールであるLLMLingua Document Compressorを利用した、RAG検索結果の文書圧縮方法について解説する。なお、プロンプト圧縮については、プロンプト圧縮技術の比較と最新動向で詳しく解説しているので、こちらを参照してください。
LLMLinguaとは?
LLMLinguaは、プロンプトを効果的に圧縮するためのCoarse-to-Fineなアルゴリズムを採用しています。このアルゴリズムは、主に以下の3つのモジュールから構成されています。
1. Budget Controller
プロンプト内の各要素(指示、例、質問など)の重要度を評価し、それぞれの要素に対して異なる圧縮率を割り当てます。これにより、重要な情報を残しつつ、全体としてプロンプトを圧縮することができます。
2. Iterative Token-level Compression
プロンプトをトークン単位で細かく分割し、繰り返し圧縮を行うことで、より高度な圧縮を実現します。これにより、圧縮後のプロンプトが元の意味を損なうことなく、さらに短くなるように調整されます。
3. Alignment
圧縮されたプロンプトが、元のプロンプトと同様の意味を保持していることを保証するために、LLMとの間の分布の整合性を調整します。これにより、圧縮されたプロンプトでも、LLMが正確な回答を生成できるようになります。
LLMLinguaのメリット
- 計算コストの削減: プロンプトを圧縮することで、LLMの処理時間を短縮し、計算コストを削減できます。
- LLMの効率向上: プロンプトが短くなることで、LLMの推論効率が向上し、より多くのタスクを処理できるようになります。
- 高精度な圧縮: LLMLinguaは、高度なアルゴリズムを用いることで、プロンプトを最大20倍まで圧縮しながら、性能低下を最小限に抑えることができます。

LLMLingua Document Compressor とは?
LLMLingua Document Compressorモジュールは、GPT2-smallやLLaMA-7Bといったコンパクトで高性能な言語モデルを活用し、プロンプト内の不要な部分を削ることで、効率的な推論を可能にするツールです。この技術により、プロンプトの長さを最大1/20に圧縮しながら、性能の低下を最小限に抑えます。
LLMLingua Document Compressorのメリット
- 高圧縮率: 最大20倍の圧縮率を実現し、大幅なコスト削減が期待できます。
- 性能の低下が少ない: 圧縮しても、LLMの生成する文章の質はほとんど変わりません。
- 柔軟性: 様々なLLMに対応できます。
LLMLingua Document Compressorの活用例
- 情報検索: 大量の文書の中から、必要な情報だけを抽出できます。
- 質問応答システム: 長文の文書を圧縮し、より迅速な回答を実現できます。
- 文書要約: 長文の文書を要約する際に、非本質的な情報を削除し、簡潔な要約を作成できます。
LLMLingua Document Compressor を試してみる
まずは、pipコマンドを使用して以下のパッケージをインストールしましょう。
# LangChain のインストール
$ pip install langchain-community==0.3.14 langchain-core==0.3.29 langchain-openai==0.2.14
# LLMLingua と RAG に必要なパッケージをインストール
$ pip install accelerate==1.2.1 faiss-cpu==1.9.0.post1 llmlingua==0.2.2 pypdf==5.1.0
LLMLinguaを試すために、まずはOpenAIのAPIキーを設定しましょう。加えて、RAG文書を整形して表示する関数も定義しておきます。
import os # OpenAI の APIキーを設定 os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-KEY>" # 検索したRAG文書を表示するための関数 def pretty_print_docs(docs): print( f"\n{'-' * 100}\n".join( [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)] ) )
次に、RAGに用いるPDF文書を準備します。今回は、arXivで公開されている『Artificial Intelligence Index Report 2024』を使用します。
import requests if not os.path.exists("2405.19522.pdf"): response = requests.get("https://arxiv.org/pdf/2405.19522") with open("2405.19522.pdf", mode="wb") as f: f.write(response.content)
続いて、PDF文書をRAGの入力単位であるチャンクに分割します。
from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import CharacterTextSplitter documents = PyPDFLoader("2405.19522.pdf").load() text_splitter = CharacterTextSplitter.from_tiktoken_encoder( separator = "\n", chunk_size= 1000, chunk_overlap=100, ) texts = text_splitter.split_documents(documents) display(texts[100])
Document(metadata={'source': '2405.19522.pdf', 'page': 99}, page_content='100\nArtificial Intelligence\nIndex Report 2024Chapter 2 PreviewTable of Contents\n2021 2022\n0.00\n0.10\n0.20\n0.30\n0.40\n0.50\n0.60Editing accuracy\n0.11, Position replacement\n0.25, Positional addition\n0.33, Average\n0.34, Alter parts\n0.47, Object addition\n0.52, Size\n0.59, Object replacement\nEditVal automatic evaluation: editing accuracy\nSource: EditVal Leaderboard, 2024 | Chart: 2024 AI Index report\nEditing\nImage editing involves using AI to modify \nimages based on text prompts. This AI-\nassisted approach has broad real-world \napplications in fields such as engineering, \nindustrial design, and filmmaking.\nEditVal \nDespite the promise of text-guided image \nediting, few robust methods can evaluate \nhow accurately AI image editors adhere to \nediting prompts. EditVal, a new benchmark \nfor assessing text-guided image editing, \nincludes over 13 edit types, such as adding \nobjects or changing their positions, \nacross 19 object classes (Figure 2.4.10). \nThe benchmark was applied to evaluate \neight leading text-guided image editing \nmethods including SINE and Null-text. \nPerformance improvements since 2021 on \na variety of the benchmark’s editing tasks, \nare shown in Figure 2.4.11. \nChapter 2: Technical PerformanceArtificial Intelligence\nIndex Report 2024\nFigure 2.4.10\nFigure 2.4.11\nA sample VisIT-Bench instruction set\nSource: Bitton et al., 2023\n2.4 Image Computer Vision and Image Generation')
準備が整いましたので、FAISSを用いてRAG検索用のインデックスを作成し、検索結果を検証します。今回は、この検索文書をLLMLinguaにより圧縮します。
from langchain_community.vectorstores import FAISS from langchain_openai import OpenAIEmbeddings embedding = OpenAIEmbeddings(model="text-embedding-ada-002") retriever = FAISS.from_documents(texts, embedding).as_retriever(search_kwargs={"k": 20}) query = "What role does QLoRA play in large language model?" docs = retriever.invoke(query) pretty_print_docs(docs)
ここをクリックすることでpretty_print_docs
の表示結果を確認できます
Document 1:
151
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Fine-Tuning
Fine-tuning has grown increasingly popular as a
method of enhancing LLMs and involves further
training or adjusting models on smaller datasets.
Fine-tuning not only boosts overall model
performance but also sharpens the model’s
capabilities on specific tasks. It also allows for more
precise control over the model’s behavior.
879 902 916
966 974 992 1,022
1,348
Guanaco 7B Bard Guanaco 13B ChatGPT Vicuna 13B Guanaco 33B Guanaco 65B GPT-4
0
200
400
600
800
1,000
1,200
1,400
Elo rating (mean)
Model competitions based on 10,000 simulations using GPT-4 and the Vicuna benchmark
Source: Dettmers et al., 2023 | Chart: 2024 AI Index report
Figure 2.12.5
QLoRA
Highlighted Research:
QLoRA, developed by researchers from the
University of Washington in 2023, is a new method
for more efficient model fine-tuning. It dramatically
reduces memory usage, enabling the fine-tuning
of a 65 billion parameter model on a single 48
GB GPU while maintaining full 16-bit fine-tuning
performance. To put this in perspective, fine-tuning
a 65B Llama model, a leading open-source LLM,
typically requires about 780 GB of GPU memory.
Therefore, QLoRA is nearly 16 times more efficient.
QLoRA manages to increase efficiency with
techniques like a 4-bit NormalFloat (NF4), double
quantization, and page optimizers. QLoRA is
used to train a model named Guanaco, which
matched or even surpassed models like ChatGPT
in performance on the Vicuna benchmark (a
benchmark that ranks the outputs of LLMs) (Figure
2.12.5). Remarkably, the Guanaco models were
created with just 24 hours of fine-tuning on a single
GPU. QLoRa highlights how methods for optimizing
and further improving models have become more
efficient, meaning fewer resources will be required
to make increasingly capable models.
----------------------------------------------------------------------------------------------------
Document 2:
B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., … Koreeda, Y. (2023).
Holistic Evaluation of Language Models (arXiv:2211.09110). arXiv. https:/ /doi.org/10.48550/arXiv.2211.09110.
Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods (arXiv:2109.07958). arXiv.
https:/ /doi.org/10.48550/arXiv.2109.07958.
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang,
C., Shen, S., Zhang, T., Su, Y., Sun, H., … Tang, J. (2023). AgentBench: Evaluating LLMs as Agents (arXiv:2308.03688). arXiv.
https:/ /doi.org/10.48550/arXiv.2308.03688.
Luccioni, A. S., Jernite, Y. & Strubell, E. (2023). Power Hungry Processing: Watts Driving the Cost of AI Deployment?
(arXiv:2311.16863). arXiv. http:/ /arxiv.org/abs/2311.16863.
Luo, J., Paduraru, C., Voicu, O., Chervonyi, Y., Munns, S., Li, J., Qian, C., Dutta, P., Davis, J. Q., Wu, N., Yang, X., Chang, C.-
M., Li, T., Rose, R., Fan, M., Nakhost, H., Liu, T., Kirkman, B., Altamura, F., … Mankowitz, D. J. (2022). Controlling Commercial
Cooling Systems Using Reinforcement Learning (arXiv:2211.07357). arXiv. https:/ /doi.org/10.48550/arXiv.2211.07357.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. (2011). “Learning Word Vectors for Sentiment Analysis.” In D.
Lin, Y. Matsumoto & R. Mihalcea, eds., Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies: 142–50. Association for Computational Linguistics. https:/ /aclanthology.org/P11-1015.
Melas-Kyriazi, L., Rupprecht, C., Laina, I. & Vedaldi, A. (2023). RealFusion: 360° Reconstruction of Any Object From a Single
Image (arXiv:2302.10663). arXiv. http:/ /arxiv.org/abs/2302.10663.
Mihaylov, T., Clark, P., Khot, T. & Sabharwal, A. (2018). “Can a Suit of Armor Conduct Electricity? A New Dataset for
Open Book Question Answering.” In E. Riloff, D. Chiang, J. Hockenmaier & J. Tsujii, eds., Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing: 2381–91. Association for Computational Linguistics.
https:/ /doi.org/10.18653/v1/D18-1260.
----------------------------------------------------------------------------------------------------
Document 3:
Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023).
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models (arXiv:2308.11462).
arXiv. http:/ /arxiv.org/abs/2308.11462.
Haque, A., Tancik, M., Efros, A. A., Holynski, A. & Kanazawa, A. (2023). Instruct-NeRF2NeRF: Editing 3D Scenes With
Instructions (arXiv:2303.12789). arXiv. http:/ /arxiv.org/abs/2303.12789.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2021). Measuring Massive Multitask
Language Understanding (arXiv:2009.03300). arXiv. http:/ /arxiv.org/abs/2009.03300.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D. & Steinhardt, J. (2021). Measuring Mathematical
Problem Solving With the MATH Dataset (arXiv:2103.03874). arXiv. http:/ /arxiv.org/abs/2103.03874.
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M. & Leskovec, J. (2021). Open Graph Benchmark:
Datasets for Machine Learning on Graphs (arXiv:2005.00687). arXiv. https:/ /doi.org/10.48550/arXiv.2005.00687.
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X. & Zhou, D. (2024). Large Language Models Cannot
Self-Correct Reasoning Yet (arXiv:2310.01798). arXiv. http:/ /arxiv.org/abs/2310.01798.
Huang, Q., Vora, J., Liang, P. & Leskovec, J. (2023). Benchmarking Large Language Models as AI Research Agents
(arXiv:2310.03302). arXiv. http:/ /arxiv.org/abs/2310.03302.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. & Narasimhan, K. (2023). SWE-bench: Can Language Models
Resolve Real-World GitHub Issues? (arXiv:2310.06770). arXiv. https:/ /doi.org/10.48550/arXiv.2310.06770.
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). What Disease Does This Patient Have?
A Large-Scale Open Domain Question Answering Dataset From Medical Exams (arXiv:2009.13081). arXiv.
http:/ /arxiv.org/abs/2009.13081.
Kıcıman, E., Ness, R., Sharma, A. & Tan, C. (2023). Causal Reasoning and Large Language Models: Opening a New Frontier
for Causality (arXiv:2305.00050). arXiv. http:/ /arxiv.org/abs/2305.00050.
----------------------------------------------------------------------------------------------------
Document 4:
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA: A Graduate-
Level Google-Proof Q&A Benchmark (arXiv:2311.12022). arXiv. http:/ /arxiv.org/abs/2311.12022.
Rustia, D. J. A., Chiu, L.-Y., Lu, C.-Y., Wu, Y.-F., Chen, S.-K., Chung, J.-Y., Hsu, J.-C. & Lin, T.-T. (2022). “Towards Intelligent and
Integrated Pest Management Through an AIoT-Based Monitoring System.” Pest Management Science 78, no. 10: 4288–4302.
https:/ /doi.org/10.1002/ps.7048.
Schaeffer, R., Miranda, B. & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? (arXiv:2304.15004).
arXiv. http:/ /arxiv.org/abs/2304.15004.
Schneider, F., Kamal, O., Jin, Z. & Schölkopf, B. (2023). Moûusai: Text-to-Music Generation With Long-Context Latent Diffusion
(arXiv:2301.11757). arXiv. https:/ /doi.org/10.48550/arXiv.2301.11757.
Shams, S. R., Jahani, A., Kalantary, S., Moeinaddini, M. & Khorasani, N. (2021). “Artificial Intelligence Accuracy Assessment in NO2
Concentration Forecasting of Metropolises Air.” Scientific Reports 11, no. 1: 1805. https:/ /doi.org/10.1038/s41598-021-81455-6.
Shi, Y., Wang, P., Ye, J., Long, M., Li, K. & Yang, X. (2024). MVDream: Multi-View Diffusion for 3D Generation (arXiv:2308.16512).
arXiv. http:/ /arxiv.org/abs/2308.16512.
Soomro, K., Zamir, A. R. & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild
(arXiv:1212.0402; Version 1). arXiv. http:/ /arxiv.org/abs/1212.0402.
Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., Finn, C. &
Hausman, K. (2023). Open-World Object Manipulation Using Pre-trained Vision-Language Models (arXiv:2303.00905). arXiv.
https:/ /doi.org/10.48550/arXiv.2303.00905.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D.,
Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open
Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). arXiv. https:/ /doi.org/10.48550/arXiv.2307.09288.
----------------------------------------------------------------------------------------------------
Document 5:
Solving With Large Language Models (arXiv:2305.10601). arXiv. http:/ /arxiv.org/abs/2305.10601.
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning
(arXiv:1811.10830). arXiv. http:/ /arxiv.org/abs/1811.10830.
Zhang, L., Rao, A. & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (arXiv:2302.05543).
arXiv. http:/ /arxiv.org/abs/2302.05543.
Zhang, Z., Han, L., Ghosh, A., Metaxas, D. & Ren, J. (2022). SINE: SINgle Image Editing With Text-to-Image Diffusion Models
(arXiv:2212.04489). arXiv. https:/ /doi.org/10.48550/arXiv.2212.04489.
----------------------------------------------------------------------------------------------------
Document 6:
Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Clark, J. (2022).
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (arXiv:2209.07858).
arXiv. http:/ /arxiv.org/abs/2209.07858.
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration
in Language Models (arXiv:2009.11462). arXiv. https:/ /doi.org/10.48550/arXiv.2009.11462.
Grinbaum, A. & Adomaitis, L. (2024). “Dual Use Concerns of Generative AI and Large Language Models.”
Journal of Responsible Innovation 11, no. 1. https:/ /doi.org/10.1080/23299460.2024.2304381.
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D. & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated
Dataset for Adversarial and Implicit Hate Speech Detection (arXiv:2203.09509v4). arXiv. http:/ /arxiv.org/abs/2203.09509.
Ippolito, D., Tramèr, F., Nasr, M., Zhang, C., Jagielski, M., Lee, K., Choquette-Choo, C. A. & Carlini, N. (2023).
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy (arXiv:2210.17546v3). arXiv.
https:/ /doi.org/10.48550/arXiv.2210.17546.
Janssen, M., Brous, P., Estevez, E., Barbosa, L. S. & Janowski, T. (2020). “Data Governance: Organizing Data for Trustworthy
Artificial Intelligence.” Government Information Quarterly 37, no. 3: 101493. https:/ /doi.org/10.1016/j.giq.2020.101493.
Li, B., Sun, J. & Poskitt, C. M. (2023). How Generalizable Are Deepfake Detectors? An Empirical Study (arXiv:2308.04177). arXiv.
http:/ /arxiv.org/abs/2308.04177.
Chapter 3: Responsible AI
Appendix
----------------------------------------------------------------------------------------------------
Document 7:
201
Artificial Intelligence
Index Report 2024Chapter 3 PreviewTable of Contents
Measuring Subjective Opinions in LLMs
Research from Anthropic suggests that large language
models do not equally represent global opinions
on a variety of topics such as politics, religion,
and technology. In this study, researchers built a
GlobalOpinionQA dataset to capture cross-country
opinions on various issues (Figure 3.5.7). They then
generated a similarity metric to compare people’s
answers in various countries with those outputted
by LLMs. Using a four-point Likert scale, LLMs were
asked to rate their agreement with statements from the
World Values Survey (WVS) and Pew Research Center’s
Global Attitudes (GAS) surveys, including questions
like, “When jobs are scarce, employers should give
priority to people of this country over immigrants,” or
“On the whole, men make better business executives
than women do.”
The experiments indicate that the models’ responses
closely align with those from individuals in Western
countries (Figure 3.5.8). The authors point out a
notable lack of diversity in opinion representation,
especially from non-Western nations among the
shared responses. While it is challenging for models
to precisely match the highly diverse distributions
of global opinions—given the inherent variation in
perspectives—it is still valuable to understand which
opinions a model is likely to share. Recognizing
the biases inherent in models can highlight their
limitations and facilitate adjustments that improve
regional applicability.
Chapter 3: Responsible AIArtificial Intelligence
Index Report 2024 3.5 Fairness
Figure 3.5.7
GlobalOpinionQA Dataset
Source: Durmus et al., 2023
----------------------------------------------------------------------------------------------------
Document 8:
Appendix 471Table of Contents
Artificial Intelligence
Index Report 2024
Artificial Intelligence
Index Report 2024 Chapter 2: Technical Performance
Appendix
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. (2023). PlanBench: An Extensible Benchmark
for Evaluating Large Language Models on Planning and Reasoning About Change. Thirty-Seventh Conference on Neural
Information Processing Systems Datasets and Benchmarks Track. https:/ /openreview.net/forum?id=YXogl4uQUO.
Voynov, O., Bobrovskikh, G., Karpyshev, P., Galochkin, S., Ardelean, A.-T., Bozhcnko, A., Karmanova, E., Kopanev, P., Labutin-
Rymsho, Y., Rakhimov, R., Safin, A., Serpiva, V., Artemov, A., Burnaev, E., Tsetserukou, D. & Zorin, D. (2023). Multi-sensor Large-
Scale Dataset for Multi-view 3D Reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 21392–403. https:/ /doi.org/10.1109/CVPR52729.2023.02049.
Walker, C. M. & Gopnik, A. (2014). “Toddlers Infer Higher-Order Relational Principles in Causal Learning.” Psychological Science
25, no. 1: 161–69.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L. & Anandkumar, A. (2023). Voyager: An Open-Ended
Embodied Agent With Large Language Models (arXiv:2305.16291). arXiv. http:/ /arxiv.org/abs/2305.16291.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. & Zhou, D. (2023). Chain-of-Thought Prompting
Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. https:/ /doi.org/10.48550/arXiv.2201.11903.
Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S. & Tompson, J. (2023). Robotic Skill Acquisition
via Instruction Augmentation With Vision-Language Models (arXiv:2211.11736). arXiv. https:/ /doi.org/10.48550/arXiv.2211.11736.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D. & Chen, X. (2023). Large Language Models as Optimizers
(arXiv:2309.03409). arXiv. http:/ /arxiv.org/abs/2309.03409.
Yang, D., Tian, J., Tan, X., Huang, R., Liu, S., Chang, X., Shi, J., Zhao, S., Bian, J., Wu, X., Zhao, Z., Watanabe, S. & Meng, H.
(2023). UniAudio: An Audio Foundation Model Toward Universal Audio Generation (arXiv:2310.00704). arXiv.
http:/ /arxiv.org/abs/2310.00704.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y. & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem
Solving With Large Language Models (arXiv:2305.10601). arXiv. http:/ /arxiv.org/abs/2305.10601.
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning
----------------------------------------------------------------------------------------------------
Document 9:
145
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.11 Properties of LLMs
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
LLMs Are Poor Self-Correctors
Highlighted Research:
It is generally understood that LLMs like GPT-4
have reasoning limitations and can sometimes
produce hallucinations. One proposed solution
to such issues is self-correction, whereby LLMs
identify and correct their own reasoning flaws. As
AI’s societal role grows, the concept of intrinsic
self-correction—allowing LLMs to autonomously
correct their reasoning without external guidance—
is especially appealing. However, it is currently not
well understood whether LLMs are in fact capable
of this kind of self-correction.
Researchers from DeepMind and the University
of Illinois at Urbana–Champaign tested GPT-4’s
performance on three reasoning benchmarks:
GSM8K (grade-school math), CommonSenseQA
(common-sense reasoning), and HotpotQA
(multidocument reasoning). They found that when
the model was left to decide on self-correction
without guidance, its performance declined across
all tested benchmarks (Figure 2.11.3).
95.50%
82.00%
49.00%
91.50%
79.50%
49.00%
89.00%
80.00%
43.00%
GSM8K CommonSenseQA HotpotQA
0%
20%
40%
60%
80%
100%
Standard prompting: 1 call Self-correct (round 1): 3 calls Self-correct (round 2): 5 calls
Accuracy (%)
GPT-4 on reasoning benchmarks with intrinsic self-correction
Source: Huang et al., 2023 | Chart: 2024 AI Index report
Figure 2.11.3
----------------------------------------------------------------------------------------------------
Document 10:
150
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Optimization by PROmpting (OPRO)
Highlighted Research:
A paper from DeepMind has introduced
Optimization by PROmpting (OPRO), a method
that uses LLMs to iteratively generate prompts
to improve algorithmic performance. OPRO uses
natural language to guide LLMs in creating new
prompts based on problem descriptions and
previous solutions (Figure 2.12.3). The generated
prompts aim to enhance the performance of AI
systems on particular benchmarks. Compared to
other prompting approaches like “let’s think step
by step” or an empty starting point, ORPO leads
to significantly greater accuracy on virtually all 23
BIG-bench Hard tasks (Figure 2.12.4).
boolean_expressions
causal_judgement
date_understanding
disambiguation_qa
dyck_languages
formal_fallacies
geometric_shapes
hyperbaton
logical_deduction_seven_objects
movie_recommendation
multistep_arithmetic_two
navigate
object_counting
penguins_in_a_table
reasoning_about_colored_objects
ruin_names
salient_translation_error_detection
snarks
sports_understanding
temporal_sequences
tracking_shu�ed_objects_seven_objects
web_of_lies
word_sorting
−20
0
20
40
boolean_expressions
causal_judgement
date_understanding
disambiguation_qa
dyck_languages
formal_fallacies
geometric_shapes
hyperbaton
logical_deduction_seven_objects
movie_recommendation
multistep_arithmetic_two
navigate
object_counting
penguins_in_a_table
reasoning_about_colored_objects
ruin_names
salient_translation_error_detection
snarks
sports_understanding
temporal_sequences
tracking_shu�ed_objects_seven_objects
web_of_lies
word_sorting
0
10
20
30
40
50
60
Accuracy di�erence
“Let’s think step by step” instruction Empty instruction
Accuracy dierence on 23 BIG-bench Hard (BBH) tasks using PaLM 2-L scorer
Source: Yang et al., 2023 | Chart: 2024 AI Index report
Task
Figure 2.12.3
Figure 2.12.4
Sample OPRO
prompts and
optimization
progress
Source: Yang et al., 2023
----------------------------------------------------------------------------------------------------
Document 11:
Appendix 470Table of Contents
Artificial Intelligence
Index Report 2024
Artificial Intelligence
Index Report 2024 Chapter 2: Technical Performance
Appendix
Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D. & Zeng, A. (2023). Large Language
Models as General Pattern Machines (arXiv:2307.04721). arXiv. https:/ /doi.org/10.48550/arXiv.2307.04721.
Mitchell, M., Palmarini, A. B. & Moskvichev, A. (2023). Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning
Tasks (arXiv:2311.09247). arXiv. http:/ /arxiv.org/abs/2311.09247.
Mokady, R., Hertz, A., Aberman, K., Pritch, Y. & Cohen-Or, D. (2022). Null-Text Inversion for Editing Real Images Using Guided
Diffusion Models (arXiv:2211.09794). arXiv. https:/ /doi.org/10.48550/arXiv.2211.09794.
Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. & Schölkopf, B. (2016). “Distinguishing Cause From Effect Using
Observational Data: Methods and Benchmarks.” The Journal of Machine Learning Research 17, no. 1: 1103–1204.
Nie, A., Zhang, Y., Amdekar, A., Piech, C., Hashimoto, T. & Gerstenberg, T. (2023). MoCa: Measuring Human-Language Model
Alignment on Causal and Moral Judgment Tasks (arXiv:2310.19677). arXiv. http:/ /arxiv.org/abs/2310.19677.
Olabi, A. G., Abdelghafar, A. A., Maghrabie, H. M., Sayed, E. T., Rezk, H., Radi, M. A., Obaideen, K. & Abdelkareem, M. A.
(2023). “Application of Artificial Intelligence for Prediction, Optimization, and Control of Thermal Energy Storage Systems.”
Thermal Science and Engineering Progress, 39: 101730. https:/ /doi.org/10.1016/j.tsep.2023.101730.
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S.,
Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024).
GPT-4 Technical Report (arXiv:2303.08774). arXiv. https:/ /doi.org/10.48550/arXiv.2303.08774.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your
Language Model Is Secretly a Reward Model (arXiv:2305.18290). arXiv. http:/ /arxiv.org/abs/2305.18290.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA: A Graduate-
Level Google-Proof Q&A Benchmark (arXiv:2311.12022). arXiv. http:/ /arxiv.org/abs/2311.12022.
----------------------------------------------------------------------------------------------------
Document 12:
90
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Factuality and Truthfulness
Despite remarkable achievements, LLMs remain
susceptible to factual inaccuracies and content
hallucination—creating seemingly realistic, yet false,
information. The presence of real-world instances
where LLMs have produced hallucinations—in
court cases, for example—underscores the growing
necessity of closely monitoring trends in LLM
factuality.
TruthfulQA
Introduced at ACL 2022, TruthfulQA is a benchmark
designed to evaluate the truthfulness of LLMs in
generating answers to questions. This benchmark
comprises approximately 800 questions across 38
categories, including health, politics, and finance.
Many questions are crafted to challenge commonly
held misconceptions, which typically lead humans to
answer incorrectly (Figure 2.2.9). Although one of the
observations of the paper is that larger models tend to
be less truthful, GPT-4 (RLHF) released in early 2024,
has achieved the highest performance thus far on the
TruthfulQA benchmark, with a score of 0.6 (Figure
2.2.10). This score is nearly three times higher than that
of a GPT-2-based model tested in 2021, indicating that
LLMs are becoming progressively better at providing
truthful answers.
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Figure 2.2.9
Sample TruthfulQA questions
Source: Lin, Hilton, and Evans, 2022
2.2 Language
----------------------------------------------------------------------------------------------------
Document 13:
86
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Understanding
English language understanding challenges AI systems
to understand the English language in various ways
such as reading comprehension and logical reasoning.
HELM: Holistic Evaluation of Language Models
As illustrated above, in recent years LLMs have
surpassed human performance on traditional English-
language benchmarks, such as SQuAD (question
answering) and SuperGLUE (language understanding).
This rapid advancement has led to the need for more
comprehensive benchmarks.
In 2022, Stanford researchers introduced HELM
(Holistic Evaluation of Language Models), designed
to evaluate LLMs across diverse scenarios,
including reading comprehension, language
understanding, and mathematical reasoning. 6
HELM assesses models from several leading
companies like Anthropic, Google, Meta, and
OpenAI, and uses a “mean win rate” to track
average performance across all scenarios. As of
January 2024, GPT-4 leads the aggregate HELM
leaderboard with a mean win rate of 0.96 (Figure
2.2.3); however, different models top different task
categories (Figure 2.2.4). 7
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
0.96
0.83 0.82 0.78 0.78 0.77 0.73 0.72 0.69 0.68
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
Palmyra X V3 (72B)
Palmyra X V2 (33B)
PaLM-2 (Unicorn)
Yi (34B)
Mixtral (8×7B 32K seqlen)
Anthropic Claude v1.3
PaLM-2 (Bison)
Anthropic Claude 2.00.00
0.20
0.40
0.60
0.80
1.00
Mean win rate
HELM: mean win rate
Source: CRFM, 2023 | Chart: 2024 AI Index report
GSM8K - EM
LegalBench - EM
MATH - Equivalent (CoT)
MMLU - EM
MedQA - EM
NarrativeQA - F1
NaturalQuestions (closed-book) -
F1
NaturalQuestions (open-book) - F1
OpenbookQA - EM
WMT 2014 - BLEU-4
Task
GPT-4 (0613)
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
Yi (34B)
Llama 2 (70B)
PaLM-2 (Bison)
GPT-4 (0613)
Palmyra X V3 (72B)
Leading model
0.93
0.71
0.86
0.74
0.82
0.78
0.46
0.81
0.96
0.26
Score
Leaders on individual HELM sub-benchmarks
Source: CRFM, 2023 | Table: 2024 AI Index report
Figure 2.2.3
Figure 2.2.4
6 HELM evaluates 10 scenarios: (1) NarrativeQA (reading comprehension), (2) Natural Questions (closed-book) (closed-book short-answer question answering), (3) Natural Questions
(open-book) (open-book short-answer question answering), (4) OpenBookQA (commonsense question answering), (5) MMLU (multisubject understanding), (6) GSM8K (grade school
math), (7) MATH (competition math), (8) LegalBench (legal reasoning), (9) MedQA (medical knowledge), and (10) WMT 2014 (machine translation).
7 There are several versions of HELM. This section reports the score on HELM Lite, Release v1.0.0 (2023-12-19), with the data having been collected in January 2024.
2.2 Language
----------------------------------------------------------------------------------------------------
Document 14:
193
Artificial Intelligence
Index Report 2024Chapter 3 PreviewTable of Contents
Universal and Transferable Attacks on Aligned
Language Models
Recent attention in AI security has centered on
uncovering adversarial attacks capable of bypassing
the implemented safety protocols of LLMs. Much of
this research requires substantial human intervention
and is idiosyncratic to specific models. However, in
2023, researchers unveiled a universal attack capable
of operating across various LLMs. This attack induces
aligned models to generate objectionable content
(Figure 3.4.9).
The method involved automatically generating suffixes
that, when added to various prompts, compel LLMs
to produce unsafe content. Figure 3.4.10 highlights
the success rates of different attacking styles on
leading LLMs. The method the researchers introduce
is called Greedy Coordinate Gradient (GCG). The
study demonstrates that these suffixes (the GCG
attack) often transfer effectively across both closed
and open models, encompassing ChatGPT, Bard,
Claude, Llama-2-Chat, and Pythia. This study raises
an important question as to how models can be better
fortified against automated adversarial attacks. It
also demonstrates how LLMs can be vulnerable to
attacks that employ unintelligible, non-human-readable
prompts. Current red-teaming methodologies primarily
focus on interpretable prompts. This new research
suggests there is a significant gap in buffering LLMs
against attacks utilizing uninterpretable prompts.
3.4 Security and Safety
Chapter 3: Responsible AIArtificial Intelligence
Index Report 2024
Figure 3.4.9
Using suffixes to manipulate LLMs
Source: Zou et al., 2023
----------------------------------------------------------------------------------------------------
Document 15:
88
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Generation
In generation tasks, AI models are tested on their ability
to produce fluent and practical language responses.
Chatbot Arena Leaderboard
The rise of capable LLMs has made it increasingly
important to understand which models are
preferred by the general public. Launched in 2023,
the Chatbot Arena Leaderboard is one of the
first comprehensive evaluations of public LLM
preference. The leaderboard allows users to query
two anonymous models and vote for the preferred
generations (Figure 2.2.7). As of early 2024, the
platform has garnered over 200,000 votes, and
users ranked OpenAI’s GPT-4 Turbo as the most
preferred model (Figure 2.2.8).
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Figure 2.2.7
A sample model response on the Chatbot Arena Leaderboard
Source: Chatbot Arena Leaderboard, 2024
2.2 Language
----------------------------------------------------------------------------------------------------
Document 16:
148
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
As LLMs use increases, techniques are being sought to enhance their performance and efficiency. This section examines
some of those advances.
Prompting
Prompting, a vital aspect of the AI pipeline, entails
supplying a model with natural language instructions
that describe tasks the model should execute.
Mastering the art of crafting effective prompts
significantly enhances the performance of LLMs
without requiring that models undergo underlying
improvements.
Graph of Thoughts Prompting
Highlighted Research:
Chain of thought (CoT) and Tree of Thoughts
(ToT) are prompting methods that can improve
the performance of LLMs on reasoning tasks. In
2023, European researchers introduced another
prompting method, Graph of Thoughts (GoT), that
has also shown promise (Figure 2.12.1). GoT enables
LLMs to model their thoughts in a more flexible,
graph-like structure which more closely mirrors
actual human reasoning. The researchers then
designed a model architecture to implement GoT
and found that, compared to ToT, it increased the
quality of outputs by 62% on a sorting task while
reducing cost by around 31% (Figure 2.12.2).
Figure 2.12.1
Graph of Thoughts (GoT) reasoning flow
Source: Besta et al., 2023
----------------------------------------------------------------------------------------------------
Document 17:
135
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.9 Robotics
2.9 Robotics
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Over time, AI has become increasingly integrated into robotics, enhancing robots’ capabilities to perform complex
tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their
surroundings, adapt flexibly to new settings, and make autonomous decisions.
PaLM-E
PaLM-E is a new AI model from Google that
merges robotics with language modeling to
address real-world tasks like robotic manipulation
and knowledge tasks like question answering and
image captioning. Leveraging transformer-based
architectures, the largest PaLM-E model is scaled
up to 562B parameters. The model is trained
on diverse visual language as well as robotics
data, which results in superior performance on
a variety of robotic benchmarks. PaLM-E also
sets new standards in visual tasks like OK-VQA,
excels in other language tasks, and can engage in
chain-of-thought, mathematical, and multi-image
reasoning, even without specific training in these
areas. Figure 2.9.1 illustrates some of the tasks that
the PaLM-E model can perform.
On Task and Motion Planning (TAMP) domains,
where robots have to manipulate objects, PaLM-E
outperforms previous state-of-the-art methods like
SayCan and PaLI on both embodied visual question
answering and planning (Figure 2.9.2).16 On
robotic manipulation tasks, PaLM-E outperforms
competing models (PaLI and CLIP-FT) in its ability
to detect failures, which is a crucial step for robots
to perform closed-loop planning (Figure 2.9.3).
PaLM-E is significant in that it demonstrates that
language modeling techniques as well as text
data can enhance the performance of AI systems
in nonlanguage domains, like robotics. PaLM-E
also highlights how there are already linguistically
adept robots capable of real-world interaction and
high-level reasoning. Developing these kinds of
multifaceted robots is an essential step in creating
more general robotic assistants that can, for
example, assist in household work.
Highlighted Research:
16 Embodied Visual Question Answering (Embodied VQA) is a task where agents need to navigate through 3D environments and answer questions about the objects they
visually perceive in the environment.
----------------------------------------------------------------------------------------------------
Document 18:
87
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
MMLU: Massive Multitask Language
Understanding
The Massive Multitask Language Understanding
(MMLU) benchmark assesses model performance in
zero-shot or few-shot scenarios across 57 subjects,
including the humanities, STEM, and social sciences
(Figure 2.2.5). MMLU has emerged as a premier
benchmark for assessing LLM capabilities: Many state-
of-the-art models like GPT-4, Claude 2, and Gemini have
been evaluated against MMLU.
In early 2023, GPT-4 posted a state-of-the-art score
on MMLU, later surpassed by Google’s Gemini Ultra.
Figure 2.2.6 highlights the top model scores on the
MMLU benchmark in different years. The scores
reported are the averages across the test set. As of
January 2024, Gemini Ultra holds the top score of
90.0%, marking a 14.8 percentage point improvement
since 2022 and a 57.6 percentage point increase since
MMLU’s inception in 2019. Gemini Ultra’s score was
the first to surpass MMLU’s human baseline of 89.8%.
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
2019 2020 2021 2022 2023
30%
40%
50%
60%
70%
80%
90%
Average accuracy (%)
90.04%
MMLU: average accuracy
Source: Papers With Code, 2023 | Chart: 2024 AI Index report
89.8%, human baseline
Figure 2.2.5
Figure 2.2.6
A sample question from MMLU
Source: Hendrycks et al., 2021
2.2 Language
----------------------------------------------------------------------------------------------------
Document 19:
http:/ /arxiv.org/abs/2308.06595.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S. & Kreis, K. (2023). Align Your Latents: High-Resolution
Video Synthesis With Latent Diffusion Models (arXiv:2304.08818). arXiv. http:/ /arxiv.org/abs/2304.08818.
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C.,
Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., … Zitkovich, B.
(2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. (arXiv:2307.15818). arXiv.
https:/ /arxiv.org/abs/2307.15818.
Castaño, J., Martínez-Fernández, S., Franch, X. & Bogner, J. (2023). Exploring the Carbon Footprint of Hugging Face’s ML
Models: A Repository Mining Study. 2023 ACM/IEEE International Symposium on Empirical Software Engineering and
Measurement (ESEM), 1–12. https:/ /doi.org/10.1109/ESEM56168.2023.10304801.
Chen, L., Chen, Z., Zhang, Y., Liu, Y., Osman, A. I., Farghali, M., Hua, J., Al-Fatesh, A., Ihara, I., Rooney, D. W. & Yap, P.-S. (2023).
“Artificial Intelligence-Based Solutions for Climate Change: A Review.” Environmental Chemistry Letters 21, no. 5: 2525–57.
https:/ /doi.org/10.1007/s10311-023-01617-y.
Chen, L., Zaharia, M. & Zou, J. (2023). How Is ChatGPT’s Behavior Changing Over Time? (arXiv:2307.09009). arXiv.
http:/ /arxiv.org/abs/2307.09009.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A.,
Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large
Language Models Trained on Code (arXiv:2107.03374; Version 2). arXiv. https:/ /doi.org/10.48550/arXiv.2107.03374.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2023). Deep Reinforcement Learning From Human
Preferences (arXiv:1706.03741). arXiv. https:/ /doi.org/10.48550/arXiv.1706.03741.
----------------------------------------------------------------------------------------------------
Document 20:
140
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Direct Preference Optimization
Highlighted Research:
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
As illustrated above, RLHF is a useful method
for aligning LLMs with human preferences.
However, RLHF requires substantial computational
resources, involving the training of multiple
language models and integrating LM policy
sampling within training loops. This complexity
can hinder its broader adoption.
In response, researchers from Stanford and CZ
Biohub have developed a new reinforcement
learning algorithm for aligning models named
Direct Preference Optimization (DPO). DPO is
simpler than RLHF but equally effective. The
researchers show that DPO is as effective as other
existing alignment methods, such as Proximal
Policy Optimization (PPO) and Supervised Fine-
Tuning (SFT), on tasks like summarization (Figure
2.10.5). The emergence of techniques like DPO
suggests that model alignment methods are
becoming more straightforward and accessible.
0.00 0.25 0.50 0.75 1.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
DPO PPO Best of 128 SFT Preferred-FT GPT-J
Sampling temperature
Win rate
Comparison of dierent algorithms on TL;DR summarization task across dierent sampling temperatures
Source: Rafailov et al., 2023 | Table: 2024 AI Index report
Human baseline
Figure 2.10.5
2.10 Reinforcement Learning
次に、LLMLinguaの構築に移りましょう。文書圧縮モデルには、GPT-2を採用しますが、Llama-2など、他のモデルへの置き換えも可能です。
from langchain.retrievers import ContextualCompressionRetriever from langchain_community.document_compressors import LLMLinguaCompressor compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu") # compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu") compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)
では、構築したLLMLinguaを用いて、RAG検索結果の圧縮率を検証します。結果を見ると、元の検索結果と比較して、RAG検索結果が大幅にコンパクト化されていることがわかります。
compressed_docs = compression_retriever.invoke(query) pretty_print_docs(compressed_docs)
ここをクリックすることでpretty_print_docs
の表示結果を確認できます
Document 1:
#ref0
ArtificialIndex 2024Chapter PreviewTable of Techniques for LLChapter PerformanceArt 2024Fine
Finetun as a enhancing LLMs and adjusting models on smaller datasets Fine overall sharp the model�s abilities tasks. also allows pre the.79 902 9
66 9 9,022
348uanacoB GuanacoB ChatGPT VicacoB Guanaco 65B GPT4
600
800,0001,200
400
Elo rating (
Model competitions based on 10,-4 Vic
: Dettmers et al., 2023 Chart 2024 AI Index.
QLo
Highlight Research:QLo researchers the University of in 2023, a
for more efficient model fineing. It usage enabling the-tun 65 billion parameter model
while 16bit fineing
. To put this perspective,-tun ama model leading-source LLM requires GB of GPU
, QLoRA 16 times efficientQLRA manages to increase efficiency with
techniques like a 4-bit NormalFloat (NF4), double
quantization, and page optimizers. QLoRA is
used to train a model named Guanaco, which
matched or even surpassed models like ChatGPT
in performance on the Vicuna benchmark (a
benchmark that ranks the outputs of LLMs) (Figure
2.12.5). Remarkably, the Guanaco models were
created with just 24 hours of fine-tuning on a single
GPU. QLoRa highlights how methods for optimizing
and further improving models have become more
efficient, meaning fewer resources will be required
to make increasingly capable models.
RAG検索結果が大幅に圧縮されたことを確認できました。次に、この圧縮された結果を用いても、LLMが生成する回答の品質が低下していないか検証します。RetrievalQA
モジュールを用いてLLMに質問を投げかけ、その回答が不自然でないことを確認しましょう。
from langchain.chains import RetrievalQA from langchain_openai import ChatOpenAI llm = ChatOpenAI(temperature=0) chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever) response = chain.invoke({"query": query}) display(response )
{'query': 'What role does QLoRA play in large language model?',
'result': "QLoRA plays a significant role in large language modeling by enabling more efficient model fine-tuning. It allows for the adjustment and enhancement of models on smaller datasets, ultimately sharpening the model's abilities for various tasks. QLoRA achieves this by utilizing techniques like 4-bit NormalFloat (NF4), double quantization, and page optimizers to increase efficiency. Notably, models like Guanaco trained using QLoRA have matched or even surpassed models like ChatGPT in performance on benchmarks like Vicuna, showcasing the effectiveness of QLoRA in optimizing and improving models with fewer resources."}
以上、LLMLingua Document Compressorを用いて、RAG検索結果の圧縮を試してみました。結果から、検索結果の大幅な圧縮と、LLM性能の維持という二つの目的を同時に達成できることが確認できました。
終わりに
今回は、LLMLingua Document Compressorの基本的な使い方と、その性能について解説しました。この技術は、プロンプトを大幅に圧縮し、大規模言語モデル(LLM)の処理速度を向上させる画期的なツールです。大規模テキストデータの処理、コスト削減、そしてLLMの応用範囲拡大など、幅広い分野での活用が期待されます。
LLMLinguaは、LongLLMLinguaによる超長文テキストへの対応や、LLMLingua-2によるさらなる圧縮率の向上など、LLMの性能を最大限に引き出すための研究開発が活発に進められています。これらの派生技術の登場により、LLMはより複雑なタスクをこなせるようになり、我々の生活やビジネスを大きく変える可能性を秘めています。
More Information:
- arXiv:2310.05736, Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 「LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models」, https://arxiv.org/abs/2310.05736