【LLM活用】LangChainとLLMLinguaでプロンプト圧縮

近年、ChatGPTをはじめとする大規模言語モデル(LLM)が、私たちの生活に大きな変化をもたらしています。これらのモデルは、膨大な量のテキストデータを学習することで、人間と自然な対話をしたり、文章を生成したりすることができるようになりました。しかし、LLMの利用には、計算コストや処理時間といった課題がつきまといます。

特に、プロンプトと呼ばれる、LLMに与える指示が長くなればなるほど、処理時間が遅延したり、コストが増加したりする傾向があります。これは、LLMがより多くの計算資源を必要とするためです。

もし、プロンプトを短くできれば、LLMの処理速度を向上させ、コストを削減できるのではないか?

このような疑問から生まれたのが、プロンプト圧縮という技術です。そして、その中でも注目を集めているのが、LLMLinguaというフレームワークです。

今回は、LangChainのモジュールであるLLMLingua Document Compressorを利用した、RAG検索結果の文書圧縮方法について解説する。なお、プロンプト圧縮については、プロンプト圧縮技術の比較と最新動向で詳しく解説しているので、こちらを参照してください。

LLMLinguaとは?

LLMLinguaは、プロンプトを効果的に圧縮するためのCoarse-to-Fineなアルゴリズムを採用しています。このアルゴリズムは、主に以下の3つのモジュールから構成されています。

1. Budget Controller

プロンプト内の各要素(指示、例、質問など)の重要度を評価し、それぞれの要素に対して異なる圧縮率を割り当てます。これにより、重要な情報を残しつつ、全体としてプロンプトを圧縮することができます。

2. Iterative Token-level Compression

プロンプトをトークン単位で細かく分割し、繰り返し圧縮を行うことで、より高度な圧縮を実現します。これにより、圧縮後のプロンプトが元の意味を損なうことなく、さらに短くなるように調整されます。

3. Alignment

圧縮されたプロンプトが、元のプロンプトと同様の意味を保持していることを保証するために、LLMとの間の分布の整合性を調整します。これにより、圧縮されたプロンプトでも、LLMが正確な回答を生成できるようになります。

LLMLinguaのメリット

  • 計算コストの削減: プロンプトを圧縮することで、LLMの処理時間を短縮し、計算コストを削減できます。
  • LLMの効率向上: プロンプトが短くなることで、LLMの推論効率が向上し、より多くのタスクを処理できるようになります。
  • 高精度な圧縮: LLMLinguaは、高度なアルゴリズムを用いることで、プロンプトを最大20倍まで圧縮しながら、性能低下を最小限に抑えることができます。
図 1: LLMLingua のフレームワーク: https://arxiv.org/pdf/2310.05736 より引用

LLMLingua Document Compressor とは?

LLMLingua Document Compressorモジュールは、GPT2-smallLLaMA-7Bといったコンパクトで高性能な言語モデルを活用し、プロンプト内の不要な部分を削ることで、効率的な推論を可能にするツールです。この技術により、プロンプトの長さを最大1/20に圧縮しながら、性能の低下を最小限に抑えます。

LLMLingua Document Compressorのメリット

  • 高圧縮率: 最大20倍の圧縮率を実現し、大幅なコスト削減が期待できます。
  • 性能の低下が少ない: 圧縮しても、LLMの生成する文章の質はほとんど変わりません。
  • 柔軟性: 様々なLLMに対応できます。

LLMLingua Document Compressorの活用例

  • 情報検索: 大量の文書の中から、必要な情報だけを抽出できます。
  • 質問応答システム: 長文の文書を圧縮し、より迅速な回答を実現できます。
  • 文書要約: 長文の文書を要約する際に、非本質的な情報を削除し、簡潔な要約を作成できます。

LLMLingua Document Compressor を試してみる

まずは、pipコマンドを使用して以下のパッケージをインストールしましょう。

# LangChain のインストール
$ pip install langchain-community==0.3.14 langchain-core==0.3.29 langchain-openai==0.2.14

# LLMLingua と RAG に必要なパッケージをインストール
$ pip install accelerate==1.2.1 faiss-cpu==1.9.0.post1 llmlingua==0.2.2 pypdf==5.1.0

LLMLinguaを試すために、まずはOpenAIのAPIキーを設定しましょう。加えて、RAG文書を整形して表示する関数も定義しておきます。

import os

# OpenAI の APIキーを設定
os.environ["OPENAI_API_KEY"] = "<YOUR-OPENAI-KEY>"

# 検索したRAG文書を表示するための関数
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

次に、RAGに用いるPDF文書を準備します。今回は、arXivで公開されている『Artificial Intelligence Index Report 2024』を使用します。

import requests

if not os.path.exists("2405.19522.pdf"):
    response = requests.get("https://arxiv.org/pdf/2405.19522")
    with open("2405.19522.pdf", mode="wb") as f:
        f.write(response.content)

続いて、PDF文書をRAGの入力単位であるチャンクに分割します。

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import CharacterTextSplitter

documents = PyPDFLoader("2405.19522.pdf").load()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator = "\n",
    chunk_size= 1000,
    chunk_overlap=100,
)
texts = text_splitter.split_documents(documents)
display(texts[100])
Document(metadata={'source': '2405.19522.pdf', 'page': 99}, page_content='100\nArtificial Intelligence\nIndex Report 2024Chapter 2 PreviewTable of Contents\n2021 2022\n0.00\n0.10\n0.20\n0.30\n0.40\n0.50\n0.60Editing accuracy\n0.11, Position replacement\n0.25, Positional addition\n0.33, Average\n0.34, Alter parts\n0.47, Object addition\n0.52, Size\n0.59, Object replacement\nEditVal automatic evaluation: editing accuracy\nSource: EditVal Leaderboard, 2024 | Chart: 2024 AI Index report\nEditing\nImage editing involves using AI to modify \nimages based on text prompts. This AI-\nassisted approach has broad real-world \napplications in fields such as engineering, \nindustrial design, and filmmaking.\nEditVal \nDespite the promise of text-guided image \nediting, few robust methods can evaluate \nhow accurately AI image editors adhere to \nediting prompts. EditVal, a new benchmark \nfor assessing text-guided image editing, \nincludes over 13 edit types, such as adding \nobjects or changing their positions, \nacross 19 object classes (Figure 2.4.10). \nThe benchmark was applied to evaluate \neight leading text-guided image editing \nmethods including SINE and Null-text. \nPerformance improvements since 2021 on \na variety of the benchmark’s editing tasks, \nare shown in Figure 2.4.11. \nChapter 2: Technical PerformanceArtificial Intelligence\nIndex Report 2024\nFigure 2.4.10\nFigure 2.4.11\nA sample VisIT-Bench instruction set\nSource: Bitton et al., 2023\n2.4 Image Computer Vision and Image Generation')

準備が整いましたので、FAISSを用いてRAG検索用のインデックスを作成し、検索結果を検証します。今回は、この検索文書をLLMLinguaにより圧縮します。

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-ada-002")
retriever = FAISS.from_documents(texts, embedding).as_retriever(search_kwargs={"k": 20})

query = "What role does QLoRA play in large language model?"
docs = retriever.invoke(query)
pretty_print_docs(docs)
ここをクリックすることでpretty_print_docsの表示結果を確認できます
Document 1:

151
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Fine-Tuning
Fine-tuning has grown increasingly popular as a 
method of enhancing LLMs and involves further 
training or adjusting models on smaller datasets.  
Fine-tuning not only boosts overall model 
performance but also sharpens the model’s 
capabilities on specific tasks. It also allows for more 
precise control over the model’s behavior.
879 902 916
966 974 992 1,022
1,348
Guanaco 7B Bard Guanaco 13B ChatGPT Vicuna 13B Guanaco 33B Guanaco 65B GPT-4
0
200
400
600
800
1,000
1,200
1,400
Elo rating (mean)
Model competitions based on 10,000 simulations using GPT-4 and the Vicuna benchmark
Source: Dettmers et al., 2023 | Chart: 2024 AI Index report
Figure 2.12.5
QLoRA
Highlighted Research:
QLoRA, developed by researchers from the 
University of Washington in 2023, is a new method 
for more efficient model fine-tuning. It dramatically 
reduces memory usage, enabling the fine-tuning 
of a 65 billion parameter model on a single 48 
GB GPU while maintaining full 16-bit fine-tuning 
performance. To put this in perspective, fine-tuning 
a 65B Llama model, a leading open-source LLM, 
typically requires about 780 GB of GPU memory. 
Therefore, QLoRA is nearly 16 times more efficient.
QLoRA manages to increase efficiency with 
techniques like a 4-bit NormalFloat (NF4), double 
quantization, and page optimizers. QLoRA is 
used to train a model named Guanaco, which 
matched or even surpassed models like ChatGPT 
in performance on the Vicuna benchmark (a 
benchmark that ranks the outputs of LLMs) (Figure 
2.12.5). Remarkably, the Guanaco models were 
created with just 24 hours of fine-tuning on a single 
GPU. QLoRa highlights how methods for optimizing 
and further improving models have become more 
efficient, meaning fewer resources will be required 
to make increasingly capable models.
----------------------------------------------------------------------------------------------------
Document 2:

B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., … Koreeda, Y. (2023). 
Holistic Evaluation of Language Models (arXiv:2211.09110). arXiv. https:/ /doi.org/10.48550/arXiv.2211.09110.
Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods (arXiv:2109.07958). arXiv. 
https:/ /doi.org/10.48550/arXiv.2109.07958.
Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, 
C., Shen, S., Zhang, T., Su, Y., Sun, H., … Tang, J. (2023). AgentBench: Evaluating LLMs as Agents (arXiv:2308.03688). arXiv. 
https:/ /doi.org/10.48550/arXiv.2308.03688.
Luccioni, A. S., Jernite, Y. & Strubell, E. (2023). Power Hungry Processing: Watts Driving the Cost of AI Deployment? 
(arXiv:2311.16863). arXiv. http:/ /arxiv.org/abs/2311.16863.
Luo, J., Paduraru, C., Voicu, O., Chervonyi, Y., Munns, S., Li, J., Qian, C., Dutta, P., Davis, J. Q., Wu, N., Yang, X., Chang, C.-
M., Li, T., Rose, R., Fan, M., Nakhost, H., Liu, T., Kirkman, B., Altamura, F., … Mankowitz, D. J. (2022). Controlling Commercial 
Cooling Systems Using Reinforcement Learning (arXiv:2211.07357). arXiv. https:/ /doi.org/10.48550/arXiv.2211.07357.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & Potts, C. (2011). “Learning Word Vectors for Sentiment Analysis.” In D. 
Lin, Y. Matsumoto & R. Mihalcea, eds., Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: 
Human Language Technologies: 142–50. Association for Computational Linguistics. https:/ /aclanthology.org/P11-1015.
Melas-Kyriazi, L., Rupprecht, C., Laina, I. & Vedaldi, A. (2023). RealFusion: 360° Reconstruction of Any Object From a Single 
Image (arXiv:2302.10663). arXiv. http:/ /arxiv.org/abs/2302.10663.
Mihaylov, T., Clark, P., Khot, T. & Sabharwal, A. (2018). “Can a Suit of Armor Conduct Electricity? A New Dataset for  
Open Book Question Answering.” In E. Riloff, D. Chiang, J. Hockenmaier & J. Tsujii, eds., Proceedings of the 2018  
Conference on Empirical Methods in Natural Language Processing: 2381–91. Association for Computational Linguistics. 
https:/ /doi.org/10.18653/v1/D18-1260.
----------------------------------------------------------------------------------------------------
Document 3:

Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., … Li, Z. (2023). 
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models (arXiv:2308.11462). 
arXiv. http:/ /arxiv.org/abs/2308.11462.
Haque, A., Tancik, M., Efros, A. A., Holynski, A. & Kanazawa, A. (2023). Instruct-NeRF2NeRF: Editing 3D Scenes With 
Instructions (arXiv:2303.12789). arXiv. http:/ /arxiv.org/abs/2303.12789.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D. & Steinhardt, J. (2021). Measuring Massive Multitask 
Language Understanding (arXiv:2009.03300). arXiv. http:/ /arxiv.org/abs/2009.03300.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D. & Steinhardt, J. (2021). Measuring Mathematical 
Problem Solving With the MATH Dataset (arXiv:2103.03874). arXiv. http:/ /arxiv.org/abs/2103.03874.
Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M. & Leskovec, J. (2021). Open Graph Benchmark:  
Datasets for Machine Learning on Graphs (arXiv:2005.00687). arXiv. https:/ /doi.org/10.48550/arXiv.2005.00687.
Huang, J., Chen, X., Mishra, S., Zheng, H. S., Yu, A. W., Song, X. & Zhou, D. (2024). Large Language Models Cannot  
Self-Correct Reasoning Yet (arXiv:2310.01798). arXiv. http:/ /arxiv.org/abs/2310.01798.
Huang, Q., Vora, J., Liang, P. & Leskovec, J. (2023). Benchmarking Large Language Models as AI Research Agents 
(arXiv:2310.03302). arXiv. http:/ /arxiv.org/abs/2310.03302.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O. & Narasimhan, K. (2023). SWE-bench: Can Language Models 
Resolve Real-World GitHub Issues? (arXiv:2310.06770). arXiv. https:/ /doi.org/10.48550/arXiv.2310.06770.
Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H. & Szolovits, P. (2020). What Disease Does This Patient Have?  
A Large-Scale Open Domain Question Answering Dataset From Medical Exams (arXiv:2009.13081). arXiv.  
http:/ /arxiv.org/abs/2009.13081.
Kıcıman, E., Ness, R., Sharma, A. & Tan, C. (2023). Causal Reasoning and Large Language Models: Opening a New Frontier  
for Causality (arXiv:2305.00050). arXiv. http:/ /arxiv.org/abs/2305.00050.
----------------------------------------------------------------------------------------------------
Document 4:

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA: A Graduate-
Level Google-Proof Q&A Benchmark (arXiv:2311.12022). arXiv. http:/ /arxiv.org/abs/2311.12022.
Rustia, D. J. A., Chiu, L.-Y., Lu, C.-Y., Wu, Y.-F., Chen, S.-K., Chung, J.-Y., Hsu, J.-C. & Lin, T.-T. (2022). “Towards Intelligent and 
Integrated Pest Management Through an AIoT-Based Monitoring System.” Pest Management Science 78, no. 10: 4288–4302. 
https:/ /doi.org/10.1002/ps.7048.
Schaeffer, R., Miranda, B. & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? (arXiv:2304.15004). 
arXiv. http:/ /arxiv.org/abs/2304.15004.
Schneider, F., Kamal, O., Jin, Z. & Schölkopf, B. (2023). Moûusai: Text-to-Music Generation With Long-Context Latent Diffusion 
(arXiv:2301.11757). arXiv. https:/ /doi.org/10.48550/arXiv.2301.11757.
Shams, S. R., Jahani, A., Kalantary, S., Moeinaddini, M. & Khorasani, N. (2021). “Artificial Intelligence Accuracy Assessment in NO2 
Concentration Forecasting of Metropolises Air.” Scientific Reports 11, no. 1: 1805. https:/ /doi.org/10.1038/s41598-021-81455-6.
Shi, Y., Wang, P., Ye, J., Long, M., Li, K. & Yang, X. (2024). MVDream: Multi-View Diffusion for 3D Generation (arXiv:2308.16512). 
arXiv. http:/ /arxiv.org/abs/2308.16512.
Soomro, K., Zamir, A. R. & Shah, M. (2012). UCF101: A Dataset of 101 Human Actions Classes From Videos in the Wild 
(arXiv:1212.0402; Version 1). arXiv. http:/ /arxiv.org/abs/1212.0402.
Stone, A., Xiao, T., Lu, Y., Gopalakrishnan, K., Lee, K.-H., Vuong, Q., Wohlhart, P., Kirmani, S., Zitkovich, B., Xia, F., Finn, C. & 
Hausman, K. (2023). Open-World Object Manipulation Using Pre-trained Vision-Language Models (arXiv:2303.00905). arXiv. 
https:/ /doi.org/10.48550/arXiv.2303.00905.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., 
Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open 
Foundation and Fine-Tuned Chat Models (arXiv:2307.09288). arXiv. https:/ /doi.org/10.48550/arXiv.2307.09288.
----------------------------------------------------------------------------------------------------
Document 5:

Solving With Large Language Models (arXiv:2305.10601). arXiv. http:/ /arxiv.org/abs/2305.10601.
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning 
(arXiv:1811.10830). arXiv. http:/ /arxiv.org/abs/1811.10830.
Zhang, L., Rao, A. & Agrawala, M. (2023). Adding Conditional Control to Text-to-Image Diffusion Models (arXiv:2302.05543). 
arXiv. http:/ /arxiv.org/abs/2302.05543.
Zhang, Z., Han, L., Ghosh, A., Metaxas, D. & Ren, J. (2022). SINE: SINgle Image Editing With Text-to-Image Diffusion Models 
(arXiv:2212.04489). arXiv. https:/ /doi.org/10.48550/arXiv.2212.04489.
----------------------------------------------------------------------------------------------------
Document 6:

Bowman, S., Chen, A., Conerly, T., DasSarma, N., Drain, D., Elhage, N., El-Showk, S., Fort, S., Clark, J. (2022).  
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned (arXiv:2209.07858).  
arXiv. http:/ /arxiv.org/abs/2209.07858.
Gehman, S., Gururangan, S., Sap, M., Choi, Y. & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration 
in Language Models (arXiv:2009.11462). arXiv. https:/ /doi.org/10.48550/arXiv.2009.11462.
Grinbaum, A. & Adomaitis, L. (2024). “Dual Use Concerns of Generative AI and Large Language Models.”  
Journal of Responsible Innovation 11, no. 1. https:/ /doi.org/10.1080/23299460.2024.2304381.
Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D. & Kamar, E. (2022). ToxiGen: A Large-Scale Machine-Generated  
Dataset for Adversarial and Implicit Hate Speech Detection (arXiv:2203.09509v4). arXiv. http:/ /arxiv.org/abs/2203.09509.
Ippolito, D., Tramèr, F., Nasr, M., Zhang, C., Jagielski, M., Lee, K., Choquette-Choo, C. A. & Carlini, N. (2023).  
Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy (arXiv:2210.17546v3). arXiv.  
https:/ /doi.org/10.48550/arXiv.2210.17546.
Janssen, M., Brous, P., Estevez, E., Barbosa, L. S. & Janowski, T. (2020). “Data Governance: Organizing Data for Trustworthy 
Artificial Intelligence.” Government Information Quarterly 37, no. 3: 101493. https:/ /doi.org/10.1016/j.giq.2020.101493.
Li, B., Sun, J. & Poskitt, C. M. (2023). How Generalizable Are Deepfake Detectors? An Empirical Study (arXiv:2308.04177). arXiv. 
http:/ /arxiv.org/abs/2308.04177.
Chapter 3: Responsible AI
Appendix
----------------------------------------------------------------------------------------------------
Document 7:

201
Artificial Intelligence
Index Report 2024Chapter 3 PreviewTable of Contents
Measuring Subjective Opinions in LLMs 
Research from Anthropic suggests that large language 
models do not equally represent global opinions 
on a variety of topics such as politics, religion, 
and technology. In this study, researchers built a 
GlobalOpinionQA dataset to capture cross-country 
opinions on various issues (Figure 3.5.7). They then 
generated a similarity metric to compare people’s 
answers in various countries with those outputted 
by LLMs. Using a four-point Likert scale, LLMs were 
asked to rate their agreement with statements from the 
World Values Survey (WVS) and Pew Research Center’s 
Global Attitudes (GAS) surveys, including questions 
like, “When jobs are scarce, employers should give 
priority to people of this country over immigrants,” or 
“On the whole, men make better business executives 
than women do.”
The experiments indicate that the models’ responses 
closely align with those from individuals in Western 
countries (Figure 3.5.8). The authors point out a 
notable lack of diversity in opinion representation, 
especially from non-Western nations among the 
shared responses. While it is challenging for models 
to precisely match the highly diverse distributions 
of global opinions—given the inherent variation in 
perspectives—it is still valuable to understand which 
opinions a model is likely to share. Recognizing 
the biases inherent in models can highlight their 
limitations and facilitate adjustments that improve 
regional applicability.
Chapter 3: Responsible AIArtificial Intelligence
Index Report 2024 3.5 Fairness
Figure 3.5.7
GlobalOpinionQA Dataset
Source: Durmus et al., 2023
----------------------------------------------------------------------------------------------------
Document 8:

Appendix 471Table of Contents
Artificial Intelligence
Index Report 2024
Artificial Intelligence
Index Report 2024 Chapter 2: Technical Performance
Appendix
Valmeekam, K., Marquez, M., Olmo, A., Sreedharan, S. & Kambhampati, S. (2023). PlanBench: An Extensible Benchmark 
for Evaluating Large Language Models on Planning and Reasoning About Change. Thirty-Seventh Conference on Neural 
Information Processing Systems Datasets and Benchmarks Track. https:/ /openreview.net/forum?id=YXogl4uQUO.
Voynov, O., Bobrovskikh, G., Karpyshev, P., Galochkin, S., Ardelean, A.-T., Bozhcnko, A., Karmanova, E., Kopanev, P., Labutin-
Rymsho, Y., Rakhimov, R., Safin, A., Serpiva, V., Artemov, A., Burnaev, E., Tsetserukou, D. & Zorin, D. (2023). Multi-sensor Large-
Scale Dataset for Multi-view 3D Reconstruction. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition 
(CVPR), 21392–403. https:/ /doi.org/10.1109/CVPR52729.2023.02049.
Walker, C. M. & Gopnik, A. (2014). “Toddlers Infer Higher-Order Relational Principles in Causal Learning.” Psychological Science 
25, no. 1: 161–69.
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L. & Anandkumar, A. (2023). Voyager: An Open-Ended 
Embodied Agent With Large Language Models (arXiv:2305.16291). arXiv. http:/ /arxiv.org/abs/2305.16291.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. & Zhou, D. (2023). Chain-of-Thought Prompting 
Elicits Reasoning in Large Language Models (arXiv:2201.11903). arXiv. https:/ /doi.org/10.48550/arXiv.2201.11903.
Xiao, T., Chan, H., Sermanet, P., Wahid, A., Brohan, A., Hausman, K., Levine, S. & Tompson, J. (2023). Robotic Skill Acquisition 
via Instruction Augmentation With Vision-Language Models (arXiv:2211.11736). arXiv. https:/ /doi.org/10.48550/arXiv.2211.11736.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D. & Chen, X. (2023). Large Language Models as Optimizers 
(arXiv:2309.03409). arXiv. http:/ /arxiv.org/abs/2309.03409.
Yang, D., Tian, J., Tan, X., Huang, R., Liu, S., Chang, X., Shi, J., Zhao, S., Bian, J., Wu, X., Zhao, Z., Watanabe, S. & Meng, H. 
(2023). UniAudio: An Audio Foundation Model Toward Universal Audio Generation (arXiv:2310.00704). arXiv.  
http:/ /arxiv.org/abs/2310.00704.
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y. & Narasimhan, K. (2023). Tree of Thoughts: Deliberate Problem 
Solving With Large Language Models (arXiv:2305.10601). arXiv. http:/ /arxiv.org/abs/2305.10601.
Zellers, R., Bisk, Y., Farhadi, A. & Choi, Y. (2019). From Recognition to Cognition: Visual Commonsense Reasoning
----------------------------------------------------------------------------------------------------
Document 9:

145
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.11 Properties of LLMs
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
LLMs Are Poor Self-Correctors
Highlighted Research:
It is generally understood that LLMs like GPT-4 
have reasoning limitations and can sometimes 
produce hallucinations. One proposed solution 
to such issues is self-correction, whereby LLMs 
identify and correct their own reasoning flaws. As 
AI’s societal role grows, the concept of intrinsic 
self-correction—allowing LLMs to autonomously 
correct their reasoning without external guidance—
is especially appealing. However, it is currently not 
well understood whether LLMs are in fact capable 
of this kind of self-correction.
 
Researchers from DeepMind and the University 
of Illinois at Urbana–Champaign tested GPT-4’s 
performance on three reasoning benchmarks: 
GSM8K (grade-school math), CommonSenseQA 
(common-sense reasoning), and HotpotQA 
(multidocument reasoning). They found that when 
the model was left to decide on self-correction 
without guidance, its performance declined across 
all tested benchmarks (Figure 2.11.3).
95.50%
82.00%
49.00%
91.50%
79.50%
49.00%
89.00%
80.00%
43.00%
GSM8K CommonSenseQA HotpotQA
0%
20%
40%
60%
80%
100%
Standard prompting: 1 call Self-correct (round 1): 3 calls Self-correct (round 2): 5 calls
Accuracy (%)
GPT-4 on reasoning benchmarks with intrinsic self-correction
Source: Huang et al., 2023 | Chart: 2024 AI Index report
Figure 2.11.3
----------------------------------------------------------------------------------------------------
Document 10:

150
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Optimization by PROmpting (OPRO)
Highlighted Research:
A paper from DeepMind has introduced 
Optimization by PROmpting (OPRO), a method 
that uses LLMs to iteratively generate prompts 
to improve algorithmic performance. OPRO uses 
natural language to guide LLMs in creating new 
prompts based on problem descriptions and 
previous solutions (Figure 2.12.3). The generated 
prompts aim to enhance the performance of AI 
systems on particular benchmarks. Compared to 
other prompting approaches like “let’s think step 
by step” or an empty starting point, ORPO leads 
to significantly greater accuracy on virtually all 23 
BIG-bench Hard tasks (Figure 2.12.4).
boolean_expressions
causal_judgement
date_understanding
disambiguation_qa
dyck_languages
formal_fallacies
geometric_shapes
hyperbaton
logical_deduction_seven_objects
movie_recommendation
multistep_arithmetic_two
navigate
object_counting
penguins_in_a_table
reasoning_about_colored_objects
ruin_names
salient_translation_error_detection
snarks
sports_understanding
temporal_sequences
tracking_shu�ed_objects_seven_objects
web_of_lies
word_sorting
−20
0
20
40
boolean_expressions
causal_judgement
date_understanding
disambiguation_qa
dyck_languages
formal_fallacies
geometric_shapes
hyperbaton
logical_deduction_seven_objects
movie_recommendation
multistep_arithmetic_two
navigate
object_counting
penguins_in_a_table
reasoning_about_colored_objects
ruin_names
salient_translation_error_detection
snarks
sports_understanding
temporal_sequences
tracking_shu�ed_objects_seven_objects
web_of_lies
word_sorting
0
10
20
30
40
50
60
Accuracy di�erence
“Let’s think step by step” instruction Empty instruction
Accuracy dierence on 23 BIG-bench Hard (BBH) tasks using PaLM 2-L scorer
Source: Yang et al., 2023 | Chart: 2024 AI Index report
Task
Figure 2.12.3
Figure 2.12.4
Sample OPRO 
prompts and 
optimization 
progress
Source: Yang et al., 2023
----------------------------------------------------------------------------------------------------
Document 11:

Appendix 470Table of Contents
Artificial Intelligence
Index Report 2024
Artificial Intelligence
Index Report 2024 Chapter 2: Technical Performance
Appendix
Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D. & Zeng, A. (2023). Large Language 
Models as General Pattern Machines (arXiv:2307.04721). arXiv. https:/ /doi.org/10.48550/arXiv.2307.04721.
Mitchell, M., Palmarini, A. B. & Moskvichev, A. (2023). Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning 
Tasks (arXiv:2311.09247). arXiv. http:/ /arxiv.org/abs/2311.09247.
Mokady, R., Hertz, A., Aberman, K., Pritch, Y. & Cohen-Or, D. (2022). Null-Text Inversion for Editing Real Images Using Guided 
Diffusion Models (arXiv:2211.09794). arXiv. https:/ /doi.org/10.48550/arXiv.2211.09794.
Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J. & Schölkopf, B. (2016). “Distinguishing Cause From Effect Using 
Observational Data: Methods and Benchmarks.” The Journal of Machine Learning Research 17, no. 1: 1103–1204.
Nie, A., Zhang, Y., Amdekar, A., Piech, C., Hashimoto, T. & Gerstenberg, T. (2023). MoCa: Measuring Human-Language Model 
Alignment on Causal and Moral Judgment Tasks (arXiv:2310.19677). arXiv. http:/ /arxiv.org/abs/2310.19677.
Olabi, A. G., Abdelghafar, A. A., Maghrabie, H. M., Sayed, E. T., Rezk, H., Radi, M. A., Obaideen, K. & Abdelkareem, M. A. 
(2023). “Application of Artificial Intelligence for Prediction, Optimization, and Control of Thermal Energy Storage Systems.” 
Thermal Science and Engineering Progress, 39: 101730. https:/ /doi.org/10.1016/j.tsep.2023.101730.
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., 
Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., … Zoph, B. (2024). 
GPT-4 Technical Report (arXiv:2303.08774). arXiv. https:/ /doi.org/10.48550/arXiv.2303.08774.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D. & Finn, C. (2023). Direct Preference Optimization: Your 
Language Model Is Secretly a Reward Model (arXiv:2305.18290). arXiv. http:/ /arxiv.org/abs/2305.18290.
Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J. & Bowman, S. R. (2023). GPQA: A Graduate-
Level Google-Proof Q&A Benchmark (arXiv:2311.12022). arXiv. http:/ /arxiv.org/abs/2311.12022.
----------------------------------------------------------------------------------------------------
Document 12:

90
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Factuality and Truthfulness
Despite remarkable achievements, LLMs remain 
susceptible to factual inaccuracies and content 
hallucination—creating seemingly realistic, yet false, 
information. The presence of real-world instances 
where LLMs have produced hallucinations—in 
court cases, for example—underscores the growing 
necessity of closely monitoring trends in LLM 
factuality.
TruthfulQA 
Introduced at ACL 2022, TruthfulQA is a benchmark 
designed to evaluate the truthfulness of LLMs in 
generating answers to questions. This benchmark 
comprises approximately 800 questions across 38 
categories, including health, politics, and finance. 
Many questions are crafted to challenge commonly 
held misconceptions, which typically lead humans to 
answer incorrectly (Figure 2.2.9). Although one of the 
observations of the paper is that larger models tend to 
be less truthful, GPT-4 (RLHF) released in early 2024, 
has achieved the highest performance thus far on the 
TruthfulQA benchmark, with a score of 0.6 (Figure 
2.2.10). This score is nearly three times higher than that 
of a GPT-2-based model tested in 2021, indicating that 
LLMs are becoming progressively better at providing 
truthful answers.
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Figure 2.2.9
Sample TruthfulQA questions
Source: Lin, Hilton, and Evans, 2022
2.2 Language
----------------------------------------------------------------------------------------------------
Document 13:

86
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Understanding
English language understanding challenges AI systems 
to understand the English language in various ways 
such as reading comprehension and logical reasoning.
HELM: Holistic Evaluation of Language Models  
As illustrated above, in recent years LLMs have 
surpassed human performance on traditional English-
language benchmarks, such as SQuAD (question 
answering) and SuperGLUE (language understanding). 
This rapid advancement has led to the need for more 
comprehensive benchmarks.
In 2022, Stanford researchers introduced HELM 
(Holistic Evaluation of Language Models), designed 
to evaluate LLMs across diverse scenarios, 
including reading comprehension, language 
understanding, and mathematical reasoning. 6 
HELM assesses models from several leading 
companies like Anthropic, Google, Meta, and 
OpenAI, and uses a “mean win rate” to track 
average performance across all scenarios. As of 
January 2024, GPT-4 leads the aggregate HELM 
leaderboard with a mean win rate of 0.96 (Figure 
2.2.3); however, different models top different task 
categories (Figure 2.2.4). 7 
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
0.96
0.83 0.82 0.78 0.78 0.77 0.73 0.72 0.69 0.68
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
Palmyra X V3 (72B)
Palmyra X V2 (33B)
PaLM-2 (Unicorn)
Yi (34B)
Mixtral (8×7B 32K seqlen)
Anthropic Claude v1.3
PaLM-2 (Bison)
Anthropic Claude 2.00.00
0.20
0.40
0.60
0.80
1.00
Mean win rate
HELM: mean win rate
Source: CRFM, 2023 | Chart: 2024 AI Index report
GSM8K - EM
LegalBench - EM
MATH - Equivalent (CoT)
MMLU - EM
MedQA - EM
NarrativeQA - F1
NaturalQuestions (closed-book) -
F1
NaturalQuestions (open-book) - F1
OpenbookQA - EM
WMT 2014 - BLEU-4
Task
GPT-4 (0613)
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
GPT-4 (0613)
GPT-4 Turbo (1106 preview)
Yi (34B)
Llama 2 (70B)
PaLM-2 (Bison)
GPT-4 (0613)
Palmyra X V3 (72B)
Leading model
0.93
0.71
0.86
0.74
0.82
0.78
0.46
0.81
0.96
0.26
Score
Leaders on individual HELM sub-benchmarks
Source: CRFM, 2023 | Table: 2024 AI Index report
Figure 2.2.3
Figure 2.2.4
6 HELM evaluates 10 scenarios: (1) NarrativeQA (reading comprehension), (2) Natural Questions (closed-book) (closed-book short-answer question answering), (3) Natural Questions 
(open-book) (open-book short-answer question answering), (4) OpenBookQA (commonsense question answering), (5) MMLU (multisubject understanding), (6) GSM8K (grade school 
math), (7) MATH (competition math), (8) LegalBench (legal reasoning), (9) MedQA (medical knowledge), and (10) WMT 2014 (machine translation). 
7 There are several versions of HELM. This section reports the score on HELM Lite, Release v1.0.0 (2023-12-19), with the data having been collected in January 2024. 
2.2 Language
----------------------------------------------------------------------------------------------------
Document 14:

193
Artificial Intelligence
Index Report 2024Chapter 3 PreviewTable of Contents
Universal and Transferable Attacks on Aligned 
Language Models 
Recent attention in AI security has centered on 
uncovering adversarial attacks capable of bypassing 
the implemented safety protocols of LLMs. Much of 
this research requires substantial human intervention 
and is idiosyncratic to specific models. However, in 
2023, researchers unveiled a universal attack capable 
of operating across various LLMs. This attack induces 
aligned models to generate objectionable content 
(Figure 3.4.9).
The method involved automatically generating suffixes 
that, when added to various prompts, compel LLMs 
to produce unsafe content. Figure 3.4.10 highlights 
the success rates of different attacking styles on 
leading LLMs. The method the researchers introduce 
is called Greedy Coordinate Gradient (GCG). The 
study demonstrates that these suffixes (the GCG 
attack) often transfer effectively across both closed 
and open models, encompassing ChatGPT, Bard, 
Claude, Llama-2-Chat, and Pythia. This study raises 
an important question as to how models can be better 
fortified against automated adversarial attacks. It 
also demonstrates how LLMs can be vulnerable to 
attacks that employ unintelligible, non-human-readable 
prompts. Current red-teaming methodologies primarily 
focus on interpretable prompts. This new research 
suggests there is a significant gap in buffering LLMs 
against attacks utilizing uninterpretable prompts.
3.4 Security and Safety
Chapter 3: Responsible AIArtificial Intelligence
Index Report 2024
Figure 3.4.9
Using suffixes to manipulate LLMs
Source: Zou et al., 2023
----------------------------------------------------------------------------------------------------
Document 15:

88
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Generation
In generation tasks, AI models are tested on their ability 
to produce fluent and practical language responses.
Chatbot Arena Leaderboard  
The rise of capable LLMs has made it increasingly 
important to understand which models are 
preferred by the general public. Launched in 2023, 
the Chatbot Arena Leaderboard is one of the 
first comprehensive evaluations of public LLM 
preference. The leaderboard allows users to query 
two anonymous models and vote for the preferred 
generations (Figure 2.2.7). As of early 2024, the 
platform has garnered over 200,000 votes, and 
users ranked OpenAI’s GPT-4 Turbo as the most 
preferred model (Figure 2.2.8).
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Figure 2.2.7
A sample model response on the Chatbot Arena Leaderboard
Source: Chatbot Arena Leaderboard, 2024
2.2 Language
----------------------------------------------------------------------------------------------------
Document 16:

148
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.12 Techniques for LLM Improvement
2.12 Techniques for LLM Improvement
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
As LLMs use increases, techniques are being sought to enhance their performance and efficiency. This section examines 
some of those advances. 
Prompting
Prompting, a vital aspect of the AI pipeline, entails 
supplying a model with natural language instructions 
that describe tasks the model should execute. 
Mastering the art of crafting effective prompts 
significantly enhances the performance of LLMs 
without requiring that models undergo underlying 
improvements. 
Graph of Thoughts Prompting
Highlighted Research:
Chain of thought (CoT) and Tree of Thoughts 
(ToT) are prompting methods that can improve 
the performance of LLMs on reasoning tasks. In 
2023, European researchers introduced another 
prompting method, Graph of Thoughts (GoT), that 
has also shown promise (Figure 2.12.1). GoT enables 
LLMs to model their thoughts in a more flexible, 
graph-like structure which more closely mirrors 
actual human reasoning. The researchers then 
designed a model architecture to implement GoT 
and found that, compared to ToT, it increased the 
quality of outputs by 62% on a sorting task while 
reducing cost by around 31% (Figure 2.12.2).
Figure 2.12.1
Graph of Thoughts (GoT) reasoning flow
Source: Besta et al., 2023
----------------------------------------------------------------------------------------------------
Document 17:

135
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
2.9 Robotics
2.9 Robotics
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
Over time, AI has become increasingly integrated into robotics, enhancing robots’ capabilities to perform complex 
tasks. Especially with the rise of foundation models, this integration allows robots to iteratively learn from their 
surroundings, adapt flexibly to new settings, and make autonomous decisions.
PaLM-E
PaLM-E is a new AI model from Google that 
merges robotics with language modeling to 
address real-world tasks like robotic manipulation 
and knowledge tasks like question answering and 
image captioning. Leveraging transformer-based 
architectures, the largest PaLM-E model is scaled 
up to 562B parameters. The model is trained 
on diverse visual language as well as robotics 
data, which results in superior performance on 
a variety of robotic benchmarks. PaLM-E also 
sets new standards in visual tasks like OK-VQA, 
excels in other language tasks, and can engage in 
chain-of-thought, mathematical, and multi-image 
reasoning, even without specific training in these 
areas. Figure 2.9.1 illustrates some of the tasks that 
the PaLM-E model can perform.
On Task and Motion Planning (TAMP) domains, 
where robots have to manipulate objects, PaLM-E 
outperforms previous state-of-the-art methods like 
SayCan and PaLI on both embodied visual question 
answering and planning (Figure 2.9.2).16 On 
robotic manipulation tasks, PaLM-E outperforms 
competing models (PaLI and CLIP-FT) in its ability 
to detect failures, which is a crucial step for robots 
to perform closed-loop planning (Figure 2.9.3).
PaLM-E is significant in that it demonstrates that 
language modeling techniques as well as text 
data can enhance the performance of AI systems 
in nonlanguage domains, like robotics. PaLM-E 
also highlights how there are already linguistically 
adept robots capable of real-world interaction and 
high-level reasoning. Developing these kinds of 
multifaceted robots is an essential step in creating 
more general robotic assistants that can, for 
example, assist in household work.
Highlighted Research:
16 Embodied Visual Question Answering (Embodied VQA) is a task where agents need to navigate through 3D environments and answer questions about the objects they 
visually perceive in the environment.
----------------------------------------------------------------------------------------------------
Document 18:

87
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
MMLU: Massive Multitask Language 
Understanding  
The Massive Multitask Language Understanding 
(MMLU) benchmark assesses model performance in 
zero-shot or few-shot scenarios across 57 subjects, 
including the humanities, STEM, and social sciences 
(Figure 2.2.5). MMLU has emerged as a premier 
benchmark for assessing LLM capabilities: Many state-
of-the-art models like GPT-4, Claude 2, and Gemini have 
been evaluated against MMLU.
In early 2023, GPT-4 posted a state-of-the-art score 
on MMLU, later surpassed by Google’s Gemini Ultra. 
Figure 2.2.6 highlights the top model scores on the 
MMLU benchmark in different years. The scores 
reported are the averages across the test set. As of 
January 2024, Gemini Ultra holds the top score of 
90.0%, marking a 14.8 percentage point improvement 
since 2022 and a 57.6 percentage point increase since 
MMLU’s inception in 2019. Gemini Ultra’s score was 
the first to surpass MMLU’s human baseline of 89.8%.
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
2019 2020 2021 2022 2023
30%
40%
50%
60%
70%
80%
90%
Average accuracy (%)
90.04%
MMLU: average accuracy
Source: Papers With Code, 2023 | Chart: 2024 AI Index report
89.8%, human baseline
Figure 2.2.5
Figure 2.2.6
A sample question from MMLU
Source: Hendrycks et al., 2021
2.2 Language
----------------------------------------------------------------------------------------------------
Document 19:

http:/ /arxiv.org/abs/2308.06595.
Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S. W., Fidler, S. & Kreis, K. (2023). Align Your Latents: High-Resolution 
Video Synthesis With Latent Diffusion Models (arXiv:2304.08818). arXiv. http:/ /arxiv.org/abs/2304.08818.
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C.,  
Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., … Zitkovich, B. 
(2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. (arXiv:2307.15818). arXiv.  
https:/ /arxiv.org/abs/2307.15818.
Castaño, J., Martínez-Fernández, S., Franch, X. & Bogner, J. (2023). Exploring the Carbon Footprint of Hugging Face’s ML 
Models: A Repository Mining Study. 2023 ACM/IEEE International Symposium on Empirical Software Engineering and 
Measurement (ESEM), 1–12. https:/ /doi.org/10.1109/ESEM56168.2023.10304801.
Chen, L., Chen, Z., Zhang, Y., Liu, Y., Osman, A. I., Farghali, M., Hua, J., Al-Fatesh, A., Ihara, I., Rooney, D. W. & Yap, P.-S. (2023). 
“Artificial Intelligence-Based Solutions for Climate Change: A Review.” Environmental Chemistry Letters 21, no. 5: 2525–57. 
https:/ /doi.org/10.1007/s10311-023-01617-y.
Chen, L., Zaharia, M. & Zou, J. (2023). How Is ChatGPT’s Behavior Changing Over Time? (arXiv:2307.09009). arXiv.  
http:/ /arxiv.org/abs/2307.09009.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., 
Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., … Zaremba, W. (2021). Evaluating Large 
Language Models Trained on Code (arXiv:2107.03374; Version 2). arXiv. https:/ /doi.org/10.48550/arXiv.2107.03374.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S. & Amodei, D. (2023). Deep Reinforcement Learning From Human 
Preferences (arXiv:1706.03741). arXiv. https:/ /doi.org/10.48550/arXiv.1706.03741.
----------------------------------------------------------------------------------------------------
Document 20:

140
Artificial Intelligence
Index Report 2024Chapter 2 PreviewTable of Contents
Direct Preference Optimization
Highlighted Research:
Chapter 2: Technical PerformanceArtificial Intelligence
Index Report 2024
As illustrated above, RLHF is a useful method 
for aligning LLMs with human preferences. 
However, RLHF requires substantial computational 
resources, involving the training of multiple 
language models and integrating LM policy 
sampling within training loops. This complexity  
can hinder its broader adoption.
In response, researchers from Stanford and CZ 
Biohub have developed a new reinforcement 
learning algorithm for aligning models named 
Direct Preference Optimization (DPO). DPO is 
simpler than RLHF but equally effective. The 
researchers show that DPO is as effective as other 
existing alignment methods, such as Proximal 
Policy Optimization (PPO) and Supervised Fine-
Tuning (SFT), on tasks like summarization (Figure 
2.10.5). The emergence of techniques like DPO 
suggests that model alignment methods are 
becoming more straightforward and accessible.
0.00 0.25 0.50 0.75 1.00
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
DPO PPO Best of 128 SFT Preferred-FT GPT-J
Sampling temperature
Win rate
Comparison of dierent algorithms on TL;DR summarization task across dierent sampling temperatures
Source: Rafailov et al., 2023 | Table: 2024 AI Index report
Human baseline
Figure 2.10.5
2.10 Reinforcement Learning

次に、LLMLinguaの構築に移りましょう。文書圧縮モデルには、GPT-2を採用しますが、Llama-2など、他のモデルへの置き換えも可能です。

from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import LLMLinguaCompressor

compressor = LLMLinguaCompressor(model_name="openai-community/gpt2", device_map="cpu")
# compressor = LLMLinguaCompressor(model_name="NousResearch/Llama-2-7b-hf", device_map="cpu")
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

では、構築したLLMLinguaを用いて、RAG検索結果の圧縮率を検証します。結果を見ると、元の検索結果と比較して、RAG検索結果が大幅にコンパクト化されていることがわかります。

compressed_docs = compression_retriever.invoke(query)
pretty_print_docs(compressed_docs)
ここをクリックすることでpretty_print_docsの表示結果を確認できます
Document 1:

#ref0
ArtificialIndex 2024Chapter PreviewTable of Techniques for LLChapter PerformanceArt 2024Fine
Finetun as a enhancing LLMs and  adjusting models on smaller datasets Fine overall  sharp the model�s abilities tasks. also allows pre the.79 902 9
66 9 9,022
348uanacoB GuanacoB ChatGPT VicacoB Guanaco 65B GPT4
600
800,0001,200
400
Elo rating (
Model competitions based on 10,-4 Vic
: Dettmers et al., 2023 Chart 2024 AI Index.
QLo
Highlight Research:QLo researchers the University of in 2023, a 
for more efficient model fineing. It  usage enabling the-tun  65 billion parameter model 
 while 16bit fineing 
. To put this perspective,-tun ama model leading-source LLM  requires GB of GPU 
, QLoRA 16 times efficientQLRA manages to increase efficiency with 
techniques like a 4-bit NormalFloat (NF4), double 
quantization, and page optimizers. QLoRA is 
used to train a model named Guanaco, which 
matched or even surpassed models like ChatGPT 
in performance on the Vicuna benchmark (a 
benchmark that ranks the outputs of LLMs) (Figure 
2.12.5). Remarkably, the Guanaco models were 
created with just 24 hours of fine-tuning on a single 
GPU. QLoRa highlights how methods for optimizing 
and further improving models have become more 
efficient, meaning fewer resources will be required 
to make increasingly capable models.

RAG検索結果が大幅に圧縮されたことを確認できました。次に、この圧縮された結果を用いても、LLMが生成する回答の品質が低下していないか検証します。RetrievalQAモジュールを用いてLLMに質問を投げかけ、その回答が不自然でないことを確認しましょう。

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)
chain = RetrievalQA.from_chain_type(llm=llm, retriever=compression_retriever)

response = chain.invoke({"query": query})
display(response )
{'query': 'What role does QLoRA play in large language model?',
 'result': "QLoRA plays a significant role in large language modeling by enabling more efficient model fine-tuning. It allows for the adjustment and enhancement of models on smaller datasets, ultimately sharpening the model's abilities for various tasks. QLoRA achieves this by utilizing techniques like 4-bit NormalFloat (NF4), double quantization, and page optimizers to increase efficiency. Notably, models like Guanaco trained using QLoRA have matched or even surpassed models like ChatGPT in performance on benchmarks like Vicuna, showcasing the effectiveness of QLoRA in optimizing and improving models with fewer resources."}

以上、LLMLingua Document Compressorを用いて、RAG検索結果の圧縮を試してみました。結果から、検索結果の大幅な圧縮と、LLM性能の維持という二つの目的を同時に達成できることが確認できました。

終わりに

今回は、LLMLingua Document Compressorの基本的な使い方と、その性能について解説しました。この技術は、プロンプトを大幅に圧縮し、大規模言語モデル(LLM)の処理速度を向上させる画期的なツールです。大規模テキストデータの処理、コスト削減、そしてLLMの応用範囲拡大など、幅広い分野での活用が期待されます。

LLMLinguaは、LongLLMLinguaによる超長文テキストへの対応や、LLMLingua-2によるさらなる圧縮率の向上など、LLMの性能を最大限に引き出すための研究開発が活発に進められています。これらの派生技術の登場により、LLMはより複雑なタスクをこなせるようになり、我々の生活やビジネスを大きく変える可能性を秘めています。

More Information:

  • arXiv:2310.05736, Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 「LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models」, https://arxiv.org/abs/2310.05736