大语言模型预训练和评估奖励模型的提示

Discussing AI Research Papers in March 2024

Mar 31, 2024

这是人工智能研究的另一个月，很难选出最喜欢的研究。

除了新的研究之外，还有许多其他重要的公告。其中，xAI开源了其Grok-1模型，该模型拥有3140亿个参数，是目前最大的开源模型。此外，有报道称Claude-3正在接近甚至超过GPT-4的性能。然后还有Open-Sora 1.0（视频生成的完全开源项目）、Eagle 7B（基于RWKV的新模型）、Mosaic的1320亿参数DBRX（专家混合模型）以及AI21的Jamba（基于Mamba的SSM-transformer模型）。

然而，由于这些模型的详细信息相当匮乏，我将专注于讨论研究论文。这个月，我正在阅读一篇关于继续预训练大语言模型策略的论文，随后是关于在人类反馈强化学习中使用的奖励建模（一种流行的LLM对齐方法）的讨论，以及一个新的基准测试。

持续预训练对于大语言模型（LLM）是一个重要的话题，因为它允许我们更新现有的LLM，例如确保这些模型与最新的信息和趋势保持同步。此外，它还使我们能够将它们适应新的目标领域，而无需从头开始重新训练。

奖励建模很重要，因为它使我们能够更紧密地与人类偏好对齐大语言模型，并在一定程度上有助于安全。但除了优化人类偏好之外，它还提供了一种机制来学习和适应大语言模型以执行复杂的任务，通过提供指令输出示例，在这些示例中，正确行为的显式编程是具有挑战性或不切实际的。

祝阅读愉快！

1. 持续预训练大型语言模型的简单可扩展策略

我们经常讨论微调大语言模型以遵循指令。然而，在实践中，使用新知识或领域特定数据更新大语言模型也非常重要。最近的一篇论文《简单且可扩展的策略以持续预训练大型语言模型》提供了有价值的见解，介绍了如何使用新数据继续预训练一个大语言模型。

Specifically, the researchers compare models trained in three different ways:

常规预训练：使用随机权重初始化模型并在数据集D1上对其进行预训练。
持续预训练：在上面的场景中使用预先训练好的模型，并在数据集D2上进一步对其进行预训练。
在合并数据集上重新训练：与第一种场景一样，使用随机权重初始化模型，但将其训练在D1和D2数据集的组合（联合）上。

方法三，在合并数据集上重新训练，是该领域普遍采用的做法，例如我去年在讨论BloombergGPT论文时所写的。这是因为重新训练通常有助于找到良好的学习率计划——通常使用线性warmup后跟半个周期余弦衰减——并且有助于灾难性遗忘。

Catastrophic forgetting refers to the phenomenon where a neural network, especially in sequential learning tasks, forgets previously learned information upon learning new information. This is particularly problematic in models trained across diverse datasets or tasks over time.

因此，通过在包含新旧信息的综合数据集上重新训练模型，模型可以在保持先前学习任务性能的同时适应新的数据。

1.1 Takeaways and Results

这篇24页的论文报告了大量实验结果，附带了无数个图表，按照今天的标准来看非常详尽。为了将其简化为易于消化的形式，下面的图表总结了主要结果，表明通过持续预训练可以实现与从头开始重新训练相同的良好性能，而无需使用组合数据集。

持续预训练的成本是重新训练成本的一半（因为已经有了预先训练好的模型，只需要使用一半的数据），但可以达到相同良好的性能。来源：来自https://arxiv.org/abs/2403.08763的可注释图示.

成功应用持续预训练的“诀窍”是什么？

重新加热和衰减学习率（见下一节）。
在新的数据集（D2）中添加一小部分（例如，5%）原始预训练数据（D1）以防止灾难性遗忘。注意，更小的比例，如0.5％和1％也是有效的。

1.2 学习率调度

在预训练或微调大语言模型时，通常使用的学习率调度方式是先进行线性warmup，然后采用半周期余弦衰减，如下所示。

用于预训练和微调LLM的常见学习率调度。来源：从零开始构建大语言模型 <https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb>)

如上图所示，在线性热身阶段，学习率从较低值开始逐渐增加到训练初期的预设值。这种方法有助于在进入主要训练阶段之前稳定模型的权重参数。随后，在热身期之后，学习率采用余弦衰减调度来同时训练和逐渐降低模型的学习率。

考虑到预训练以非常低的 learning rate 结束，我们如何调整继续预训练的学习率？通常情况下，我们将学习率重新引入到一个热身阶段，并在其后跟随一个衰减阶段，这被称为重新加热和重新衰减。简单来说，我们使用与初始预训练阶段完全相同的 learning rate 调度。

*持续预训练的时间表。图基于从零开始构建大型语言模型，https://github.com/rasbt/LLMs-from-scratch/blob/main/appendix-D/01_main-chapter-code/appendix-D.ipynb*

作者们发现重新加温与衰减确实有效。此外，他们还与所谓的“无限学习率”计划进行了比较，该计划指的是在2021年Scaling Vision Transformers论文中概述的计划。这个计划从温和的余弦（或可选的反平方根）衰减开始，过渡到恒定的学习率，并以急剧衰减结束以进行退火。

三个预训练阶段的重新温习、重新衰减和无限学习率调度实验。来源：来自https://arxiv.org/abs/2403.08763的注释图

无限学习率调度可以很方便，因为可以在常数学习率阶段随时通过短暂的退火阶段停止预训练（而不是完成余弦半周期）。然而，如上图所示的结果表明，使用“无限学习率”进行预训练和持续预训练并不是必要的。常见的重新加热和重新衰减与无限学习率调度具有相同的最终损失。

1.3 Conclusion and Caveats

据我所知，重新加温（re-warming）和重新衰减（re-decaying）以及将原始预训练数据添加到新数据中，这些方法在某种程度上是常识。然而，我非常赞赏研究人员花费时间正式测试这种方法，并在这份长达24页的详细报告中进行了详细的阐述。

此外，我发现“无限学习率”调度并非必要，实际上通过常见的线性warm-up后接半周期余弦衰减，可以获得相同的最终损失。

虽然我欣赏这篇论文所进行的全面实验套件，但一个潜在的注意事项是，大多数实验都是在相对较小的具有经典LLM架构（GPT-NeoX）的405M参数模型上进行的。然而，作者们展示了这些结果对于一个10B参数模型也是成立的，这给了人们理由相信这些结果也适用于更大的（例如，70B参数）模型和可能的其他架构变体。

研究人员关注了规模相似的预训练数据集。此外，附录还显示，当继续预训练的数据集只有初始预训练数据集的50％或30％时，这些结果是一致的。一个有趣的研究方向是探究在预训练数据集比初始预训练数据集小得多的情况下（这在实践中很常见），这些趋势和推荐是否仍然适用。

另一个有趣的研究方向是测试持续预训练如何影响指令微调大型语言模型的指令遵循能力。特别是，我想知道在通过持续预训练更新LLM的知识后是否需要添加另一轮指令微调。

作为旁注，如果您对高效预训练大语言模型(LLM)感兴趣，我们最近开源了一个名为Thunder的PyTorch编译器。.

当我的同事们将这种方法应用于我协助开发的LitGPT开源LLM库时，他们在预训练Llama 2 7B模型时实现了40％的运行时间性能提升。

使用Thunder预训练大语言模型，图片来源：https://github.com/Lightning-AI/lightning-thunder

2. 评估语言模型的奖励建模

RewardBench: 评估语言模型中的奖励建模介绍了一种基准，用于在人类反馈强化学习（RLHF）中使用的奖励模型——这是大语言模型流行的指令微调和对齐流程。

在我们讨论本文的主要收获之前，让我们先快速绕道一下，在下一节简要讨论RLHF和奖励建模。

2.1 奖励建模和RLHF介绍

RLHF旨在改进大语言模型(LLM)，使其生成的输出更符合人类的偏好。通常，这指的是模型响应的有用性和无害性。我之前在一篇文章中也详细介绍了RLHF过程：https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives。

注意，本文重点是评估奖励模型的表现，而不是通过大语言模型微调指令后得到的指令跟随型大语言模型（如ChatGPT和Llama 2-chat）。用于创建这些指令跟随型大语言模型的过程——即基于人类反馈的强化学习（RLHF）过程，已在下面的图表中总结。

*大语言模型微调与对齐人类偏好的三步强化学习人类反馈过程概述。基于 InstructGPT 论文中的注释图，https://arxiv.org/abs/2203.02155.*

如上图所示，奖励模型的创建是RLHF过程中的中间步骤。此外，奖励模型本身就是一个大语言模型。

奖励模型与原始基础LLM之间的区别在于我们适应了奖励模型的输出层，使其返回一个可以作为奖励标签使用的分数。为了实现这一目标，我们有两种选择：(1)用一个新的线性层替换现有的输出层，产生单个逻辑值，或(2)重新利用现有输出逻辑中的一个，并使用奖励标签对其进行微调。

训练奖励模型的过程和损失函数类似于用于分类的神经网络训练。在常规二元分类中，我们预测输入示例属于类别1还是类别0。我们使用逻辑函数来建模这种类别归属概率，即输入示例属于类别1的概率。

二元分类任务通过逻辑函数的主要结论总结在下图中。

如果你是第一次接触用于训练分类器的逻辑函数，可以在这里找到更多信息：

PyTorch文章《优化负对数似然和交叉熵中的损失学习》
我的免费讲座，第四单元：训练多层神经网络（特别是第4.1、4.2和4.3单元中的5+3+5=13个视频；或者，这些视频也可以在YouTube上观看，链接在此）

关于奖励建模，我们可以使用二元分类的逻辑损失，其中结果被标记为0或1，来训练奖励模型。

然而，对于奖励模型来说，更常见的是使用类似布拉德利-泰勒模型（Bradley-Terry model），该模型是专门为成对比较任务设计的，其中目标不是独立地将项目分类到类别中，而是确定项目之间的偏好或排名。

Bradley-Terry模型在关注相对比较结果的场景下特别有用，比如“这两件物品哪一件更受欢迎？”而不是绝对分类，如“这个物品是0还是1？”

2.2. 强化学习人类反馈（RLHF）与直接偏好优化（DPO）比较

在大多数模型中，如Llama 2和OpenAI的InstructGPT（可能与ChatGPT模型的背后方法相同），奖励模型被训练为分类器，以预测两个答案之间的人类偏好概率，正如上文所述。

然而，训练奖励模型需要额外的步骤，实际上，如果我们直接优化奖励而不创建显式的奖励模型，则更容易。这种方法，也称为直接偏好优化(DPO)，最近已经获得了广泛的流行。

在DPO中，目标是优化策略π，其中策略是训练模型的术语，使其最大化预期奖励，同时保持一定程度上的参考策略π _ref 的接近度。这有助于在新策略π中维持π _ref 的一些期望属性（如稳定性或安全性）。

上述方程中的β通常作为温度参数发挥作用，控制概率分布对策略得分差异的敏感度。较高的beta使分布更敏感于差异，导致偏好选项之间的函数更加陡峭，其中偏好更加明显。较低的beta使模型不太敏感于得分差异，导致代表较弱偏好的平坦函数。本质上，beta有助于校准概率模型中表达偏好的强度。

由于其相对简单性，即不需要训练单独的奖励模型，通过DPO微调的LLM非常受欢迎。但是房间里的大象是，它表现如何？根据原始DPO论文，DPO在下面的表格中表现非常好。然而，这必须带着一点怀疑来看待，因为使用专用奖励模型的RLHF（即RLHF-PPO）更难训练，需要更大的数据集和计算资源，并且比较可能不能反映最佳DPO模型与最佳RLHF-PPO模型的比较情况。

来自原始DPO论文的注释表格，https://arxiv.org/abs/2305.18290

此外，许多 DPO 模型可以在大多数大语言模型排行榜上找到。然而，由于使用专门的奖励模型的 RLHF 比 DPO 更复杂，因此存在更多的 DPO 模型。因此，很难说在直接比较中 DPO 是否实际上更好，因为没有这些模型的等效模型（也就是说，使用 DPO 而不是使用专门的奖励模型的 RLHF 在完全相同的架构和数据集上训练的模型）。

2.3 RewardBench

在简要介绍RLHF和奖励建模之后，本节将直接进入“RewardBench:评估语言模型的奖励建模”论文，该论文提出了一种基准来评估奖励模型以及DPO模型的奖励得分。

所提出的基准套件评估所选（优选）响应和被拒绝响应的得分，如图所示。

RewardBench 将奖励模型和 DPO 模型评估建模为预测任务，并计算方法选择“首选”响应的频率。(来自 RewardBench 论文的注释图，https://arxiv.org/abs/2403.13787)

下一个表格列出了根据RewardBench评估的前二十个模型。这个表格所展示的数据基本上证实了我之前提到的内容。也就是说，许多DPO模型可以在大多数大语言模型排行榜上找到，这很可能是因为DPO相对于使用专门的奖励模型的RLHF来说更简单易用，因此有更多的DPO模型被开发出来。

The top 20 models according to RewardBench. (Annotated table from the RewardBench paper, https://arxiv.org/abs/2403.13787).

注意，现有排行榜和RewardBench之间的区别在于它们评估的指标不同。虽然其他排行榜评估通过奖励模型训练得到的LLM的问答和对话性能，但RewardBench侧重于用于训练这些LLM的奖励分数。

这篇论文的另一个有趣的结论是，测得的奖励精度与模型大小呈正相关，这正如人们所预期的那样，如下表所示。（不幸的是，这种比较仅适用于 DPO 模型。）

DPO models by model type and size. (Annotated table from the RewardBench paper, https://arxiv.org/abs/2403.13787).

2.4 Conclusion, Caveats, and Suggestions for Future Research

虽然这篇论文没有引入任何新的LLM微调方法论，但它提供了一个很好的机会来讨论奖励建模和DPO。此外，很高兴终于看到了一种用于评估奖励模型的基准测试。研究人员创建和分享这个基准测试值得称赞。

作为一个小小的警示，了解 RewardBench 的排名是否与使用这些奖励模型在公共排行榜上产生的 LLM 聊天模型的结果有强相关性会很有趣。然而，由于公共排行榜数据和 RewardBench 数据都是公开可用的，这应该激励某个人在未来的一篇论文中分析这些数据。

另一个小的注意事项是，作者在论文中承认RewardBench确实对DPO模型有所偏袒。这是因为目前存在的DPO模型比奖励模型多得多。

在未来的研究中，在其他论文中，将会很有趣地看到未来对RLHF奖励模型和DPO模型的控制实验，这些实验使用固定的计算资源和数据集，以观察哪种模型表现更优秀。

Ahead of AI是一个个人热情项目，并不提供直接的经济补偿。然而，对于那些希望支持我的人，请考虑购买我的书籍副本。如果您觉得它们有见地且有益，请随时向您的朋友和同事推荐它们。

Sebastian's Books

其他有趣的科研论文（2024年3月）

Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 10 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects.

Model Stock: All We Need Is Just a Few Fine-Tuned Models by Jang, Yun, and Han (28 Mar), https://arxiv.org/abs/2403.19522

The paper presents an efficient finetuning technique called Model Stock that uses just two models for layer-wise weight averaging.

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions by Zhang, Luan, Hu, et al. (28 Mar), https://arxiv.org/abs/2403.19651

MagicLens is a self-supervised image retrieval model framework that leverages text instructions to facilitate the search for images based on a broad spectrum of relations beyond visual similarity.

Mechanistic Design and Scaling of Hybrid Architectures by Poli, Thomas, Nguyen, et al. (26 Mar), https://arxiv.org/abs/2403.17844

This paper introduces a mechanistic architecture design pipeline that simplifies deep learning development by using synthetic tasks for efficient architecture evaluation, revealing that hybrid and sparse architectures outperform traditional models in scalability and efficiency.

* LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning by Pan, Liu, Diao, et al. (26 Mar), https://arxiv.org/abs/2403.17919

This research introduces a simple technique of randomly freezing middle layers during training based on importance sampling, which is efficient and can outperform both LoRA and and full LLM finetuning by a noticeable margin in terms of model performance.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models by Li, Zhang, Wang et al. (27 Mar), https://arxiv.org/abs/2403.18814

Mini-Gemini is a framework aimed at improving multi-modal vision language models (VLMs) through high-resolution visual tokens, a high-quality dataset, and VLM-guided generation.

Long-form Factuality in Large Language Models by Wei, Yang, Song, et al. (27 Mar), https://arxiv.org/abs/2403.18802

LongFact is a comprehensive prompt set for benchmarking the long-form factuality of LLMs across 38 topics.

ViTAR: Vision Transformer with Any Resolution by Fan, You, Han, et al. (27 Mar), https://arxiv.org/abs/2403.18361

This paper addresses the challenge of Vision Transformers limited scalability across various image resolutions, introducing dynamic resolution adjustment and fuzzy positional encoding.

BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text by Bolton, Venigalla, Yasunaga, et al. (27 Mar), https://arxiv.org/abs/2403.18421

BioMedLM is a compact GPT-style LLM trained on biomedical papers from PubMed, serving as another nice case study for creating "small," specialized, yet capable LLMs.

The Unreasonable Ineffectiveness of the Deeper Layers by Gromov, Tirumala, Shapourian, et al. (26 Mar), https://arxiv.org/abs/2403.17887

The study demonstrates that selectively pruning up to half the layers of pretrained LLMs, followed by strategic finetuning with quantization and QLoRA, minimally impacts performance on question-answering tasks.

LLM Agent Operating System by Mei, Li, Xu, et al. (25 Mar), https://arxiv.org/abs/2403.16971

This paper introduces AIOS, an operating system designed to integrate LLMs with intelligent agents

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement by Lee, Wattanawong, Kim, et al. (22 Mar), https://arxiv.org/abs/2403.15042

LLM2LLM is a data augmentation strategy that improves the performance of large language models in low-data scenarios by using a teacher model to generate synthetic data from errors made by a student model during initial training

Can Large Language Models Explore In-Context? by Krishnamurthy, Harris, Foster, et al. (22 Mar), https://arxiv.org/abs/2403.15371

This study finds that contemporary Large Language Models, including GPT-3.5, GPT-4, and Llama2, do not reliably engage in exploratory behavior in multi-armed bandit environments without significant interventions

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time Series by Patro and Agneeswaran (22 Mar), https://arxiv.org/abs/2403.15360

SiMBA introduces a novel architecture combining Einstein FFT for channel modeling and the Mamba block for sequence modeling to address stability issues in large-scale networks in both image and time-series domains.

RakutenAI-7B: Extending Large Language Models for Japanese by Levine, Huang, Wang, et al. (21 Mar), https://arxiv.org/abs/2403.15484

RakutenAI-7B is a Japanese-oriented suite of large language models under the Apache 2.0 license, including specialized instruction and chat models, achieving top performance on the Japanese LM Harness benchmarks.

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models by Zheng, Zhang, Zhang, et al. (20 Mar), https://arxiv.org/abs/2403.13372

LlamaFactory introduces a versatile framework with a user-friendly web UI, LlamaBoard, enabling efficient, code-free finetuning of over 100 large language models.

* RewardBench: Evaluating Reward Models for Language Modeling by Lambert, Pyatkin, Morrison, et al. (20 Mar), https://arxiv.org/abs/2403.13787

The paper introduces RewardBench, a benchmark dataset and toolkit designed for the comprehensive evaluation of reward models used in Reinforcement Learning from Human Feedback (RLHF) to align pretrained language models with human preferences.

* PERL: Parameter Efficient Reinforcement Learning from Human Feedback by Sidahmed, Phatale, Hutcheson, et al. (19 Mar), https://arxiv.org/abs/2403.10704

This work introduces Parameter Efficient Reinforcement Learning (PERL) using Low-Rank Adaptation (LoRA) for training models with Reinforcement Learning from Human Feedback (RLHF), a method that aligns pretrained base LLMs with human preferences efficiently.

Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression by Hong, Duan, Zhang, et al. (18 Mar), https://arxiv.org/abs/2403.15447

This study analyzes the complex relationship between LLM compression techniques and trustworthiness, finding that quantization is better than pruning for maintaining efficiency and trustworthiness.

TnT-LLM: Text Mining at Scale with Large Language Models by Wan, Safavi, Jauhar, et al. (18 Mar), https://arxiv.org/abs/2403.12173

The paper introduces TnT-LLM, a framework leveraging LLMs for automating label taxonomy generation and assignment with minimal human input.

* RAFT: Adapting Language Model to Domain Specific RAG by Zhang, Patil, Jain, et al. (15 Mar), https://arxiv.org/abs/2403.10131

This paper introduces Retrieval Augmented FineTuning (RAFT) for enhancing LLMs for open-book, in-domain question answering by training them to identify and disregard non-helpful "distractor" documents while accurately citing relevant information from the right sources.

* MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training by McKinzie, Gan, Fauconnier, et al. (14 Mar), https://arxiv.org/abs/2403.09611

This work advances multimodal LLMs by analyzing architecture and data strategies and proposes the 30B MM1 model series, which excels in pretraining and finetuning across benchmarks.

GiT: Towards Generalist Vision Transformer through Universal Language Interface by Wang, Tang, Jiang, et al. (14 Mar), https://arxiv.org/abs/2403.09394

GiT is a framework leveraging a basic Vision Transformer (ViT) for a wide range of vision tasks that is focused on simplifying the architecture by using a universal language interface for tasks like captioning, detection, and segmentation.

LocalMamba: Visual State Space Model with Windowed Selective Scan by Huang, Pei, You, et al. https://arxiv.org/abs/2403.09338

This work improves Vision Mamba tasks by optimizing scan directions, employing a local scanning method to better capture 2D dependencies and a dynamic layer-specific scan optimization, which leads to substantial performance gains on benchmarks like ImageNet.

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences by Ao, Zhao, Han, et al. (14 Mar), https://arxiv.org/abs/2403.09347

"BurstAttention" optimizes distributed attention in Transformer-based models for long sequences, cutting communication overhead by 40% and doubling processing speed on GPUs.

Language Models Scale Reliably With Over-Training and on Downstream Tasks by Gadre, Smyrnis, Shankar, et al. (13 Mar) https://arxiv.org/abs/2403.08540

This paper explores the gaps in scaling laws for LLMs by focusing on overtraining and the relationship between model perplexity and downstream task performance.

* Simple and Scalable Strategies to Continually Pre-train Large Language Models, by Ibrahim, Thérien, Gupta, et al. (13 Mar), https://arxiv.org/abs/2403.08763

This work demonstrates that LLMs can be efficiently updated with new data through a combination of simple learning rate rewarming and adding a small fraction of previous training data to counteract catastrophic forgetting.

Chronos: Learning the Language of Time Series by Ansari, Stella, Turkmen, et al. (12 Mar), https://arxiv.org/abs/2403.07815

Chronos applies transformer-based models to time series forecasting, achieving good performance on both known and unseen datasets by training on a mix of real and synthetic data.

* Stealing Part of a Production Language Model by Carlini, Paleka, Dvijotham, et al. (11 Mar), https://arxiv.org/abs/2403.06634

Researchers present a new model-stealing attack capable of precisely extracting information from black-box language models like OpenAI's ChatGPT and Google's PaLM-2 (revealing for the first time the hidden dimensions of these models).

Algorithmic Progress in Language Models by Ho, Besiroglu, and Erdil (9 Mar), https://arxiv.org/abs/2403.05812

The study finds that since 2012, the computational efficiency for pretraining language models (including large language models) has doubled approximately every 8 months, a pace much faster than the hardware advancements predicted by Moore's Law.

LLM4Decompile: Decompiling Binary Code with Large Language Models by Tan, Luo, Li, and Zhang (8 Mar), https://arxiv.org/abs/2403.05286

This summary describes the release of open-source LLMs for decompilation, pretrained on a substantial dataset comprising both C source code and corresponding assembly code.

Is Cosine-Similarity of Embeddings Really About Similarity? by Steck, Ekanadham, and Kallus (8 Mar), https://arxiv.org/abs/2403.05440

The paper examines the effectiveness and limitations of using cosine similarity for determining semantic similarities between high-dimensional objects through low-dimensional embeddings.

Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context by Reid, Savinov, Teplyashin, et al. (8 Mar), https://arxiv.org/abs/2403.05530

This technical report introduces Gemini 1.5 Pro, a multimodal model from Google Gemini family excelling in long-context tasks across various modalities.

* Common 7B Language Models Already Possess Strong Math Capabilities by Li, Wang, Hu, et al. (7 Mar), https://arxiv.org/abs/2403.04706

This study reveals the LLaMA-2 7B model's surprising mathematical skills even though it only underwent standard pretraining, and its consistency improves with scaled-up supervised instruction-finetuning data.

How Far Are We from Intelligent Visual Deductive Reasoning? by Zhang, Bai, Zhang, et al. (7 Mar), https://arxiv.org/abs/2403.04732

This study explores the capabilities of state-of-the-art Vision-Language Models (VLMs) like GPT-4V in the nuanced field of vision-based deductive reasoning, uncovering significant blindspots in visual deductive reasoning, and finding that techniques effective for text-based reasoning in LLMs don't directly apply to visual reasoning challenges.

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL by Farebrother, Orbay, Vuong (6 Mar), et al. https://arxiv.org/abs/2403.03950

This paper explores the potential of enhancing deep reinforcement learning (RL) scalability by training value functions, crucial for RL, using categorical cross-entropy classification instead of traditional regression

* GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection by Zhao, Zhang, Chen, et al. (6 Mar), https://arxiv.org/abs/2403.03507

Gradient Low-Rank Projection (GaLore) is a new training strategy that significantly reduces memory usage by up to 65.5% for optimizer states during the training of LLMs, without sacrificing performance.

MedMamba: Vision Mamba for Medical Image Classification by Yue and Li (2024), https://arxiv.org/abs/2403.03849

MedMamba tackles medical image classification by blending CNNs with state space models (Conv-SSM) for efficient long-range dependency modeling and local feature extraction.

3D Diffusion Policy by Ze, Zhang, Zhang, et al. (6 Mar), https://arxiv.org/abs/2403.03954

3D Diffusion Policy is a new visual imitation learning approach integrating 3D visual representations with diffusion policies to improve efficiency and generalization in robot training with fewer demonstrations and enhanced safety.

Are Language Models Puzzle Prodigies? Algorithmic Puzzles Unveil Serious Challenges in Multimodal Reasoning by Ghosal, Han, Ken, and Poria (6 Mar) , https://arxiv.org/abs/2403.03864

This paper introduces a new multimodal puzzle-solving challenge revealing that models like GPT4-V and Gemini struggle significantly with the complex puzzles.

SaulLM-7B: A pioneering Large Language Model for Law by Colombo, Pires, Boudiaf, et al. (6 Mar), https://arxiv.org/abs/2403.03883

SaulLM-7B is a 7 billion-parameter language model specialized for the legal domain, built on the Mistral 7B architecture and trained on a massive corpus of English legal texts.

Learning to Decode Collaboratively with Multiple Language Models by Shen, Lang, Wang, et al. (6 Mar), https://arxiv.org/abs/2403.03870

This approach enables multiple large language models to collaboratively generate text at the token level, automatically learning when to contribute or defer to others, enhancing performance across various tasks by leveraging the combined expertise of generalist and specialist models.

Backtracing: Retrieving the Cause of the Query by Wang, Wirawarn, Khattab, et al. (6 Mar), https://arxiv.org/abs/2403.03956

The study introduces "backtracing" as a task to help content creators like lecturers identify the text segments that led to user queries, aiming to enhance content delivery in education, news, and conversation domains.

* ShortGPT: Layers in Large Language Models are More Redundant Than You Expect by Men, Xu, Zhang, et al. (6 Mar), https://arxiv.org/abs/2403.03853

This study introduces the Block Influence (BI) metric to assess each layer's importance in LLMs and proposes ShortGPT, a pruning approach that removes redundant layers based on BI scores.

Design2Code: How Far Are We From Automating Front-End Engineering? by Si, Zhang, Yang, et al. (5 Mar), https://arxiv.org/abs/2403.03163

This research introduces Design2Code, a benchmark for how well multimodal LLMs convert visual designs into code, using a curated set of 484 real-world webpages for evaluation, where GPT-4V emerged as the top-performing model.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis by Esser, Kulal, Blattmann, et al. (5 Mar), https://arxiv.org/abs/2403.03206

This work enhances rectified flow models for high-resolution text-to-image synthesis by improving noise sampling and introducing a novel transformer-based architecture that enhances text comprehension and image quality, showing better performance through extensive evaluation and human preference ratings.

Enhancing Vision-Language Pre-training with Rich Supervisions by Gao, Shi, Zhu et al. (5 Mar), https://arxiv.org/abs/2403.03346

Strongly Supervised pretraining with ScreenShots (S4) introduces a new pretraining approach for vision-LLMs using web screenshots along with leveraging the inherent tree-structured hierarchy of HTML elements.

Evolution Transformer: In-Context Evolutionary Optimization by Lange, Tian, and Tang (5 Mar), https://arxiv.org/abs/2403.02985

The proposed evolution transformer leverages a causal transformer architecture for meta-optimization.

* The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning by Li, Pan, Gopal et al. (5 Mar), https://arxiv.org/abs/2403.03218

The WMDP benchmark is a curated dataset of over 4,000 questions designed to gauge and mitigate LLMs' knowledge in areas with misuse potential, such as biosecurity and cybersecurity.

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures by Duan, Wang, Chen, et al. (4 Mar), https://arxiv.org/abs/2403.02308

VRWKV adapts the RWKV model from NLP to computer vision, outperforming vision transformer (ViTs) like DeiT in classification speed and memory usage, and excelling in dense prediction tasks.

Training-Free Pretrained Model Merging, by Xu, Yuan, Wang, et al. (4 Mar), https://arxiv.org/abs/2403.01753

The proposed model merging framework addresses the challenge of balancing unit similarity inconsistencies between weight and activation spaces during model merging by linearly combining similarity matrices of both, resulting in better multi-task model performance.

The Hidden Attention of Mamba Models by Ali, Zimerman, and Wolf (3 Mar), https://arxiv.org/abs/2403.01590

This paper shows that selective state space models such as Mamba can be viewed as attention-driven models.

Improving LLM Code Generation with Grammar Augmentation by Ugare, Suresh, Kang (3 Mar), https://arxiv.org/abs/2403.01632

SynCode is a framework that improves code generation with LLMs by using the grammar of programming languages (essentially an offline-constructed efficient lookup table) for syntax validation and to constrain the LLM’s vocabulary to only syntactically valid tokens.

Learning and Leveraging World Models in Visual Representation Learning by Garrido, Assran, Ballas et al. (1 Mar), https://arxiv.org/abs/2403.00504

The study extends the popular Joint-Embedding Predictive Architecture (JEPA) by introducing Image World Models (IWMs) to go beyond masked image modeling.

This magazine is personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues.

Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (from Scratch)

Your support means a great deal! Thank you!

Subscribe to Ahead of AI

By Sebastian Raschka · Launched a year ago

Ahead AI specializes in Machine Learning & AI research and is read by tens of thousands of researchers and practitioners who want to stay ahead in the ever-evolving field.

116 Likes

12 Restacks

16 Comments

Ruite

Ruite’s Newsletter

Apr 1Liked by Sebastian Raschka, PhD

I love how you can distill complex information in a digestable format for people who doesn't have a strong background on all the technicalities of LLMs, so thank you Sebastian

Expand full comment

Like (2)

1 reply by Sebastian Raschka, PhD

Nathan Lambert

Interconnects

Apr 1·edited Apr 1Liked by Sebastian Raschka, PhD

RewardBench lead author here, a couple notes:

* We're working on your training correlation caveat :)

* Now we're at the phase where we are getting closed models added to the benchmark to show the gap open needs to close (because good alignment capabilities are important for good societal outcomes.

* The leaderboard is characterized by the design space not being well explored, so more DPO models exist because they're popular. I don't expect this to change too much, but already more people are training RMs since release! (a specific training blog post here: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0?pvs=21)

Great summary of the benchmark. Keep up the great work.

https://huggingface.co/spaces/allenai/reward-bench

14 more comments...