这是用户在 2024-6-7 11:36 为 https://aigc.openbot.ai/p/aigc-weekly-68 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

AIGC Newsletter

AIGC Weekly AIGC 每周简报

AIGC Weekly | #68 AIGC 每周简报 | 第 68 期

AIGC Top Papers and AI news of the week
本周 AIGC 顶级论文和 AI 新闻

pxiaoer
May 20, 2024 2024 年 5 月 20 日
Share ### Share 在现代数字世界中,数据压缩技术变得越来越重要。无论是音频、视频还是图像,压缩技术都能显著减少文件大小,从而节省存储空间和带宽。常见的压缩格式包括 FLAC、JPEG 和 MP3 等。 #### 音频压缩 音频压缩技术可以分为有损压缩和无损压缩两类。有损压缩(如 MP3)通过丢弃一些音频信息来减少文件大小,而无损压缩(如 FLAC)则保留了所有原始音频信息。研究表明,无损压缩在音质上优于有损压缩,但文件大小也相对较大 [20]。 #### 图像压缩 图像压缩同样可以分为有损和无损两种。JPEG 是一种常见的有损压缩格式,广泛应用于数码摄影和网页图像。无损压缩格式(如 PNG)则常用于需要高质量图像的场合。图 1 显示了不同压缩格式的比较。 图 1: 不同图像压缩格式的比较 #### 视频压缩 视频压缩技术在流媒体和视频存储中起着关键作用。常见的视频压缩格式包括 H.264 和 H.265。H.265 相比 H.264 提供了更高的压缩效率,但也需要更高的计算能力 [20]。 总的来说,数据压缩技术在现代信息处理和传输中扮演着不可或缺的角色。随着技术的不断进步,压缩算法也在不断优化,以满足日益增长的需求

Top Papers of the week(MAY 13 - MAY 19)
本周热门论文(5 月 13 日 - 5 月 19 日)

1.) Hello GPT-4o( link )
1. ) 你好 GPT-4o ( link )

GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
GPT-4o(“o”代表“omni”)是朝着更加自然的人机交互迈出的一步——它可以接受任何组合的文本、音频、图像和视频作为输入,并生成任何组合的文本、音频和图像输出。它可以在短至 232 毫秒内响应音频输入,平均响应时间为 320 毫秒,这与人类在对话中的反应时间相似。它在英文文本和代码方面的表现与 GPT-4 Turbo 相当,在非英语语言的文本方面有显著改进,同时在 API 中速度更快且成本降低了 50%。与现有模型相比,GPT-4o 在视觉和音频理解方面尤其出色。

2.) Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context ( paper )
2. ) Gemini 1.5: 解锁跨越数百万 Token 上下文的多模态理解(论文)

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. G
在本报告中,我们介绍了 Gemini 1.5 系列模型,代表了新一代高计算效率的多模态模型,能够从数百万个 Token 的上下文中回忆和推理细粒度信息,包括多个长文档和数小时的视频和音频。该系列包括两个新模型:(1) 更新版的 Gemini 1.5 Pro,在大多数能力和基准测试上超过了二月份的版本;(2) Gemini 1.5 Flash,一个更轻量化的变体,旨在提高效率,同时质量回退最小。

3.) Observational Scaling Laws and the Predictability of Language Model Performance ( paper )
3. ) 观察性缩放定律与语言模型性能的可预测性 ( 论文 )

Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ∼80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities
理解大语言模型 (Large Language Model) 性能如何随规模变化对于基准测试和算法开发至关重要。缩放定律是构建这种理解的一种方法,但在许多不同规模上训练模型的要求限制了它们的使用。我们提出了一种替代的观察方法,该方法绕过了模型训练,而是从约 80 个公开可用的模型中构建缩放定律。从多个模型家族中构建单一的缩放定律具有挑战性,因为它们的训练计算效率和能力存在很大差异。

4.) RLHF Workflow: From Reward Modeling to Online RLHF ( paper )
4.)RLHF 工作流程:从奖励建模到在线 RLHF ( paper )

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF.
我们在这份技术报告中展示了在线迭代人类反馈强化学习 (RLHF) 的工作流程,最近的大语言模型 (LLM) 文献广泛报道其性能远超离线版本。然而,现有的开源 RLHF 项目仍然主要局限于离线学习设置。在这份技术报告中,我们旨在填补这一空白,并提供一个易于复现的在线迭代 RLHF 详细方案。

5.) LayGA: Layered Gaussian Avatars for Animatable Clothing Transfer ( webpage | paper )
5. ) LayGA: 用于可动画服装转移的分层高斯头像(网页 | 论文)

Animatable clothing transfer, aiming at dressing and animating garments across characters, is a challenging problem. Most human avatar works entangle the representations of the human body and clothing together, which leads to difficulties for virtual try-on across identities. What’s worse, the entangled representations usually fail to exactly track the sliding motion of garments. To overcome these limitations, we present Layered Gaussian Avatars (LayGA), a new representation that formulates body and clothing as two separate layers for photorealistic animatable clothing transfer from multiview videos.
可动画服装转移,旨在为角色穿上并动画化服装,是一个具有挑战性的问题。大多数人类化身的研究将人体和服装的表示纠缠在一起,这导致了跨身份的虚拟试穿变得困难。更糟糕的是,这种纠缠的表示通常无法准确跟踪服装的滑动运动。为了克服这些限制,我们提出了分层高斯化身 (Layered Gaussian Avatars, LayGA),这是一种新的表示方法,将身体和服装分别作为两个独立的层,从多视角视频中实现逼真的可动画服装转移。

6.) LMD3: Language Model Data Density Dependence ( paper )
6.)LMD3:语言模型数据密度依赖性(paper)

We develop a methodology for analyzing language model task performance at the individual example level based on training data density estimation. Experiments with paraphrasing as a controlled intervention on finetuning data demonstrate that increasing the support in the training distribution for specific test queries results in a measurable increase in density, which is also a significant predictor of the performance increase caused by the intervention. Experiments with pretraining data demonstrate that we can explain a significant fraction of the variance in model perplexity via density measurements. We conclude that our framework can provide statistical evidence of the dependence of a target model's predictions on subsets of its training data, and can more generally be used to characterize the support (or lack thereof) in the training data for a given test task.
我们开发了一种基于训练数据密度估计的方法,用于在单个示例级别分析语言模型任务性能。通过对微调数据进行复述(paraphrasing)作为受控干预的实验表明,增加训练分布中对特定测试查询的支持会导致密度的可测量增加,这也是干预引起的性能提升的重要预测指标。对预训练数据的实验表明,我们可以通过密度测量解释模型困惑度(perplexity)变化的显著部分。我们得出结论,我们的框架可以提供统计证据,证明目标模型的预测依赖于其训练数据的子集,并且更普遍地用于描述训练数据对给定测试任务的支持(或缺乏支持)。

7.) Chameleon: Mixed-Modal Early-Fusion Foundation Models ( paper )
7.) 变色龙:混合模态早期融合基础模型(paper)

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation.
我们介绍了 Chameleon,这是一系列基于早期融合 Token 的混合模态模型,能够理解和生成任意顺序的图像和文本。我们概述了一种从一开始就稳定的训练方法、一种对齐方案以及针对早期融合、基于 Token 的混合模态设置的架构参数化。这些模型在一系列综合任务上进行了评估,包括视觉问答、图像描述、文本生成、图像生成和长篇混合模态生成。

8.) Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? ( paper )
8.) 在新知识上进行微调 LLMs 会鼓励幻觉吗?( paper )

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge.
当大语言模型通过监督微调进行对齐时,它们可能会遇到在预训练中未获取的新事实信息。人们常常推测,这可能会教会模型生成事实不正确的回应行为,因为模型被训练生成的事实并不基于其已有的知识。在这项工作中,我们研究了这种接触新知识对微调模型利用其已有知识能力的影响。

9.) What Can Natural Language Processing Do for Peer Review? ( paper )
9. ) 自然语言处理能为同行评审做些什么?(论文)

The number of scientific articles produced every year is growing rapidly. Providing quality control over them is crucial for scientists and, ultimately, for the public good. In modern science, this process is largely delegated to peer review -- a distributed procedure in which each submission is evaluated by several independent experts in the field. Peer review is widely used, yet it is hard, time-consuming, and prone to error. Since the artifacts involved in peer review -- manuscripts, reviews, discussions -- are largely text-based, Natural Language Processing has great potential to improve reviewing. As the emergence of large language models (LLMs) has enabled NLP assistance for many new tasks, the discussion on machine-assisted peer review is picking up the pace.
每年发表的科学文章数量正在迅速增长。对这些文章进行质量控制对科学家来说至关重要,最终也关系到公众利益。在现代科学中,这一过程主要由同行评审来完成——这是一个分布式程序,每篇投稿都会由该领域的几位独立专家进行评估。同行评审被广泛使用,但它既困难又耗时,还容易出错。由于同行评审中涉及的文稿、评审和讨论大多是基于文本的,自然语言处理 (Natural Language Processing, NLP) 在改进评审方面具有巨大潜力。随着大语言模型 (Large Language Model, LLM) 的出现,使得 NLP 能够辅助许多新任务,关于机器辅助同行评审的讨论也在加速进行。

10.) MediSyn: Text-Guided Diffusion Models for Broad Medical 2D and 3D Image Synthesis ( paper )
10. ) MediSyn: 基于文本引导扩散模型的广泛医学 2D 和 3D 图像合成 ( 论文 )

Diffusion models have recently gained significant traction due to their ability to generate high-fidelity and diverse images and videos conditioned on text prompts. In medicine, this application promises to address the critical challenge of data scarcity, a consequence of barriers in data sharing, stringent patient privacy regulations, and disparities in patient population and demographics. By generating realistic and varying medical 2D and 3D images, these models offer a rich, privacy-respecting resource for algorithmic training and research. To this end, we introduce MediSyn, a pair of instruction-tuned text-guided latent diffusion models with the ability to generate high-fidelity and diverse medical 2D and 3D images across specialties and modalities. Through established metrics, we show significant improvement in broad medical image and video synthesis guided by text prompts
扩散模型(Diffusion models)最近因其能够根据文本提示生成高保真和多样化的图像和视频而受到广泛关注。在医学领域,这一应用有望解决数据稀缺这一关键挑战,这一问题是由于数据共享的障碍、严格的患者隐私法规以及患者群体和人口统计差异所导致的。通过生成逼真且多样的医学二维和三维图像,这些模型为算法训练和研究提供了丰富且尊重隐私的资源。为此,我们介绍了 MediSyn,这是一对经过指令调优的文本引导潜在扩散模型,能够在各个专业和模态下生成高保真和多样化的医学二维和三维图像。通过既定的指标,我们展示了在文本提示引导下广泛医学图像和视频合成的显著改进。

AIGC News of the week(MAY 13 - MAY 19)
本周 AIGC 新闻(5 月 13 日 - 5 月 19 日)

1.) llama3 implemented from scratch ( link)
1.) 从头开始实现 llama3 (link)

2.) UFO: A UI-Focused Agent for Windows OS Interaction. ( link )
2.) UFO: 一个专注于用户界面的 Windows 操作系统交互智能体。 ( link )

3.) Veo: most capable generative video model ( link)
3.)Veo:最强大的生成式视频模型(link)

4.) MambaOut: Do We Really Need Mamba for Vision? ( link )
4.)MambaOut:我们真的需要 Mamba 来进行视觉处理吗?(link)

5.) amazon/MistralLite:MistralLite is a fine-tuned Mistral-7B-v0.1 language model, with enhanced capabilities of processing long context (up to 32K tokens) ( link )
5. ) amazon/MistralLite:MistralLite 是一个经过微调的 Mistral-7B-v0.1 语言模型,具有增强的长上下文处理能力(最多可处理 32K 个 Token)( link )

This newsletter's AI startup column is starting to update. Welcome everyone to subscribe: AI Startup
本期通讯的 AI 初创公司专栏开始更新。欢迎大家订阅:AI Startup

more AIGC News: AINews 更多 AIGC 新闻:AINews

AIGC Newsletter is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Share
Daily Papers and AI News Tracking(11.28)
Today's Top AI news and papers
Nov 28, 2023 • 
pxiaoer
[paper] DeepMind Genie: Generative Interactive Environments
Genie: a foundation world model
Feb 26 • 
pxiaoer
AIGC Weekly | #56
AIGC Top Papers and AI news of the week
Feb 26 • 
pxiaoer

Ready for more?

© 2024 pxiaoer
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great culture