Turing Post 图灵邮政
Posts 帖子
Topic 4: What is JEPA?
主题 4：JEPA 是什么？

Topic 4: What is JEPA?
主题 4：JEPA 是什么？

we discuss the Joint Embedding Predictive Architecture (JEPA), how it differs from transformers and provide you with list of models based on JEPA
我们讨论了联合嵌入预测架构（JEPA），它与变压器的不同之处，并为您提供了基于 JEPA 的模型列表

Valeriia Kuka 瓦列里娅库卡
June 12, 2024 2024 年 6 月 12 日

Introduction 介绍

Current AI architectures like Transformers are powerful and have achieved impressive results, including generalizing on previously unseen data and emergent abilities as the models are scaled. At the same time, they are still constrained compared to humans and animals who don’t need to see millions of data points before making the right conclusions or learning new skills like speaking. Examples include crows that can solve puzzles as five-year-olds, orcas’ sophisticated hunting methods that often deploy brutal, coordinated attacks, and elephants' cooperative abilities.
当前的 AI 架构，如变形金刚，非常强大，并取得了令人印象深刻的成果，包括对以前未见数据进行泛化和在模型扩展的同时出现的能力。同时，与无需在做出正确结论或学习新技能（如说话）之前看到数百万数据点的人类和动物相比，它们仍然受到限制。例如，乌鸦可以像五岁孩子一样解决难题，虎鲸的复杂狩猎方法通常采用残酷的协同攻击，大象的合作能力。

The Moravec paradox highlights that tasks that are difficult for humans, such as computing, are simple for computers to handle because they can be described and modeled easily. However, perception and sensory processing, which are natural for humans, are challenging for machines to master. Simply scaling the model and providing it with more data might not be a viable solution. Some argue that this approach will not lead to qualitatively different results enabling AI models to reach a new level of reasoning or world perception. Therefore, alternative methods must be explored to enable AI to attain human-level intelligence. Yann LeCun (one of AI Godfathers) insists that JEPA is the first step.
莫拉维克悖论强调，对人类而言困难的任务，比如计算，对计算机来说却很简单，因为它们可以很容易地描述和建模。然而，对人类而言自然的感知和感官处理对机器来说是具有挑战性的。简单地扩展模型并提供更多数据可能不是一个可行的解决方案。一些人认为，这种方法不会导致产生质的不同结果，使得 AI 模型能够达到新的推理或世界感知水平。因此，必须探索替代方法，使 AI 能够达到人类水平的智能。杨立昆（AI 教父之一）坚持认为 JEPA 是第一步。

In today’s episode, we will cover:
在今天的节目中，我们将涵盖：

What are the limitations of LLMs?
\{\{1001}}的限制是什么？
So what’s the potential solution?
那么潜在的解决方案是什么？
How does JEPA work? JEPA 如何运作？
What can one build on JEPA?
JEPA 上可以建造什么？
I-JEPA – JEPA for Images
I-JEPA - 图像的 JEPA
MC-JEPA - Multitasking JEPA
MC-JEPA - 多任务 JEPA
V-JEPA – JEPA for Video
V-JEPA - 用于视频的 JEPA
Generalizing JEPA 概括 JEPA
Bonus: All resources in one place
奖金：所有资源集中在一个地方

Yann LeCun is the one who was always rational about the latest models and exposed their limitations and educated the public that the fear of AGI and AI taking over humans is an unreasonable fear. In February 2022, Yann LeCun proposed his vision of achieving human-level reasoning by AI. And Joint Embedding Predictive Architecture (JEPA) was at the core of his vision. Let’s figure out what it is!
Yann LeCun 总是对最新模型保持理性，并揭示它们的局限性，教育公众害怕人工通用智能（AGI）和人工智能接管人类是一种不合理的恐惧。2022 年 2 月，Yann LeCun 提出了他通过人工智能实现人类水平推理的愿景。联合嵌入预测架构（JEPA）是他愿景的核心。让我们看看它是什么！

What are limitations of LLMs?
\{\{1001}}的局限性是什么？

Yann LeCun also gave several talks presenting his vision for objective-driven AI: talk 1 (March 28, 2024), and talk 2 (September 9, 2023). There, he extensively discussed the limitations of large language models (LLMs):
Yann LeCun 还发表了几次演讲，展示了他对以目标为驱动的人工智能的愿景：演讲 1（2024 年 3 月 28 日）和演讲 2（2023 年 9 月 9 日）。在那里，他广泛讨论了大型语言模型的局限性（LLMs）。

LLMs have no common sense: LLMs have limited knowledge of the underlying reality and make strange mistakes called “hallucinations.” This paper showed that LLMs are good at formal linguistic competence – knowledge of linguistic rules and patterns, while their performance on functional linguistic competence – understanding and using language in the world – remains unstable.
\{\{1001}} 没有常识：\{\{1002}} 对基本现实的了解有限，会犯称为“幻觉”的奇怪错误。本文表明\{\{1003}} 擅长形式语言能力——对语言规则和模式的了解，而在功能语言能力——理解和运用语言于现实世界方面表现不稳定。
LLMs have no memory and can’t plan their answer: PlanBench benchmark proved that.
\{\{1001}} 没有记忆，也不能计划他们的答案：PlanBench 基准测试证明了这一点。

So what’s the potential solution?
那么可能的解决方案是什么？

To propose new ideas, it is always great to come to the roots and fundamental disciplines. For the task of building intelligent AI, one needs to recap cognitive science, psychology, neuroscience along engineering sciences. Actually, this is the strategy that the creators of AI had taken in the 1960s. Professor LeCun did the same in the 2020s and devised the important parts for success we discuss below.
提出新想法时，总是很好回到根源和基础学科。对于构建智能人工智能的任务，人们需要回顾认知科学、心理学、神经科学以及工程学科。实际上，这正是上世纪 60 年代人工智能的创造者采取的策略。LeCun 教授在 2020 年代也采取了同样的做法，并设计了我们将在下文讨论的成功要素。

World models 世界模型

The fundamental part of LeCun’s vision is the concept of "world models," which are internal representations of how the world functions. He argues that giving the model a context of the world around it could improve its results.
LeCun 愿景的基础部分是“世界模型”概念，这些模型是世界运作方式的内部表征。他认为，给模型提供周围世界的背景可能会改善其结果。

“The idea that humans, animals, and intelligent systems use world models goes back many decades in psychology and in fields of engineering such as control and robotics.”
人类、动物和智能系统使用世界模型的想法在心理学和控制与机器人技术等工程领域中已有数十年的历史。

Yann LeCun 杨立昆

Self-supervised learning 自监督学习

Another important aspect is using self-supervised learning (SSL) akin to babies who learn the world by observing it. Models like GPT, BERT, LLaMa and other foundation models are based on SSL and have changed the way we use machine learning.
另一个重要方面是使用自监督学习（SSL），类似于婴儿通过观察世界来学习。像 GPT、BERT、LLaMa 和其他基础模型都基于 SSL，并改变了我们使用机器学习的方式。

Abstract representations 抽象表示

Apart from SSL, the model also needs to understand what should be captured by its sensors and what’s not. In other words, the model needs to contrast the relevant information in each state of the model. For example, the human eye is perfectly wired for that. What may seem to be a limitation in fact allows us to extract the essence.
除了 SSL 外，该模型还需要理解其传感器应该捕获什么，以及不应捕获什么。换句话说，该模型需要对模型的每个状态中的相关信息进行对比。例如，人眼就是为此而设计的。看似是一种限制的东西实际上却让我们能够提取本质。

The invisible gorilla study published in 1999 is the most famous example of a phenomenon called “inattentional blindness.” When we pay close attention to one thing, we often fail to notice other things – even if they are obvious. This is just one example of how our eyes function, scientists also showed that our eyes need some time to refocus on things just like a camera on your smartphone.
1999 年发表的“看不见的大猩猩”研究是所谓“注意盲目”的现象最著名的例子。当我们专注于一件事时，我们经常会忽视其他事物 - 即使它们很明显。这只是我们的眼睛如何运作的一个例子，科学家们还表明我们的眼睛需要一些时间来重新聚焦，就像您智能手机上的相机一样。

Using this analogy, Yann LeCun proposed that a model should use abstract representations* of images rather than comparing the pixels.
使用这个类比，Yann LeCun 提出，模型应该使用图像的抽象表示，而不是比较像素。

*Abstract representations simplify complex information into a form that is more manageable and meaningful for specific tasks or analyses. By focusing on the essential aspects and ignoring the less important details, these representations help systems (whether human or machine) to process information more efficiently and effectively.

Architecture – Objective-Driven AI
建筑 - 目标驱动的人工智能

LeCun proposes a modular, configurable architecture for autonomous intelligence, emphasizing the development of self-supervised learning methods to enable AI to learn these world models without extensive labeled data.
LeCun 提出了一种模块化、可配置的架构，用于自主智能，强调开发自监督学习方法，使人工智能能够学习这些世界模型而无需大量标记数据。

Image Credit: Meta AI blog post, Yann LeCun on a vision to make AI systems learn and reason like animals and humans
图像来源：Meta AI 博客文章，Yann LeCun 关于让人工智能系统学习和推理像动物和人类的愿景

Here’s a detailed view of the components of the system architecture for autonomous intelligence:
这是自主智能系统架构组件的详细视图：

Configurator: Acts as the executive control center of the AI system by dynamically configuring other components of the system based on the specific task or context. For instance, it adjusts the parameters of the perception, world model, and actor modules to optimize performance for the given task.
配置器：作为人工智能系统的执行控制中心，根据具体任务或上下文动态配置系统的其他组件。例如，它调整感知、世界模型和执行器模块的参数，以优化给定任务的性能。
Perception module: Captures and interprets sensory data from various sensors to estimate the current state of the world. This component is a basis for all higher-level processing and decision-making.
感知模块：从各种传感器捕获和解释感官数据，以估计世界的当前状态。这个组件是所有高级处理和决策的基础。
World model module: Predicts future states of the environment and fills in missing information. It acts as a simulator, using current and past data to forecast future conditions and possible scenarios. This component is the key for AI to perform hypothetical reasoning and planning, essential for navigating complex, dynamic environments.
世界模型模块：预测环境的未来状态并填补缺失信息。它充当模拟器，利用当前和过去的数据来预测未来的情况和可能的场景。这个组件是 AI 执行假设推理和规划的关键，对于在复杂、动态环境中导航至关重要。
Cost module: Evaluates the potential consequences of actions in terms of predefined costs associated with a given state or action. It has two submodules:
成本模块：根据与给定状态或动作相关的预定义成本，评估行动的潜在后果。它有两个子模块：
- Intrinsic cost: Hard-wired, calculating immediate discomfort or risk
  内在成本：硬连线，计算立即的不适或风险
- Critic: Trainable, estimating future costs based on current actions
  评论家：可训练的，根据当前行动估计未来成本
Actor module: Decides and proposes specific actions based on the predictions and evaluations provided by other components of the architecture. It computes optimal action sequences that minimize the predicted costs, often using methods akin to those in optimal control theory.
演员模块：根据架构的其他组件提供的预测和评估，决定并提出具体行动。它计算最小化预测成本的最佳行动序列，通常使用类似于最优控制理论的方法。
Short-term memory: Keeps track of the immediate history of the system’s interactions with the environment. It stores recent data on the world state, actions taken, and the associated costs, allowing the system to reference this information in real-time decision-making.
短期记忆：跟踪系统与环境互动的即时历史。它存储了关于世界状态、采取的行动以及相关成本的最新数据，使系统能够在实时决策中引用这些信息。

How does Joint Embedding Predictive Architecture (JEPA) work?
联合嵌入预测架构（JEPA）是如何工作的？

Joint Embedding Predictive Architecture (JEPA) is a central element in the pursuit of developing AI that can understand and interact with the world as humans do. It encapsulates the key elements we mentioned above. JEPA allows the system to handle uncertainty and ignore irrelevant details while maintaining essential information for making predictions.
联合嵌入预测架构（JEPA）是追求开发能像人类一样理解并与世界互动的人工智能的核心要素。它包含了我们上面提到的关键元素。JEPA 允许系统处理不确定性并忽略无关细节，同时保留制定预测所需的基本信息。

Image Credit: Yann LeCun’s Harvard presentation (March 28, 2024)
图片来源：杨立昆的哈佛演讲（2024 年 3 月 28 日）

It works based on these elements:
它是基于这些元素运作的

Inputs: JEPA takes pairs of related inputs. For example, sequential frames of a video (x could be a current frame, and y the next frame)
输入：JEPA 接受相关输入对。例如，视频的连续帧（x 可以是当前帧，y 是下一帧）
Encoders: They transform the inputs, x and y, into abstract representations (s_x and s_y) which capture only essential features of the inputs and omit irrelevant details.
编码器：它们将输入 x 和 y 转换为抽象表示（s {0} 和 s {1}），这些表示仅捕捉输入的基本特征并省略无关细节。
Predictor module: It is trained to predict the abstract representation of the next frame, s_y, based on the abstract representation of the current frame, s_x.
预测模块：它经过训练，根据当前帧的抽象表示 s _x ，来预测下一帧的抽象表示 s _y 。

JEPA handles uncertainty in predictions in either of the two ways:
JEPA 以两种方式处理预测中的不确定性：

During the encoding phase, when the encoder drops irrelevant information. For example, the encoder checks which features of the input data are too uncertain or noisy and decides not to include these in the abstract representation.
在编码阶段，当编码器丢弃不相关信息时。例如，编码器会检查输入数据的哪些特征过于不确定或嘈杂，然后决定不将这些特征包括在抽象表示中。
After the encoding, based on the latent variable (z). Latent Variable z represents elements present in sy but not observable in s_x. To handle uncertainty, z is varied across a predefined set of values, each representing different hypothetical scenarios or aspects of the future state y that might not be directly observable from x. By altering z, the predictive model can simulate how small changes in unseen factors could influence the upcoming state.
编码后，基于潜变量（z）。潜变量 z 代表在 sy 中存在但不可观察的元素 s \{\{0}}。为了处理不确定性，z 在预定义的一组值之间变化，每个值代表不同的假设情景或未来状态 y 的方面，这些可能无法直接从 x 观察到。通过改变 z，预测模型可以模拟未见因素的微小变化如何影响即将到来的状态。

Image Credit: Yann LeCun’s Munich presentation (September 29, 2023)
图像来源：杨立昆的慕尼黑演讲（2023 年 9 月 29 日）

Interestingly, several JEPAs could be combined into a multistep/recurrent JEPA or stacked into a Hierarchical JEPA that could be used to perform predictions at several levels of abstraction and several time scales.
有趣的是，几个 JEPAs 可以合并成一个多步骤/递归的 JEPA，或者堆叠成一个分层的 JEPA，可以用来在几个抽象层次和几个时间尺度上进行预测。

Image Credit: Yann LeCun’s Munich presentation (September 29, 2023)
图像来源：杨立昆的慕尼黑演讲（2023 年 9 月 29 日）

What can one build on JEPA?
JEPA 上可以建造什么？

Following the proposed JEPA architecture, Meta AI researchers along with Yann LeCun as a co-author published several specialized models. What are they?
根据提出的 JEPA 架构，Meta AI 研究人员与 Yann LeCun 合著发表了几个专门的模型。它们是什么？

Image-based Joint-Embedding Predictive Architecture (I-JEPA) – JEPA for Images
基于图像的联合嵌入预测架构（I-JEPA）- 用于图像的 JEPA

I-JEPA, proposed in June 2023, was the first model based on JEPA.
I-JEPA，于 2023 年 6 月提出，是基于 JEPA 的第一个模型。

Image Credit: Meta AI blog post, The first AI model based on Yann LeCun’s vision for more human-like AI
图像来源：Meta AI 博客文章，第一个基于 Yann LeCun 愿景的人类化 AI 模型

I-JEPA is a non-generative, self-supervised learning framework designed for processing images. It works by masking parts of the images and then trying to predict those masked parts:
I-JEPA 是一个为处理图像而设计的非生成式、自监督学习框架。它通过遮盖图像的部分，然后尝试预测这些被遮盖的部分来工作。

Masking: The image is divided into numerous patches. Some of these patches, referred to as "target blocks," are masked (hidden) so that the model doesn’t have information about them
遮罩：图像被分成许多块。其中一些块被称为“目标块”，被遮蔽（隐藏），使得模型无法获取有关它们的信息。
Context sampling: A portion of the image, called the "context block," is left unmasked. This part is used by the context encoder to understand the visible aspects of the image.
上下文抽样：图像的一部分，称为“上下文块”，保持未掩盖。上下文编码器使用此部分来理解图像的可见方面。
Prediction: The predictor then tries to predict the hidden parts (target blocks) based only on what it can see in the context block.
预测：然后预测器尝试仅基于上下文块中可见的内容来预测隐藏部分（目标块）。
Iteration: This process involves updating the model's parameters to reduce the difference between predicted and actual patches.
迭代：该过程涉及更新模型参数，以减少预测和实际补丁之间的差异。

I-JEPA consists of three parts each of which is a Vision Transformer (ViT):
I-JEPA 由三部分组成，每部分都是一个 Vision Transformer（ViT）：

Context encoder: Processes parts of the image that are visible, known as the "context block"
上下文编码器：处理可见的图像部分，称为“上下文块”
Predictor: Uses the output from the context encoder to predict what the masked (hidden) parts of the image look like
预测器：使用上下文编码器的输出来预测图像中被掩盖（隐藏）部分的样子
Target encoder: Generates representations from the target blocks (hidden parts) that the model uses to learn and make predictions about hidden parts of the image.
目标编码器：从模型用于学习和预测图像隐藏部分的目标块（隐藏部分）生成表示。

The overall goal of I-JEPA is to train the predictor to accurately predict the representations of the hidden image parts from the visible context. This self-supervised learning process allows the model to learn powerful image representations without relying on explicit labels.
I-JEPA 的整体目标是训练预测器准确预测隐藏图像部分的表示，从可见上下文中。这种自监督学习过程使模型能够学习强大的图像表示，而不依赖于明确的标签。

MC-JEPA (Motion-Content Joint-Embedding Predictive Architecture) – Multitasking JEPA
MC-JEPA（Motion-Content Joint-Embedding Predictive Architecture）- 多任务 JEPA

MC-JEPA is another JEPA variation designed to simultaneously interpret video data: dynamic elements (motion) and static details (content) using a shared encoder. It was proposed just a month after I-JEPA, in July 2023.
MC-JEPA 是另一种 JEPA 变体，旨在使用共享编码器同时解释视频数据：动态元素（运动）和静态细节（内容）。它是在 2023 年 7 月提出的，仅仅比 I-JEPA 晚一个月。

Image Credit: MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features
图像来源: MC-JEPA: 一种联合嵌入预测架构，用于自监督学习运动和内容特征

MC-JEPA is a more comprehensive and robust visual representation model that can be used in real-world applications in computer vision like autonomous driving, video surveillance, and activity recognition.
MC-JEPA 是一种更全面和强大的视觉表示模型，可用于计算机视觉领域的实际应用，如自动驾驶、视频监控和活动识别。

Video-based Joint-Embedding Predictive Architecture (V-JEPA) – JEPA for Video
基于视频的联合嵌入预测架构（V-JEPA）- 用于视频的 JEPA

V-JEPA is designed to enhance AI's understanding of video content which was marked as an important future direction after the initial I-JEPA publication.
V-JEPA 旨在增强人工智能对视频内容的理解，这是在最初的 I-JEPA 出版后被确定为一个重要的未来方向。

Image Credit: Meta AI blog post, V-JEPA: The next step toward advanced machine intelligence
图像来源：Meta AI 博客文章，V-JEPA：迈向先进机器智能的下一步

V-JEPA consists of two main components:
V-JEPA 由两个主要组成部分组成：

Encoder: Transforms input video frames into a high-dimensional space where similar features are closer together. The encoder captures essential visual cues from the video.
编码器：将输入的视频帧转换为一个高维空间，在这个空间中，相似的特征更加接近。编码器从视频中捕捉到关键的视觉线索。
Predictor: Takes the encoded features of one part of the video and predicts the features of another part. This prediction is based on learning the temporal and spatial transformations within the video, aiding in understanding motion and changes over time.
预测器：获取视频的一部分的编码特征，并预测另一部分的特征。这种预测是基于学习视频中的时间和空间变换，有助于理解运动和随时间变化。

V-JEPA's design allows it to learn from videos in a way that mimics some aspects of human learning – observing and predicting the visual world without needing explicit annotations. The model's ability to generalize from unsupervised video data to diverse visual tasks makes it a powerful tool for advancing how machines understand and interact with dynamic visual environments.
V-JEPA 的设计使其能够从视频中学习，模仿人类学习的某些方面——观察和预测视觉世界，而无需明确的注释。该模型从无监督视频数据中推广到不同的视觉任务的能力，使其成为推动机器如何理解和与动态视觉环境交互的强大工具。

Generalizing JEPA 概括 JEPA

The latest paper published in March 2024, "Learning and Leveraging World Models in Visual Representation Learning," introduces the concept of Image World Models (IWM) and explores how the use of JEPA architecture can be generalized to a broader set of corruptions – changes in input images like color jitters, blurs – apart from masking.
最新的论文发表于 2024 年 3 月，题为《学习和利用视觉表示学中的世界模型》，介绍了图像世界模型（IWM）的概念，并探讨了 JEPA 架构的使用如何可以推广到更广泛的破坏集合 - 输入图像的变化，如颜色抖动、模糊 - 除了遮罩之外。

Image Credit: Learning and Leveraging World Models in Visual Representation Learning
图像来源：在视觉表示学习中学习和利用世界模型

The study explores two types of world models:
研究探讨了两种世界模型：

Invariant models: Recognize and maintain stable, unchanged features across different scenarios
不变模型：识别和保持不变的特征，跨不同情境保持稳定
Equivariant models: Adapt to changes in the input data, preserving the relationships and transformations that occur
等变模型：适应输入数据的变化，保持发生的关系和变换

The research discovered that machines can more accurately predict and adjust to visual changes by utilizing these world models. This resulted in the development of more resilient and adaptable systems. This method challenges traditional AI approaches and provides a new means to improve the effectiveness of machine learning models without requiring direct supervision.
研究发现，机器可以通过利用这些世界模型更准确地预测和调整视觉变化。这导致了更具弹性和适应性系统的发展。这种方法挑战了传统的人工智能方法，并提供了一种新的方式来改善机器学习模型的有效性，而无需直接监督。

Bonus: All resources in one place
奖金：所有资源集中在一个地方

Original models 原始模型

First proposal of JEPA: Yann LeCun on a vision to make AI systems learn and reason like animals and humans
JEPA 的第一个提议：Yann LeCun 提出了一个愿景，让人工智能系统像动物和人类一样学习和推理
I-JEPA: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
I-JEPA：使用联合嵌入预测架构从图像中进行自监督学习
MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features
MC-JEPA：一种联合嵌入预测架构，用于自监督学习动作和内容特征
V-JEPA: The next step toward advanced machine intelligence
V-JEPA：迈向先进机器智能的下一步
Generalizing JEPA: Learning and Leveraging World Models in Visual Representation Learning
将 JEPA 概括化：在视觉表示学习中学习和利用世界模型

Yann LeCun talks: 杨立昆谈话:

JEPA-inspired models JEPA 启发的模型

We also created for you a list of related models inspired by JEPA architecture. They are grouped based on their application domains:
我们还为您创建了一份受 JEPA 架构启发的相关模型列表。它们根据其应用领域进行分组：

Audio and Speech Applications
音频和语音应用

A-JEPA: Focused on audio data using masked-modeling principles for improving contextual semantic understanding in audio and speech classification tasks.
A-JEPA：专注于使用掩模建模原则改进音频和语音分类任务中的上下文语义理解。
Investigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learning: Analyzes masking strategies and sample durations in self-supervised audio representation learning.
研究联合嵌入预测架构中的设计选择，用于通用音频表示学习：分析自监督音频表示学习中的掩蔽策略和样本持续时间。

Visual and Spatial Data Applications
视觉和空间数据应用

S-JEA: Enhances visual representation learning through hierarchical semantic representations in stacked joint embedding architectures.
S-JEA: 通过堆叠式联合嵌入结构中的分层语义表示增强视觉表示学习。
DMT-JEPA: Targets image modeling with a focus on local semantic understanding, applicable to classification, object detection, and segmentation.
DMT-JEPA：以本地语义理解为重点，针对图像建模目标，适用于分类、目标检测和分割。
JEP-KD: Aligns visual speech recognition models with audio features, improving performance in visual speech recognition.
JEP-KD：通过与音频特征对齐视觉语音识别模型，提高视觉语音识别性能。
Point-JEPA: Applied to point cloud data, enhancing efficiency and representation learning in spatial datasets.
Point-JEPA：应用于点云数据，增强空间数据集中的效率和表示学习。
Signal-JEPA: Focuses on EEG signal processing, improving cross-dataset transfer and classification in EEG analysis.
Signal-JEPA：专注于脑电图信号处理，在脑电图分析中改善跨数据集传输和分类。

Graph and Dynamic Data Applications
图形和动态数据应用程序

Graph-JEPA: First joint-embedding architecture for graphs, using hyperbolic coordinate prediction for subgraph representation.
Graph-JEPA：第一个用于图的联合嵌入架构，使用双曲坐标预测子图表示。
ST-JEMA: Enhances learning of dynamic functional connectivity from fMRI data, focusing on high-level semantic representations.
ST-JEMA：增强了从 fMRI 数据中学习动态功能连接性的能力，重点关注高级语义表示。

Time-Series and Remote Sensing Applications
时间序列和遥感应用

LaT-PFN: Combines time-series forecasting with joint embedding architecture, leveraging related series for robust in-context learning.
LaT-PFN：将时间序列预测与联合嵌入架构相结合，利用相关系列进行强大的上下文学习。
Time-Series JEPA: Optimizes remote control over limited-capacity networks through spatio-temporal correlations in sensor data.
时间序列 JEPA:通过传感器数据中的时空相关性，优化有限容量网络上的远程控制。
Predicting Gradient is Better: Utilizes self-supervised learning for SAR ATR, leveraging gradient features for automatic target recognition.
预测梯度更好：利用自监督学习进行合成孔径雷达目标识别，利用梯度特征进行自动目标识别。

Evaluation and Methodological Studies
评估和方法论研究

LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures: Introduces a metric for evaluating representations in joint-embedding self-supervised learning architectures, focusing on linear probing performance.
激光雷达（LiDAR）：在联合嵌入自监督学习架构中感知线性探测性能：引入了一种评估联合嵌入自监督学习架构中表示的度量标准，重点关注线性探测性能。

How did you like it?
你喜欢吗？

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍
谢谢您的阅读！与三个朋友分享这篇文章，即可获得一个月的免费订阅！🤍

Join the conversation 加入对话

or to participate.
登录或订阅以参与。

Keep reading 继续阅读

Topic 1: What is Mixture-of-Experts (MoE)
主题 1：什么是专家混合模型（MoE）

we discuss the origins of MoE, why is it better than one neural network, Sparsely-Gated MoE, and sudden hype. Enjoy the collection of helpful links
我们讨论 MoE 的起源，为什么它比一个神经网络更好，稀疏门控 MoE，以及突如其来的炒作。享受这些有用链接的收集。

Topic 4: What is FSDP and YaFSDP?
主题 4：什么是 FSDP 和 YaFSDP？

we discuss the breakthroughs in GPU optimization techniques, exploring how YaFSDP surpasses FSDP in enhancing efficiency and scalability for large language models.
我们讨论了 GPU 优化技术方面的突破，探讨了 YaFSDP 如何在提高大型语言模型的效率和可伸缩性方面超越了 FSDP。

Topic 2: What is Mamba?
主题 2：什么是曼巴？

We discuss sequence modeling, the drawbacks of transformers, and what Mamba brings into play
我们讨论序列建模、变压器的缺点以及曼巴带来的影响

Topic 4: What is JEPA?主题 4：JEPA 是什么？

we discuss the Joint Embedding Predictive Architecture (JEPA), how it differs from transformers and provide you with list of models based on JEPA我们讨论了联合嵌入预测架构（JEPA），它与变压器的不同之处，并为您提供了基于 JEPA 的模型列表

Introduction 介绍

What are limitations of LLMs?\{\{1001}}的局限性是什么？

So what’s the potential solution?那么可能的解决方案是什么？

World models 世界模型

Self-supervised learning 自监督学习

Abstract representations 抽象表示

Architecture – Objective-Driven AI建筑 - 目标驱动的人工智能

How does Joint Embedding Predictive Architecture (JEPA) work?联合嵌入预测架构（JEPA）是如何工作的？

What can one build on JEPA?JEPA 上可以建造什么？

Image-based Joint-Embedding Predictive Architecture (I-JEPA) – JEPA for Images基于图像的联合嵌入预测架构（I-JEPA）- 用于图像的 JEPA

MC-JEPA (Motion-Content Joint-Embedding Predictive Architecture) – Multitasking JEPAMC-JEPA（Motion-Content Joint-Embedding Predictive Architecture）- 多任务 JEPA

Video-based Joint-Embedding Predictive Architecture (V-JEPA) – JEPA for Video基于视频的联合嵌入预测架构（V-JEPA）- 用于视频的 JEPA