Main 主要

De novo protein design seeks to generate proteins with specified structural and/or functional properties, for example, making a binding interaction with a given target12, folding into a particular topology13 or containing a catalytic site4. Denoising diffusion probabilistic models (DDPMs), a powerful class of machine learning models recently demonstrated to generate new photorealistic images in response to text prompts14,15, have several properties well suited to protein design. First, DDPMs generate highly diverse outputs, as they are trained to denoise data (for instance, images or text) that have been corrupted with Gaussian noise. By learning to stochastically reverse this corruption, diverse outputs closely resembling the training data are generated. Second, DDPMs can be guided at each step of the iterative generation process towards specific design objectives through provision of conditioning information. Third, for almost all protein design applications it is necessary to explicitly model three-dimensional (3D) structures; rotationally equivariant DDPMs can do this in a global representation frame independent manner. Recent work has adapted DDPMs for protein monomer design by conditioning on small protein ‘motifs’5,9 or on secondary structure and block-adjacency (‘fold’) information8. Although promising, these attempts have shown limited success in generating sequences that fold to the intended structures in silico5,16, probably due to the limited ability of the denoising networks to generate realistic protein backbones, and have not been tested experimentally.
从头蛋白质设计旨在生成具有特定结构和/或功能特性的蛋白质,例如,与给定靶标 12 进行结合相互作用,折叠成特定的拓扑结构 13 或包含催化位点 4 。去噪扩散概率模型 (DDPM) 是一类强大的机器学习模型,最近被证明可以生成新的逼真图像 14,15 以响应文本提示,具有多种非常适合蛋白质设计的特性。首先,DDPM生成高度多样化的输出,因为它们经过训练,可以对被高斯噪声破坏的数据(例如,图像或文本)进行降噪。通过学习随机逆转这种腐败,可以生成与训练数据非常相似的各种输出。其次,通过提供调节信息,可以在迭代生成过程的每个步骤中引导DDPM实现特定的设计目标。第三,对于几乎所有的蛋白质设计应用,都需要显式地对三维(3D)结构进行建模;旋转等变 DDPM 可以以独立于全局表示帧的方式执行此操作。最近的工作通过调节小蛋白质“基序” 5,9 或二级结构和块邻接(“折叠”)信息 8 来调整DDPMs进行蛋白质单体设计。尽管很有希望,但这些尝试在生成在计算机中折叠到预期结构的序列方面显示出有限的成功 5,16 ,这可能是由于去噪网络产生真实蛋白质骨架的能力有限,并且尚未进行实验测试。

We reasoned that improved diffusion models for protein design could be developed by taking advantage of the deep understanding of protein structure implicit in powerful structure prediction methods such as AlphaFold2 (ref. 17) (AF2) and RoseTTAFold18 (RF). RF has properties well suited for use in a protein design DDPM (Fig. 1a): it generates protein structures with high precision, operates on a rigid-frame representation of residues with rotational equivariance and has an architecture enabling conditioning on design specifications at the individual residue, inter-residue distance and orientation, and 3D coordinate levels. In previous work, we fine-tuned RF to complete protein backbones around input functional motifs in a single step (RFjoint Inpainting4). Experimental characterization showed that the method can scaffold a wide range of protein functional motifs with atomic accuracy19, but the approach fails on minimalist site descriptions that do not sufficiently constrain the overall fold and, because it is deterministic, can produce only a limited diversity of designs for a given problem. We reasoned that by fine-tuning RF as the denoising network in a generative diffusion model instead, we could overcome both problems: because the starting point is random noise, each denoising trajectory yields a different solution, and because structure is built up progressively through many denoising iterations, little to no starting structural information should be required. In this study, we used an updated version of RF18 as the basis for the denoising network architecture (Supplementary Methods), but other equivariant structure prediction networks (AF2 (ref. 17), OmegaFold20, ESMFold21) could in principle be substituted into an analogous DDPM.
我们推断,通过利用强大的结构预测方法(如AlphaFold2(参考文献 17 )(AF2)和RoseTTAFold 18 (RF))中隐含的对蛋白质结构的深刻理解,可以开发用于蛋白质设计的改进扩散模型。RF 具有非常适合用于蛋白质设计 DDPM(图 1a)的特性:它以高精度生成蛋白质结构,在具有旋转等方差的残基的刚性框架表示上运行,并且具有能够在单个残基、残基间距离和方向以及 3D 坐标水平上调节设计规范的架构。在之前的工作中,我们对 RF 进行了微调,以在单个步骤中完成围绕输入功能基序的蛋白质骨架 (RF joint Inpainting 4 )。实验表征表明,该方法可以以原子精度搭建广泛的蛋白质功能基序 19 ,但该方法在极简主义位点描述上失败了,这些位点描述不能充分约束整体折叠,并且由于它是确定性的,只能为给定的问题产生有限的设计多样性。我们推断,通过在生成扩散模型中微调RF作为去噪网络,我们可以克服这两个问题:因为起点是随机噪声,所以每个去噪轨迹都会产生不同的解决方案,并且由于结构是通过许多去噪迭代逐步建立起来的,因此几乎不需要起始结构信息。在这项研究中,我们使用更新版本的RF 18 作为去噪网络架构(补充方法)的基础,但其他等变结构预测网络(AF2(参考文献)。 17 )、OmegaFold 20 、ESMFold 21 )原则上可以替换为类似的DDPM。

Fig. 1: Protein design using RFdiffusion.
figure 1

a, Diffusion models for proteins are trained to recover corrupted (noised) protein structures and to generate new structures by reversing the corruption process through iterative denoising of initially random noise XT into a realistic structure X0 (top panel). The RF structure prediction network (middle panel, left side) is fine-tuned with minimal architectural changes into RFdiffusion (middle panel, right side); the denoising network of a DDPM is also shown. In RF, the primary input to the model is the sequence. In RFdiffusion, the primary input is diffused residue frames (coordinates and orientations). In both cases, the model predicts final 3D coordinates (denoted X^0 in RFdiffusion). The bottom panel shows that in RFdiffusion, the model receives its previous prediction as a template input (‘self-conditioning’, Supplementary Methods). At each timestep t of a trajectory (typically 200 steps), RFdiffusion takes X^0t+1 from the previous step and Xt and then predicts an updated X0 structure (X^0t). The next coordinate input to the model (Xt1) is generated by a noisy interpolation (interp) towards X^0t. b, RFdiffusion is broadly applicable for protein design. RFdiffusion generates protein structures either without further input (top row) or by conditioning on (top to bottom): symmetry specifications; binding targets; protein functional motifs or symmetric functional motifs. In each case random noise, along with conditioning information, is input to RFdiffusion, which iteratively refines that noise until a final protein structure is designed. c, An example of an unconditional design trajectory for a 300-residue chain, depicting the input to the model (Xt) and the corresponding X^0 prediction. At early timesteps (high t), X^0 bears little resemblance to a protein but is gradually refined into a realistic protein structure.
a,蛋白质的扩散模型经过训练,可以恢复损坏的(噪声)蛋白质结构,并通过将初始随机噪声 X T 迭代去噪到真实结构 X 0 中来逆转损坏过程来生成新结构(上图)。RFdiffusion(中间面板,右侧)对RFdiffusion(中间面板,右侧)进行了微调,架构变化最小;还显示了DDPM的去噪网络。在 RF 中,模型的主要输入是序列。在 RFdiffusion 中,主要输入是扩散残差帧(坐标和方向)。在这两种情况下,模型都会预测最终的 3D 坐标(用 RFdiffusion 表示 X^0 )。底部面板显示,在 RFdiffusion 中,模型接收其先前的预测作为模板输入(“自调节”,补充方法)。在轨迹的每个时间步长 t(通常为 200 步)处,RFdiffusion 从前一步和 X t 获取 X^0t+1 ,然后预测更新的 X 0 结构 ( X^0t )。模型 ( Xt1 ) 的下一个坐标输入由朝向 X^0t 的噪声插值 (interp) 生成。b,RF扩散广泛适用于蛋白质设计。RFdiffusion 无需进一步输入(顶行)或通过调节(从上到下)生成蛋白质结构:对称规范;约束目标;蛋白质功能基序或对称功能基序。在每种情况下,随机噪声以及条件信息都会被输入到RFdiffusion中,RFdiffusion会迭代细化该噪声,直到设计出最终的蛋白质结构。 c,300 个残基链的无条件设计轨迹示例,描述了模型 (X t ) 的输入和相应的 X^0 预测。在早期时间步长(高t), X^0 与蛋白质几乎没有相似之处,但逐渐被细化为现实的蛋白质结构。

We construct a RF-based diffusion model, RFdiffusion, using the RF frame representation that comprises a Cα coordinate and N-Cα-C rigid orientation for each residue. We generate training inputs by noising structures sampled from the Protein Data Bank (PDB) for up to 200 steps22. For translations, we perturb Cα coordinates with 3D Gaussian noise. For residue orientations, we use Brownian motion on the manifold of rotation matrices (building on refs. 23,24). To enable RFdiffusion to learn to reverse each step of the noising process, we train the model by minimizing a mean-squared error (m.s.e.) loss between frame predictions and the true protein structure (without alignment), averaged across all residues (Supplementary Methods). This loss drives denoising trajectories to match the data distribution at each timestep and hence to converge on structures of designable protein backbones (Extended Data Fig. 2a). The m.s.e. contrasts to the loss used in RF structure prediction training (frame aligned point error or FAPE) in that, unlike FAPE, m.s.e. loss is not invariant to the global reference frame and therefore promotes continuity of the global coordinate frame between timesteps (Supplementary Methods).
我们构建了一个基于RF的扩散模型RFdiffusion,使用RF框架表示,该帧表示包括每个残基的Cα坐标和N-Cα-C刚性取向。我们通过对从蛋白质数据库 (PDB) 采样的结构进行多达 200 步 22 的噪声来生成训练输入。对于平移,我们用 3D 高斯噪声扰动 Cα 坐标。对于残基取向,我们在旋转矩阵的流形上使用布朗运动(建立在 refs. 23,24 )。为了使RFdiffusion能够学习逆转噪声过程的每一步,我们通过最小化帧预测和真实蛋白质结构(无比对)之间的均方误差(m.s.e.)损失来训练模型,平均所有残基(补充方法)。这种损失驱动去噪轨迹以匹配每个时间步长的数据分布,从而收敛于可设计蛋白质骨架的结构(扩展数据图2a)。m.s.e. 与射频结构预测训练中使用的损耗(帧对准点误差或 FAPE)形成鲜明对比,因为与 FAPE 不同,m.s.e. 损耗不是全局参考系的不变的,因此促进了时间步长之间全局坐标系的连续性(补充方法)。

To generate a new protein backbone, we first initialize random residue frames and RFdiffusion makes a denoised prediction. Each residue frame is updated by taking a step in the direction of this prediction with some noise added to generate the input to the next step. The nature of the noise added and the size of this reverse step is chosen such that the denoising process matches the distribution of the noising process (Supplementary Methods and Extended Data Fig. 2a). RFdiffusion initially seeks to match the full breadth of possible protein structures compatible with the purely random frames with which it is initialized, and hence the denoised structures do not initially seem protein-like (Fig. 1c, left). However, through many such steps, the breadth of possible protein structures from which the input could have arisen narrows and RFdiffusion predictions come to closely resemble protein structures (Fig. 1c, right). We use the ProteinMPNN network1 to subsequently design sequences encoding these structures, typically sampling eight sequences per design in line with previous work5,16 (but see Supplementary Fig. 2a). We also considered simultaneously designing structure and sequence within RFdiffusion, but given the excellent performance of combining ProteinMPNN with the diffusion of structure alone, we did not extensively explore this possibility.
为了生成新的蛋白质骨架,我们首先初始化随机残基框架,然后RFdiffusion进行去噪预测。每个残差帧都通过朝着这个预测的方向迈出一步来更新,并添加一些噪声以生成下一步的输入。选择添加的噪声的性质和该反向步骤的大小,使得去噪过程与噪声过程的分布相匹配(补充方法和扩展数据图2a)。RFdiffusion 最初寻求匹配与其初始化的纯随机框架兼容的可能蛋白质结构的全部宽度,因此去噪结构最初看起来不像蛋白质(图 1c,左)。然而,通过许多这样的步骤,可能产生输入的蛋白质结构的广度变窄了,RF扩散预测变得与蛋白质结构非常相似(图1c,右)。我们使用 ProteinMPNN 网络 1 随后设计编码这些结构的序列,通常每个设计采样 8 个序列,这与以前的工作 5,16 一致(但参见补充图 2a)。我们还考虑在RFdiffusion中同时设计结构和序列,但鉴于将ProteinMPNN与单独结构扩散相结合的优异性能,我们没有广泛探索这种可能性。

Figure 1a highlights the similarities between RF structure prediction and an RFdiffusion denoising step: in both cases, the networks transform coordinates into a predicted structure, conditioned on inputs to the model. In RF, sequence is the primary input, with extra structural information provided as templates and initial coordinates to the model. In RFdiffusion, the primary input is the noised coordinates from the previous step. For specific design tasks, a range of auxiliary conditioning information, including partial sequence, fold information or fixed functional-motif coordinates can be provided (Fig. 1b and Supplementary Methods).
图 1a 突出显示了 RF 结构预测和 RF扩散去噪步骤之间的相似之处:在这两种情况下,网络都将坐标转换为预测结构,条件是模型的输入。在 RF 中,序列是主要输入,额外的结构信息作为模板和初始坐标提供给模型。在 RFdiffusion 中,主要输入是上一步中的噪声坐标。对于特定的设计任务,可以提供一系列辅助调节信息,包括部分序列、折叠信息或固定的功能基序坐标(图 1b 和补充方法)。

We explored two different strategies for training RFdiffusion: (1) in a manner akin to ‘canonical’ diffusion models, with predictions at each timestep independent of predictions at previous timesteps (as in previous work5,8,9,16), and (2) with self-conditioning25, in which the model can condition on previous predictions between timesteps (Fig. 1a, bottom row and Supplementary Methods). The latter strategy was inspired by the success of ‘recycling’ in AF2, which is also central to the more recent RF model used here (Supplementary Methods). Self-conditioning within RFdiffusion notably improved performance on in silico benchmarks encompassing both conditional and unconditional protein design tasks (Fig. 2e and Extended Data Fig. 1e). Increased coherence of predictions within self-conditioned trajectories may, at least in part, explain these performance increases (Extended Data Fig. 1h). Fine-tuning RFdiffusion from pretrained RF weights was far more successful than training for an equivalent length of time from untrained weights (Extended Data Fig. 1f,g, also Supplementary Fig. 1) and the m.s.e. loss was also crucial for unconditional generation (Extended Data Fig. 1d). For all in silico benchmarks in this paper, we use the AF2 structure prediction network17 for validation and define an in silico ‘success’ as an RFdiffusion output for which the AF2 structure predicted from a single sequence is (1) of high confidence (mean predicted aligned error (pAE), less than five), (2) globally within a 2 Å backbone root mean-squared deviation (r.m.s.d.) of the designed structure and (3) within 1 Å backbone r.m.s.d. on any scaffolded functional site (Supplementary Methods). This measure of in silico success has been found to correlate with experimental success4,7,26 and is significantly more stringent than template modelling (TM)-score-based metrics used elsewhere5,16,27,28,29 (Supplementary Fig. 2c,d).
我们探索了两种不同的训练RFdiffusion的策略:(1)以类似于“规范”扩散模型的方式,每个时间步的预测独立于先前时间步的预测(如以前的工作 5,8,9,16 ),以及(2)使用自调节 25 ,其中模型可以根据时间步之间的先前预测进行条件(图1a,底行和补充方法)。后一种策略的灵感来自AF2中“回收”的成功,这也是这里使用的最新射频模型(补充方法)的核心。RFdiffusion中的自调节显著提高了计算机基准测试的性能,包括条件和无条件蛋白质设计任务(图2e和扩展数据图1e)。在自我调节轨迹中,预测的连贯性增加可能至少部分解释了这些性能的提高(扩展数据图1h)。从预训练的RF权重中微调RF扩散比从未经训练的权重中训练相同时间长度的RF扩散要成功得多(扩展数据图1f,g,也是补充图1),并且m.s.e.损失对于无条件生成也至关重要(扩展数据图1d)。对于本文中的所有计算机基准,我们使用 AF2 结构预测网络 17 进行验证,并将计算机“成功”定义为 RF扩散输出,其中从单个序列预测的 AF2 结构为 (1) 高置信度(平均预测对齐误差 (pAE),小于 5),(2) 在设计结构的 2 Å 主干根均方差 (r.m.s.d.) 内,以及 (3) 在任何支架上 1 Å 主干 r.m.s.d. 内功能站点(补充方法)。 已发现这种计算机成功衡量标准与实验成功 4,7,26 相关,并且比其他地方使用的基于模板建模 (TM) 分数的指标要严格得多 5,16,27,28,29 (补充图 2c、d)。

Unconditional protein monomer generation
无条件蛋白质单体生成

As shown in Fig. 2a–c and Supplementary Fig. 3c,d, starting from random noise, RFdiffusion can readily generate elaborate protein structures with little overall structural similarity to structures seen during training, indicating considerable generalization beyond the PDB (see Supplementary Table 1 for a comparison of all designs in the paper to the PDB). The designs are diverse (Supplementary Fig. 3a), spanning a wide range of alpha, beta and mixed alpha–beta topologies, with AF2 and ESMFold (Fig. 2c, Extended Data Fig. 1b,c and Supplementary Fig. 2b) predictions very close to the design structure models for de novo designs with as many as 600 residues. RFdiffusion generates plausible structures for even very large proteins, but these are difficult to validate in silico as they are probably generally beyond the single sequence prediction capabilities of AF2 and ESMFold. The quality and diversity of designs that are sampled are inherent to the model, and do not depend on any auxiliary conditioning input (for example, secondary structure information8). We experimentally characterized six of the 300 amino acid designs and three of the 200 amino acid designs, and found that they have circular dichroism spectra consistent with the mixed alpha–beta topologies of the designs and are extremely thermostable (Extended Data Fig. 3). Physics-based protein design methodologies have struggled in unconstrained generation of diverse protein monomers because of the difficulty of sampling on the very large and rugged conformational landscape30, and overcoming this limitation has been a primary test of deep-learning based protein design approaches5,6,8,16,27,31. RFdiffusion strongly outperforms (based on the AF2 success metric described above) Hallucination with RF, an experimentally validated method using Monte Carlo search or gradient descent to identify sequences predicted to fold into stable structures (Fig. 2d). RFdiffusion generation is also more compute efficient than unconstrained Hallucination with RF, and efficiency can be greatly improved by taking larger steps at inference time and by truncating trajectories early, which is possible because RF predicts the final structure at each timestep (Extended Data Fig. 2b,c). For example, a 100-residue protein can be generated in as little as 11 s on an NVIDIA RTX A4000 Graphical Processing Unit, in contrast to RF Hallucination, which takes around 8.5 min.
如图2a-c和补充图3c,d所示,从随机噪声开始,RFdiffusion可以很容易地产生复杂的蛋白质结构,与训练期间看到的结构几乎没有整体结构相似性,这表明PDB之外有相当大的泛化(参见补充表1,将论文中的所有设计与PDB进行比较)。这些设计是多种多样的(补充图3a),涵盖了广泛的α、β和混合α-β拓扑结构,AF2和ESMFold(图2c,扩展数据图1b,c和补充图2b)的预测非常接近具有多达600个残基的从头设计的设计结构模型。RFdiffusion甚至可以为非常大的蛋白质生成合理的结构,但这些结构很难在计算机中验证,因为它们通常可能超出了AF2和ESMFold的单序列预测能力。抽样设计的质量和多样性是模型固有的,不依赖于任何辅助条件输入(例如,二级结构信息 8 )。我们对 300 个氨基酸设计中的 6 个和 200 个氨基酸设计中的 3 个进行了实验表征,发现它们具有与设计的混合 α-β 拓扑一致的圆二色谱,并且具有极高的热稳定性(扩展数据图 3)。基于物理的蛋白质设计方法在不受限制地生成不同的蛋白质单体方面一直很困难,因为在非常大和崎岖的构象景观 30 上进行采样的困难,克服这一限制一直是基于深度学习的蛋白质设计方法 5,6,8,16,27,31 的主要测试。 RFdiffusion 的性能优于(基于上述 AF2 成功指标)RF 幻觉,这是一种经过实验验证的方法,使用蒙特卡洛搜索或梯度下降来识别预测折叠成稳定结构的序列(图 2d)。RF扩散生成也比使用RF的无约束幻觉具有更高的计算效率,并且通过在推理时采取更大的步骤和尽早截断轨迹可以大大提高效率,这是可能的,因为RF在每个时间步预测最终结构(扩展数据图2b,c)。例如,在 NVIDIA RTX A4000 图形处理单元上,可以在短短 11 秒内生成 100 个残基的蛋白质,而射频幻觉则需要大约 8.5 分钟。

Fig. 2: Outstanding performance of RFdiffusion for monomer generation.
图2:RF扩散在单体生成中的出色性能。
figure 2

a, RFdiffusion can generate new monomeric proteins of different lengths (left 300, right 600) with no conditioning information. Grey, design model; colours, AF2 prediction. r.m.s.d. AF2 versus design (Å), left to right: 0.90, 0.98, 1.15, 1.67. b, Unconditional designs from RFdiffusion are new and not present in the training set as quantified by highest TM-score to the PDB; the divergence from previously known structures increases with length. c, Unconditional samples are closely repredicted by AF2 up to about 400 amino acids. d, RFdiffusion significantly outperforms Hallucination (with RF) at unconditional monomer generation (two-proportion z-test of in silico success: n = 400 designs per condition, z = 9.5, P = 1.6 × 10−21). Although Hallucination successfully generates designs up to 100 amino acids in length, in silico success rates rapidly deteriorate beyond this length. e, Ablating pretraining (by starting from untrained RF), RFdiffusion fine-tuning (that is, using original RF structure prediction weights as the denoiser), self-conditioning or m.s.e. losses (by training with FAPE) each notably decrease the performance of RFdiffusion. r.m.s.d. between design and AF2 is shown, for the unconditional generation of 300 amino acid proteins (Supplementary Methods). f, Two example 300 amino acid proteins that expressed as soluble monomers. Designs (grey) overlaid with AF2 predictions (colours) are shown on the left, alongside circular dichroism (CD) spectra (top) and melt curves (bottom) on the right. The designs are highly thermostable. g, RFdiffusion can condition on fold information. An example TIM barrel is shown (bottom left), conditioned on the secondary structure and block adjacency of a previously designed TIM barrel, PDB 6WVS (top left). Designs have very similar circular dichroism spectra to PDB 6WVS (top right) and are highly thermostable (bottom right). See also Extended Data Fig. 3 for further traces. Boxplots represent median ± interquartile range; tails are minimum and maximum excluding outliers (±1.5× interquartile range).
a,RFdiffusion可以产生不同长度(左300,右600)的新单体蛋白,无需条件信息。灰色,设计模型;颜色,AF2 预测。r.m.s.d. AF2 与设计 (Å),从左到右:0.90、0.98、1.15、1.67。b,RFdiffusion 的无条件设计是新的,并且不存在于训练集中,由 PDB 的最高 TM 分数量化;与先前已知结构的差异随着长度的增加而增加。c,无条件样品通过AF2密切重新预测,最多可达约400个氨基酸。d,RF扩散在无条件单体生成时明显优于幻觉(使用RF)(计算机成功的双比例z检验:每个条件n = 400个设计,z = 9.5,P = 1.6 × 10 −21 )。尽管幻觉成功地产生了长达100个氨基酸的设计,但在计算机模拟中,超过这个长度的成功率会迅速下降。e、消融预训练(从未训练的射频开始)、射频扩散微调(即使用原始射频结构预测权重作为降噪器)、自调节或 m.s.e. 损失(通过使用 FAPE 训练)都会显著降低射频扩散的性能。r.m.s.d. 在设计和 AF2 之间,用于无条件生成 300 个氨基酸蛋白质(补充方法)。f,两种以可溶性单体形式表达的300个氨基酸的实例。左侧显示叠加了 AF2 预测(颜色)的设计(灰色),右侧显示了圆二色性 (CD) 光谱(上)和熔体曲线(下)。这些设计具有高度的热稳定性。g, RF扩散可以调节折叠信息。图中显示了一个示例 TIM 枪管(左下角),该枪管以先前设计的 TIM 枪管 PDB 6WVS(左上角)的二级结构和块相邻性为条件。 该设计具有与PDB 6WVS(右上)非常相似的圆二色光谱,并且具有高度的热稳定性(右下)。有关进一步的迹线,另请参阅扩展数据图 3。箱线图表示四分位距±位距的中位数;尾部是最小值和最大值,不包括异常值(±1.5×四分位距)。

It is often desirable to be able to specify a protein fold during design (such as triose-phosphate isomerase (TIM) barrels or cavity-containing NTF2s for small molecule binder and enzyme design32,33), and thus we further fine-tuned RFdiffusion to condition on secondary structure and/or fold information, enabling rapid and accurate generation of diverse designs with the desired topologies (Fig. 2g and Extended Data Fig. 4). In silico success rates were 42.5 and 54.1% for TIM barrels and NTF2 folds, respectively (Extended Data Fig. 4d), and experimental characterization of 11 TIM barrel designs indicated that at least eight designs were soluble, thermostable and had circular dichroism spectra consistent with the design model (Fig. 2g and Extended Data Fig. 4e,f).
通常希望能够在设计过程中指定蛋白质折叠(例如用于小分子结合剂和酶设计的 32,33 磷酸丙糖异构酶 (TIM) 桶或含腔的 NTF2),因此我们进一步微调 RFdiffusion 以调节二级结构和/或折叠信息,从而能够快速准确地生成具有所需拓扑结构的不同设计(图 2g 和扩展数据图 4)。TIM桶和NTF2折叠的计算机成功率分别为42.5%和54.1%(扩展数据图4d),对11种TIM桶设计的实验表征表明,至少有8种设计是可溶的、热稳定的,并且具有与设计模型一致的圆二色光谱(图2g和扩展数据图4e,f)。

Design of higher-order oligomers
高阶低聚物的设计

There is considerable interest in designing symmetric oligomers, which can serve as vaccine platforms34, delivery vehicles35 and catalysts36. Cyclic oligomers have been designed using structure prediction networks with an adaptation of Hallucination that searches for sequences predicted to fold to the desired cyclic symmetry, but this approach fails for higher-order dihedral, tetrahedral, octahedral and icosahedral symmetries, probably in part because of the much lower representation of such structures in the PDB7.
人们对设计对称低聚物有相当大的兴趣,这些低聚物可以用作疫苗平台 34 、运载工具 35 和催化剂 36 。环状低聚物是使用结构预测网络设计的,该网络具有幻觉的适应性,该网络搜索预测为折叠到所需循环对称性的序列,但这种方法对于高阶二面体、四面体、八面体和二十面体对称性无效,部分原因可能是此类结构在 PDB 7 中的表示性要低得多。

We set out to generalize RFdiffusion to create symmetric oligomeric structures with any specified point group symmetry. Given a specification of a point group symmetry for an oligomer with n chains, and the monomer chain length, we generate random starting residue frames for a single monomer subunit as in the unconditional generation case, and then generate n − 1 copies of this starting point arranged with the specified point group symmetry. Because RFdiffusion is equivariant (inherited from RF) with respect to rotation and relabelings of chains, symmetry is largely maintained in the denoising predictions; we explicitly resymmetrize at each step but this changes the structures only slightly (compare grey and coloured chains in Extended Data Fig. 5a and Supplementary Methods). For octahedral and icosahedral architectures, we explicitly model only the smallest subset of monomers required to generate the full assembly (for example, for icosahedra, the subunits at the five-, three- and twofold symmetry axes) to reduce the computational cost and memory footprint.
我们开始推广RFdiffusion,以创建具有任何指定点群对称性的对称低聚结构。给定具有 n 条链的低聚物的点群对称性规范和单体链长度,我们为单个单体亚基生成随机起始残基框架,如无条件生成情况,然后生成该起点的 n − 1 个副本以指定的点群对称性排列。由于RF扩散在链的旋转和重新标记方面是等变的(继承自RF),因此在去噪预测中在很大程度上保持了对称性;我们在每个步骤中都明确地进行了再对称,但这仅略微改变了结构(比较扩展数据图5a和补充方法中的灰色和彩色链)。对于八面体和二十面体架构,我们仅明确模拟生成完整组装所需的最小单体子集(例如,对于二十面体,五重、三重和二重对称轴处的亚基),以降低计算成本和内存占用。

Despite not being trained on symmetric inputs, RFdiffusion is able to generate symmetric oligomers with high in silico success rates (Extended Data Fig. 5b), particularly when guided by an auxiliary inter- and intrachain contact potential (Extended Data Fig. 5c). As illustrated in Fig. 3 and Extended Data Fig. 5e, RFdiffusion designs are nearly indistinguishable from AF2 predictions of the structures adopted by the designed sequences, and many show little resemblance to previously solved protein structures (Extended Data Fig. 5d and Supplementary Table 1). Several of the oligomeric topologies are not seen in the PDB, including two-layer beta barrels (Fig. 3a, C10 symmetry) and complex mixed alpha/beta topologies (Fig. 3a, C8 symmetry; closest TM align in PDB 6BRP, 0.47, and PDB 6BRO, 0.43, respectively).
尽管没有经过对称输入的训练,但RFdiffusion能够产生具有高计算机成功率的对称低聚物(扩展数据图5b),特别是在辅助链间和链内接触电位的引导下(扩展数据图5c)。如图3和扩展数据图5e所示,RF扩散设计与所设计序列所采用的结构的AF2预测几乎没有区别,并且许多与先前解析的蛋白质结构几乎没有相似之处(扩展数据图5d和补充表1)。PDB中没有几种低聚拓扑结构,包括两层β桶(图3a,C10对称性)和复杂的混合α/β拓扑结构(图3a,C8对称性;最接近的TM对齐分别在PDB 6BRP(0.47)和PDB 6BRO(0.43)中)。

Fig. 3: Design and experimental characterization of symmetric oligomers.
图3:对称低聚物的设计和实验表征。
figure 3

a, RFdiffusion-generated assemblies overlaid with the AF2 structure predictions based on the designed sequences; in all five cases they are nearly indistinguishable (for the octahedron (bottom), the prediction was for the C3 substructure). Symmetries are indicated to the left of the design models. b,c, Designed assemblies characterized by nsEM. Model symmetries are as follows: cyclic, C3 (HE0822, 350 amino acids (AA) per chain), C6 (HE0626, 100 AA per chain) and C8 (HE0675, 60 AA per chain) (b); dihedral, D3 (HE0490, 80 AA per chain) and D4 (HE0537, 100 AA per chain) (c). From left to right: (1) symmetric design model, (2) AF2 prediction of design following sequence design with ProteinMPNN, (3) 2D class averages showing both top and side views (scale bar, 60 Å for all class averages) and (4) 3D reconstructions from class averages with the design model fit into the density map. The overall shapes are consistent with the design models, and confirm the intended oligomeric state. As in a, AF2 predictions of each design are nearly indistinguishable from the design model (backbone r.m.s.d.s (Å) for HE0822, HE0626, HE0490, HE0675 and HE0537, are 1.33, 1.03, 0.60, 0.74 and 0.75, respectively). d, nsEM characterization of an icosahedral particle (HE0902, 100 AA per chain). The design model, including the AF2 prediction of the C3 subunit are shown on the left. nsEM data are shown on the right: on top, a representative micrograph is shown alongside 2D class averages along each symmetry axis (C3, C2 and C5, from left to right) with the corresponding 3D reconstruction map views shown directly below overlaid on the design model.
a,RFdiffusion生成的组件叠加基于设计序列的AF2结构预测;在所有五种情况下,它们几乎无法区分(对于八面体(底部),预测是针对C3子结构的)。对称性指示在设计模型的左侧。b,c, 以 nsEM 为特征的设计组件。模型对称性如下:环状,C3(HE0822,每链350个氨基酸(AA)),C6(HE0626,每链100个AA)和C8(HE0675,每链60个AA)(b);二面体,D3(HE0490,每链 80 AA)和 D4(HE0537,每链 100 AA)(c)。从左到右:(1)对称设计模型,(2)使用ProteinMPNN进行序列设计后设计的AF2预测,(3)显示顶视图和侧视图的2D类平均值(比例尺,所有类平均值为60 Å)和(4)使用设计模型拟合密度图的类平均值进行3D重建。整体形状与设计模型一致,并确认了预期的低聚状态。与a一样,每种设计的AF2预测与设计模型几乎没有区别(HE0822、HE0626、HE0490、HE0675和HE0537的骨干r.m.s.d.s(Å)分别为1.33、1.03、0.60、0.74和0.75)。d,二十面体颗粒的nsEM表征(HE0902,每链100 AA)。设计模型,包括 C3 亚基的 AF2 预测如左图所示。nsEM 数据显示在右侧:在顶部,代表性的显微照片与沿每个对称轴(C3、C2 和 C5,从左到右)的 2D 类平均值一起显示,相应的 3D 重建图视图直接显示在设计模型上。

We selected 608 designs for experimental characterization and found using size-exclusion chromatography (SEC) that at least 87 had oligomerization states closely consistent with the design models (within the 95% confidence interval, 126 designs within the 99% confidence interval, as determined by SEC calibration curves; Supplementary Figs. 4 and 5). We took advantage of the increased size of these oligomers (compared to the smaller unconditional and fold-conditioned monomers described above) and collected negative stain electron microscopy (nsEM) data on a subset of these designs across different symmetry groups. For most, distinct particles were evident with shapes resembling the design models in both the raw micrographs and subsequent two-dimensional (2D) classifications (Fig. 3 and Extended Data Fig. 5f). nsEM characterization of a C3 design (HE0822) with 350 residue subunits (1,050 residues in total) suggests that the actual structure is very close to the design, both over the 350 residue subunits and the overall C3 architecture. 2D class averages are clearly consistent with both top and side views of the design model, and a 3D reconstruction of the density has key features consistent with the design, including the distinctive pinwheel shape (Fig. 3b, top row). Electron microscopy 2D class averages of C5 and C6 designs with more than 750 residues (HE0794, HE0789, HE0841) were also consistent with the respective design models (Extended Data Fig. 5f).
我们选择了 608 个设计进行实验表征,并使用尺寸排阻色谱 (SEC) 发现至少有 87 个设计具有与设计模型密切相关的寡聚化状态(在 95% 置信区间内,126 个设计在 99% 置信区间内,由 SEC 校准曲线确定;补充图4和图5)。我们利用了这些低聚物的尺寸增加(与上述较小的无条件和折叠条件单体相比),并收集了不同对称组中这些设计子集的负染色电子显微镜 (nsEM) 数据。对于大多数人来说,在原始显微照片和随后的二维(2D)分类中,不同的颗粒具有明显的形状,其形状与设计模型相似(图3和扩展数据图5f)。具有 350 个残基亚基(总共 1,050 个残基)的 C3 设计 (HE0822) 的 nsEM 表征表明,实际结构与设计非常接近,无论是超过 350 个残基亚基还是整个 C3 结构。2D 等级平均值与设计模型的顶视图和侧视图明显一致,密度的 3D 重建具有与设计一致的关键特征,包括独特的风车形状(图 3b,顶行)。具有超过 750 个残基的 C5 和 C6 设计(HE0794、HE0789、HE0841)的电子显微镜 2D 类平均值也与各自的设计模型一致(扩展数据图 5f)。

RFdiffusion also generated cyclic oligomers with alpha and/or beta barrel structures that resemble expanded TIM barrels and provide an interesting comparison between innovation during natural evolution and innovation through deep learning. The TIM barrel fold, with eight strands and eight helices, is one of the most abundant folds in nature37. nsEM confirmed the structure of two RFdiffusion designed cyclic oligomers, which considerably extend beyond this fold (Fig. 3b, bottom rows). HE0626 is a C6 alpha–beta barrel composed of 18 strands and 18 helices, and HE0675 is a C8 octamer composed of an inner ring of 16 strands and an outer ring of 16 helices arranged locally in a very similar repeating pattern to the TIM barrel (1:1 helix:strand). For both HE0626 and HE0675 we obtained nsEM 3D reconstructions that are in agreement with the computational design models. The HE0600 design is also an alpha–beta barrel (Extended Data Fig. 5f), but has two strands for every helix (24 strands and 12 helices in total) and hence is locally different from a TIM barrel. Whereas natural evolution has extensively explored structural variations of the classic eight-strand or eight-helix TIM barrel fold, RFdiffusion can more readily explore global changes in barrel curvature, enabling discovery of TIM barrel-like structures with many more helices and strands.
RFdiffusion 还生成了具有 α 和/或 β 桶结构的环状低聚物,类似于膨胀的 TIM 桶,并在自然进化过程中的创新与通过深度学习的创新之间进行了有趣的比较。TIM桶形褶皱有八股和八个螺旋,是自然界 37 中最丰富的褶皱之一。nsEM证实了两种RFdiffusion设计的环状低聚物的结构,它们大大超出了这个折叠(图3b,底部行)。HE0626 是由 18 条链和 18 个螺旋组成的 C6 α-β 桶,HE0675 是一种 C8 八聚体,由 16 条链的内环和 16 个螺旋的外环组成,以与 TIM 桶非常相似的重复模式局部排列(1:1 螺旋:股)。对于 HE0626 和 HE0675,我们获得了与计算设计模型一致的 nsEM 3D 重建。HE0600 设计也是一个 α-β 桶(扩展数据图 5f),但每个螺旋有两条链(总共 24 条链和 12 个螺旋),因此在局部上与 TIM 桶不同。虽然自然进化已经广泛探索了经典的八链或八螺旋TIM桶状褶皱的结构变化,但RFdiffusion可以更容易地探索桶曲率的全局变化,从而能够发现具有更多螺旋和链的TIM桶状结构。

RFdiffusion also readily generated structures with dihedral, tetrahedral and icosohedral symmetries (Fig. 3c,d and Extended Data Fig. 5e,f). SEC characterization indicated that 38 D2, seven D3 and three D4 designs had the expected molecular weights (these have four, six and eight chains, respectively) (Supplementary Fig. 5). Although the D2 dihedrals are too small for nsEM, 2D class averages—and for some, 3D reconstructions of D3 and D4 designs—were congruent with the overall topologies of the design models (Fig. 3c and Extended Data Fig. 5f). Similarly, 3D reconstruction (Fig. 3c) and cryogenic electron microscopy (cryo-EM) 2D class averages (Extended Data Fig. 5g and Supplementary Fig. 6) of the D4 HE0537 closely match the design model, recapitulating the roughly 45° offset between tetramic subunits. 2D nsEM class averages for a 12-chain tetrahedron (HE0964) were consistent with the design model (Extended Data Fig. 5f). Forty-eight icosahedra were selected for experimental validation, and one, HE0902, a 15 nm (diameter) highly porous assembly (Fig. 3d, left) was observed in nsEM micrographs to form homogeneous particles. 2D class averages and a 3D reconstruction very closely match the design model (Fig. 3d), with triangular hubs arrayed around the empty C5 axes. Designs such as HE0902 (and future similar large assemblies) should be useful as new nanomaterials and vaccine scaffolds, with robust assembly and (in the case of HE0902) the outward facing N and C termini offering many possibilities for antigen display.
RF扩散也很容易产生具有二面体、四面体和二十面体对称性的结构(图3c,d和扩展数据图5e,f)。SEC表征表明,38个D2、7个D3和3个D4设计具有预期的分子量(它们分别有4条、6条和8条链)(补充图5)。尽管 D2 二面体对于 nsEM 来说太小,但 2D 类平均值(对于某些人来说,D3 和 D4 设计的 3D 重建)与设计模型的整体拓扑结构一致(图 3c 和扩展数据图 5f)。同样,D4 HE0537 的 3D 重建(图 3c)和低温电子显微镜 (cryo-EM) 2D 类平均值(扩展数据图 5g 和补充图 6)与设计模型非常吻合,概括了四亚基之间大约 45° 的偏移。12链四面体(HE0964)的二维nsEM类平均值与设计模型一致(扩展数据图5f)。选择了 48 个二十面体进行实验验证,其中一个是 HE0902,在 nsEM 显微照片中观察到一个 15 nm(直径)高多孔组件(Fig. 3d,左)以形成均匀颗粒。2D 类平均值和 3D 重建与设计模型 (Fig. 3d) 非常匹配,三角形轮毂围绕空的 C5 轴排列。HE0902(以及未来类似的大型组件)等设计应该可以用作新的纳米材料和疫苗支架,具有坚固的组件和(在HE0902的情况下)朝外的N和C末端,为抗原显示提供了许多可能性。

Functional-motif scaffolding
功能主题脚手架

We next investigated the use of RFdiffusion for scaffolding protein structural motifs that carry out binding and catalytic functions, in which the role of the scaffold is to hold the motif in precisely the 3D geometry needed for optimal function. In RFdiffusion, we input motifs as 3D coordinates (including sequence and sidechains) both during conditional training and inference, and build scaffolds that hold the motif atomic coordinates in place. Many deep-learning methods have been developed recently to address this problem, including RFjoint Inpainting4, constrained Hallucination4 and other DDPMs5,8,29. To rigorously evaluate the performance of these methods in comparison to RFdiffusion across a broad set of design challenges, we established an in silico benchmark test (Supplementary Table 9) comprising 25 motif-scaffolding design problems addressed in six recent publications encompassing several design methodologies4,5,29,38,39,40. The challenges span a broad range of motifs, including simple ‘inpainting’ problems, viral epitopes, receptor traps, small molecule binding sites, binding interfaces and enzyme active sites.
接下来,我们研究了RFdiffusion在执行结合和催化功能的蛋白质结构基序支架中的应用,其中支架的作用是将基序精确地保持在最佳功能所需的3D几何形状中。在RFdiffusion中,我们在条件训练和推理过程中将基序作为3D坐标(包括序列和侧链)输入,并构建将基序原子坐标固定到位的支架。最近已经开发了许多深度学习方法来解决这个问题,包括 RF joint Inpainting 4 、约束幻觉 4 和其他 DDPM。 5,8,29 为了严格评估这些方法与RFdiffusion相比,在一系列广泛的设计挑战中的性能,我们建立了一个计算机基准测试(补充表9),其中包括25个主题脚手架设计问题,这些问题在最近的六份出版物中得到解决,包括多种设计方法 4,5,29,38,39,40 。这些挑战涵盖了广泛的基序,包括简单的“修复”问题、病毒表位、受体陷阱、小分子结合位点、结合界面和酶活性位点。

RFdiffusion solves 23 of the 25 benchmark problems, compared to 15 for Hallucination and 19 for RFjoint Inpainting (Fig. 4a,b). For 19 out of 23 of the problems solved by RFdiffusion, the fraction of successful designs is higher than either Hallucination or RFjoint Inpainting. The excellent performance of RFdiffusion required no hyperparameter tuning or external potentials; this contrasts with Hallucination, for which problem-specific optimization can be required. In 17 out of 23 of the problems, RFdiffusion-generated successful solutions with higher in silico success rates when noise was not added during the reverse diffusion trajectories (see Extended Data Fig. 1i for further discussion on the effect of noise on design quality, and Supplementary Fig. 8 for analysis of design diversity). The ability of RFdiffusion to scaffold functional motifs is not related to their presence in the RFdiffusion training set (Supplementary Fig. 7).
RFdiffusion 解决了 25 个基准问题中的 23 个,而幻觉为 15 个,RF joint 修复为 19 个(图 4a,b)。在RFdiffusion解决的23个问题中,有19个问题的成功设计比例高于幻觉或RF joint 修复。RFdiffusion 的出色性能不需要超参数调整或外部电位;这与幻觉形成鲜明对比,幻觉可能需要针对特定问题的优化。在23个问题中的17个问题中,当在反向扩散轨迹中未添加噪声时,RFdiffusion生成的成功解具有更高的计算机成功率(有关噪声对设计质量影响的进一步讨论,请参阅扩展数据图1i,以及有关设计多样性分析的补充图8)。RFdiffusion 支架功能基序的能力与它们在 RFdiffusion 训练集中的存在无关(补充图 7)。

Fig. 4: Scaffolding of diverse functional sites with RFdiffusion.
图4:使用RFdiffusion构建不同功能位点的支架。
figure 4

a, RFdiffusion outperforms other methods across 25 benchmark motif-scaffolding problems collected from six recent publications (Supplementary Table 9). In silico success is defined as AF2 r.m.s.d. to design model less than 2 Å, AF2 r.m.s.d. to the native functional motif less than 1 Å and AF2 pAE less than five. One hundred designs were generated per problem, with no previous optimization on the benchmark set (some optimization was necessary for Hallucination). Supplementary Table 10 presents full results. In silico success rates on the problems are correlated between the methods, and RFdiffusion can still struggle on challenging problems in which all methods have low success. b, Four examples of designs in which RFdiffusion significantly outperforms existing methods. Teal, native motif; colours, AF2 prediction of a design. Metrics (r.m.s.d. AF2 versus design/versus native motif (Å), AF2 pAE): 5TRV long, 1.17/0.57; 4.73; 6E6R long, 0.89/0.27, 4.56; 7MRX long, 0.84/0.82 4.32; 5TPN, 0.59/0.49 3.77. c, RFdiffusion can scaffold the p53 helix that binds MDM2 (left) and makes extra contacts with the target (right, average 31% increased surface area. Design was p53_design_89). Designs were generated with an RFdiffusion model fine-tuned on complexes. d, BLI measurements indicate high-affinity binding to MDM2 (p53_design_89, 0.7 nM; p53_design_53, 0.5 nM); the native affinity is 600 nM (ref. 42). e, Out of 95 designs, 55 showed binding to MDM2 (more than 50% of maximum response). Thirty-two of these were monomeric (Supplementary Fig. 10h). f, After fine-tuning (Supplementary Methods), RFdiffusion can scaffold enzyme active sites. An oxidoreductase example (EC1) is shown (PDB 1A4I); catalytic site (teal); RFdiffusion output (grey, model; colours, AF2 prediction); zoom of active site. AF2 versus design backbone r.m.s.d. 0.88 Å, AF2 versus design motif backbone r.m.s.d. 0.53 Å, AF2 versus design motif full-atom r.m.s.d. 1.05 Å, AF2 pAE 4.47. g, In silico success rates on active sites derived from EC1-5 (AF2 Motif r.m.s.d. versus native: backbone less than 1 Å, backbone and sidechain atoms less than 1.5 Å, r.m.s.d. AF2 versus design less than 2 Å, AF2 pAE less than 5).
a,RFdiffusion 在从最近的六篇出版物中收集的 25 个基准基序脚手架问题中优于其他方法(补充表 9)。计算机模拟成功定义为 AF2 r.m.s.d. 设计模型小于 2 Å,AF2 r.m.s.d. 设计小于 1 Å 的天然功能基序,AF2 pAE 小于 5。每个问题生成了 100 个设计,之前没有对基准集进行优化(幻觉需要一些优化)。补充表10列出了完整的结果。在计算机模拟中,方法之间的问题成功率是相关的,RFdiffusion在所有方法的成功率都很低的挑战性问题上仍然很困难。b,RFdiffusion明显优于现有方法的四个设计示例。蓝绿色,原生图案;颜色,设计的 AF2 预测。指标(r.m.s.d. AF2 与设计/与天然基序 (Å),AF2 pAE):5TRV 长,1.17/0.57;4.73;6E6R长,0.89/0.27,4.56;7MRX长,0.84/0.82 4.32;5TPN,0.59/0.49 3.77。c,RFdiffusion 可以支撑结合 MDM2(左)并与靶标进行额外接触的 p53 螺旋(右,表面积平均增加 31%。设计是p53_design_89)。设计是使用在复合物上微调的 RFdiffusion 模型生成的。d, BLI测量表明与MDM2的高亲和力结合(p53_design_89,0.7 nM;p53_design_53,0.5 nM);天然亲和力为 600 nM(参考文献)。 42 e,在 95 种设计中,55 种显示出与 MDM2 的结合(超过最大响应的 50%)。其中32个是单体的(补充图10h)。f、微调(补充方法)后,RF扩散可以支架酶活性位点。 示为氧化还原酶示例(EC1)(PDB 1A4I);催化位点(蓝绿色);RF扩散输出(灰色,模型;颜色,AF2预测);活动站点的缩放。AF2 与设计主链 r.m.s.d. 0.88 Å,AF2 与设计主链 r.m.s.d. 0.53 Å,AF2 与设计基序全原子 r.m.s.d. 1.05 Å,AF2 pAE 4.47。g,源自 EC1-5 的活性位点的计算机模拟成功率(AF2 Motif r.m.s.d. vs 天然:骨架小于 1 Å,主链和侧链原子小于 1.5 Å,r.m.s.d. AF2 与设计小于 2 Å,AF2 pAE 小于 5)。

One of the benchmark problems is the scaffolding of the p53 helix that binds MDM2. Inhibiting this interaction through high-affinity competitive inhibition by scaffolding the p53 helix and making further interactions with MDM2 is a promising therapeutic avenue41. In silico success has been described elsewhere4, but experimental success has not been reported. We used an RFdiffusion model fine-tuned on protein complexes (Supplementary Methods) to generate 96 designs scaffolding this helix. We scaffolded the p53 helix in the presence of MDM2, so extra interactions could be designed by RFdiffusion and experimentally identified 0.5 and 0.7 nM binders (Fig. 4c,d), three orders of magnitude higher affinity than the reported 600 nM affinity of the p53 peptide alone42. The overall success rate was quite high: out of the 96 designs, 55 showed some detectable binding at 10 μM (Fig. 4e and Supplementary Fig. 10h).
基准问题之一是结合MDM2的p53螺旋的支架。通过支架 p53 螺旋并通过高亲和力竞争性抑制来抑制这种相互作用并与 MDM2 进行进一步的相互作用是一种很有前途的治疗途径 41 。计算机模拟成功在其他地方已有描述 4 ,但实验成功尚未报道。我们使用对蛋白质复合物进行微调的 RFdiffusion 模型(补充方法)来生成 96 个设计支架。我们在 MDM2 存在下搭建了 p53 螺旋支架,因此可以通过 RFdiffusion 设计额外的相互作用,并通过实验鉴定出 0.5 和 0.7 nM 结合剂(图 4c,d),亲和力比单独 42 报道的 p53 肽的 600 nM 亲和力高三个数量级。总体成功率相当高:在96种设计中,55种在10μM处显示出一些可检测到的结合(图4e和补充图10h)。

Scaffolding enzyme active sites
支架酶活性位点

A grand challenge in protein design is to scaffold minimal descriptions of enzyme active sites comprising a few single amino acids. Whereas some in silico success has been reported previously4, a general solution that can readily produce high-quality, orthogonally validated outputs remains elusive. Following fine-tuning on a task mimicking this problem (Supplementary Methods), RFdiffusion was able to scaffold enzyme active sites comprising many sidechain and backbone functional groups with high accuracy and in silico success rates across a range of enzyme classes (Fig. 4f and Extended Data Fig. 6a–d; in silico success required fine tuning). Although RFdiffusion is unable to explicitly model bound small molecules at present (however, see our conclusions), the substrate can be implicitly modelled using an external potential to guide the generation of ‘pockets’ around the active site. As a demonstration, we scaffold a retroaldolase active site triad while implicitly modelling the reaction substrate (Extended Data Fig. 6e–h).
蛋白质设计的一大挑战是支撑由几个单个氨基酸组成的酶活性位点的最小描述。虽然之前 4 已经报道过一些计算机模拟成功,但可以很容易地产生高质量、正交验证输出的通用解决方案仍然难以捉摸。在对模拟此问题的任务进行微调(补充方法)后,RFdiffusion能够高精度地搭建包含许多侧链和骨架官能团的酶活性位点,并在一系列酶类别中实现计算机成功率(图4f和扩展数据图6a-d;计算机成功需要微调)。尽管RFdiffusion目前无法显式地模拟结合的小分子(但是,请参阅我们的结论),但可以使用外部电位隐式模拟底物,以引导活性位点周围“口袋”的产生。作为演示,我们在隐式模拟反应底物的同时搭建了反醛缩酶活性位点三联体(扩展数据图6e-h)。

Symmetric functional-motif scaffolding
对称功能主题脚手架

Several important design challenges involve the scaffolding of several copies of a functional motif in symmetric arrangements. For example, many viral glycoproteins are trimeric and symmetry matched arrangements of inhibitory domains can be extremely potent43,44,45,46. Conversely, symmetric presentation of viral epitopes in an arrangement that mimics the virus could induce new classes of neutralizing antibodies47,48. To explore this general direction, we sought to design trimeric multivalent binders to the SARS-CoV-2 spike protein. In previous work, flexible linkage of a binder to the ACE2 binding site (on the spike protein receptor binding domain) to a trimerization domain yielded a high-affinity inhibitor that had potent and broadly neutralizing antiviral activity in animal models43. Ideally, however, symmetric fusions to binders would be rigid, so as to reduce the entropic cost of binding while maintaining the avidity benefits from multivalency. We used RFdiffusion to design C3-symmetric trimers that rigidly hold three binding domains (the functional motif in this case) such that they exactly match the ACE2 binding sites on the SARS-CoV-2 spike protein trimer. The designs were confidently predicted by AF2 to both assemble as C3-symmetric oligomers, and to scaffold the AHB2 SARS-CoV-2 binder interface with high accuracy (Fig. 5a).
几个重要的设计挑战涉及以对称排列的方式搭建功能图案的多个副本的脚手架。例如,许多病毒糖蛋白是三聚体的,并且抑制结构域的对称匹配排列可能非常有效 43,44,45,46 。相反,病毒表位以模拟病毒的排列对称呈现可以诱导新类别的中和抗体 47,48 。为了探索这一大方向,我们试图设计SARS-CoV-2刺突蛋白的三聚体多价结合剂。在以前的工作中,结合剂与 ACE2 结合位点(在刺突蛋白受体结合域上)与三聚体化结构域的灵活连接产生了一种高亲和力抑制剂,该抑制剂在动物模型 43 中具有有效且广泛的中和抗病毒活性。然而,理想情况下,与粘合剂的对称融合将是刚性的,以降低结合的熵成本,同时保持多价的亲和力益处。我们使用 RFdiffusion 设计了 C3 对称三聚体,该三聚体刚性地保持三个结合域(本例中的功能基序),使它们与 SARS-CoV-2 刺突蛋白三聚体上的 ACE2 结合位点完全匹配。AF2 自信地预测这些设计既组装为 C3 对称低聚物,又可以高精度地搭建 AHB2 SARS-CoV-2 粘合剂界面(图 5a)。

Fig. 5: Symmetric motif scaffolding with RFdiffusion.
图5:使用RFdiffusion的对称基序脚手架。
figure 5

a, Design of symmetric oligomers scaffolding the binding interface of ACE2 mimic AHB2 (left, teal) against the SARS-CoV-2 spike trimer (left, grey). Three AHB2 copies are input to RFdiffusion along with C3 noise (middle); output are C3-symmetric oligomers holding the three AHB2 copies in place to engage all spike subunits. AF2 predictions (right) recapitulate the AHB2 structure with 0.6 Å r.m.s.d. over the assymetric unit and 2.9 Å r.m.s.d. over the C3 assembly. b, Design of C4-symmetric oligomers to scaffold a Ni2+ binding motif (left). Starting from square-planar histidine rotamers within helical fragments (Supplementary Methods), RFdiffusion generates a C4 oligomer scaffolding the binding domain (middle). AF2 predictions (colour) agree closely with the design model (grey), with backbone r.m.s.d. less than 1.0 Å (right). c, nsEM 2D class averages (scale bar, 60 Å) and 3D reconstruction density are consistent with the symmetry and structure of the NiB1.17 design model shown superimposed on the density in ribbon representation (top). Isothermal titration calorimetry binding isotherm of design NiB1.17 (blue) indicates a dissociation constant less than 20 nM at a metal:monomer stoichiometry of 1:4. The H52A mutant isotherm (pink) ablates binding, indicating scaffolded histidine residues are critical for metal binding. d, Additional experimentally characterized Ni2+ binders NiB2.15 (left), NiB1.12 (middle) and NiB1.20 (right). Metal-coordinating sidechains in the design models (top, teal) are closely recapitulated in the AF2 predictions (colours). 2D nsEM class averages (middle; scale bar, 60 Å) are consistent with design models. Binding isotherms for wild-type (WT) and H52A mutant (bottom) indicate Ni2+ binding mediated directly by the scaffolded histidines at the designed stoichiometry. Note that for ITC plots, points represent single measurements.
a,对称低聚物支架的设计,该支架将 ACE2 模拟 AHB2(左,蓝绿色)与 SARS-CoV-2 刺突三聚体(左,灰色)的结合界面。三个 AHB2 拷贝与 C3 噪声(中间)一起输入到 RFdiffusion;输出是 C3 对称低聚物,将三个 AHB2 拷贝固定到位以接合所有刺突亚基。AF2 预测(右)概括了 AHB2 结构,在非对称单元上 0.6 Å r.m.s.d.,在 C3 组件上 2.9 Å r.m.s.d.。b,C4对称低聚物的设计,以支撑Ni 2+ 结合基序(左)。从螺旋片段内的方形平面组氨酸旋转体(补充方法)开始,RFdiffusion 产生 C4 低聚物支架,该支架位于结合域(中间)。AF2 预测(彩色)与设计模型(灰色)非常吻合,主干 r.m.s.d. 小于 1.0 Å(右)。c,nsEM 2D 类平均值(比例尺,60 Å)和 3D 重建密度与 NiB1.17 设计模型的对称性和结构一致,该模型叠加在色带表示中的密度上(顶部)。设计NiB1.17(蓝色)的等温滴定量热法结合等温线表示在金属:单体化学计量比为1:4时的解离常数小于20 nM。H52A突变体等温线(粉红色)消融结合,表明支架组氨酸残基对金属结合至关重要。d, 其他实验表征的Ni 2+ 结合剂NiB2.15(左)、NiB1.12(中)和NiB1.20(右)。设计模型中的金属配位侧链(顶部、蓝绿色)在 AF2 预测(颜色)中得到了密切概括。二维 nsEM 类平均值(中间;比例尺,60 Å)与设计模型一致。 野生型 (WT) 和 H52A 突变体(底部)的结合等温线表明,在设计的化学计量下,支架组氨酸直接介导 Ni 2+ 结合。请注意,对于 ITC 图,点表示单个测量值。

The ability to scaffold functional sites with any desired symmetry opens up new approaches to designing metal-coordinating protein assemblies49,50. Divalent transition metal ions show distinct preferences for specific coordination geometries (for example, square planar, tetrahedral and octahedral) with ion-specific optimal sidechain–metal bond lengths. RFdiffusion provides a general route to building up symmetric protein assemblies around such sites, with the symmetry of the assembly matching the symmetry of the coordination geometry. As a first test, we sought to design square-planar Ni2+ binding sites. We designed C4 protein assemblies with four central histidine imidazoles arranged in an ideal Ni2+-binding site with square-planar coordination geometry (Fig. 5b). Diverse designs starting from distinct C4-symmetric histidine square-planar sites had good in silico success with the histidine residues in near ideal geometries for coordinating metal in the AF2-predicted structures (Supplementary Fig. 9).
以任何所需的对称性搭建功能位点的能力为设计金属配位蛋白组装开辟了新的方法 49,50 。二价过渡金属离子对具有离子特异性最佳侧链-金属键长度的特定配位几何形状(例如,方形平面、四面体和八面体)表现出不同的偏好。RFdiffusion 提供了一种围绕这些位点构建对称蛋白质组装体的一般途径,组装体的对称性与配位几何体的对称性相匹配。作为第一个测试,我们试图设计方形平面的Ni 2+ 结合位点。我们设计了 C4 蛋白组装体,其中四个中心组氨酸咪唑排列在理想的 Ni 2+ 结合位点,具有方形平面配位几何形状(图 5b)。从不同的 C4 对称组氨酸方形平面位点开始的多样化设计在计算机模拟中取得了良好的成功,组氨酸残基具有接近理想几何形状的配位金属 AF2 预测结构(补充图 9)。

We expressed and purified 44 designs in Escherichia coli, and found that 37 had SEC chromatograms consistent with the intended oligomeric state (Extended Data Fig. 7b). Of the designs, 36 were tested for Ni2+ coordination by isothermal titration calorimetry, and 18 were found to bind Ni2+ with dissociation constants ranging from low nanomolar to low micromolar (Fig. 5c,d and Extended Data Fig. 7a). The inflection points in the wild-type isotherms indicate binding with the designed stoichiometry, a one to four ratio of ion to monomer. Although most of the designed proteins showed exothermic metal coordination, in a few cases binding was endothermic (Fig. 5d, left and Extended Data Fig. 7a: NiB2.9, NiB2.10, NiB2.15 and NiB2.23), suggesting that Ni2+ coordination is entropically driven in these assemblies. To confirm that Ni2+ binding was indeed mediated by the scaffolded histidine 52, we mutated this residue to alanine, which abolished or notably reduced binding in 17 out of 17 cases with successful expression (Extended Data Figs. 7a,c and Fig. 5c,d; one mutant did not express). We structurally characterized by nsEM a subset of the designs—NiB1.12, NiB1.15, NiB1.17 and NiB1.20—that showed histidine-dependent binding. All four designs showed clear fourfold symmetry both in the raw micrographs and in 2D class averages (Fig. 5c,d), with design NiB1.17 also clearly showing twofold axis side views with a measured diameter approximating the design model. A 3D reconstruction of NiB1.17 was in close agreement with the design model (Fig. 5c).
我们在大肠杆菌中表达并纯化了 44 种设计,发现 37 种具有与预期低聚体状态一致的 SEC 色谱图(扩展数据图 7b)。在这些设计中,36 种通过等温滴定量热法测试了 Ni 2+ 配位,发现 18 种与 Ni 2+ 结合,解离常数范围从低纳摩尔到低微摩尔(图 5c,d 和扩展数据图 7a)。野生型等温线中的拐点表示与设计的化学计量结合,离子与单体的比例为一:四。尽管大多数设计的蛋白质显示出放热金属配位,但在少数情况下结合是吸热的(图5d,左图和扩展数据图7a:NiB2.9,NiB2.10,NiB2.15和NiB2.23),表明 2+ Ni配位在这些组装体中是熵驱动的。为了确认 Ni 2+ 结合确实是由支架组氨酸 52 介导的,我们将该残基突变为丙氨酸,在 17 例成功表达的病例中,有 17 例消除或显着减少结合(扩展数据图 7a,c 和图 5c,d;一个突变体不表达)。我们通过nsEM在结构上表征了设计的一个子集——NiB1.12、NiB1.15、NiB1.17和NiB1.20——显示出组氨酸依赖性结合。所有四种设计在原始显微照片和二维类平均值中都显示出清晰的四重对称性(图 5c,d),设计 NiB1.17 也清楚地显示了测量直径近似于设计模型的双轴侧视图。NiB1.17的3D重建与设计模型非常吻合(图5c)。

Design of protein-binding proteins
蛋白质结合蛋白的设计

The design of high-affinity binders to target proteins is a grand challenge in protein design, with numerous therapeutic applications51. A general method for de novo binder design from target structure information alone using the physically based Rosetta method was recently described12, and subsequently, using ProteinMPNN for sequence design and AF2 for design filtering was found to improve design success rates26. However, experimental success rates were low, still requiring many thousands of designs to be screened for each design campaign12, and the approach relied on prespecifying a particular set of protein scaffolds as the basis for the designs, inherently limiting the diversity and shape complementarity of possible solutions12. To our knowledge, no deep-learning method has yet demonstrated experimental general success in designing completely de novo binders.
靶蛋白的高亲和力结合剂的设计是蛋白质设计中的一大挑战,具有许多治疗应用 51 。最近描述了 12 一种使用基于物理的 Rosetta 方法仅从目标结构信息从头设计粘合剂的通用方法,随后发现使用 ProteinMPNN 进行序列设计,使用 AF2 进行设计过滤可以提高设计成功率 26 。然而,实验成功率很低,仍然需要为每个设计活动 12 筛选数千个设计,并且该方法依赖于预先指定一组特定的蛋白质支架作为设计的基础,这固有地限制了可能解决方案 12 的多样性和形状互补性。据我们所知,目前还没有一种深度学习方法在设计完全从头粘合剂方面显示出实验性的成功。

We reasoned that RFdiffusion might be able to address this challenge by directly generating binding proteins in the context of the target. For many therapeutic applications, for example, blocking a protein–protein interaction, it is desirable to bind to a particular site on a target protein. To enable this, we fine-tuned RFdiffusion on protein complex structures, providing a feature as input indicating a subset of the residues on the target chain (called ‘interface hotspots’) to which the diffused chain binds (Fig. 6a and Extended Data Fig. 8a,b). For design challenges in which a particular binder fold might be especially compatible, we enabled coarse-grained control over binder scaffold topology by fine-tuning an extra model to condition binder diffusion on secondary structure and block-adjacency information, in addition to conditioning on interface hotspots (Extended Data Fig. 8c,d and Supplementary Methods).
我们推断RFdiffusion可能能够通过在靶标的背景下直接产生结合蛋白来应对这一挑战。对于许多治疗应用,例如阻断蛋白质-蛋白质相互作用,希望与靶蛋白上的特定位点结合。为了实现这一点,我们对蛋白质复合物结构的RFdiffusion进行了微调,提供了一个特征作为输入,指示扩散链结合的靶链上残基的子集(称为“界面热点”)(图6a和扩展数据图8a,b)。对于特定粘结剂折叠可能特别兼容的设计挑战,我们通过微调额外的模型来调节粘结剂支架拓扑结构的粗粒度控制,以根据二级结构和块邻接信息调节粘结剂扩散,此外还对界面热点进行调节(扩展数据图 8c,d 和补充方法)。

Fig. 6: De novo design of protein-binding proteins.
图6:蛋白质结合蛋白的从头设计。
figure 6

a, RFdiffusion generates protein binders given a target and specification of interface hotspot residues. b, De novo binders were designed to five protein targets; Influenza A H1 HA, IL-7Rα, InsR, PD-L1 and TrkA and hits with BLI response greater than or equal to 50% of the positive control were identified for all targets. For IL-7Rα, InsR, PD-L1 and TrkA, RFdiffusion has success rates roughly two orders of magnitude higher than the original design campaigns. We attribute one order of magnitude to RFdiffusion, and the second to filtering with AF2 (estimated success rates for previous campaigns if AF2 filtering had been used: HA, 0%; IL-7Rα, 2.2%; InsR, 5.5%; PD-L1, 3.7%; TrkA, 1.5%). c, For IL-7Rα, InsR, PD-L1 and TrkA, the highest affinity binder is shown above a BLI titration series. Reported KD values are based on global kinetic fitting with fixed global Rmax. d, The highest affinity HA binder, HA_20, binds with a KD of 28 nM. c,d, Yellow or orange, target or hotspot residues; grey, design model; purple, AF2 prediction (r.m.s.d. AF2 versus design). Binders: IL7Ra_55 (2.1 Å), InsulinR_30 (2.6 Å), PDL1_77 (1.5 Å), TrkA_88 (1.4 Å) (left to right in c) and HA_20 (1.7 Å) (d). e, Cryo-EM 2D class averages of HA_20 bound to influenza HA, strain A/USA:Iowa/1943 H1N1 (scale bar, 10 nm). f, 2.9 Å cryo-EM 3D reconstruction of the complex viewed along two orthogonal axes. HA_20 (purple) is bound to H1 along the stem of all three subunits. g, The cryo-EM structure of the HA_20 binder in complex closely matches the design model (r.m.s.d. to RFdiffusion design, 0.63 Å; yellow, influenza HA). h, Structure of the HA_20 binder alone superimposed on the design model viewed along two orthogonal axes. For cryo-EM panels, yellow, Influenza H1 map and/or structure; grey, HA_20 binder design model; purple, HA_20 binder map or structure.
a,RFdiffusion在给定界面热点残基的靶标和规格的情况下生成蛋白质结合剂。b, 从头结合剂设计为5个蛋白质靶标;所有靶标均鉴定出甲型流感 H1 HA、IL-7Rα、InsR、PD-L1 和 TrkA 以及 BLI 反应大于或等于阳性对照的 50% 的命中。对于 IL-7Rα、InsR、PD-L1 和 TrkA,RFdiffusion 的成功率比原始设计活动高出大约两个数量级。我们将一个数量级归因于 RFdiffusion,第二个数量级归因于 AF2 滤波(如果使用 AF2 滤波,则先前活动的估计成功率:HA,0%;IL-7Rα,2.2%;InsR,5.5%;PD-L1,3.7%;TrkA,1.5%)。c,对于IL-7Rα、InsR、PD-L1和TrkA,最高亲和力的结合剂如BLI滴定系列所示。报告的 K D 值基于具有固定全局 R max 的全局动力学拟合。d, 最高亲和力的HA结合剂HA_20,结合的K D 为28 nM。c,d: 黄色或橙色,靶斑或热点残留;灰色,设计模型;紫色,AF2 预测(r.m.s.d. AF2 与设计)。粘合剂:IL7Ra_55 (2.1 Å)、InsulinR_30 (2.6 Å)、PDL1_77 (1.5 Å)、TrkA_88 (1.4 Å)(c中从左到右)和 HA_20 (1.7 Å) (d)。e,与流感 HA 结合的 HA_20 的冷冻电镜 2D 类平均值,菌株 A/USA:Iowa/1943 H1N1(比例尺,10 nm)。f, 2.9 Å 冷冻电镜 3D 重建沿两个正交轴观察的复合物。HA_20(紫色)沿着所有三个亚基的茎与 H1 结合。g, 复合物中HA_20粘合剂的冷冻电镜结构与设计模型非常吻合(r.m.s.d.到RFdiffusion设计,0.63 Å;黄色,流感HA)。h,单独叠加在沿两个正交轴查看的设计模型上的HA_20粘结剂结构。 对于冷冻电镜面板,黄色,H1流感图谱和/或结构;灰色、HA_20粘结剂设计模型;紫色,HA_20粘合剂映射或结构。

To compare RFdiffusion to previous binder design methods, we performed binder design campaigns against five targets: Influenza A H1 Haemagglutinin (HA)52, Interleukin-7 Receptor-ɑ (IL-7Rɑ)12, Programmed Death-Ligand 1 (PD-L1)12, Insulin Receptor (InsR) and Tropomyosin Receptor Kinase A (TrkA)12. We designed putative binders to each target, both with and without conditioning on compatible fold information, with high in silico success rates (Extended Data Fig. 8e,f). Designs were filtered by AF2 confidence in the interface and monomer structure26, and 95 were selected for each target for experimental characterization.
为了将RFdiffusion与以前的粘合剂设计方法进行比较,我们针对五个靶点进行了粘合剂设计活动:甲型流感H1血凝素(HA)、 52 白细胞介素-7受体-ɑ(IL-7Rɑ)、 12 程序性死亡配体1(PD-L1)、 12 胰岛素受体(InsR)和原肌球蛋白受体激酶A(TrkA)。 12 我们针对每个靶标设计了假定的粘合剂,无论是否对兼容的折叠信息进行调节,计算机模拟成功率都很高(扩展数据图8e,f)。通过界面和单体结构 26 的AF2置信度过滤设计,每个靶标选择95个进行实验表征。

The designed binders were expressed in E. coli and purified, and binding was assessed through single point biolayer interferometry (BLI) screening at 10 μM binder concentration (Extended Data Fig. 8g). The overall experimental success rate, defined as binding at or above 50% of the maximal response for the positive control, was 19% (this is a conservative estimate as some designs that showed binding had insufficient material to permit screening at 10 μM: Extended Data Fig. 8g); an increase of roughly two orders of magnitude over our previous Rosetta-based method on the same targets (Fig. 6b). Binders were identified for all five targets, with fewer than 100 designs tested per target compared to thousands in previous studies. Full BLI titrations for a subset of the designs showed nanomolar affinities with no further experimental optimization, including HA and IL-7Rɑ binders with affinities of roughly 30 nM (Fig. 6c). Binding interfaces were often highly distinct from interfaces to these targets in the PDB (Supplementary Figs. 11 and 12). To assess binder specificity, six of the highest affinity IL-7Rɑ binders were assessed by means of competition BLI, and all six competed for binding with a structurally validated positive control binding to the same site (Supplementary Fig. 10a; further work is required to fully characterize proteome-wide specificity).
将设计的结合剂在大肠杆菌中表达并纯化,并通过10μM结合剂浓度的单点生物层干涉测量(BLI)筛选评估结合(扩展数据图8g)。总体实验成功率(定义为结合率达到或高于阳性对照最大反应的 50%)为 19%(这是一个保守估计,因为一些显示结合的设计没有足够的材料允许在 10 μM 下进行筛选:扩展数据图 8g);比我们之前在相同目标上基于罗塞塔的方法增加了大约两个数量级(图6b)。所有五个靶点都确定了粘合剂,每个靶标测试的设计少于100个,而以前的研究中有数千个。部分设计的全BLI滴定显示纳摩尔亲和力,无需进一步的实验优化,包括亲和力约为30 nM的HA和IL-7Rɑ结合剂(图6c)。在PDB中,结合界面通常与这些靶标的界面高度不同(补充图11和12)。为了评估结合剂特异性,通过竞争BLI评估了六种亲和力最高的IL-7Rɑ结合剂,并且所有六种结合剂都与结构验证的阳性对照结合到同一位点(补充图10a;需要进一步的工作来充分表征蛋白质组范围的特异性)。

We solved the structure of the highest affinity Influenza binder, HA_20, in complex with Iowa43 HA using cryo-EM (Extended Data Table 1). Raw electron micrographs revealed a well-folded HA glycoprotein with clearly discernible side, top and tilted view orientations suspended in a thin layer of vitreous ice (Extended Data Fig. 9a). The 2D class averages further show clear secondary structure elements corresponding to both Iowa43 HA (Extended Data Fig. 9b), as well as the HA_20 binder bound to the stem (Fig 6e). The 3D heterogenous refinement without symmetry revealed full occupancy of all three HA stem epitopes by the HA_20 binder. A final non-uniform 3D refinement reconstruction with C3 symmetry yielded a 2.9 Å map of the HA/HA_20 protein–protein complex (Fig 6f) and corresponding 3D structure that almost perfectly matches the computational design model (0.63 Å, Fig 6f,g; the sidechain interactions at the interface are very different from the closest structure in the PDB; Extended Data Fig. 9h). Over the binder alone, the experimental structure deviates from the RFdiffusion design by only 0.6 Å (Fig. 6h). These results demonstrate the ability of RFdiffusion to generate new proteins with atomic level accuracy, and to precisely target functionally relevant sites on therapeutically important proteins.
我们使用冷冻电镜解决了与 Iowa43 HA 复合物中最高亲和力的流感结合剂 HA_20 的结构(扩展数据表 1)。原始电子显微照片显示,折叠良好的HA糖蛋白具有清晰可辨的侧面、顶部和倾斜视图方向,悬浮在一层薄薄的玻璃冰中(扩展数据图9a)。二维类平均值进一步显示了对应于 Iowa43 HA 的清晰二级结构元素(扩展数据图 9b)以及与茎结合的HA_20粘合剂(图 6e)。无对称性的3D异质细化揭示了HA_20结合剂完全占据了所有三个HA茎表位。最终,具有C3对称性的非均匀3D细化重建产生了HA/HA_20蛋白-蛋白质复合物的2.9 Å图谱(图6f)和相应的3D结构图谱,该图几乎与计算设计模型完美匹配(0.63 Å,图6f,g;界面处的侧链相互作用与PDB中最接近的结构有很大不同;扩展数据图9h)。仅就粘结剂而言,实验结构仅与RFdiffusion设计相差0.6 Å(图6h)。这些结果证明了RFdiffusion能够以原子级精度生成新蛋白质,并精确靶向治疗重要蛋白质上的功能相关位点。

Discussion 讨论

RFdiffusion is a comprehensive improvement over current protein design methods. RFdiffusion readily generates diverse unconditional designs up to 600 residues in length that are accurately predicted by AF2, far exceeding the complexity and accuracy achieved by most previous methods (a recent Hallucination-based approach also achieved high unconditional performance53). Half of our tested unconditional designs express in a soluble way,  and have circular dichroism spectra consistent with the design models and high thermostability. Despite their substantially increased complexity, the ideality and stability of RFdiffusion designs is akin to that of de novo protein designs generated using previous methods such as Rosetta. RFdiffusion enables generation of higher-order architectures with any desired symmetry, unlike Hallucination methods, which have so far been limited to cyclic symmetries. Electron microscopy confirmed that the structures of these oligomers are very similar to the design models, which in many cases show little global similarity to known protein oligomers.
RFdiffusion是对当前蛋白质设计方法的全面改进。RFdiffusion 很容易生成各种无条件设计,长度可达 600 个残基,这些设计可通过 AF2 准确预测,远远超过大多数先前方法所达到的复杂性和准确性(最近一种基于幻觉的方法也实现了高无条件性能 53 )。我们测试的无条件设计中有一半以可溶方式表达,并具有与设计模型一致的圆二色光谱和高热稳定性。尽管它们的复杂性大大增加,但RF扩散设计的理想性和稳定性类似于使用先前方法(如Rosetta)生成的从头蛋白质设计。RFdiffusion能够生成具有任何所需对称性的高阶架构,这与幻觉方法不同,幻觉方法迄今为止仅限于循环对称性。电子显微镜证实,这些低聚物的结构与设计模型非常相似,在许多情况下,设计模型与已知的蛋白质低聚物几乎没有全局相似性。

There has been recent progress in scaffolding protein functional motifs using deep-learning methods (RF Hallucination, RFjoint Inpainting and diffusion), but Hallucination is slow for large systems, Inpainting fails when insufficient starting information is provided and previous diffusion methods had low accuracy. RFdiffusion outperforms these previous methods in the complexity of the motifs that can be scaffolded, the precision with which sidechains are positioned (for catalysis and other functions), and the accuracy of motif recapitulation by AF2. The design of MDM2 binding proteins with three orders of magnitude higher affinities than the scaffolded P53 motif demonstrates the robustness of RFdiffusion motif scaffolding. Combining accurate motif scaffolding with the design of symmetric assemblies enabled consistent and atomically precise positioning of sidechains to coordinate Ni2+ ions across diverse tetrameric assemblies
最近在使用深度学习方法(RF 幻觉、RF joint Inpainting 和 diffusion)构建蛋白质功能基序方面取得了进展,但对于大型系统来说,幻觉速度很慢,当提供起始信息不足时,Inpainting 会失败,并且以前的扩散方法精度较低。RFdiffusion在可以搭建的基序的复杂性、侧链定位的精度(用于催化和其他功能)以及AF2对基序概括的准确性方面优于以前的这些方法。MDM2结合蛋白的设计比支架的P53基序具有三个数量级的亲和力,证明了RFdiffusion基序支架的稳健性。将精确的基序支架与对称组件的设计相结合,可以实现侧链的一致和原子精确定位,以协调不同四聚体组件中的 2+ 镍离子

For binder design from target structural information alone, previous work required testing tens of thousands of sequences12. RFdiffusion, when combined with improved filtering26 raises experimental success rates by two orders of magnitude; high-affinity binders can be identified from dozens of designs, in many cases eliminating the requirement for slow and expensive high-throughput screening (at least for the non-polar sites targeted here; further studies will be required to assess success rates on more polar target sites and sites without native binding partners). A high-resolution cryo-EM structure of one of these designs in complex with influenza HA shows that RFdiffusion can design functional proteins with atomic accuracy. Vázquez Torres et al. demonstrate the ability of RFdiffusion to design picomolar affinity binders to flexible helical peptides54, further highlighting its use for de novo binder design. Vázquez Torres et al. also show how RFdiffusion can be extended for protein model refinement by partial noising and denoising, which enables tuneable sampling around a given input structure. For peptide binder design, this enabled increases in affinity of nearly three orders of magnitude without high-throughput screening.
仅根据目标结构信息进行粘结剂设计,以前的工作需要测试数以万计的序列 12 。RFdiffusion与改进的 26 滤波相结合,将实验成功率提高了两个数量级;可以从数十种设计中鉴定出高亲和力结合剂,在许多情况下,无需进行缓慢而昂贵的高通量筛选(至少对于此处针对的非极性位点;需要进一步的研究来评估更多极性靶位点和没有天然结合伴侣的位点的成功率)。其中一种设计与流感 HA 复合物的高分辨率冷冻电镜结构表明,RFdiffusion 可以设计具有原子精度的功能蛋白。Vázquez Torres等人证明了RFdiffusion能够将皮摩尔亲和力结合剂设计成柔性螺旋肽 54 ,进一步突出了它在从头粘合剂设计中的应用。Vázquez Torres 等人还展示了如何通过部分噪声和去噪来扩展 RFdiffusion 以改进蛋白质模型,从而可以围绕给定的输入结构进行可调采样。对于肽结合剂设计,无需高通量筛选即可将亲和力提高近三个数量级。

The breadth and complexity of problems solvable with RFdiffusion and the robustness and accuracy of the solutions far exceeds what has been achieved previously. In a manner reminiscent of the generation of images from text prompts, RFdiffusion makes possible, with minimal specialist knowledge, the generation of functional proteins from minimal molecular specifications (for example, high-affinity binders to a user-specified target protein, and diverse protein assemblies from user-specified symmetries).
RFdiffusion可解决的问题的广度和复杂性以及解决方案的鲁棒性和准确性远远超过了以前所取得的成就。RFdiffusion以一种让人联想到从文本提示生成图像的方式,以最少的专业知识,从最小的分子规格(例如,用户指定的靶蛋白的高亲和力结合剂,以及来自用户指定的对称性的各种蛋白质组装)生成功能蛋白。

The power and scope of RFdiffusion can be extended in several directions. RF has recently been extended to nucleic acids and protein–nucleic acid complexes55, which should enable RFdiffusion to design nucleic acid binding proteins and perhaps folded RNA structures. Extension of RF to incorporate ligands should similarly enable extension of RFdiffusion to explicitly model ligand atoms, and allow the design of protein–ligand interactions. The ability to customize RFdiffusion to specific design challenges by addition of external potentials and by fine-tuning (as illustrated here for catalytic site scaffolding, binder-targeting and fold specification), along with continued improvements to the underlying methodology, should enable de novo protein design to achieve still higher levels of complexity, to approach and, in some cases, surpass what natural evolution has achieved.
RFdiffusion 的功率和范围可以在多个方向上扩展。RF最近被扩展到核酸和蛋白质-核酸复合物 55 ,这应该使RFdiffusion能够设计核酸结合蛋白,并可能设计折叠的RNA结构。RF的扩展以合并配体应该同样能够扩展RFdiffusion以显式模拟配体原子,并允许设计蛋白质-配体相互作用。通过添加外部电位和微调(如此处所示的催化位点支架、粘合剂靶向和折叠规范)来定制 RFdiffusion 以应对特定设计挑战的能力,以及对基础方法的持续改进,应该使从头蛋白质设计能够实现更高水平的复杂性,接近并在某些情况下超越自然进化所取得的成就。

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
有关研究设计的更多信息,请参阅本文链接的《自然投资组合报告摘要》。