TrOCR: Transformer-based Optical Character Recognition
with Pre-trained Models
TrOCR：基于transformer的光学字符识别与预训练模型

Minghao Li¹, Tengchao Lv², Jingye Chen²^∗, Lei Cui²,
李明浩 ¹ ，吕腾超 ² ，陈敬业 ² ^∗ ，崔磊 ² ，
Yijuan Lu², Dinei Florencio², Cha Zhang², Zhoujun Li¹, Furu Wei²
卢义娟 ² ，迪内·弗洛伦西奥 ² ，查张 ² ，李周军 ¹ ，魏福如 ² Work done during internship at Microsoft Research Asia.
在微软亚洲研究院实习期间完成的工作。

Abstract 摘要

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at https://aka.ms/trocr.
文本识别是文献数字化的一个长期研究课题。现有的方法通常是基于用于图像理解的CNN和用于字符级文本生成的RNN。此外，作为后处理步骤，通常需要另一种语言模型来提高整体准确性。在本文中，我们提出了一个端到端的文本识别方法与预训练的图像Transformer和文本Transformer模型，即TrOCR，它利用Transformer架构的图像理解和单词级文本生成。TrOCR模型简单但有效，可以使用大规模合成数据进行预训练，并使用人工标记的数据集进行微调。实验结果表明，TrOCR模型在印刷体、手写体和场景文本识别任务上的性能优于当前最先进的模型。TrOCR模型和代码可在https://aka.ms/trocr上公开获得。

Introduction 介绍

Optical Character Recognition (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image. Typically, an OCR system includes two main modules: a text detection module and a text recognition module. Text detection aims to localize all text blocks within the text image, either at word-level or textline-level. The text detection task is usually considered as an object detection problem where conventional object detection models such as YoLOv5 and DBNet (Liao et al. 2019) can be applied. Meanwhile, text recognition aims to understand the text image content and transcribe the visual signals into natural language tokens. The text recognition task is usually framed as an encoder-decoder problem where existing methods leveraged CNN-based encoder for image understanding and RNN-based decoder for text generation. In this paper, we focus on the text recognition task for document images and leave text detection as the future work.
光学字符识别（OCR）是将打印、手写或打印文本的图像以电子或机械方式转换为机器编码文本，无论是来自扫描文档、文档照片、场景照片还是叠加在图像上的字幕文本。通常，OCR系统包括两个主要模块：文本检测模块和文本识别模块。文本检测的目标是定位文本图像中的所有文本块，无论是在单词级还是文本行级。文本检测任务通常被认为是一个对象检测问题，可以应用传统的对象检测模型，如YoLOv 5和DBNet（Liao et al. 2019）。同时，文本识别的目的是理解文本图像的内容，并将视觉信号转换为自然语言符号。文本识别任务通常被视为编码器-解码器问题，现有方法利用基于CNN的编码器进行图像理解，基于RNN的解码器进行文本生成。在本文中，我们专注于文本识别任务的文档图像和离开文本检测作为未来的工作。

Recent progress in text recognition (Diaz et al. 2021) has witnessed the significant improvements by taking advantage of the Transformer (Vaswani et al. 2017) architectures. However, existing methods are still based on CNNs as the backbone, where the self-attention is built on top of CNN backbones as encoders to understand the text image. For decoders, Connectionist Temporal Classification (CTC) (Graves et al. 2006) is usually used compounded with an external language model on the character-level to improve the overall accuracy. Despite the great success achieved by the hybrid encoder/decoder method, there is still a lot of room to improve with pre-trained CV and NLP models: 1) the network parameters in existing methods are trained from scratch with synthetic/human-labeled datasets, leaving large-scale pre-trained models unexplored. 2) as image Transformers become more and more popular (Dosovitskiy et al. 2021; Touvron et al. 2021), especially the recent self-supervised image pre-training (Bao, Dong, and Wei 2021), it is straightforward to investigate whether pre-trained image Transformers can replace CNN backbones, meanwhile exploiting the pre-trained image Transformers to work together with the pre-trained text Transformers in a single framework on the text recognition task.
文本识别的最新进展（迪亚兹et al. 2021）通过利用Transformer（Vaswani et al. 2017）架构，见证了显着的改进。然而，现有的方法仍然基于CNN作为主干，其中自注意力建立在CNN主干之上作为编码器来理解文本图像。对于解码器，通常在字符级别上与外部语言模型复合使用连接主义时间分类（CTC）（Graves et al. 2006），以提高整体准确性。尽管混合编码器/解码器方法取得了巨大的成功，但预训练的CV和NLP模型仍有很大的改进空间：1）现有方法中的网络参数是使用合成/人类标记的数据集从头开始训练的，留下大规模的预训练模型未被探索。2)随着图像变形金刚变得越来越受欢迎（Dosovitskiy等人，2021年; Touvron等人， 2021），特别是最近的自监督图像预训练（Bao，Dong和Wei 2021），可以直接研究预训练的图像Transformers是否可以取代CNN主干，同时利用预训练的图像Transformers在文本识别任务的单个框架中与预训练的文本Transformers一起工作。

Refer to caption — Figure 1: The architecture of TrOCR, where an encoder-decoder model is designed with a pre-trained image Transformer as the encoder and a pre-trained text Transformer as the decoder.
图1：TrOCR的架构，其中编码器-解码器模型设计有一个预训练的图像Transformer作为编码器，一个预训练的文本Transformer作为解码器。

To this end, we propose TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models, which is shown in Figure 1. Distinct from the existing text recognition models, TrOCR is a simple but effective model which does not use the CNN as the backbone. Instead, following (Dosovitskiy et al. 2021), it first resizes the input text image into $384\times 384$ and then the image is split into a sequence of $16\times 16$ patches which are used as the input to image Transformers. Standard Transformer architecture with the self-attention mechanism is leveraged on both encoder and decoder parts, where wordpiece units are generated as the recognized text from the input image. To effectively train the TrOCR model, the encoder can be initialized with pre-trained ViT-style models (Dosovitskiy et al. 2021; Touvron et al. 2021; Bao, Dong, and Wei 2021) while the decoder can be initialized with pre-trained BERT-style models (Devlin et al. 2019; Liu et al. 2019; Dong et al. 2019; Wang et al. 2020b), respectively. Therefore, the advantage of TrOCR is three-fold. First, TrOCR uses the pre-trained image Transformer and text Transformer models, which take advantages of large-scale unlabeled data for image understanding and language modeling, with no need for an external language model. Second, TrOCR does not require any convolutional network for the backbone and does not introduce any image-specific inductive biases, which makes the model very easy to implement and maintain. Finally, experiment results on OCR benchmark datasets show that the TrOCR can achieve state-of-the-art results on printed, handwritten and scene text image datasets without any complex pre/post-processing steps. Furthermore, we can easily extend the TrOCR for multilingual text recognition with minimum efforts, where just leveraging multilingual pre-trained models in the decoder-side and expand the dictionary.
为此，我们提出了TrOCR，这是一种端到端的基于Transformer的OCR模型，用于使用预训练的CV和NLP模型进行文本识别，如图1所示。与现有的文本识别模型不同，TrOCR是一种简单但有效的模型，它不使用CNN作为主干。相反，在（Dosovitskiy et al. 2021）之后，它首先将输入文本图像调整为 $384\times 384$ ，然后将图像拆分为一系列 $16\times 16$ 补丁，这些补丁用作图像变形器的输入。具有自注意机制的标准Transformer体系结构在编码器和解码器部分上都得到了利用，其中单词单元作为从输入图像中识别的文本而生成。为了有效地训练TrOCR模型，编码器可以使用预先训练的ViT风格模型（Dosovitskiy等人，2021; Touvron等人，2021; Bao，Dong和Wei 2021）进行初始化，而解码器可以使用预先训练的BERT风格模型（Devlin等人，2019; Liu等人，2019; Dong等人，2019）进行初始化。 2019; Wang et al. 2020 b）。因此，TrOCR的优势是三方面的。首先，TrOCR使用预训练的图像Transformer和文本Transformer模型，这些模型利用大规模未标记数据进行图像理解和语言建模，而不需要外部语言模型。其次，TrOCR不需要任何卷积网络作为主干，也不引入任何特定于图像的归纳偏差，这使得模型非常容易实现和维护。最后，在OCR基准数据集上的实验结果表明，TrOCR在印刷、手写和场景文本图像数据集上均能获得最佳的识别效果，无需复杂的前/后处理步骤。此外，我们可以轻松地扩展TrOCR以进行多语言文本识别，只需在解码器端利用多语言预训练模型并扩展字典。

The contributions of this paper are summarized as follows:
本文的贡献归纳如下：

1.

We propose TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models. To the best of our knowledge, this is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.

1.我们提出了TrOCR，这是一种端到端的基于Transformer的OCR模型，用于使用预训练的CV和NLP模型进行文本识别。据我们所知，这是第一个将预训练的图像和文本Transformers联合用于OCR中的文本识别任务的工作。
2.

TrOCR achieves state-of-the-art results with a standard Transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing steps.

2. TrOCR通过标准的基于Transformer的编码器-解码器模型实现了最先进的结果，该模型无需卷积，并且不依赖于任何复杂的预处理/后处理步骤。
3.

The TrOCR models and code are publicly available at https://aka.ms/trocr.

3. TrOCR模型和代码可在https://aka.ms/trocr上公开获得。

TrOCR

Model Architecture 模型架构

TrOCR is built up with the Transformer architecture, including an image Transformer for extracting the visual features and a text Transformer for language modeling. We adopt the vanilla Transformer encoder-decoder structure in TrOCR. The encoder is designed to obtain the representation of the image patches and the decoder is to generate the wordpiece sequence with the guidance of the visual features and previous predictions.
TrOCR采用Transformer结构，包括用于提取视觉特征的图像Transformer和用于语言建模的文本Transformer。在TrOCR中我们采用了普通的Transformer编解码器结构。编码器的设计是为了获得图像块的表示和解码器是生成的词片序列的视觉特征和先前的预测的指导下。

Encoder 编码器

The encoder receives an input image $x_{\rm img}\in\Re^{3\times H_{0}\times W_{0}}$ , and resizes it to a fixed size $(H,W)$ . Since the Transformer encoder cannot process the raw images unless they are a sequence of input tokens, the encoder decomposes the input image into a batch of $N=HW/P^{2}$ foursquare patches with a fixed size of $(P,P)$ , while the width $W$ and the height $H$ of the resized image are guaranteed to be divisible by the patch size $P$ . Subsequently, the patches are flattened into vectors and linearly projected to $D$ -dimensional vectors, aka the patch embeddings. $D$ is the hidden size of the Transformer through all of its layers.
编码器接收输入图像 $x_{\rm img}\in\Re^{3\times H_{0}\times W_{0}}$ ，并将其调整为固定大小 $(H,W)$ 。由于Transformer编码器不能处理原始图像，除非它们是输入令牌的序列，所以编码器将输入图像分解成具有固定大小 $(P,P)$ 的一批 $N=HW/P^{2}$ 正方形块，而保证调整大小后的图像的宽度 $W$ 和高度 $H$ 可被块大小 $P$ 整除。随后，补丁被展平成向量，并线性投影到 $D$ 维向量，又名补丁嵌入。 $D$ 是Transformer在其所有层中的隐藏大小。

Similar to ViT (Dosovitskiy et al. 2021) and DeiT (Touvron et al. 2021), we keep the special token “[CLS]” that is usually used for image classification tasks. The “[CLS]” token brings together all the information from all the patch embeddings and represents the whole image. Meanwhile, we also keep the distillation token in the input sequence when using the DeiT pre-trained models for encoder initialization, which allows the model to learn from the teacher model. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions.
与ViT（Dosovitskiy et al. 2021）和DeiT（Touvron et al. 2021）类似，我们保留了通常用于图像分类任务的特殊令牌“[CLS]”。“[CLS]”标记汇集了所有补丁嵌入的所有信息，并表示整个图像。同时，当使用DeiT预训练模型进行编码器初始化时，我们还将蒸馏令牌保留在输入序列中，这允许模型从教师模型中学习。补丁嵌入和两个特殊的令牌被赋予可学习的1D位置嵌入，根据它们的绝对位置。

Unlike the features extracted by the CNN-like network, the Transformer models have no image-specific inductive biases and process the image as a sequence of patches, which makes the model easier to pay different attention to either the whole image or the independent patches.
与类CNN网络提取的特征不同，Transformer模型没有图像特定的感应偏差，并将图像处理为一系列补丁，这使得模型更容易对整个图像或独立补丁进行不同的关注。

Decoder 解码器

We use the original Transformer decoder for TrOCR. The standard Transformer decoder also has a stack of identical layers, which have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feed-forward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input. In addition, the decoder leverages the attention masking in the self-attention to prevent itself from getting more information during training than prediction. Based on the fact that the output of the decoder will right shift one place from the input of the decoder, the attention mask needs to ensure the output for the position $i$ can only pay attention to the previous output, which is the input on the positions less than $i$ :
我们使用TrOCR的原始Transformer解码器。标准的Transformer解码器也具有相同层的堆栈，这些层具有与编码器中的层类似的结构，除了解码器在多头自注意和前馈网络之间插入“编码器-解码器注意”以在编码器的输出上分配不同的注意。在编码器-解码器注意模块中，键和值来自编码器输出，而查询来自解码器输入。此外，解码器利用自我注意力中的注意力掩蔽来防止自己在训练期间获得比预测更多的信息。基于解码器的输出将从解码器的输入右移一个位置的事实，注意掩码需要确保位置 $i$ 的输出只能注意先前的输出，其是小于 $i$ 的位置上的输入：

	$\displaystyle h_{i}$	$\displaystyle=Proj(Emb(Token_{i}))$
	$\displaystyle\sigma(h_{ij})$	$\displaystyle=\frac{e^{h_{ij}}}{\sum_{k=1}^{V}e^{h_{ik}}}\ \ \ for\ j=1,2,\dots,V$

The hidden states from the decoder are projected by a linear layer from the model dimension to the dimension of the vocabulary size $V$ , while the probabilities over the vocabulary are calculated on that by the softmax function. We use beam search to get the final output.
来自解码器的隐藏状态通过线性层从模型维度投影到词汇大小 $V$ 的维度，而词汇上的概率通过softmax函数计算。我们使用波束搜索来获得最终输出。

Model Initialization 模型初始化

Both the encoder and the decoder are initialized by the public models pre-trained on large-scale labeled and unlabeled datasets.
编码器和解码器都由在大规模标记和未标记数据集上预训练的公共模型初始化。

Encoder Initialization 编码器

The DeiT (Touvron et al. 2021) and BEiT (Bao, Dong, and Wei 2021) models are used for the encoder initialization in the TrOCR models. DeiT trains the image Transformer with ImageNet (Deng et al. 2009) as the sole training set. The authors try different hyper-parameters and data augmentation to make the model data-efficient. Moreover, they distill the knowledge of a strong image classifier to a distilled token in the initial embedding, which leads to a competitive result compared to the CNN-based models.
DeiT（Touvron等人，2021）和BEiT（Bao，Dong和Wei，2021）模型用于TrOCR模型中的编码器初始化。DeiT使用ImageNet（Deng et al. 2009）作为唯一的训练集来训练图像Transformer。作者尝试不同的超参数和数据增强，使模型的数据效率。此外，它们在初始嵌入中将强图像分类器的知识提取为提取的令牌，这导致与基于CNN的模型相比具有竞争力的结果。

Referring to the Masked Language Model pre-training task, BEiT proposes the Masked Image Modeling task to pre-train the image Transformer. Each image will be converted to two views: image patches and visual tokens. They tokenize the original image into visual tokens by the latent codes of discrete VAE (Ramesh et al. 2021), randomly mask some image patches, and make the model recover the original visual tokens. The structure of BEiT is the same as the image Transformer and lacks the distilled token when compared with DeiT.
参考Masked Language Model预训练任务，BEiT提出了Masked Image Modeling任务来预训练图像Transformer。每个图像将被转换为两个视图：图像补丁和视觉令牌。他们通过离散VAE的潜码将原始图像标记为视觉标记（Ramesh et al. 2021），随机屏蔽一些图像块，并使模型恢复原始视觉标记。BEiT的结构与图像Transformer相同，与DeiT相比，缺少提取的标记。

Decoder Initialization 解码器

We use the RoBERTa (Liu et al. 2019) models and the MiniLM (Wang et al. 2020b) models to initialize the decoder. Generally, RoBERTa is a replication study of (Devlin et al. 2019) that carefully measures the impact of many key hyperparameters and training data size. Based on BERT, they remove the next sentence prediction objective and dynamically change the masking pattern of the Masked Language Model.
我们使用RoBERTa（Liu et al. 2019）模型和MiniLM（Wang et al. 2020 b）模型来初始化解码器。一般来说，RoBERTa是（Devlin et al. 2019）的复制研究，仔细测量了许多关键超参数和训练数据大小的影响。基于BERT，他们删除了下一个句子预测目标，并动态地改变了掩蔽语言模型的掩蔽模式。

The MiniLM are compressed models of the large pre-trained Transformer models while retaining 99% performance. Instead of using the soft target probabilities of masked language modeling predictions or intermediate representations of the teacher models to guide the training of the student models in the previous work. The MiniLM models are trained by distilling the self-attention module of the last Transformer layer of the teacher models and introducing a teacher assistant to assist with the distillation.
MiniLM是大型预训练Transformer模型的压缩模型，同时保留99%的性能。在以前的工作中，使用掩蔽语言建模预测的软目标概率或教师模型的中间表示来指导学生模型的训练。MiniLM模型通过提取教师模型的最后一层Transformer的自我注意模块并引入教师助理来协助提取来训练。

When loading the above models to the decoders, the structures do not precisely match since both of them are only the encoder of the Transformer architecture. For example, the encoder-decoder attention layers are absent in these models. To address this, we initialize the decoders with the RoBERTa and MiniLM models by manually setting the corresponding parameter mapping, and the absent parameters are randomly initialized.
当将上述模型加载到解码器时，结构并不精确匹配，因为它们都只是Transformer架构的编码器。例如，在这些模型中不存在编码器-解码器注意层。为了解决这个问题，我们通过手动设置相应的参数映射来初始化RoBERTa和MiniLM模型的解码器，并随机初始化缺少的参数。

Task Pipeline 任务管道

In this work, the pipeline of the text recognition task is that given the textline images, the model extracts the visual features and predicts the wordpiece tokens relying on the image and the context generated before. The sequence of ground truth tokens is followed by an “[EOS]” token, which indicates the end of a sentence. During training, we shift the sequence backward by one place and add the “[BOS]” token to the beginning indicating the start of generation. The shifted ground truth sequence is fed into the decoder, and the output of that is supervised by the original ground truth sequence with the cross-entropy loss. For inference, the decoder starts from the “[BOS]” token to predict the output iteratively while continuously taking the newly generated output as the next input.
在这项工作中，文本识别任务的流水线是给定文本行图像，模型提取视觉特征并依赖于图像和之前生成的上下文来预测单词标记。地面真值标记的序列后面是一个“[EOS]”标记，表示句子的结束。在训练过程中，我们将序列向后移动一个位置，并将“[BOS]”标记添加到开头，指示生成的开始。移位后的地面真值序列被馈送到解码器，其输出由具有交叉熵损失的原始地面真值序列监督。对于推理，解码器从“[BOS]”令牌开始迭代地预测输出，同时连续地将新生成的输出作为下一个输入。

Pre-training 预训练

We use the text recognition task for the pre-training phase, since this task can make the models learn the knowledge of both the visual feature extraction and the language model. The pre-training process is divided into two stages that differ by the used dataset. In the first stage, we synthesize a large-scale dataset consisting of hundreds of millions of printed textline images and pre-train the TrOCR models on that. In the second stage, we build two relatively small datasets corresponding to printed and handwritten downstream tasks, containing millions of textline images each. We use the existed and widely adopted synthetic scene text datasets for the scene text recognition task. Subsequently, we pre-train separate models on these task-specific datasets in the second stage, all initialized by the first-stage model.
我们在预训练阶段使用文本识别任务，因为该任务可以使模型学习视觉特征提取和语言模型的知识。预训练过程分为两个阶段，根据所使用的数据集而有所不同。在第一阶段，我们合成了一个由数亿张打印文本行图像组成的大规模数据集，并在此基础上预训练TrOCR模型。在第二阶段，我们构建了两个相对较小的数据集，分别对应于打印和手写的下游任务，每个数据集包含数百万个文本行图像。我们使用现有的和广泛采用的合成场景文本数据集的场景文本识别任务。随后，我们在第二阶段中对这些特定于任务的数据集进行预训练，所有模型都由第一阶段模型初始化。

Fine-tuning 微调

Except for the experiments regarding scene text recognition, the pre-trained TrOCR models are fine-tuned on the downstream text recognition tasks. The outputs of the TrOCR models are based on Byte Pair Encoding (BPE) (Sennrich, Haddow, and Birch 2015) and SentencePiece (Kudo and Richardson 2018) and do not rely on any task-related vocabularies.
除了关于场景文本识别的实验之外，预训练的TrOCR模型在下游文本识别任务上进行了微调。TrOCR模型的输出基于字节对编码（BPE）（Sennrich，Haddow和Birch 2015）和SentencePiece（Kudo和Richardson 2018），不依赖于任何与任务相关的词汇表。

Encoder	Decoder	Precision	Recall	F1
$\textrm{DeiT}_{\rm BASE}$	$\textrm{RoBERTa}_{\rm BASE}$	69.28	69.06	69.17
$\textrm{BEiT}_{\rm BASE}$	$\textrm{RoBERTa}_{\rm BASE}$	76.45	76.18	76.31
ResNet50	$\textrm{RoBERTa}_{\rm BASE}$	66.74	67.29	67.02
$\textrm{DeiT}_{\rm BASE}$	$\textrm{RoBERTa}_{\rm LARGE}$	77.03	76.53	76.78
$\textrm{BEiT}_{\rm BASE}$	$\textrm{RoBERTa}_{\rm LARGE}$	79.67	79.06	79.36
ResNet50	$\textrm{RoBERTa}_{\rm LARGE}$	72.54	71.13	71.83

Table 1: Ablation study on the SROIE dataset, where all the models are trained using the SROIE dataset only.
表1：SROIE数据集上的消融研究，其中所有模型仅使用SROIE数据集进行训练。

Model	Precision	Recall	F1
From Scratch 从头	38.06	38.43	38.24
+ Pretrained Model + 预训练模型	72.95	72.56	72.75
+ Data Augmentation + 数据增强	82.58	82.03	82.30
+ First-Stage Pretrain + 第一阶段预训练	95.31	95.65	95.48
+ Second-Stage Pretrain + 第二阶段预训练	95.76	95.91	95.84

Table 2: Ablation study of pretrained model initialization, data augmentation and two stages of pre-training on the SROIE dataset.
表2：SROIE数据集上预训练模型初始化、数据增强和两个预训练阶段的消融研究。

Data Augmentation 数据增强

We leverage data augmentation to enhance the variety of the pre-training and fine-tuning data. Six kinds of image transformations plus keeping the original are taken for printed and handwritten datasets, which are random rotation (-10 to 10 degrees), Gaussian blurring, image dilation, image erosion, downscaling, and underlining. We randomly decide which image transformation to take with equal possibilities for each sample. For scene text datasets, RandAugment (Cubuk et al. 2020) is applied following (Atienza 2021), and the augmentation types include inversion, curving, blur, noise, distortion, rotation, etc.
我们利用数据增强来增强预训练和微调数据的多样性。对打印和手写数据集进行了六种图像变换，并保持原始图像，它们是随机旋转（-10至10度），高斯模糊，图像膨胀，图像腐蚀，缩小和下划线。我们随机决定对每个样本进行相同可能性的图像变换。对于场景文本数据集，RandAugment（Cubuk et al. 2020）应用于（Atienza 2021），增强类型包括反转，弯曲，模糊，噪声，失真，旋转等。

Experiments 实验

Data 数据

Pre-training Dataset 预训练数据集

To build a large-scale high-quality dataset, we sample two million document pages from the publicly available PDF files on the Internet. Since the PDF files are digital-born, we can get pretty printed textline images by converting them into page images and extracting the textlines with their cropped images. In total, the first-stage pre-training dataset contains 684M textlines.
为了构建一个大规模的高质量数据集，我们从互联网上公开的PDF文件中抽取了200万个文档页面。由于PDF文件是数字化的，我们可以通过将它们转换为页面图像并提取文本行及其裁剪图像来获得漂亮的打印文本行图像。总的来说，第一阶段预训练数据集包含684M文本行。

We use 5,427 handwritten fonts¹¹1The fonts are obtained from https://fonts.google.com/?category=Handwriting and https://www.1001fonts.com/handwritten-fonts.html. to synthesize handwritten textline images by the TRDG²²2https://github.com/Belval/TextRecognitionDataGenerator, an open-source text recognition data generator. The text used for generation is crawled from random pages of Wikipedia. The handwritten dataset for the second-stage pre-training consists of 17.9M textlines, including IIIT-HWS dataset (Krishnan and Jawahar 2016). In addition, we collect around 53K receipt images in the real world and recognize the text on them by commercial OCR engines. According to the results, we crop the textlines by their coordinates and rectify them into normalized images. We also use TRDG to synthesize 1M printed textline images with two receipt fonts and the built-in printed fonts. In total, the printed dataset consists of 3.3M textlines. The second-stage pre-training data for the scene text recognition are MJSynth (MJ) (Jaderberg et al. 2014) and SynthText (ST) (Gupta, Vedaldi, and Zisserman 2016), totaling about 16M text images.
我们使用5，427个手写字体 ¹ 来合成TRDG ² （一个开源文本识别数据生成器）的手写文本行图像。用于生成的文本是从维基百科的随机页面中抓取的。第二阶段预训练的手写数据集由1790万文本行组成，包括IIIT-HWS数据集（Krishnan和Jawahar 2016）。此外，我们在真实的世界中收集了大约53 K的收据图像，并通过商业OCR引擎识别其上的文本。根据检测结果，对文本行进行坐标裁剪，并将其校正为归一化图像。我们还使用TRDG合成1 M打印文本行图像，其中包含两种收据字体和内置打印字体。总的来说，打印数据集由330万条文本行组成。场景文本识别的第二阶段预训练数据是MJSynth（MJ）（Jaderberg et al. 2014）和SynthText（ST）（Gupta，Vedaldi，and Zisserman 2016），总计约16 M文本图像。

Benchmarks 基准

The SROIE (Scanned Receipts OCR and Information Extraction) dataset (Task 2) focuses on text recognition in receipt images. There are 626 receipt images and 361 receipt images in the training and test sets of SROIE. Since the text detection task is not included in this work, we use cropped images of the textlines for evaluation, which are obtained by cropping the whole receipt images according to the ground truth bounding boxes.
SROIE（扫描收据OCR和信息提取）数据集（任务2）专注于收据图像中的文本识别。在SROIE的训练集和测试集中有626张收据图像和361张收据图像。由于文本检测任务不包括在这项工作中，我们使用的文本行的裁剪图像进行评估，这是通过裁剪整个接收图像根据地面实况边界框。

The IAM Handwriting Database is composed of handwritten English text, which is the most popular dataset for handwritten text recognition. We use the Aachen’s partition of the dataset³³3https://github.com/jpuigcerver/Laia/tree/master/egs/iam: 6,161 lines from 747 forms in the train set, 966 lines from 115 forms in the validation set and 2,915 lines from 336 forms in the test set.
IAM手写数据库由手写英文文本组成，这是手写文本识别中最流行的数据集。我们使用数据集 ³ 的亚琛分区：训练集中747个表单中的6，161行，验证集中115个表单中的966行和测试集中336个表单中的2，915行。

Recognizing scene text images is more challenging than printed text images, as many images in the wild suffer from blur, occlusion, or low-resolution problems. Here we leverage some widely-used benchmarks, including IIIT5K-3000 (Mishra, Alahari, and Jawahar 2012), SVT-647 (Wang, Babenko, and Belongie 2011), IC13-857, IC13-1015 (Karatzas et al. 2013), IC15-1811, IC15-2077 (Karatzas et al. 2015), SVTP-645 (Phan et al. 2013), and CT80-288 (Risnumawan et al. 2014) to evaluate the capacity of the proposed TrOCR.
识别场景文本图像比打印文本图像更具挑战性，因为许多图像都存在模糊、遮挡或低分辨率问题。在这里，我们利用一些广泛使用的基准测试，包括IIIT 5 K-3000（Mishra，Alahari和Jawahar 2012），SVT-647（Wang，Babenko和Belongie 2011），IC 13 -857，IC 13 -1015（Karatzas et al. 2013）、IC 15 -1811、IC 15 -2077（Karatzas et al. 2015）、SVTP-645（Phan et al. 2013）和CT 80 -288（Risnumawan et al. 2014）评价拟定TrOCR的能力。

Model	Recall	Precision	F1
CRNN	28.71	48.58	36.09
Tesseract OCR	57.50	51.93	54.57
H&H Lab H&H实验室	96.35	96.52	96.43
MSOLab	94.77	94.88	94.82
CLOVA OCR	94.3	94.88	94.59
$\textrm{TrOCR}_{\rm SMALL}$	95.89	95.74	95.82
$\textrm{TrOCR}_{\rm BASE}$	96.37	96.31	96.34
$\textrm{TrOCR}_{\rm LARGE}$	96.59	96.57	96.58

Table 3: Evaluation results (word-level Precision, Recall, F1) on the SROIE dataset, where the baselines come from the SROIE leaderboard (https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=2).
表3：SROIE数据集上的评估结果（单词级精度、召回率、F1），其中基线来自SROIE排行榜（https：rrc.cvc.uab.es/? ch=13&com=evaluation&task=2）。

Settings 设置

The TrOCR models are built upon the Fairseq (Ott et al. 2019) which is a popular sequence modeling toolkit. For the model initialization, the DeiT models are implemented and initialized by the code and the pre-trained models from the timm library (Wightman 2019) while the BEiT models and the MiniLM models are from the UniLM’s official repository⁴⁴4https://github.com/microsoft/unilm. The RoBERTa models come from the corresponding page in the Fairseq GitHub repository. We use 32 V100 GPUs with the memory of 32GBs for pre-training and 8 V100 GPUs for fine-tuning. For all the models, the batch size is set to 2,048 and the learning rate is 5e-5. We use the BPE and sentencepiece tokenizer from Fairseq to tokenize the textlines to wordpieces.
TrOCR模型基于Fairseq（Ott et al. 2019）构建，Fairseq是一种流行的序列建模工具包。对于模型初始化，DeiT模型由代码和来自timm库（Wightman 2019）的预训练模型实现和初始化，而BEiT模型和MiniLM模型来自UniLM的官方存储库 ⁴ 。RoBERTa模型来自Fairseq GitHub存储库中的相应页面。我们使用32个V100 GPU，内存为32 GB，用于预训练，8个V100 GPU用于微调。对于所有模型，批量大小设置为2，048，学习率为5e-5。我们使用Fairseq的BPE和文本标记器将文本行标记为单词。

We employ the $384\times 384$ resolution and $16\times 16$ patch size for DeiT and BEiT encoders. The $\textrm{DeiT}_{\rm SMALL}$ has 12 layers with 384 hidden sizes and 6 heads. Both the $\textrm{DeiT}_{\rm BASE}$ and the $\textrm{BEiT}_{\rm BASE}$ have 12 layers with 768 hidden sizes and 12 heads while the $\textrm{BEiT}_{\rm LARGE}$ has 24 layers with 1024 hidden sizes and 16 heads. We use 6 layers, 256 hidden sizes and 8 attention heads for the small decoders, 512 hidden sizes for the base decoders and 12 layers, 1,024 hidden sizes and 16 heads for the large decoders. For this task, we only use the last half of all layers from the corresponding RoBERTa model, which are the last 6 layers for the $\textrm{RoBERTa}_{\rm BASE}$ and the last 12 layers for the $\textrm{RoBERTa}_{\rm LARGE}$ . The beam size is set to 10 for TrOCR models.
我们采用 $384\times 384$ 分辨率和 $16\times 16$ 补丁大小用于DeiT和BEiT编码器。 $\textrm{DeiT}_{\rm SMALL}$ 有12层，384个隐藏尺寸和6个头。 $\textrm{DeiT}_{\rm BASE}$ 和 $\textrm{BEiT}_{\rm BASE}$ 都有12层，768个隐藏尺寸和12个头，而 $\textrm{BEiT}_{\rm LARGE}$ 有24层，1024个隐藏尺寸和16个头。我们使用6层，256个隐藏大小和8个注意力头用于小型解码器，512个隐藏大小用于基本解码器，12层，1，024个隐藏大小和16个头用于大型解码器。对于这个任务，我们只使用来自相应的RoBERTa模型的所有层的最后一半，即 $\textrm{RoBERTa}_{\rm BASE}$ 的最后6层和 $\textrm{RoBERTa}_{\rm LARGE}$ 的最后12层。TrOCR型号的光束大小设置为10。

We take the CRNN model (Shi, Bai, and Yao 2016) as the baseline model. The CRNN model is composed of convolutional layers for image feature extraction, recurrent layers for sequence modeling and the final frame label prediction, and a transcription layer to translate the frame predictions to the final label sequence. To address the character alignment issue, they use the CTC loss to train the CRNN model. For a long time, the CRNN model is the dominant paradigm for text recognition. We use the PyTorch implementation⁵⁵5https://github.com/meijieru/crnn.pytorch and initialized the parameters by the provided pre-trained model.
我们将CRNN模型（Shi，Bai和Yao 2016）作为基线模型。CRNN模型由用于图像特征提取的卷积层、用于序列建模和最终帧标签预测的递归层以及用于将帧预测转换为最终标签序列的转录层组成。为了解决字符对齐问题，他们使用CTC损失来训练CRNN模型。长期以来，CRNN模型一直是文本识别的主流范式。我们使用PyTorch实现 ⁵ 并通过提供的预训练模型初始化参数。

Evaluation Metrics 评估指标

The SROIE dataset is evaluated using the word-level precision, recall and f1 score. If repeated words appear in the ground truth, they are also supposed to appear in the prediction. The precision, recall and f1 score are described as:
SROIE数据集使用单词级精确度，召回率和f1得分进行评估。如果重复的单词出现在地面真值中，它们也应该出现在预测中。精确度，召回率和f1得分描述为：

	$\displaystyle Precision=\frac{\text{Correct matches}}{\text{The number of the detected words}}$
	$\displaystyle Recall=\frac{\text{Correct matches}}{\text{The number of the ground truth words}}$
	$\displaystyle F1=\frac{\text{2}\times\text{Precision}\times\text{Recall}}{\text{Precision + Recall}}.$

The IAM dataset is evaluated by the case-sensitive Character Error Rate (CER). The scene text datasets are evaluated by the Word Accuracy. For fair comparison, we filter the final output string to suit the popular 36-character charset (lowercase alphanumeric) in this task.
IAM数据集通过区分大小写的字符错误率（CER）进行评估。场景文本数据集由Word Accuracy评估。为了公平比较，我们过滤最终输出字符串，以适应该任务中流行的36个字符的字符集（字母数字）。

Results 结果

Architecture Comparison 架构比较

We compare different combinations of the encoder and decoder to find the best settings. For encoders, we compare DeiT, BEiT and the ResNet-50 network. Both the DeiT and BEiT are the base models in their original papers. For decoders, we compare the base decoders initialized by $\textrm{RoBERTa}_{\rm BASE}$ and the large decoders initialized by $\textrm{RoBERTa}_{\rm LARGE}$ . For further comparison, we also evaluate the CRNN baseline model and the Tesseract OCR in this section, while the latter is an open-source OCR Engine using the LSTM network.
我们比较编码器和解码器的不同组合，以找到最佳设置。对于编码器，我们比较了DeiT、BEiT和ResNet-50网络。DeiT和BEiT都是他们原始论文中的基础模型。对于解码器，我们比较了由 $\textrm{RoBERTa}_{\rm BASE}$ 初始化的基本解码器和由 $\textrm{RoBERTa}_{\rm LARGE}$ 初始化的大解码器。为了进一步比较，我们还在本节中评估了CRNN基线模型和Tesseract OCR，后者是一个使用LSTM网络的开源OCR引擎。

Table 1 shows the results of combined models. From the results, we observe that the BEiT encoders show the best performance among the three types of encoders while the best decoders are the $\textrm{RoBERTa}_{\rm LARGE}$ decoders. Apparently, the pre-trained models on the vision task improve the performance of text recognition models, and the pure Transformer models are better than the CRNN models and the Tesseract on this task. According to the results, we mainly use three settings on the subsequent experiments: $\textbf{TrOCR}_{\bf SMALL}$ (total parameters=62M) consists of the encoder of $\textrm{DeiT}_{\rm SMALL}$ and the decoder of MiniLM, $\textbf{TrOCR}_{\bf BASE}$ (total parameters=334M) consists of the encoder of $\textrm{BEiT}_{\rm BASE}$ and the decoder of $\textrm{RoBERTa}_{\rm LARGE}$ , $\textbf{TrOCR}_{\bf LARGE}$ (total parameters=558M) consists of the encoder of $\textrm{BEiT}_{\rm LARGE}$ and the decoder of $\textrm{RoBERTa}_{\rm LARGE}$ . In Table 2, we have also done some ablation experiments to verify the effect of pre-trained model initialization, data augmentation, and two stages of pre-training. All of them have great improvements to the TrOCR models.
表1显示了组合模型的结果。从结果中，我们观察到BEiT编码器在三种类型的编码器中表现出最佳性能，而最佳解码器是 $\textrm{RoBERTa}_{\rm LARGE}$ 解码器。显然，视觉任务上的预训练模型提高了文本识别模型的性能，纯Transformer模型在此任务上优于CRNN模型和Tesseract。根据实验结果，我们在后续的实验中主要采用了三种设置： $\textbf{TrOCR}_{\bf SMALL}$ （总参数=62M）由 $\textrm{DeiT}_{\rm SMALL}$ 的编码器和MiniLM的解码器组成， $\textbf{TrOCR}_{\bf BASE}$ （总参数=334M）由 $\textrm{BEiT}_{\rm BASE}$ 的编码器和 $\textrm{RoBERTa}_{\rm LARGE}$ 的解码器组成， $\textbf{TrOCR}_{\bf LARGE}$ （总参数=558M）由 $\textrm{BEiT}_{\rm LARGE}$ 的编码器和 $\textrm{RoBERTa}_{\rm LARGE}$ 的解码器组成。在表2中，我们还做了一些消融实验来验证预训练模型初始化、数据增强和两个预训练阶段的效果。所有这些都对TrOCR模型进行了很大的改进。

Model	Architecture	Training Data 训练数据	External LM 外部LM	CER
(Bluche and Messina 2017) （Bluche和Messina 2017）	GCRNN / CTC	Synthetic + IAM 合成+ IAM	Yes	3.2
(Michael et al. 2019) （Michael等人，2019）	LSTM/LSTM w/Attn	IAM	No	4.87
(Wang et al. 2020a) (Wang等人，2020 a）	FCN / GRU	IAM	No	6.4
(Kang et al. 2020) (Kang等，2020年）	Transformer w/ CNN 带CNN的Transformer	Synthetic + IAM 合成+ IAM	No	4.67
(Diaz et al. 2021) (Diaz等，2021）	S-Attn / CTC	Internal + IAM 内部+ IAM	No	3.53
(Diaz et al. 2021) (Diaz等，2021）	S-Attn / CTC	Internal + IAM 内部+ IAM	Yes	2.75
(Diaz et al. 2021) (Diaz等，2021）	Transformer w/ CNN 带CNN的Transformer	Internal + IAM 内部+ IAM	No	2.96
$\textrm{TrOCR}_{\rm SMALL}$	Transformer	Synthetic + IAM 合成+ IAM	No	4.22
$\textrm{TrOCR}_{\rm BASE}$	Transformer	Synthetic + IAM 合成+ IAM	No	3.42
$\textrm{TrOCR}_{\rm LARGE}$	Transformer	Synthetic + IAM 合成+ IAM	No	2.89

Table 4: Evaluation results (CER) on the IAM Handwriting dataset.
表4：IAM手写数据集的评价结果（CER）。

Model	Parameters	Total Sentences 判刑总数	Total Tokens 总代币	Time	Speed #Sentences 速度#句子	Speed #Tokens 速度#令牌
$\textrm{TrOCR}_{\rm SMALL}$	62M	2,915	31,081	348.4s	8.37 sentences/s 8.37句子/秒	89.22 tokens/s 89.22代币/秒
$\textrm{TrOCR}_{\rm BASE}$	334M	2,915	31,959	633.7s	4.60 sentences/s 4.60句子/秒	50.43 tokens/s 50.43代币/秒
$\textrm{TrOCR}_{\rm LARGE}$	558M	2,915	31,966	666.8s	4.37 sentences/s 4.37句子/秒	47.94 tokens/s 47.94代币/秒

Table 5: Inference time on the IAM Handwriting dataset.
表5：IAM手写数据集上的推断时间。

	Test datasets and # of samples 测试数据集和样本数量
Model	IIIT5k	SVT	IC13		IC15		SVTP	CUTE
Model	3,000	647	857	1,015	1,811	2,077	645	288
PlugNet (Mou et al. 2020) PlugNet（Mou等人，2020）	94.4	92.3	–	95.0	–	82.2	84.3	85.0
SRN (Yu et al. 2020) SRN（Yu et al.，2020）	94.8	91.5	95.5	–	82.7	–	85.1	87.8
RobustScanner (Yue et al. 2020) RobustScanner（Yue et al. 2020）	95.4	89.3	–	94.1	–	79.2	82.9	92.4
TextScanner (Wan et al. 2020) TextScanner（Wan等人，2020）	95.7	92.7	–	94.9	–	83.5	84.8	91.6
AutoSTR (Zhang et al. 2020a) AutoSTR（Zhang等人，2020 a）	94.7	90.9	–	94.2	81.8	–	81.7	–
RCEED (Cui et al. 2021) RCEED（Cui等人，2021）	94.9	91.8	–	–	–	82.2	83.6	91.7
PREN2D (Yan et al. 2021) PREN 2D（Yan等人，2021）	95.6	94.0	96.4	–	83.0	–	87.6	91.7
VisionLAN (Wang et al. 2021) VisionLAN（Wang等人，2021）	95.8	91.7	95.7	–	83.7	–	86.0	88.5
Bhunia (Bhunia et al. 2021b) Bhunia（Bhunia等人，2021 b）	95.2	92.2	–	95.5	–	84.0	85.7	89.7
CVAE-Feed.¹ (Bhunia et al. 2021a) CVAE-饲料。 ¹ （Bhunia等人，2021 a）	95.2	–	–	95.7	–	84.6	88.9	89.7
STN-CSTR (Cai, Sun, and Xiong 2021) STN-CSTR（Cai，Sun和Xiong 2021）	94.2	92.3	96.3	94.1	86.1	82.0	86.2	–
ViTSTR-B (Atienza 2021) ViTSTR-B（Atienza 2021）	88.4	87.7	93.2	92.4	78.5	72.6	81.8	81.3
CRNN (Shi, Bai, and Yao 2016) CRNN（Shi，Bai和Yao 2016）	84.3	78.9	–	88.8	–	61.5	64.8	61.3
TRBA (Baek, Matsui, and Aizawa 2021) TRBA（Baek，Matsui和Aizawa 2021）	92.1	88.9	–	93.1	–	74.7	79.5	78.2
ABINet (Fang et al. 2021) ABINet（Fang等人，2021）	96.2	93.5	97.4	–	86.0	–	89.3	89.2
Diaz (Diaz et al. 2021) 迪亚兹（迪亚兹等人，2021）	96.8	94.6	96.0	–	80.4	–	–	–
PARSeq_A (Bautista and Atienza 2022) PARSeq _A （Bautista和Atienza 2022）	97.0	93.6	97.0	96.2	86.5	82.9	88.9	92.2
MaskOCR (ViT-B) (Lyu et al. 2022) MaskOCR（ViT-B）（Lyu等人，2022）	95.8	94.7	98.1	-	87.3	-	89.9	89.2
MaskOCR (ViT-L) (Lyu et al. 2022) MaskOCR（ViT-L）（Lyu等人，2022）	96.5	94.1	97.8	-	88.7	-	90.2	92.7
$\textrm{TrOCR}_{\rm BASE}$ (Syn) $\textrm{TrOCR}_{\rm BASE}$ （Syn）	90.1	91.0	97.3	96.3	81.1	75.0	90.7	86.8
$\textrm{TrOCR}_{\rm LARGE}$ (Syn) $\textrm{TrOCR}_{\rm LARGE}$ （Syn）	91.0	93.2	98.3	97.0	84.0	78.0	91.0	89.6
$\textrm{TrOCR}_{\rm BASE}$ (Syn+Benchmark) $\textrm{TrOCR}_{\rm BASE}$ （Syn+Benchmark）	93.4	95.2	98.4	97.4	86.9	81.2	92.1	90.6
$\textrm{TrOCR}_{\rm LARGE}$ (Syn+Benchmark) $\textrm{TrOCR}_{\rm LARGE}$ （Syn+Benchmark）	94.1	96.1	98.4	97.3	88.1	84.1	93.0	95.1

Table 6: Word accuracy on the six benchmark datasets (36-char), where “Syn” indicates the model using synthetic data only and “Syn+Benchmark” indicates the model using synthetic data and benchmark datasets.
表6：六个基准数据集（36个字符）的单词准确度，其中“Syn”表示仅使用合成数据的模型，“Syn+Benchmark”表示使用合成数据和基准数据集的模型。

SROIE Task 2 SROIE任务2

Table 3 shows the results of the TrOCR models and the current SOTA methods on the leaderboard of the SROIE dataset. To capture the visual information, all of these baselines leverage CNN-based networks as the feature extractors while the TrOCR models use the image Transformer to embed the information from the image patches. For language modeling, MSO Lab (Sang and Cuong 2019) and CLOVA OCR (Sang and Cuong 2019) use LSTM layers and H&H Lab (Shi, Bai, and Yao 2016) use GRU layers while the TrOCR models use the Transformer decoder with a pure attention mechanism. According to the results, the TrOCR models outperform the existing SOTA models with pure Transformer structures. It is also confirmed that Transformer-based text recognition models get competitive performance compared to CNN-based networks in visual feature extraction and RNN-based networks in language modeling on this task without any complex pre/post-process steps.
表3显示了TrOCR模型和当前SOTA方法在SROIE数据集排行榜上的结果。为了捕获视觉信息，所有这些基线都利用基于CNN的网络作为特征提取器，而TrOCR模型使用图像Transformer来嵌入来自图像块的信息。对于语言建模，MSO Lab（Sang和Cuong 2019）和CLOVA OCR（Sang和Cuong 2019）使用LSTM层，H&H Lab（Shi，Bai和Yao 2016）使用GRU层，而TrOCR模型使用具有纯注意力机制的Transformer解码器。结果表明，TrOCR模型的性能优于现有的纯Transformer结构的SOTA模型.还证实了基于transformer的文本识别模型在视觉特征提取方面与基于cnn的网络相比具有竞争力的性能，而基于rnn的网络在语言建模方面没有任何复杂的前/后处理步骤。

IAM Handwriting Database
IAM手写数据库

Table 4 shows the results of the TrOCR models and the existing methods on the IAM Handwriting Database. According to the results, the methods with CTC decoders show good performance on this task and the external LM will result in a significant reduction in CER. By comparing the methods (Bluche and Messina 2017) with the TrOCR models, the $\textrm{TrOCR}_{\rm LARGE}$ achieves a better result, which indicates that the Transformer decoder is more competitive than the CTC decoder in text recognition and has enough ability for language modeling instead of relying on an external LM. Most of the methods use sequence models in their encoders after the CNN-based backbone except the FCN encoders in (Wang et al. 2020a), which leads to a significant improvement on CER. Instead of relying on the features from the CNN-based backbone, the TrOCR models using the information from the image patches get similar and even better results, illustrating that the Transformer structures are competent to extract visual features well after pre-training. From the experiment results, the TrOCR models exceed all the methods which only use synthetic/IAM as the sole training set with pure Transformer structures and achieve a new state-of-the-art CER of 2.89. Without leveraging any extra human-labeled data, TrOCR even gets comparable results with the methods in (Diaz et al. 2021) using the additional internal human-labeled dataset.
表4显示了TrOCR模型和IAM手写数据库上现有方法的结果。根据结果，与CTC解码器的方法在此任务上显示出良好的性能和外部LM将导致显着降低CER。通过将方法（Bluche和Messina 2017）与TrOCR模型进行比较， $\textrm{TrOCR}_{\rm LARGE}$ 取得了更好的结果，这表明Transformer解码器在文本识别方面比CTC解码器更具竞争力，并且具有足够的语言建模能力，而不是依赖于外部LM。除了（Wang et al. 2020 a）中的FCN编码器之外，大多数方法在基于CNN的主干之后的编码器中使用序列模型，这导致CER的显著改进。使用图像块信息的TrOCR模型并不依赖于基于CNN的主干的特征，而是获得了类似甚至更好的结果，这表明Transformer结构能够在预训练后很好地提取视觉特征。从实验结果来看，TrOCR模型超过了所有仅使用合成/IAM作为纯Transformer结构的唯一训练集的方法，并实现了2.89的新的最先进的CER。在不利用任何额外的人类标记数据的情况下，TrOCR甚至使用额外的内部人类标记数据集获得了与（迪亚兹et al. 2021）中的方法相当的结果。

Scene Text Datasets 场景文本数据集

In Table 6, we compare the $\textrm{TrOCR}_{\rm BASE}$ and $\textrm{TrOCR}_{\rm LARGE}$ models of fine-tuning with synthetic data only and fine-tuning with synthetic data and benchmark datasets (the training sets of IC13, IC15, IIIT5K, SVT) to the popular and recent SOTA methods. Compared to all, the TrOCR models establish five new SOTA results of eight experiments while getting comparable results on the rest. Our model underperforms on the IIIT5K dataset, and we find some scene text sample images contain symbols, but the ground truth does not. It is inconsistent with the behavior in our pre-training data (retaining symbols in ground truth), causing the model to tend still to process symbols. There are two kinds of mistakes: outputting symbols but truncating the output in advance to ensure that the number of wordpieces is consistent with the ground truth, or identifying symbols as similar characters.
在表6中，我们将仅使用合成数据进行微调的 $\textrm{TrOCR}_{\rm BASE}$ 和 $\textrm{TrOCR}_{\rm LARGE}$ 模型以及使用合成数据和基准数据集（IC13，IC15，IIIT5K，SVT的训练集）进行微调的模型与流行的和最新的SOTA方法进行了比较。相比之下，TrOCR模型建立了八个实验的五个新的SOTA结果，同时在其余实验中获得了可比的结果。我们的模型在IIIT5K数据集上表现不佳，并且我们发现一些场景文本样本图像包含符号，但地面真实值不包含符号。这与我们的预训练数据中的行为不一致（在地面真值中保留符号），导致模型仍然倾向于处理符号。有两种错误：输出符号，但提前截断输出，以确保字段的数量与地面真理一致，或识别符号为相似的字符。

Inference Speed 推理速度

Table 5 shows the inference speed of different settings TrOCR models on the IAM Handwriting Database. We can conclude that there is no significant margin in inference speed between the base models and the large models. In contrast, the small model shows comparable results for printed and handwriting text recognition even though the number of parameters is an order of magnitude smaller and the inference speed is as twice as fast. The low number of parameters and high inference speed means fewer computational resources and user waiting time, making it more suitable for deployment in industrial applications.
表5显示了IAM手写数据库上不同设置TrOCR模型的推理速度。我们可以得出结论，基本模型和大型模型之间的推理速度没有显着的差距。相比之下，小模型显示了印刷和手写文本识别的可比结果，即使参数的数量是一个数量级小，推理速度是两倍快。参数数量少，推理速度快，意味着计算资源和用户等待时间更少，更适合在工业应用中部署。

Related Work 相关工作

Scene Text Recognition 场景文本识别

For text recognition, the most popular approaches are usually based on the CTC-based models. (Shi, Bai, and Yao 2016) proposed the standard CRNN, an end-to-end architecture combined by CNN and RNN. The convolutional layers are used to extract the visual features and convert them to sequence by concatenating the columns, while the recurrent layers predict the per-frame labels. They use a CTC decoding strategy to remove the repeated symbols and all the blanks from the labels to achieve the final prediction. (Su and Lu 2014) used the Histogram of Oriented Gradient (HOG) features extracted from the image patches in the same column of the input image, instead of the features from the CNN network. A BiLSTM is then trained for labeling the sequential data with the CTC technique to find the best match. (Gao et al. 2019) extracted the feature by the densely connected network incorporating the residual attention block and capture the contextual information and sequential dependency by the CNN network. They compute the probability distribution on the output of the CNN network instead of using an RNN network to model them. After that, CTC translates the probability distributions into the final label sequence.
对于文本识别，最流行的方法通常是基于CTC的模型。(Shi，Bai和Yao 2016）提出了标准CRNN，这是一种由CNN和RNN组合的端到端架构。卷积层用于提取视觉特征，并通过连接列将其转换为序列，而递归层预测每帧标签。它们使用CTC解码策略来去除标签中的重复符号和所有空白，以实现最终的预测。(Su和Lu 2014）使用从输入图像的同一列中的图像块提取的定向梯度直方图（HOG）特征，而不是来自CNN网络的特征。然后训练BiLSTM，用CTC技术标记序列数据，以找到最佳匹配。(Gao等人 2019）通过包含剩余注意块的密集连接网络提取特征，并通过CNN网络捕获上下文信息和顺序依赖性。它们计算CNN网络输出的概率分布，而不是使用RNN网络对其进行建模。之后，CTC将概率分布转换为最终的标签序列。

The Sequence-to-Sequence models (Zhang et al. 2020b; Wang et al. 2019; Sheng, Chen, and Xu 2019; Bleeker and de Rijke 2019; Lee et al. 2020; Atienza 2021) are gradually attracting more attention, especially after the advent of the Transformer architecture (Vaswani et al. 2017). SaHAN (Zhang et al. 2020b), standing for the scale-aware hierarchical attention network, are proposed to address the character scale-variation issue. The authors use the FPN network and the CRNN models as the encoder as well as a hierarchical attention decoder to retain the multi-scale features. (Wang et al. 2019) extracted a sequence of visual features from the input images by the CNN with attention module and BiLSTM. The decoder is composed of the proposed Gated Cascade Attention Module (GCAM) and generates the target characters from the feature sequence extracted by the encoder. For the Transformer models, (Sheng, Chen, and Xu 2019) first applied the Transformer to Scene Text Recognition. Since the input of the Transformer architecture is required to be a sequence, a CNN-based modality-transform block is employed to transform 2D input images to 1D sequences. (Bleeker and de Rijke 2019) added a direction embedding to the input of the decoder for the bidirectional text decoding with a single decoder, while (Lee et al. 2020) utilized the two-dimensional dynamic positional embedding to keep the spatial structures of the intermediate feature maps for recognizing texts with arbitrary arrangements and large inter-character spacing. (Yu et al. 2020) proposed semantic reasoning networks to replace RNN-like structures for more accurate text recognition. (Atienza 2021) only used the image Transformer without text Transformer for the text recognition in a non-autoregressive way. 重试错误原因

The texts in natural images may appear in irregular shapes caused by perspective distortion. (Shi et al. 2016; Baek et al. 2019; Litman et al. 2020; Shi et al. 2018; Zhan and Lu 2019) addressed this problem by processing the input images with an initial rectification step. For example, thin-plate spline transformation (Shi et al. 2016; Baek et al. 2019; Litman et al. 2020; Shi et al. 2018) is applied to find a smooth spline interpolation between a set of fiducial points and normalize the text region to a predefined rectangle, while (Zhan and Lu 2019) proposed an iterative rectification network to model the middle line of scene texts as well as the orientation and boundary of textlines. (Baek et al. 2019; Diaz et al. 2021) proposed universal architectures for comparing different recognition models.
自然图像中的文字可能会因透视变形而呈现不规则的形状。(Shi等，2016; Baek等人，2019年; Litman等人，2020年; Shi等人，2018年; Zhan和Lu，2019年）通过使用初始校正步骤处理输入图像来解决这个问题。例如，薄板样条变换（Shi等人，2016; Baek等人，2019; Litman等人，2020; Shi等人，2018）被应用于在一组基准点之间找到平滑的样条插值，并将文本区域归一化为预定义的矩形，而（Zhan和Lu 2019）提出了一种迭代校正网络来模拟场景文本的中线以及文本行的方向和边界。(Baek et al. 2019;迪亚兹et al. 2021）提出了用于比较不同识别模型的通用架构。

Handwritten Text Recognition
手写文本识别

(Memon et al. 2020) gave a systematic literature review about the modern methods for handwriting recognition. Various attention mechanisms and positional encodings are compared in the (Michael et al. 2019) to address the alignment between the input and output sequence. The combination of RNN encoders (mostly LSTM) and CTC decoders (Bluche and Messina 2017; Graves and Schmidhuber 2008; Pham et al. 2014) took a large part in the related works for a long time. Besides, (Graves and Schmidhuber 2008; Voigtlaender, Doetsch, and Ney 2016; Puigcerver 2017) have also tried multidimensional LSTM encoders. Similar to the scene text recognition, the seq2seq methods and the scheme for attention decoding have been verified in (Michael et al. 2019; Kang et al. 2020; Chowdhury and Vig 2018; Bluche 2016). (Ingle et al. 2019) addressed the problems in building a large-scale system.
（Memon et al. 2020）对现代手写识别方法进行了系统的文献综述。在（Michael et al. 2019）中比较了各种注意力机制和位置编码，以解决输入和输出序列之间的对齐问题。RNN编码器（主要是LSTM）和CTC解码器（Bluche和Messina 2017; Graves和Schmidhuber 2008; Pham等人2014）的组合在相关工作中占据了很大一部分。此外，（Graves and Schmidhuber 2008; Voigtlaender，Doetsch，and Ney 2016; Puigcerver 2017）也尝试了多维LSTM编码器。与场景文本识别类似，seq2seq方法和注意力解码方案已在（Michael et al. 2019; Kang et al. 2020; Chowdhury and Vig 2018; Bluche 2016）中得到验证。（Ingle et al. 2019）解决了构建大规模系统的问题。

Conclusion 结论

In this paper, we present TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained models. Distinct from existing approaches, TrOCR does not rely on the conventional CNN models for image understanding. Instead, it leverages an image Transformer model as the visual encoder and a text Transformer model as the textual decoder. Moreover, we use the wordpiece as the basic unit for the recognized output instead of the character-based methods, which saves the computational cost introduced by the additional language modeling. Experiment results show that TrOCR achieves state-of-the-art results on printed, handwritten and scene text recognition with just a simple encoder-decoder model, without any post-processing steps.
在本文中，我们提出了TrOCR，这是一种端到端的基于Transformer的OCR模型，用于使用预训练模型进行文本识别。与现有方法不同，TrOCR不依赖于传统的CNN模型来进行图像理解。相反，它利用图像Transformer模型作为视觉编码器，利用文本Transformer模型作为文本解码器。此外，我们使用词片作为识别输出的基本单位，而不是基于字符的方法，这节省了额外的语言建模引入的计算成本。实验结果表明，TrOCR在印刷体、手写体和场景文本的识别上都取得了很好的效果，只需要一个简单的编码器-解码器模型，不需要任何后处理步骤。

References 引用

Atienza (2021) 阿蒂恩扎（2021） Atienza, R. 2021. Vision Transformer for Fast and Efficient Scene Text Recognition. arXiv preprint arXiv:2105.08582.
阿蒂恩萨河2021.视觉Transformer用于快速高效的场景文本识别。arXiv预印本arXiv：2105.08582。
Baek et al. (2019) Baek等人（2019） Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S. J.; and Lee, H. 2019. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4715–4723.
Baek，J.; Kim，G.;李，J.;公园，S。; Han，D.; Yun，S.;哦，S。J.;和Lee，H. 2019.场景文本识别模型比较有什么问题？数据集和模型分析。IEEE/CVF计算机视觉国际会议论文集，4715-4723。
Baek, Matsui, and Aizawa (2021)
白、松井、相泽（2021） Baek, J.; Matsui, Y.; and Aizawa, K. 2021. What if We Only Use Real Datasets for Scene Text Recognition? Toward Scene Text Recognition With Fewer Labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3113–3122.
Baek，J.; Matsui，Y.; Aizawa，K. 2021.如果我们只使用真实的数据集进行场景文本识别呢？使用更少标签的场景文本识别。IEEE/CVF计算机视觉与模式识别会议论文集（CVPR），3113-3122。
Bao, Dong, and Wei (2021)
Bao，Dong，and Wei（2021） Bao, H.; Dong, L.; and Wei, F. 2021. BEiT: BERT Pre-Training of Image Transformers. arXiv:2106.08254.
Bao，H.;董湖; Wei，F. 2021. BEiT：BERT Pre-Training of Image Transformers。arXiv：2106.08254。
Bautista and Atienza (2022)
Bautista和Atienza（2022） Bautista, D.; and Atienza, R. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. arXiv preprint arXiv:2207.06966.
Bautista，D.;和Atienza，R. 2022.基于置换自回归序列模型的场景文本识别。arXiv预印本arXiv：2207.06966。
Bhunia et al. (2021a) Bhunia等人（2021 a） Bhunia, A. K.; Chowdhury, P. N.; Sain, A.; and Song, Y.-Z. 2021a. Towards the Unseen: Iterative Text Recognition by Distilling From Errors. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14950–14959.
Bhunia，A. K.; Chowdhury，P. N.; Sain，A.;和宋，Y.- Z. 2021年a。走向看不见的：从错误中提取的迭代文本识别。IEEE/CVF计算机视觉国际会议（ICCV），14950-14959。
Bhunia et al. (2021b) Bhunia等人（2021 b） Bhunia, A. K.; Sain, A.; Kumar, A.; Ghose, S.; Chowdhury, P. N.; and Song, Y.-Z. 2021b. Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14940–14949.
Bhunia，A. K.; Sain，A.; Kumar，A.; Ghose，S.; Chowdhury，P. N.;和宋，Y.- Z. 2021年b。联合视觉语义推理：用于文本识别的多级解码器。IEEE/CVF计算机视觉国际会议（ICCV），14940-14949。
Bleeker and de Rijke (2019)
Bleeker and de Rijke（2019） Bleeker, M.; and de Rijke, M. 2019. Bidirectional scene text recognition with a single decoder. arXiv preprint arXiv:1912.03656.
Bleeker，M.;和de Rijke，M. 2019.单解码器双向场景文本识别。arXiv预印本arXiv：1912.03656。
Bluche (2016) 电影Bluche（2016） Bluche, T. 2016. Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. Advances in Neural Information Processing Systems, 29: 838–846.
Bluche，T. 2016.用于端到端手写段落识别的联合行分割和转录。神经信息处理系统进展，29：838-846。
Bluche and Messina (2017)
Bluche and Messina（2017） Bluche, T.; and Messina, R. 2017. Gated convolutional recurrent neural networks for multilingual handwriting recognition. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, 646–651. IEEE.
Bluche，T.;和Messina，R. 2017.用于多语言手写识别的门控卷积递归神经网络。2017年第14届IAPR文件分析和识别国际会议（ICDAR），第1卷，646-651。美国电气与电子工程师协会。
Cai, Sun, and Xiong (2021)
Cai，Sun，and Xiong（2021） Cai, H.; Sun, J.; and Xiong, Y. 2021. Revisiting Classification Perspective on Scene Text Recognition.
Cai，H.; Sun，J.;和Xiong，Y. 2021.再论场景文本识别的分类视角。
Chowdhury and Vig (2018)
Chowdhury and Vig（2018） Chowdhury, A.; and Vig, L. 2018. An efficient end-to-end neural model for handwritten text recognition. arXiv preprint arXiv:1807.07965.
Chowdhury，A.;和Vig，L. 2018.一个用于手写文本识别的高效端到端神经模型。arXiv预印本arXiv：1807.07965。
Cubuk et al. (2020) Cubuk等人（2020） Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 702–703.
Cubuk，E. D.中; Zoph，B.; Shlens，J.;和Le，Q.五. 2020年。Randaugment：实用的自动化数据扩充，减少搜索空间。IEEE/CVF计算机视觉和模式识别研讨会论文集，702-703。
Cui et al. (2021) Cui等人（2021） Cui, M.; Wang, W.; Zhang, J.; and Wang, L. 2021. Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition. In Lladós, J.; Lopresti, D.; and Uchida, S., eds., Document Analysis and Recognition – ICDAR 2021, 156–170. Cham: Springer International Publishing. ISBN 978-3-030-86337-1.
Cui，M.; Wang，W.;张，J.;和Wang，L. 2021.用于场景文本识别的表示和相关增强的编码器-解码器框架。In Lladós，J.; Lopresti，D.;和Uchida，S.，编，文件分析和识别- ICDAR 2021，156-170。出版社：Springer International Publishing ISBN 978-3-030-86337-1。
Deng et al. (2009) Deng等人（2009年） Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
J.邓; Dong，W.; Socher，R.;李湖，澳-地J.;李，K.;和Fei-Fei，L. 2009. Imagenet：一个大规模的分层图像数据库。2009年IEEE计算机视觉与模式识别会议，248-255。很好
Devlin et al. (2019) Devlin等人（2019） Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
Devlin，J.;张，M.- W.的;李，K.;和Toutanova，K. 2019. BERT：Deep Bidirectional Transformers for Language Understanding的预训练。arXiv：1810.04805。
Diaz et al. (2021) 迪亚兹等人（2021） Diaz, D. H.; Qin, S.; Ingle, R.; Fujii, Y.; and Bissacco, A. 2021. Rethinking Text Line Recognition Models. arXiv:2104.07787.
迪亚兹，D. H.的; Qin，S.; Ingle，R.; Fujii，Y.;和Bissaud，A. 2021.重新思考文本行识别模型。arXiv：2104.07787。
Dong et al. (2019) Dong等人（2019） Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv:1905.03197.
董湖;杨，N.; Wang，W.; Wei，F.; Liu，X.;王，Y; Gao，J.;周，M.; Hon，H.- W. 2019.用于自然语言理解和生成的统一语言模型预训练。arXiv：1905.03197。
Dosovitskiy et al. (2021)
Dosovitskiy等人（2021） Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR.
Dosovitskiy，A.; Beyer，L.; Kolesnikov，A.; Weissenborn，D.; Zhai，X.; Unterthiner，T.; Dehghani，M.; Minderer，M.; Heigold，G.; Gelly，S.; Uszkoreit，J.;和Houlsby，N. 2021.一张图像值得16 x16字：大规模图像识别的变形金刚。ICLR。
Fang et al. (2021) Fang等人（2021） Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; and Zhang, Y. 2021. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 7098–7107.
Fang，S.; Xie，H.;王玉; Mao，Z.;和Zhang，Y. 2021.像人类一样阅读：场景文本识别的自主，双向和迭代语言建模。IEEE/CVF计算机视觉和模式识别会议（CVPR），7098-7107。
Gao et al. (2019) Gao等人（2019） Gao, Y.; Chen, Y.; Wang, J.; Tang, M.; and Lu, H. 2019. Reading scene text with fully convolutional sequence modeling. Neurocomputing, 339: 161–170.
Gao，Y.;陈玉;王，J.;唐，M.;和Lu，H. 2019.阅读场景文本与完全卷积序列建模。神经计算，339：161-170。
Graves et al. (2006) Graves等人（2006年） Graves, A.; Fernández, S.; Gomez, F.; and Schmidhuber, J. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369–376.
Graves，A.; Fernández，S.; Gomez，F.;和Schmidhuber，J.2006年。连接主义时间分类：用递归神经网络标记未分割序列数据。第23届机器学习国际会议论文集，369-376。
Graves and Schmidhuber (2008)
03 The Dog（2008） Graves, A.; and Schmidhuber, J. 2008. Offline handwriting recognition with multidimensional recurrent neural networks. Advances in neural information processing systems, 21: 545–552.
Graves，A.;和Schmidhuber，J. 2008.基于多维递归神经网络的脱机手写体识别。神经信息处理系统的进展，21：545-552。
Gupta, Vedaldi, and Zisserman (2016)
Gupta，Vedaldi，and Zisserman（2016） Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2315–2324.
Gupta，A.; Vedaldi，A.;和Zisserman，A. 2016.用于自然图像中文本定位的合成数据。IEEE计算机视觉与模式识别会议论文集，2315-2324。
Ingle et al. (2019) Ingle等人（2019） Ingle, R. R.; Fujii, Y.; Deselaers, T.; Baccash, J.; and Popat, A. C. 2019. A scalable handwritten text recognition system. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 17–24. IEEE.
英格尔河R.; Fujii，Y.; Deselaers，T.; Baccash，J.;和Popat，A. C. 2019.一个可扩展的手写文本识别系统。在2019年国际会议上文件分析和识别（ICDAR），17-24。美国电气与电子工程师协会。
Jaderberg et al. (2014) Jaderberg等人（2014年） Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2014. Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. In Workshop on Deep Learning, NIPS.
Jaderberg，M.; Simonyan，K.; Vedaldi，A.;和Zisserman，A. 2014.用于自然场景文本识别的合成数据和人工神经网络。深度学习研讨会，NIPS。
Kang et al. (2020) Kang等人（2020） Kang, L.; Riba, P.; Rusiñol, M.; Fornés, A.; and Villegas, M. 2020. Pay attention to what you read: Non-recurrent handwritten text-line recognition. arXiv preprint arXiv:2005.13044.
Kang，L.; Riba，P.; Rusiñol，M.; Fornés，A.;和Villegas，M. 2020.注意你读的东西：非循环手写文本行识别。arXiv预印本arXiv：2005.13044。
Karatzas et al. (2015) Karatzas等人（2015年） Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V. R.; Lu, S.; et al. 2015. ICDAR 2015 competition on robust reading. In ICDAR.
Karatzas，D.; Gomez-Bigorda，L.; Nicolaou，A.; Ghosh，S.; Bagdanov，A.; Iwamura，M.; Matas，J.; Neumann，L.; rasekhar，V. R.; Lu，S.;等，2015年。ICDAR 2015年强健阅读竞赛。在ICDAR。
Karatzas et al. (2013) Karatzas等人（2013年） Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Bigorda, L. G.; Mestre, S. R.; Mas, J.; Mota, D. F.; Almazan, J. A.; and De Las Heras, L. P. 2013. ICDAR 2013 robust reading competition. In ICDAR.
Karatzas，D.; Shafait，F.; Uchida，S.; Iwamura，M.; i比戈达湖G.;梅斯特雷湾R.; Mas，J.;莫塔湾F.地;阿尔马赞;和De拉斯埃拉斯，L. P. 2013. ICDAR 2013强劲的阅读竞争。在ICDAR。
Krishnan and Jawahar (2016)
Krishnan和Jawahar（2016） Krishnan, P.; and Jawahar, C. V. 2016. Generating Synthetic Data for Text Recognition. arXiv:1608.04224.
Krishnan，P.;和Jawahar，C.五. 2016年。生成用于文本识别的合成数据。arXiv：1608.04224。
Kudo and Richardson (2018)
Kudo和Richardson（2018） Kudo, T.; and Richardson, J. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
Kudo，T.; and Richardson，J. 2018. Sentencepiece：一个简单的、独立于语言的子词分词器和去分词器，用于神经文本处理。arXiv预印本arXiv：1808.06226。
Lee et al. (2020) Lee等人（2020） Lee, J.; Park, S.; Baek, J.; Oh, S. J.; Kim, S.; and Lee, H. 2020. On recognizing texts of arbitrary shapes with 2D self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 546–547.
李，J.;公园，S。; Baek，J.;哦，S。J.; Kim，S.;和Lee，H. 2020.用二维自注意力识别任意形状的文本。IEEE/CVF计算机视觉和模式识别研讨会论文集，546-547。
Liao et al. (2019) Liao等人（2019） Liao, M.; Wan, Z.; Yao, C.; Chen, K.; and Bai, X. 2019. Real-time Scene Text Detection with Differentiable Binarization. arXiv:1911.08947.
Liao，M.; Wan，Z.; Yao，C.; Chen，K.;和Bai，X. 2019.基于微分二值化的实时场景文本检测。arXiv：1911.08947。
Litman et al. (2020) Litman等人（2020） Litman, R.; Anschel, O.; Tsiper, S.; Litman, R.; Mazor, S.; and Manmatha, R. 2020. Scatter: selective context attentional scene text recognizer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11962–11972.
Litman，R.；Anschel，O.；Tsiper，S.；Litman，R.；Mazor，S.；和Manmatha，R. 2020.分散：选择性上下文注意场景文本识别器。IEEE/CVF计算机视觉和模式识别会议论文集，11962—11972。
Liu et al. (2019) Liu等人（2019） Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
刘玉; Ott，M.; Goyal，N.; Du，J.; Joshi，M.; Chen，D.; Levy，O.;刘易斯，M.; Zettlemoyer，L.; Stoyanov，V.2019。RoBERTa：一种鲁棒优化的BERT预训练方法。arXiv：1907.11692。
Lyu et al. (2022) Lyu等人（2022） Lyu, P.; Zhang, C.; Liu, S.; Qiao, M.; Xu, Y.; Wu, L.; Yao, K.; Han, J.; Ding, E.; and Wang, J. 2022. MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining. arXiv preprint arXiv:2206.00311.
Lyu，P.; Zhang，C.; Liu，S.; Qiao，M.;徐，Y.; Wu，L.; Yao，K.; Han，J.; Ding，E.;和Wang，J.2022. MaskOCR：使用Masked Encoder-Decoder预训练的文本识别。arXiv预印本arXiv：2206.00311。
Memon et al. (2020) Memon等人（2020） Memon, J.; Sami, M.; Khan, R. A.; and Uddin, M. 2020. Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access, 8: 142642–142668.
Memon，J.; Sami，M.;汗河A.;和Uddin，M. 2020.手写光学字符识别（OCR）：一个全面的系统性文献综述（SLR）。IEEE Access，8：142642-142668。
Michael et al. (2019) Michael等人（2019） Michael, J.; Labahn, R.; Grüning, T.; and Zöllner, J. 2019. Evaluating sequence-to-sequence models for handwritten text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 1286–1293. IEEE.
Michael，J.; Labahn，R.; Grüning，T.; Zöllner，J. 2019.评估用于手写文本识别的序列到序列模型。2019年国际文件分析和识别会议（ICDAR），1286-1293。IEEE。
Mishra, Alahari, and Jawahar (2012)
Mishra，Alahari和Jawahar（2012） Mishra, A.; Alahari, K.; and Jawahar, C. 2012. Top-down and bottom-up cues for scene text recognition. In CVPR.
Mishra，A.; Alahari，K.;和Jawahar，C. 2012.自上而下和自下而上的场景文本识别提示。在CVPR。
Mou et al. (2020) Mou等人（2020） Mou, Y.; Tan, L.; Yang, H.; Chen, J.; Liu, L.; Yan, R.; and Huang, Y. 2020. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, 158–174. Springer.
牟，Y.; Tan，L.; Yang，H.;陈杰;刘，L.; Yan，R.;和Huang，Y. 2020. Plugnet：由可插入的超分辨率单元监督的降级感知场景文本识别。在计算机视觉ECCV 2020：第16届欧洲会议，格拉斯哥，英国，2020年8月23日至28日，会议记录，第XV部分16，158-174。斯普林格
Ott et al. (2019) Ott等人（2019） Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; and Auli, M. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
Ott，M.; Edmund，S.; Baevski，A.; Fan，A.; Gross，S.; Ng，N.; Grangier，D.;和奥利，M. 2019. fairseq：A Fast，Extensible Toolkit for Sequence Modeling. NAACL-HLT 2019会议记录：演示。
Pham et al. (2014) Pham等人（2014年） Pham, V.; Bluche, T.; Kermorvant, C.; and Louradour, J. 2014. Dropout improves recurrent neural networks for handwriting recognition. In 2014 14th international conference on frontiers in handwriting recognition, 285–290. IEEE.
Pham，V.; Bluche，T.; Kermorvant，C.; Louradour，J. 2014. Dropout改进了用于手写识别的递归神经网络。2014年第14届手写识别前沿国际会议，285-290。IEEE。
Phan et al. (2013) Phan等人（2013年） Phan, T. Q.; Shivakumara, P.; Tian, S.; and Tan, C. L. 2013. Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, 569–576.
Phan，T. Q.问：Shivakumara，P.;田，S.;和Tan，C. L. 2013.在自然场景中识别具有透视变形的文本。在IEEE计算机视觉国际会议论文集，569-576。
Puigcerver (2017) 普伊格塞弗（2017） Puigcerver, J. 2017. Are multidimensional recurrent layers really necessary for handwritten text recognition? In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, 67–72. IEEE.
Puigcerver，J. 2017.手写文本识别真的需要多维递归层吗？2017年第14届IAPR国际文件分析和识别会议（ICDAR），第1卷，67-72。美国电气与电子工程师协会。
Ramesh et al. (2021) Ramesh等人（2021） Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; and Sutskever, I. 2021. Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092.
Ramesh，A.; Pavlov，M.; Goh，G.; Gray，S.; Voss，C.;拉德福，A.; Chen，M.;和Sutskever，I. 2021.零拍摄文本到图像生成。arXiv预印本arXiv：2102.12092。
Risnumawan et al. (2014)
Risnumawan等人（2014年） Risnumawan, A.; Shivakumara, P.; Chan, C. S.; and Tan, C. L. 2014. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications.
Risnumawan，A.; Shivakumara，P.; Chan，C. S.的;和Tan，C. L. 2014.一种鲁棒的自然场景图像任意文字检测系统。专家系统及其应用
Sang and Cuong (2019)
Sang and Cuong（2019） Sang, D. V.; and Cuong, L. T. B. 2019. Improving CRNN with EfficientNet-like feature extractor and multi-head attention for text recognition. In Proceedings of the Tenth International Symposium on Information and Communication Technology, 285–290.
Sang，D.五、和Cuong，L. T. B。2019.用EfficientNet-like特征提取器和多头注意力改进CRNN用于文本识别。第十届信息和通信技术国际研讨会论文集，285-290。
Sennrich, Haddow, and Birch (2015)
Sennrich，Haddow和Birch（2015） Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Sennrich，R.; Haddow，B.;和Birch，A. 2015.以子词为单位的生僻词神经机器翻译。arXiv预印本arXiv：1508.07909。
Sheng, Chen, and Xu (2019)
Sheng，Chen，and Xu（2019） Sheng, F.; Chen, Z.; and Xu, B. 2019. NRTR: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR), 781–786. IEEE.
Sheng，F.;陈志;和Xu，B. 2019. NRTR：一种用于场景文本识别的无递归序列到序列模型。2019年国际文件分析和识别会议（ICDAR），781-786。IEEE。
Shi, Bai, and Yao (2016)
Shi，Bai，Yao（2016） Shi, B.; Bai, X.; and Yao, C. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence, 39(11): 2298–2304.
Shi，B.;白，X.;和Yao，C. 2016.一个端到端的可训练神经网络用于基于图像的序列识别及其在场景文本识别中的应用。IEEE transactions on pattern analysis and machine intelligence，39（11）：2298-2304.
Shi et al. (2016) Shi等人（2016） Shi, B.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2016. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4168–4176.
Shi，B.;王，X.; Lyu，P.; Yao，C.;和Bai，X. 2016.具有自动校正功能的鲁棒场景文本识别。IEEE计算机视觉与模式识别会议论文集，4168-4176。
Shi et al. (2018) Shi等人（2018） Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; and Bai, X. 2018. Aster: An attentional scene text recognizer with flexible rectification. IEEE transactions on pattern analysis and machine intelligence, 41(9): 2035–2048.
Shi，B.;杨，M.;王，X.; Lyu，P.; Yao，C.;和Bai，X. 2018. Aster：一个具有灵活校正的注意场景文本识别器。IEEE transactions on pattern analysis and machine intelligence，41（9）：2035-2048.
Su and Lu (2014)
Su and Lu（2014） Su, B.; and Lu, S. 2014. Accurate scene text recognition based on recurrent neural network. In Asian Conference on Computer Vision, 35–48. Springer.
Su，B.;和Lu，S. 2014.基于递归神经网络的场景文本精确识别。在亚洲计算机视觉会议上，35-48。斯普林格
Touvron et al. (2021) Touvron等人（2021） Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and Jégou, H. 2021. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 10347–10357. PMLR.
Touvron，H.; Cord，M.; Douze，M.;马萨，F.; Sablayrolles，A.; Jégou，H. 2021.训练数据高效的图像转换器和蒸馏通过注意力。机器学习国际会议，10347-10357。PMLR。
Vaswani et al. (2017) Vaswani等人（2017） Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in neural information processing systems, 5998–6008.
Vaswani，A.; Shazeer，N.; Parmar，N.; Uszkoreit，J.; Jones，L.; Gomez，A. N.; Kaiser，M.;和Polosukhin，I. 2017.注意力就是你所需要的。神经信息处理系统的进展，5998-6008。
Voigtlaender, Doetsch, and Ney (2016)
Voigtlaender，Doetsch和Ney（2016） Voigtlaender, P.; Doetsch, P.; and Ney, H. 2016. Handwriting recognition with large multidimensional long short-term memory recurrent neural networks. In 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 228–233. IEEE.
Voigtlaender，P.; Doetsch，P.;和Ney，H. 2016.基于大规模多维长短时记忆递归神经网络的手写体识别。2016年第15届手写识别前沿国际会议（ICFHR），228-233。IEEE。
Wan et al. (2020) Wan, Z.; He, M.; Chen, H.; Bai, X.; and Yao, C. 2020. Textscanner: Reading characters in order for robust scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 12120–12127.
Wang, Babenko, and Belongie (2011) Wang, K.; Babenko, B.; and Belongie, S. 2011. End-to-end scene text recognition. In 2011 International conference on computer vision, 1457–1464. IEEE.
王，K.; Babenko，B.;和Belongie，S. 2011.端到端场景文本识别。2011年国际计算机视觉会议，1457-1464。IEEE。
Wang et al. (2019) Wang等人（2019） Wang, S.; Wang, Y.; Qin, X.; Zhao, Q.; and Tang, Z. 2019. Scene text recognition via gated cascade attention. In 2019 IEEE International Conference on Multimedia and Expo (ICME), 1018–1023. IEEE.
Wang，S.;王，Y; Qin，X.;赵，Q.;和Tang，Z. 2019.通过门控级联注意的场景文本识别。在2019年IEEE多媒体和博览会国际会议（ICME），1018-1023。美国电气与电子工程师协会。
Wang et al. (2020a) Wang等人（2020 a） Wang, T.; Zhu, Y.; Jin, L.; Luo, C.; Chen, X.; Wu, Y.; Wang, Q.; and Cai, M. 2020a. Decoupled Attention Network for Text Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
王，T.; Zhu，Y.; Jin，L.; Luo，C.;陈X;吴，Y.;王，Q.;和Cai，M. 2020年a。用于文本识别的解耦注意力网络。在AAAI人工智能会议上。
Wang et al. (2020b) Wang等人（2020 b） Wang, W.; Wei, F.; Dong, L.; Bao, H.; Yang, N.; and Zhou, M. 2020b. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957.
Wang，W.; Wei，F.;董湖; Bao，H.; Yang，N.; Zhou，M. 2020年b。Minilm：深度自我注意力蒸馏，用于预训练变压器的任务不可知压缩。arXiv预印本arXiv：2002.10957。
Wang et al. (2021) Wang等人（2021） Wang, Y.; Xie, H.; Fang, S.; Wang, J.; Zhu, S.; and Zhang, Y. 2021. From Two to One: A New Scene Text Recognizer With Visual Language Modeling Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 14194–14203.
王玉; Xie，H.; Fang，S.;王，J.; Zhu，S.;和Zhang，Y. 2021.从二到一：一个新的场景文本识别器与视觉语言建模网络。IEEE/CVF计算机视觉国际会议（ICCV）论文集，14194-14203。
Wightman (2019) 电影Wightman（2019） Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
怀特曼河2019. PyTorch图像模型。https://github.com/rwightman/pytorch-image-models.
Yan et al. (2021) Yan等人（2021） Yan, R.; Peng, L.; Xiao, S.; and Yao, G. 2021. Primitive Representation Learning for Scene Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 284–293.
Yan，R.; Peng，L.;肖，S.;和Yao，G. 2021.用于场景文本识别的基元表示学习。IEEE/CVF计算机视觉与模式识别会议论文集（CVPR），284-293。
Yu et al. (2020) Yu等人（2020） Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; and Ding, E. 2020. Towards Accurate Scene Text Recognition With Semantic Reasoning Networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12110–12119.
Yu，D.;李，X.; Zhang，C.; Liu，T.;汉，J.;刘，J.;和Ding，E. 2020.使用语义推理网络实现精确的场景文本识别。2020年IEEE/CVF计算机视觉和模式识别会议（CVPR），12110-12119。
Yue et al. (2020) Yue等人（2020） Yue, X.; Kuang, Z.; Lin, C.; Sun, H.; and Zhang, W. 2020. Robustscanner: Dynamically enhancing positional clues for robust text recognition. In European Conference on Computer Vision, 135–151. Springer.
Yue，X.; Kuang，Z.;林，C.;孙，H.;和Zhang，W. 2020. Robustscanner：动态增强位置线索，实现强大的文本识别。在欧洲计算机视觉会议上，135-151。斯普林格
Zhan and Lu (2019)
Zhan and Lu（2019） Zhan, F.; and Lu, S. 2019. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2059–2068.
Zhan，F.;和Lu，S. 2019. Esir：通过迭代图像校正进行端到端场景文本识别。IEEE/CVF计算机视觉和模式识别会议论文集，2059-2068。
Zhang et al. (2020a) Zhang等人（2020 a） Zhang, H.; Yao, Q.; Yang, M.; Xu, Y.; and Bai, X. 2020a. AutoSTR: Efficient backbone search for scene text recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, 751–767. Springer.
张，H.; Yao，Q.;杨，M.;徐，Y.;和Bai，X. 2020年a。AutoSTR：用于场景文本识别的高效主干搜索。在计算机视觉ECCV 2020：第16届欧洲会议，格拉斯哥，英国，2020年8月23日至28日，会议记录，第XXIV部分16，751-767。斯普林格
Zhang et al. (2020b) Zhang等人（2020 b） Zhang, J.; Luo, C.; Jin, L.; Wang, T.; Li, Z.; and Zhou, W. 2020b. SaHAN: Scale-aware hierarchical attention network for scene text recognition. Pattern Recognition Letters, 136: 205–211.
张，J.; Luo，C.; Jin，L.;王，T.; Li，Z.; Zhou，W. 2020年b。SaHAN：Scale-aware hierarchical attention network for scene text recognition。Pattern Recognition Letters，136：205-211.

TrOCR: Transformer-based Optical Character Recognition with Pre-trained ModelsTrOCR：基于transformer的光学字符识别与预训练模型