Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction^†^†thanks: This research was supported by NSFC (Nos. 61672399, U1401258), and the 973 Program (No. 2015CB352400).
本研究得到国家自然科学基金委员会（编号：61672399、U1401258）和973计划（编号：2015CB352400）的资助。
用于全市人流预测的深度时空残差网络 ^†

Junbo Zhang¹, Yu Zheng^1,2,3,4, Dekang Qi^2,1
¹Microsoft Research, Beijing, China
¹ 微软研究院，中国北京
²School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China
² 中国成都，西南交通大学信息科学与技术学院
³School of Computer Science and Technology, Xidian University, China
³ 西安电子科技大学计算机科学与技术学院，中国
⁴Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
⁴ 中国科学院深圳先进技术研究院
{junbo.zhang, yuzheng}@microsoft.com, dekangqi@outlook.com Correspondence author. This work was done when the third author was an intern at Microsoft Research.
通讯作者。这项工作是第三作者在微软研究院实习时完成的。

Abstract 摘要

Forecasting the flow of crowds is of great importance to traffic management and public safety, and very challenging as it is affected by many complex factors, such as inter-region traffic, events, and weather. We propose a deep-learning-based approach, called ST-ResNet, to collectively forecast the inflow and outflow of crowds in each and every region of a city. We design an end-to-end structure of ST-ResNet based on unique properties of spatio-temporal data. More specifically, we employ the residual neural network framework to model the temporal closeness, period, and trend properties of crowd traffic. For each property, we design a branch of residual convolutional units, each of which models the spatial properties of crowd traffic. ST-ResNet learns to dynamically aggregate the output of the three residual neural networks based on data, assigning different weights to different branches and regions. The aggregation is further combined with external factors, such as weather and day of the week, to predict the final traffic of crowds in each and every region. Experiments on two types of crowd flows in Beijing and New York City (NYC) demonstrate that the proposed ST-ResNet outperforms six well-known methods.
人群流量预测对交通管理和公共安全具有重要意义，但由于受到区域间交通、事件和天气等诸多复杂因素的影响，因此非常具有挑战性。我们提出了一种基于深度学习的方法，称为 ST-ResNet，用于综合预测城市每个区域的人群流入和流出情况。我们根据时空数据的独特属性设计了 ST-ResNet 的端到端结构。更具体地说，我们采用残差神经网络框架来模拟人群流量的时间接近性、周期性和趋势性。针对每种属性，我们都设计了一个残差卷积单元分支，每个单元都对人群流量的空间属性进行建模。ST-ResNet 学会根据数据动态聚合三个残差神经网络的输出，为不同的分支和区域分配不同的权重。聚合结果会进一步与天气和星期等外部因素相结合，从而预测每个区域的最终人流量。在北京和纽约市（NYC）进行的两类人群流量实验表明，所提出的 ST-ResNet 优于六种著名方法。

Introduction 导言

Predicting crowd flows in a city is of great importance to traffic management and public safety (?). For instance, massive crowds of people streamed into a strip region at the 2015 New Year’s Eve celebrations in Shanghai, resulting in a catastrophic stampede that killed 36 people. In mid-July of 2016, hundreds of “Pokemon Go” players ran through New York City’s Central Park in hopes of catching a particularly rare digital monster, leading to a dangerous stampede there. If one can predict the crowd flow in a region, such tragedies can be mitigated or prevented by utilizing emergency mechanisms, such as conducting traffic control, sending out warnings, or evacuating people, in advance.
预测城市人流对交通管理和公共安全具有重要意义（？）例如，在 2015 年上海的除夕庆祝活动中，大量人群涌入一个带状区域，导致了一场灾难性的踩踏事件，造成 36 人死亡。2016 年 7 月中旬，数百名 "口袋妖怪 Go "玩家跑过纽约市中央公园，希望捕捉一种特别稀有的数字怪物，导致那里发生了危险的踩踏事件。如果能够预测一个地区的人流情况，就可以提前利用应急机制，如进行交通管制、发出警告或疏散人群，从而减轻或避免此类悲剧的发生。

In this paper, we predict two types of crowd flows (?): inflow and outflow, as shown in Figure 1(a). Inflow is the total traffic of crowds entering a region from other places during a given time interval. Outflow denotes the total traffic of crowds leaving a region for other places during a given time interval. Both flows track the transition of crowds between regions. Knowing them is very beneficial for risk assessment and traffic management. Inflow/outflow can be measured by the number of pedestrians, the number of cars driven nearby roads, the number of people traveling on public transportation systems (e.g., metro, bus), or all of them together if data is available. Figure 1(b) presents an example. We can use mobile phone signals to measure the number of pedestrians, showing that the inflow and outflow of $r_{2}$ are $(3,1)$ respectively. Similarly, using the GPS trajectories of vehicles, two types of flows are $(0,3)$ respectively.
在本文中，我们预测两种类型的人群流量（?）：流入和流出，如图 1(a) 所示。流入是指在给定时间间隔内，从其他地方进入某一地区的人群总流量。流出是指在给定时间间隔内，人群离开某一区域前往其他地方的总流量。这两种流量都能跟踪人群在区域间的转换。了解它们对风险评估和交通管理非常有益。可以通过行人数量、附近道路上行驶的汽车数量、乘坐公共交通系统（如地铁、公交车）的人数来衡量人流的流入/流出，或者在有数据的情况下将所有这些数据加在一起。图 1(b) 展示了一个例子。我们可以使用手机信号来测量行人数量，显示 $r_{2}$ 的流入量和流出量分别为 $(3,1)$ 。同样，利用车辆的 GPS 轨迹，可以分别得出 $(0,3)$ 两种流量。

Refer to caption — (a) Inflow and outflow (a) 流入和流出

Simultaneously forecasting the inflow and outflow of crowds in each region of a city, however, is very challenging, affected by the following three complex factors:
然而，同时预测一个城市每个区域的人流流入和流出是非常具有挑战性的，受到以下三个复杂因素的影响：

1.

Spatial dependencies. The inflow of Region $r_{2}$ (shown in Figure 1(a)) is affected by outflows of nearby regions (like $r_{1}$ ) as well as distant regions. Likewise, the outflow of $r_{2}$ would affect inflows of other regions (e.g., $r_{3}$ ). The inflow of region $r_{2}$ would affect its own outflow as well.

1.空间依赖性。区域 $r_{2}$ 的流入（如图 1(a)所示）会受到附近区域（如 $r_{1}$ ）和远处区域的流出的影响。同样， $r_{2}$ 的流出也会影响其他区域（如 $r_{3}$ ）的流入。区域 $r_{2}$ 的流入也会影响其自身的流出。
2.

Temporal dependencies. The flow of crowds in a region is affected by recent time intervals, both near and far. For instance, a traffic congestion occurring at 8am will affect that of 9am. In addition, traffic conditions during morning rush hours may be similar on consecutive workdays, repeating every 24 hours. Furthermore, morning rush hours may gradually happen later as winter comes. When the temperature gradually drops and the sun rises later in the day, people get up later and later.

2.时间依赖性。一个地区的人流会受到近期和远期时间间隔的影响。例如，上午 8 点发生的交通拥堵会影响上午 9 点的交通。此外，连续工作日早高峰时段的交通状况可能相似，每 24 小时重复一次。此外，随着冬季的到来，早高峰时间可能会逐渐推迟。当气温逐渐下降，太阳在一天中升起的时间越来越晚时，人们起床的时间也会越来越晚。
3.

External influence. Some external factors, such as weather conditions and events may change the flow of crowds tremendously in different regions of a city.

3.外部影响。一些外部因素，如天气条件和事件，可能会使城市不同区域的人流发生巨大变化。

To tackle these challenges, we propose a deep spatio-temporal residual network (ST-ResNet) to collectively predict inflow and outflow of crowds in every region. Our contributions are four-fold:
为了应对这些挑战，我们提出了一种深度时空残差网络（ST-ResNet）来综合预测每个区域的人群流入和流出。我们的贡献有四个方面：

•

ST-ResNet employs convolution-based residual networks to model nearby and distant spatial dependencies between any two regions in a city, while ensuring the model’s prediction accuracy is not comprised by the deep structure of the neural network.

- ST-ResNet 采用基于卷积的残差网络来模拟城市中任意两个区域之间的远近空间依赖关系，同时确保模型的预测精度不受神经网络深度结构的影响。
•

We summarize the temporal properties of crowd flows into three categories, consisting of temporal closeness, period, and trend. ST-ResNet uses three residual networks to model these properties, respectively.
•

ST-ResNet dynamically aggregates the output of the three aforementioned networks, assigning different weights to different branches and regions. The aggregation is further combined with external factors (e.g., weather).
•

We evaluate our approach using Beijing taxicabs’ trajectories and meteorological data, and NYC bike trajectory data. The results demonstrate the advantages of our approach compared with 6 baselines.

Preliminaries

In this section, we briefly revisit the crowd flows prediction problem (?; ?) and introduce deep residual learning (?).

Formulation of Crowd Flows Problem

Definition 1 (Region (?))

There are many definitions of a location in terms of different granularities and semantic meanings. In this study, we partition a city into an $I\times J$ grid map based on the longitude and latitude where a grid denotes a region, as shown in Figure 2(a).

Definition 2 (Inflow/outflow (?))

Let $\mathbb{P}$ be a collection of trajectories at the $t^{th}$ time interval. For a grid $(i,j)$ that lies at the $i^{th}$ row and the $j^{th}$ column, the inflow and outflow of the crowds at the time interval $t$ are defined respectively as

	$\displaystyle x_{t}^{in,i,j}$	$\displaystyle=$	$\displaystyle\sum\limits_{Tr\in\mathbb{P}}\|\{k>1\|g_{k-1}\not\in(i,j)\wedge g_{k}\in(i,j)\}\|$
	$\displaystyle x_{t}^{out,i,j}$	$\displaystyle=$	$\displaystyle\sum\limits_{Tr\in\mathbb{P}}\|\{k\geq 1\|g_{k}\in(i,j)\wedge g_{k+1}\not\in(i,j)\}\|$

where $Tr:g_{1}\rightarrow g_{2}\rightarrow\cdots\rightarrow g_{|Tr|}$ is a trajectory in $\mathbb{P}$ , and $g_{k}$ is the geospatial coordinate; $g_{k}\in(i,j)$ means the point $g_{k}$ lies within grid $(i,j)$ , and vice versa; $|\cdot|$ denotes the cardinality of a set.

At the $t^{th}$ time interval, inflow and outflow in all $I\times J$ regions can be denoted as a tensor $\mathbf{X}_{t}\in\mathbb{R}^{2\times I\times J}$ where $(\mathbf{X}_{t})_{0,i,j}=x_{t}^{in,i,j}$ , $(\mathbf{X}_{t})_{1,i,j}=x_{t}^{out,i,j}$ . The inflow matrix is shown in Figure 2(b).

Formally, for a dynamical system over a spatial region represented by a $I\times J$ grid map, there are 2 types of flows in each grid over time. Thus, the observation at any time can be represented by a tensor $\mathbf{X}\in\mathbb{R}^{2\times I\times J}$ .

Problem 1

Given the historical observations $\{\mathbf{X}_{t}|t=0,\cdots,n-1\}$ , predict $\mathbf{X}_{n}$ .

Deep Residual Learning

Deep residual learning (?) allows convolution neural networks to have a super deep structure of 100 layers, even over-1000 layers. And this method has shown state-of-the-art results on multiple challenging recognition tasks, including image classification, object detection, segmentation and localization (?).

Formally, a residual unit with an identity mapping (?) is defined as:

\mathbf{X}^{(l+1)}=\mathbf{X}^{(l)}+\mathcal{F}(\mathbf{X}^{(l)})

(1)

where $\mathbf{X}^{(l)}$ and $\mathbf{X}^{(l+1)}$ are the input and output of the $l^{th}$ residual unit, respectively; $\mathcal{F}$ is a residual function, e.g., a stack of two $3\times 3$ convolution layers in (?). The central idea of the residual learning is to learn the additive residual function $\mathcal{F}$ with respect to $\mathbf{X}^{(l)}$ (?).

Deep Spatio-Temporal Residual Networks

Figure 3 presents the architecture of ST-ResNet, which is comprised of four major components modeling temporal closeness, period, trend, and external influence, respectively. As illustrated in the top-right part of Figure 3, we first turn Inflow and outflow throughout a city at each time interval into a 2-channel image-like matrix respectively, using the approach introduced in Definitions 1 and 2. We then divide the time axis into three fragments, denoting recent time, near history and distant history. The 2-channel flow matrices of intervals in each time fragment are then fed into the first three components separately to model the aforementioned three temporal properties: closeness, period and trend, respectively. The first three components share the same network structure with a convolutional neural network followed by a Residual Unit sequence. Such structure captures the spatial dependency between nearby and distant regions. In the external component, we manually extract some features from external datasets, such as weather conditions and events, feeding them into a two-layer fully-connected neural network. The outputs of the first three components are fused as $\mathbf{X}_{Res}$ based on parameter matrices, which assign different weights to the results of different components in different regions. $\mathbf{X}_{Res}$ is further integrated with the output of the external component $\mathbf{X}_{Ext}$ . Finally, the aggregation is mapped into $[-1,1]$ by a Tanh function, which yields a faster convergence than the standard logistic function in the process of back-propagation learning (?).

Structures of the First Three Components

The first three components (i.e. closeness, period, trend) share the same network structure, which is composed of two sub-components: convolution and residual unit, as shown in Figure 4.

Convolution. A city usually has a very large size, containing many regions with different distances. Intuitively, the flow of crowds in nearby regions may affect each other, which can be effectively handled by the convolutional neural network (CNN) that has shown its powerful ability to hierarchically capture the spatial structural information (?). In addition, subway systems and highways connect two locations with a far distance, leading to the dependency between distant regions. In order to capture the spatial dependency of any region, we need to design a CNN with many layers because one convolution only accounts for spatial near dependencies, limited by the size of their kernels. The same problem also has been found in the video sequence generating task where the input and output have the same resolution (?). Several methods have been introduced to avoid the loss of resolution brought about by subsampling while preserving distant dependencies (?). Being different from the classical CNN, we do not use subsampling, but only convolutions (?). As shown in Figure 4(a), there are three multiple levels of feature maps that are connected with a few convolutions. We find that a node in the high-level feature map depends on nine nodes of the middle-level feature map, those of which depend on all nodes in the lower-level feature map (i.e. input). It means one convolution naturally captures spatial near dependencies, and a stack of convolutions can further capture distant even citywide dependencies.
卷积。城市通常面积很大，包含许多距离不同的区域。直观地说，附近区域的人流可能会相互影响，而卷积神经网络（CNN）可以有效地处理这种情况，它在分层捕捉空间结构信息方面表现出了强大的能力（？）此外，地铁系统和高速公路将距离较远的两个地点连接起来，从而导致远距离区域之间的依赖关系。为了捕捉任何区域的空间依赖性，我们需要设计一个有很多层的 CNN，因为受限于其核的大小，一次卷积只能说明空间近距离依赖性。在输入和输出具有相同分辨率的视频序列生成任务中，我们也发现了同样的问题（？）有几种方法可以避免子采样带来的分辨率损失，同时保留远距离相关性（？）与经典的 CNN 不同，我们不使用子采样，只使用卷积（？）如图 4(a)所示，有三个多层次的特征图，它们通过一些卷积连接在一起。我们发现，高层特征图的一个节点取决于中层特征图的九个节点，而这些节点又取决于低层特征图的所有节点（即输入）。这意味着一个卷积可以自然地捕捉空间上的近距离依赖关系，而一叠卷积则可以进一步捕捉远距离甚至全市范围内的依赖关系。

The closeness component of Figure 3 adopts a few 2-channel flows matrices of intervals in the recent time to model temporal closeness dependence. Let the recent fragment be $[\mathbf{X}_{t-{l_{c}}},\mathbf{X}_{t-{(l_{c}-1)}},\cdots,\mathbf{X}_{t-1}]$ , which is also known as the closeness dependent sequence. We first concatenate them along with the first axis (i.e. time interval) as one tensor $\mathbf{X}_{c}^{(0)}\in\mathbb{R}^{2l_{c}\times I\times J}$ , which is followed by a convolution (i.e. Conv1 shown in Figure 3) as:
图 3 中的亲近度部分采用了最近时间间隔的一些 2 通道流量矩阵来模拟时间上的亲近度依赖。假设最近的片段为 $[\mathbf{X}_{t-{l_{c}}},\mathbf{X}_{t-{(l_{c}-1)}},\cdots,\mathbf{X}_{t-1}]$ ，这也被称为密切相关序列。我们首先将它们连同第一轴（即时间间隔）合并为一个张量 $\mathbf{X}_{c}^{(0)}\in\mathbb{R}^{2l_{c}\times I\times J}$ ，然后进行卷积（即图 3 所示的 Conv1），如图 3 所示：

\mathbf{X}_{c}^{(1)}=f\left(W^{(1)}_{c}*\mathbf{X}_{c}^{(0)}+b^{(1)}_{c}\right)

where $*$ denotes the convolution¹¹1To make the input and output have the same size (i.e. $I\times J$ ) in a convolutional operator, we employ a border-mode which allows a filter to go outside the border of an input, padding each area outside the border with a zero.
为了使卷积算子的输入和输出具有相同的大小（即 $I\times J$ ），我们采用了边界模式，允许滤波器超出输入的边界，并在边界外的每个区域填充一个零。; $f$ is an activation function, e.g. the rectifier $f(z):=\max(0,z)$ (?); $W^{(1)}_{c},b_{c}^{(1)}$ are the learnable parameters in the first layer.
其中， $*$ 表示卷积 ¹ ； $f$ 是激活函数，例如整流函数 $f(z):=\max(0,z)$ (?) ； $W^{(1)}_{c},b_{c}^{(1)}$ 是第一层的可学习参数。

Residual Unit. It is a well-known fact that very deep convolutional networks compromise training effectiveness though the well-known activation function (e.g. ReLU) and regularization techniques are applied (?; ?; ?). On the other hand, we still need a very deep network to capture very large citywide dependencies. For a typical crowd flows data, assume that the input size is $32\times 32$ , and the kernel size of convolution is fixed to $3\times 3$ , if we want to model citywide dependencies (i.e., each node in high-level layer depends on all nodes of the input), it needs more than 15 consecutive convolutional layers. To address this issue, we employ residual learning (?) in our model, which have been demonstrated to be very effective for training super deep neural networks of over-1000 layers.
残差单元。众所周知，深度卷积网络在使用众所周知的激活函数（如 ReLU）和正则化技术（?; ?; ?）后，训练效果会大打折扣。另一方面，我们仍然需要一个非常深的网络来捕捉全市范围内非常大的依赖关系。对于典型的人流数据，假设输入大小为 $32\times 32$ ，卷积内核大小固定为 $3\times 3$ ，如果我们要模拟全市范围内的依赖关系（即高层中的每个节点都依赖于输入的所有节点），则需要超过 15 个连续的卷积层。为了解决这个问题，我们在模型中采用了残差学习（residual learning），该方法已被证明对训练超过 1000 层的超深度神经网络非常有效。

In our ST-ResNet (see Figure 3), we stack $L$ residual units upon Conv1 as follows,
在我们的 ST-ResNet 中（见图 3），我们在 Conv1 上堆叠了 $L$ 残差单元，如下所示、

\mathbf{X}_{c}^{(l+1)}=\mathbf{X}_{c}^{(l)}+\mathcal{F}(\mathbf{X}_{c}^{(l)};\theta_{c}^{(l)}),l=1,\cdots,L

(2)

where $\mathcal{F}$ is the residual function (i.e. two combinations of “ReLU + Convolution”, see Figure 4(b)), and $\theta^{(l)}$ includes all learnable parameters in the $l^{th}$ residual unit. We also attempt Batch Normalization (BN) (?) that is added before ReLU. On top of the $L^{th}$ residual unit, we append a convolutional layer (i.e. Conv2 shown in Figure 3). With 2 convolutions and $L$ residual units, the output of the closeness component of Figure 3 is $\mathbf{X}_{c}^{(L+2)}$ .
其中 $\mathcal{F}$ 是残差函数（即 "ReLU + 卷积 "的两个组合，见图 4(b)）， $\theta^{(l)}$ 包括 $l^{th}$ 残差单元中的所有可学习参数。我们还尝试在 ReLU 之前添加批量归一化（BN）（？在 $L^{th}$ 残差单元的顶部，我们添加了一个卷积层（即图 3 中所示的 Conv2）。有了 2 个卷积层和 $L$ 个残差单元，图 3 中近似度部分的输出为 $\mathbf{X}_{c}^{(L+2)}$ 。

Likewise, using the above operations, we can construct the period and trend components of Figure 3. Assume that there are $l_{p}$ time intervals from the period fragment and the period is $p$ . Therefore, the period dependent sequence is $[\mathbf{X}_{t-{l_{p}}\cdot p},\mathbf{X}_{t-({l_{p}}-1)\cdot p},\cdots,\mathbf{X}_{t-p}]$ . With the convolutional operation and $L$ residual units like in Eqs. Structures of the First Three Components and 2, the output of the period component is $\mathbf{X}_{p}^{(L+2)}$ . Meanwhile, the output of the trend component is $\mathbf{X}_{q}^{(L+2)}$ with the input $[\mathbf{X}_{t-{l_{q}}\cdot q},\mathbf{X}_{t-({l_{q}}-1)\cdot q},\cdots,\mathbf{X}_{t-q}]$ where $l_{q}$ is the length of the trend dependent sequence and $q$ is the trend span. Note that $p$ and $q$ are actually two different types of periods. In the detailed implementation, $p$ is equal to one-day that describes daily periodicity, and $q$ is equal to one-week that reveals the weekly trend.
同样，利用上述操作，我们可以构建图 3 的周期和趋势部分。假设周期片段有 $l_{p}$ 个时间间隔，周期为 $p$ 。因此，与周期相关的序列为 $[\mathbf{X}_{t-{l_{p}}\cdot p},\mathbf{X}_{t-({l_{p}}-1)\cdot p},\cdots,\mathbf{X}_{t-p}]$ 。通过卷积运算和 $L$ 残差单元，如前三个分量的结构和公式 2 所示，周期分量的输出为 $\mathbf{X}_{p}^{(L+2)}$ 。同时，趋势分量的输出为 $\mathbf{X}_{q}^{(L+2)}$ ，输入为 $[\mathbf{X}_{t-{l_{q}}\cdot q},\mathbf{X}_{t-({l_{q}}-1)\cdot q},\cdots,\mathbf{X}_{t-q}]$ ，其中 $l_{q}$ 为趋势相关序列的长度， $q$ 为趋势跨度。请注意， $p$ 和 $q$ 实际上是两种不同类型的周期。在具体实现中， $p$ 等于一天，描述日周期性， $q$ 等于一周，揭示周趋势。

The Structure of the External Component
外部组件的结构

Traffic flows can be affected by many complex external factors, such as weather and event. Figure 5(a) shows that crowd flows during holidays (Chinese Spring Festival) can be significantly different from the flows during normal days. Figure 5(b) shows that heavy rain sharply reduces the crowd flows at Office Area compared to the same day of the latter week. Let $E_{t}$ be the feature vector that represents these external factors at predicted time interval $t$ . In our implementation, we mainly consider weather, holiday event, and metadata (i.e. DayOfWeek, Weekday/Weekend). The details are introduced in Table 1. To predict flows at time interval $t$ , the holiday event and metadata can be directly obtained. However, the weather at future time interval $t$ is unknown. Instead, one can use the forecasting weather at time interval $t$ or the approximate weather at time interval $t-1$ . Formally, we stack two fully-connected layers upon $E_{t}$ , the first layer can be viewed as an embedding layer for each sub-factor followed by an activation. The second layer is used to map low to high dimensions that have the same shape as $\mathbf{X}_{t}$ . The output of the external component of Figure 3 is denoted as $\mathbf{X}_{Ext}$ with the parameters $\theta_{Ext}$ .
交通流量会受到许多复杂外部因素的影响，如天气和事件。图 5(a)显示，节假日（春节）的人流与平日的人流会有很大不同。图 5(b) 显示，与后一周的同一天相比，大雨使办公区的人流量急剧下降。假设 $E_{t}$ 是表示预测时间间隔 $t$ 中这些外部因素的特征向量。在实现过程中，我们主要考虑天气、假日事件和元数据（即周日、工作日/周末）。详情见表 1。要预测时间间隔 $t$ 的流量，可以直接获取假日事件和元数据。但是，未来时间间隔 $t$ 的天气情况是未知的。相反，我们可以使用时间间隔 $t$ 的预报天气或时间间隔 $t-1$ 的近似天气。从形式上看，我们在 $E_{t}$ 上堆叠了两个全连接层，第一层可视为每个子因子的嵌入层，然后是激活层。第二层用于将低维映射到与 $\mathbf{X}_{t}$ 具有相同形状的高维。图 3 外部组件的输出表示为 $\mathbf{X}_{Ext}$ ，参数为 $\theta_{Ext}$ 。

Fusion 融合

In this section, we discuss how to fuse four components of Figure 3. We first fuse the first three components with a parametric-matrix-based fusion method, which is then further combined with the external component.
在本节中，我们将讨论如何融合图 3 的四个组成部分。我们首先用基于参数矩阵的融合方法融合前三个部分，然后再与外部部分进一步融合。

Figures 6(a) and (d) show the ratio curves using Beijing trajectory data presented in Table 1 where $x$ -axis is time gap between two time intervals and $y$ -axis is the average ratio value between arbitrary two inflows that have the same time gap. The curves from two different regions all show an empirical temporal correlation in time series, namely, inflows of recent time intervals are more relevant than ones of distant time intervals, which implies temporal closeness. The two curves have different shapes, which demonstrates that different regions may have different characteristics of closeness. Figures 6(b) and (e) depict inflows at all time intervals of 7 days. We can see the obvious daily periodicity in both regions. In Office Area, the peak values on weekdays are much higher than ones on weekends. Residential Area has similar peak values for both weekdays and weekends. Figures 6(c) and (f) describe inflows at a certain time interval (9:00pm-9:30pm) of Tuesday from March 2015 and June 2015. As time goes by, the inflow progressively decreases in Office Area, and increases in Residential Area. It shows the different trends in different regions. In summary, inflows of two regions are all affected by closeness, period, and trend, but the degrees of influence may be very different. We also find the same properties in other regions as well as their outflows.
图 6(a)和(d)显示了利用表 1 中北京轨迹数据绘制的比率曲线，其中 $x$ 轴为两个时间间隔之间的时间差， $y$ 轴为任意两个具有相同时间差的流入量之间的平均比率值。来自两个不同地区的曲线都显示出时间序列中的经验时间相关性，即近期时间段的流入量比远期时间段的流入量更相关，这意味着时间上的接近性。两条曲线的形状不同，说明不同地区的密切程度可能有不同的特点。图 6(b)和(e)描述了 7 天内所有时间间隔的流入量。我们可以看到两个区域都有明显的日周期性。在办公区，工作日的峰值远高于周末。住宅区工作日和周末的峰值相似。图 6(c)和(f)描述了 2015 年 3 月和 2015 年 6 月周二某一时间间隔（晚上 9:00-9:30）的流入量。随着时间的推移，办公区的流入量逐渐减少，而住宅区的流入量逐渐增加。这显示了不同区域的不同趋势。总之，两个区域的流入量都会受到邻近度、时间段和趋势的影响，但影响程度可能大不相同。我们在其他地区也发现了相同的属性，以及它们的流出量。

Above all, the different regions are all affected by closeness, period and trend, but the degrees of influence may be different. Inspired by these observations, we propose a parametric-matrix-based fusion method.
最重要的是，不同区域都会受到接近度、周期和趋势的影响，但影响程度可能不同。受这些观察结果的启发，我们提出了一种基于参数矩阵的融合方法。

Parametric-matrix-based fusion. We fuse the first three components (i.e. closeness, period, trend) of Figure 3 as follows
基于参数矩阵的融合。我们将图 3 的前三个部分（即接近度、周期、趋势）融合如下

\mathbf{X}_{Res}=\mathbf{W}_{c}\circ\mathbf{X}_{c}^{(L+2)}+\mathbf{W}_{p}\circ\mathbf{X}_{p}^{(L+2)}+\mathbf{W}_{q}\circ\mathbf{X}_{q}^{(L+2)}

(3)

where $\circ$ is Hadamard product (i.e., element-wise multiplication), $\mathbf{W}_{c}$ , $\mathbf{W}_{p}$ and $\mathbf{W}_{q}$ are the learnable parameters that adjust the degrees affected by closeness, period and trend, respectively.
其中， $\circ$ 为哈达玛乘积（即元素相乘）， $\mathbf{W}_{c}$ 、 $\mathbf{W}_{p}$ 和 $\mathbf{W}_{q}$ 为可学习参数，分别用于调整受接近度、周期和趋势影响的程度。

Fusing the external component. We here directly merge the output of the first three components with that of the external component, as shown in Figure 3. Finally, the predicted value at the $t^{th}$ time interval, denoted by $\widehat{\mathbf{X}}_{t}$ , is defined as
融合外部组件。在这里，我们直接将前三个分量的输出与外部分量的输出合并，如图 3 所示。最后，在 $t^{th}$ 时间间隔内的预测值（用 $\widehat{\mathbf{X}}_{t}$ 表示）定义为

\widehat{\mathbf{X}}_{t}=\tanh(\mathbf{X}_{Res}+\mathbf{X}_{Ext})

(4)

where $\tanh$ is a hyperbolic tangent that ensures the output values are between -1 and 1.
其中 $\tanh$ 是双曲正切，确保输出值介于 -1 和 1 之间。

Our ST-ResNet can be trained to predict $\mathbf{X}_{t}$ from three sequences of flow matrices and external factor features by minimizing mean squared error between the predicted flow matrix and the true flow matrix:
通过最小化预测流量矩阵与真实流量矩阵之间的均方误差，可以训练我们的 ST-ResNet 从三个流量矩阵序列和外部因素特征中预测 $\mathbf{X}_{t}$ ：

\mathcal{L}(\theta)=\|\mathbf{X}_{t}-\widehat{\mathbf{X}}_{t}\|^{2}_{2}

(5)

where $\theta$ are all learnable parameters in the ST-ResNet.
其中 $\theta$ 是 ST-ResNet 中所有可学习的参数。

Algorithm and Optimization
算法与优化

Algorithm 1 outlines the ST-ResNet training process. We first construct the training instances from the original sequence data (lines 1-6). Then, ST-ResNet is trained via backpropagation and Adam (?) (lines 7-11).
算法 1 概述了 ST-ResNet 的训练过程。我们首先根据原始序列数据构建训练实例（第 1-6 行）。然后，通过反向传播和亚当（?）训练 ST-ResNet（第 7-11 行）。

Input: Historical observations:

\{\mathbf{X}_{0},\cdots,\mathbf{X}_{n-1}\}

;
输入：历史观测数据：

\{\mathbf{X}_{0},\cdots,\mathbf{X}_{n-1}\}

；

external features:

\{E_{0},\cdots,E_{n-1}\}

;
外部特征：

\{E_{0},\cdots,E_{n-1}\}

；

lengths of closeness, period, trend sequences:

l_{c}

l_{p},l_{q};

接近度、周期、趋势序列的长度：

l_{c}

l_{p},l_{q};

peroid:

p

; trend span:

q

.
peroid：

p

; 趋势跨度：

q

。

Output: Learned ST-ResNet model
输出：学习的 ST-ResNet 模型

1 // construct training instances
1 // 构建训练实例

\mathcal{D}\longleftarrow\emptyset

3 for all available time interval $t(1\leq t\leq n-1)$ do
3 对所有可用时间间隔

t(1\leq t\leq n-1)

执行

\mathcal{S}_{c}=[\mathbf{X}_{t-{l_{c}}},\mathbf{X}_{t-({l_{c}}-1)},\cdots,\mathbf{X}_{t-1}]

\mathcal{S}_{p}=[\mathbf{X}_{t-{l_{p}}\cdot p},\mathbf{X}_{t-({l_{p}}-1)\cdot p},\cdots,\mathbf{X}_{t-p}]

\mathcal{S}_{q}=[\mathbf{X}_{t-{l_{q}}\cdot q},\mathbf{X}_{t-({l_{q}}-1)\cdot q},\cdots,\mathbf{X}_{t-q}]

7 //

\mathbf{X}_{t}

is the target at time

t

7 //

\mathbf{X}_{t}

是时间

t

时的目标

8 put an training instance

(\{\mathcal{S}_{c},\mathcal{S}_{p},\mathcal{S}_{q},E_{t}\},\mathbf{X}_{t})

into

\mathcal{D}

8 将训练实例

(\{\mathcal{S}_{c},\mathcal{S}_{p},\mathcal{S}_{q},E_{t}\},\mathbf{X}_{t})

放入

\mathcal{D}

中

10// train the model
10// 训练模型

11 initialize all learnable parameters

\theta

in ST-ResNet
11 初始化 ST-ResNet 中的所有可学习参数

\theta

12 repeat 12 重复

13 randomly select a batch of instances

\mathcal{D}_{b}

from

\mathcal{D}

13 从

\mathcal{D}

中随机选择一批实例

\mathcal{D}_{b}

14 find

\theta

by minimizing the objective (5) with

\mathcal{D}_{b}

14 通过最小化目标 (5) 求出

\theta

，其中

\mathcal{D}_{b}

16until stopping criteria is met
16 直到达到停止标准

Algorithm 1 ST-ResNet Training Algorithm
算法 1 ST-ResNet 训练算法

Experiments 实验

Settings 设置

Datasets. We use two different sets of data as shown in Table 1. Each dataset contains two sub-datasets: trajectories and weather, as detailed as follows.
数据集。如表 1 所示，我们使用了两组不同的数据。每个数据集包含两个子数据集：轨迹和天气，详情如下。

•

TaxiBJ: Trajectoriy data is the taxicab GPS data and meteorology data in Beijing from four time intervals: 1st Jul. 2013 - 30th Otc. 2013, 1st Mar. 2014 - 30th Jun. 2014, 1st Mar. 2015 - 30th Jun. 2015, 1st Nov. 2015 - 10th Apr. 2016. Using Definition 2, we obtain two types of crowd flows. We choose data from the last four weeks as the testing data, and all data before that as training data.

- TaxiBJ：轨迹数据为北京市出租车GPS数据和气象数据，时间跨度分别为2013年7月1日-2013年6月30日、2014年3月1日-2014年6月30日、2015年3月1日-2015年6月30日、2015年11月1日-2016年4月10日。2013年3月1日至2014年6月30日，2015年3月1日至2015年6月30日，2015年11月1日至2016年4月10日。利用定义 2，我们可以得到两种类型的人群流动。我们选择最近四周的数据作为测试数据，之前的所有数据作为训练数据。
•

BikeNYC: Trajectory data is taken from the NYC Bike system in 2014, from Apr. 1st to Sept. 30th. Trip data includes: trip duration, starting and ending station IDs, and start and end times. Among the data, the last 10 days are chosen as testing data, and the others as training data.

- BikeNYC：轨迹数据取自纽约市自行车系统 2014 年 4 月 1 日至 9 月 30 日的数据。行程数据包括：行程持续时间、起止站点 ID 和起止时间。在这些数据中，选择最后 10 天作为测试数据，其他数据作为训练数据。

Table 1: Datasets (holidays include adjacent weekends).
表 1：数据集（节假日包括相邻的周末）。

Dataset	TaxiBJ	BikeNYC
Data type 数据类型	Taxi GPS 出租车 GPS	Bike rent 自行车租赁
Location	Beijing	New York 纽约
Time Span 时间跨度	7/1/2013 - 10/30/2013
	3/1/2014 - 6/30/2014	4/1/2014 -
	3/1/2015 - 6/30/2015	9/30/2014
	11/1/2015 - 4/10/2016
Time interval 时间间隔	30 minutes 30 分钟	1 hour 1 小时
Gird map size 地图尺寸	(32, 32)	(16, 8)
Trajectory data 轨迹数据
Average sampling rate (s) 平均采样率（秒）	$\sim$ 60	$\setminus$
# taxis/bikes # 出租车/自行车	34,000+	6,800+
# available time interval # 可用时间间隔	22,459	4,392
External factors (holidays and meteorology) 外部因素（节假日和气象）
# holidays # 节假日	41	20
Weather conditions 天气状况	16 types (e.g., Sunny, Rainy) 16 种类型（如晴天、雨天）	$\setminus$
Temperature / ^∘C 温度 / ^∘ C	$[-24.6,41.0]$	$\setminus$
Wind speed / mph 风速/英里/小时	$[0,48.6]$	$\setminus$

Baselines. We compare our ST-ResNet with the following 6 baselines:
基线。我们将 ST-ResNet 与以下 6 种基线进行了比较：

•

HA: We predict inflow and outflow of crowds by the average value of historical inflow and outflow in the corresponding periods, e.g., 9:00am-9:30am on Tuesday, its corresponding periods are all historical time intervals from 9:00am to 9:30am on all historical Tuesdays.

- HA: 我们根据历史上相应时段内人群流入和流出的平均值来预测人群的流入和流出，例如，周二上午 9:00-9:30 时，其相应时段为历史上所有周二上午 9:00-9:30 时的所有时间段。
•

ARIMA: Auto-Regressive Integrated Moving Average (ARIMA) is a well-known model for understanding and predicting future values in a time series.

- ARIMA：自回归整合移动平均（ARIMA）是一种著名的模型，用于理解和预测时间序列的未来值。
•

SARIMA: Seasonal ARIMA.

- SARIMA：Seasonal ARIMA。
•

VAR: Vector Auto-Regressive (VAR) is a more advanced spatio-temporal model, which can capture the pairwise relationships among all flows, and has heavy computational costs due to the large number of parameters.

- VAR：向量自回归模型（VAR）是一种更先进的时空模型，它可以捕捉所有流量之间的成对关系，但由于参数较多，计算成本较高。
•

ST-ANN: It first extracts spatial (nearby 8 regions’ values) and temporal (8 previous time intervals) features, then fed into an artificial neural network.

- ST-ANN：首先提取空间特征（附近 8 个区域的数值）和时间特征（之前 8 个时间间隔），然后输入人工神经网络。
•

DeepST (?): a deep neural network (DNN)-based prediction model for spatio-temporal data, which shows state-of-the-art results on crowd flows prediction. It has 4 variants, including DeepST-C, DeepST-CP, DeepST-CPT, and DeepST-CPTM, which focus on different temporal dependencies and external factors.

- DeepST (?)：基于深度神经网络（DNN）的时空数据预测模型，在人群流动预测方面取得了最先进的成果。它有 4 个变体，包括 DeepST-C、DeepST-CP、DeepST-CPT 和 DeepST-CPTM，分别侧重于不同的时间依赖性和外部因素。

Preprocessing. In the output of the ST-ResNet, we use $\tanh$ as our final activation (see Eq. 4), whose range is between -1 and 1. Here, we use the Min-Max normalization method to scale the data into the range $[-1,1]$ . In the evaluation, we re-scale the predicted value back to the normal values, compared with the groundtruth. For external factors, we use one-hot coding to transform metadata (i.e., DayOfWeek, Weekend/Weekday), holidays and weather conditions into binary vectors, and use Min-Max normalization to scale the Temperature and Wind speed into the range $[0,1]$ .
预处理。在 ST-ResNet 的输出中，我们使用 $\tanh$ 作为最终激活（见公式 4），其范围介于 -1 和 1 之间。在评估过程中，我们将预测值与地面实况进行比较，将预测值重新调整为正常值。对于外部因素，我们使用单次编码将元数据（即周日、周末/周日）、节假日和天气状况转换为二进制向量，并使用 Min-Max 归一化方法将温度和风速缩放到 $[0,1]$ 范围内。

Hyperparameters. The python libraries, including Theano (?) and Keras (?), are used to build our models. The convolutions of Conv1 and all residual units use 64 filters of size $3\times 3$ , and Conv2 uses a convolution with 2 filters of size $3\times 3$ . The batch size is 32. We select 90% of the training data for training each model, and the remaining 10% is chosen as the validation set, which is used to early-stop our training algorithm for each model based on the best validation score. Afterwards, we continue to train the model on the full training data for a fixed number of epochs (e.g., 10, 100 epochs). There are 5 extra hyperparamers in our ST-ResNet, of which $p$ and $q$ are empirically fixed to one-day and one-week, respectively. For lengths of the three dependent sequences, we set them as: $l_{c}\in\{3,4,5\},l_{p}\in\{1,2,3,4\},l_{q}\in\{1,2,3,4\}$ .
超参数。我们使用 python 库（包括 Theano (?) 和 Keras (?) ）来构建模型。Conv1 和所有残差单元的卷积使用 64 个大小为 $3\times 3$ 的滤波器，Conv2 使用 2 个大小为 $3\times 3$ 的滤波器进行卷积。批次大小为 32。我们选择 90% 的训练数据来训练每个模型，剩下的 10% 选作验证集，用于根据最佳验证得分提前停止每个模型的训练算法。之后，我们继续在全部训练数据上对模型进行固定次数的训练（如 10 次、100 次）。我们的 ST-ResNet 中有 5 个额外的超参数，其中 $p$ 和 $q$ 根据经验分别固定为一天和一周。对于三个从属序列的长度，我们将其设置为 $l_{c}\in\{3,4,5\},l_{p}\in\{1,2,3,4\},l_{q}\in\{1,2,3,4\}$ .

Evaluation Metric: We measure our method by Root Mean Square Error (RMSE) as
评估指标：我们用均方根误差（RMSE）来衡量我们的方法，即

RMSE=\sqrt{\frac{1}{z}\sum_{i}(x_{i}-\hat{x}_{i})^{2}}

where $\hat{x}$ and $x$ are the predicted value and ground thuth, respectively; $z$ is the number of all predicted values.
其中， $\hat{x}$ 和 $x$ 分别为预测值和地面真值； $z$ 为所有预测值的个数。

Results on TaxiBJ TaxiBJ 上的结果

We first give the comparison with 6 other models on TaxiBJ, as shown in Table 2. We give 7 variants of ST-ResNet with different layers and different factors. Taking L12-E for example, it considers all available external factors and has 12 residual units, each of which is comprised of two convolutional layers. We observe that all of these 7 models are better than 6 baselines. Comparing with the previous state-of-the-art models, L12-E-BN reduces error to $16.69$ , which significantly improves accuracy.
首先，我们在 TaxiBJ 上与其他 6 个模型进行了比较，如表 2 所示。我们给出了 ST-ResNet 的 7 个变体，它们具有不同的层和不同的因子。以 L12-E 为例，它考虑了所有可用的外部因素，有 12 个残差单元，每个单元由两个卷积层组成。我们发现，这 7 个模型都优于 6 个基线模型。与之前的先进模型相比，L12-E-BN 将误差降低到 $16.69$ ，从而显著提高了准确性。

Table 2: Comparison among different methods on TaxiBJ
表 2：不同方法在 TaxiBJ 上的比较

Model		RMSE
HA		57.69
ARIMA		22.78
SARIMA		26.88
VAR		22.88
ST-ANN		19.57
DeepST		18.18
	ST-ResNet [ours] ST-ResNet [我们的]
L2-E	2 residual units + E 2 个剩余单位 + E	17.67
L4-E	4 residual units + E 4 个剩余单位 + E	17.51
L12-E	12 residual units + E 12 个剩余单位 + E	16.89
L12-E-BN	L12-E with BN 带 BN 的 L12-E	16.69
L12-single-E	12 residual units (1 conv) + E 12 个留守单位（1 个信念）+ E	17.40
L12	12 residual units 12 个留守单位	17.00
L12-E-noFusion	12 residual units + E without fusion 12 个残留单位 + E 无融合	17.96

Effects of Different Components. Let L12-E be the compared model.
不同组件的影响。让 L12-E 成为比较模型。

•

Number of residual units: Results of L2-E, L4-E and L12-E show that RMSE decreases as the number of residual units increases. Using residual learning, the deeper the network is, the more accurate the results will be.

- 残差单位数：L2-E、L4-E 和 L12-E 的结果表明，RMSE 会随着残差单元数的增加而降低。利用残差学习，网络越深，结果就越准确。
•

Internal structure of residual unit: We attempt three different types of residual units. L12-E adopts the standard Residual Unit (see Figure 4(b)). Compared with L12-E, Residual Unit of L12-single-E only contains 1 ReLU followed by 1 convolution, and Residual Unit of L12-E-BN added two batch normalization layers, each of which is inserted before ReLU. We observe that L12-single-E is worse than L12-E, and L12-E-BN is the best, demonstrating the effectiveness of batch normalization.

- 余留单元的内部结构：我们尝试了三种不同类型的残留单元。L12-E 采用标准残差单元（见图 4(b)）。与 L12-E 相比，L12-single-E 的残差单元只包含 1 个 ReLU 和 1 个卷积，而 L12-E-BN 的残差单元增加了两个批处理归一化层，每个层都插入 ReLU 之前。我们发现，L12-single-E 比 L12-E 差，而 L12-E-BN 最好，这说明了批归一化的有效性。
•

External factors: L12-E considers the external factors, including meteorology data, holiday events and metadata. If not, the model is degraded as L12. The results indicate that L12-E is better than L12, pointing out that external factors are always beneficial.

- 外部因素：L12-E 考虑了外部因素，包括气象数据、假日事件和元数据。如果不考虑，模型就会退化为 L12。结果表明，L12-E 优于 L12，说明外部因素总是有益的。
•

Parametric-matrix-based fusion: Being different with L12-E, L12-E-noFusion donot use parametric-matrix-based fusion (see Eq. 3). Instead, L12-E-noFusion use a straightforward method for fusing, i.e., $\mathbf{X}_{c}^{(L+2)}+\mathbf{X}_{p}^{(L+2)}+\mathbf{X}_{q}^{(L+2)}$ . It shows the error greatly increases, which demonstrates the effectiveness of our proposed parametric-matrix-based fusion.

- 基于参数矩阵的融合：与 L12-E 不同，L12-E-noFusion 不使用基于参数矩阵的融合（见公式 3）。相反，L12-E-noFusion 使用了一种直接的融合方法，即 $\mathbf{X}_{c}^{(L+2)}+\mathbf{X}_{p}^{(L+2)}+\mathbf{X}_{q}^{(L+2)}$ 。结果显示误差大大增加，这证明了我们提出的基于参数矩阵的融合方法的有效性。

Results on BikeNYC BikeNYC 上的结果

Table 3 shows the results of our model and other baselines on BikeNYC. Being different from TaxiBJ, BikeNYC consists of two different types of crowd flows, including new-flow and end-flow (?). Here, we adopt a total of 4-residual-unit ST-ResNet, and consider the metadata as external features like DeepST (?). ST-ResNet has relatively from $14.8\%$ up to $37.1\%$ lower RMSE than these baselines, demonstrating that our proposed model has good generalization performance on other flow prediction tasks.
表 3 显示了我们的模型和其他基线模型在 BikeNYC 上的结果。与 TaxiBJ 不同，BikeNYC 包含两种不同类型的人流，包括新人流和末人流（？）在这里，我们采用了总共 4 个独立单元的 ST-ResNet，并将元数据视为 DeepST（?）与这些基线相比，ST-ResNet 的 RMSE 相对较低，从 $14.8\%$ 到 $37.1\%$ 不等，这表明我们提出的模型在其他人流预测任务上具有良好的泛化性能。

Table 3: Comparisons with baselines on BikeNYC. The results of ARIMA, SARIMA, VAR and 4 DeepST variants are taken from (?).
表 3：在 BikeNYC 上与基线的比较。ARIMA、SARIMA、VAR 和 4 个 DeepST 变体的结果摘自（?）

Model	RMSE
ARIMA	10.07
SARIMA	10.56
VAR	9.92
DeepST-C	8.39
DeepST-CP	7.64
DeepST-CPT	7.56
DeepST-CPTM	7.43
ST-ResNet [ours, 4 residual units] ST-ResNet [我们的，4 个残差单元]	6.33

Related Work 相关工作

Crowd Flow Prediction. There are some previously published works on predicting an individual’s movement based on their location history (?; ?). They mainly forecast millions, even billions, of individuals’ mobility traces rather than the aggregated crowd flows in a region. Such a task may require huge computational resources, and it is not always necessary for the application scenario of public safety. Some other researchers aim to predict travel speed and traffic volume on the road (?; ?; ?). Most of them are predicting single or multiple road segments, rather than citywide ones. Recently, researchers have started to focus on city-scale traffic flow prediction (?; ?). Both work are different from ours where the proposed methods naturally focus on the individual region not the city, and they do not partition the city using a grid-based method which needs a more complex method to find irregular regions first. 重试错误原因

Deep Learning. CNNs have been successfully applied to various problems, especially in the field of computer vision (?). Residual learning (?) allows such networks to have a very super deep structure. Recurrent neural networks (RNNs) have been used successfully for sequence learning tasks (?). The incorporation of long short-term memory (LSTM) enables RNNs to learn long-term temporal dependency. However, both kinds of neural networks can only capture spatial or temporal dependencies. Recently, researchers combined above networks and proposed a convolutional LSTM network (?) that learns spatial and temporal dependencies simultaneously. Such a network cannot model very long-range temporal dependencies (e.g., period and trend), and training becomes more difficult as depth increases. 重试错误原因

In our previous work (?), a general prediction model based on DNNs was proposed for spatio-temporal data. In this paper, to model a specific spatio-temporal prediction (i.e. citywide crowd flows) effectively, we mainly propose employing the residual learning and a parametric-matrix-based fusion mechanism. A survey on data fusion methodologies can be found at (?).

Conclusion and Future Work

We propose a novel deep-learning-based model for forecasting the flow of crowds in each and every region of a city, based on historical trajectory data, weather and events. We evaluate our model on two types of crowd flows in Beijing and NYC, achieving performances which are significantly beyond 6 baseline methods, confirming that our model is better and more applicable to the crowd flow prediction. The code and datasets have been released at: https://www.microsoft.com/en-us/research/publication/deep-spatio-temporal-residual-networks-for-citywide-crowd-flows-prediction.

In the future, we will consider other types of flows (e.g., taxi/truck/bus trajectory data, phone signals data, metro card swiping data), and use all of them to generate more types of flow predictions, and collectively predict all of these flows with an appropriate fusion mechanism.

References

[Abadi, Rajabioun, and Ioannou 2015] Abadi, A.; Rajabioun, T.; and Ioannou, P. A. 2015. Traffic flow prediction for road transportation networks with limited traffic data. IEEE Transactions on Intelligent Transportation Systems 16(2):653–662.
[Chollet 2015] Chollet, F. 2015. Keras. https://github.com/fchollet/keras.
[Fan et al. 2015] Fan, Z.; Song, X.; Shibasaki, R.; and Adachi, R. 2015. Citymomentum: an online approach for crowd behavior prediction at a citywide level. In ACM UbiComp, 559–569. ACM.
[He et al. 2015] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep residual learning for image recognition. In IEEE CVPR.
[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Identity mappings in deep residual networks. In ECCV.
[Hoang, Zheng, and Singh 2016] Hoang, M. X.; Zheng, Y.; and Singh, A. K. 2016. Forecasting citywide crowd flows based on big data. In ACM SIGSPATIAL.
[Ioffe and Szegedy 2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 448–456.
[Jain et al. 2007] Jain, V.; Murray, J. F.; Roth, F.; Turaga, S.; Zhigulin, V.; Briggman, K. L.; Helmstaedter, M. N.; Denk, W.; and Seung, H. S. 2007. Supervised learning of image restoration with convolutional networks. In ICCV, 1–8. IEEE.
[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. ImageNet classification with deep convolutional neural networks. In NIPS.
[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
[LeCun et al. 2012] LeCun, Y. A.; Bottou, L.; Orr, G. B.; and Müller, K.-R. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer.
[Li et al. 2015] Li, Y.; Zheng, Y.; Zhang, H.; and Chen, L. 2015. Traffic prediction in a bike-sharing system. In ACM SIGSPATIAL.
[Long, Shelhamer, and Darrell 2015] Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In IEEE CVPR, 3431–3440.
[Mathieu, Couprie, and LeCun 2015] Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440.
[Nair and Hinton 2010] Nair, V., and Hinton, G. E. 2010. Rectified linear units improve restricted boltzmann machines. In ICML, 807–814.
[Silva, Kang, and Airoldi 2015] Silva, R.; Kang, S. M.; and Airoldi, E. M. 2015. Predicting traffic volumes and estimating the effects of shocks in massive transportation systems. Proceedings of the National Academy of Sciences 112(18):5643–5648.
[Song et al. 2014] Song, X.; Zhang, Q.; Sekimoto, Y.; and Shibasaki, R. 2014. Prediction of human emergency behavior and their mobility following large-scale disaster. In ACM SIGKDD, 5–14. ACM.
[Sutskever, Vinyals, and Le 2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
[Theano Development Team 2016] Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688.
[Xingjian et al. 2015] Xingjian, S.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-k.; and WOO, W.-c. 2015. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, 802–810.
[Xu et al. 2014] Xu, Y.; Kong, Q.-J.; Klette, R.; and Liu, Y. 2014. Accurate and interpretable bayesian mars for traffic flow prediction. IEEE Transactions on Intelligent Transportation Systems 15(6):2457–2469.
[Zhang et al. 2016] Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; and Yi, X. 2016. DNN-based prediction model for spatial-temporal data. In ACM SIGSPATIAL.
[Zheng et al. 2014] Zheng, Y.; Capra, L.; Wolfson, O.; and Yang, H. 2014. Urban computing: concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology (TIST) 5(3):38.
[Zheng 2015] Zheng, Y. 2015. Methodologies for cross-domain data fusion: An overview. IEEE transactions on big data 1(1):16–34.