Perceive, Attend, and Drive:
Learning Spatial Attention for Safe Self-Driving

Bob Wei

^{*, 1, 2}

, Mengye Ren

^{*, 1, 3}

, Wenyuan Zeng

^{1, 3}

, Ming Liang

^{1}

, Bin Yang

^{1, 3}

, Raquel Urtasun

^{1, 2}

^{*}

Equal contribution

^{1}

Uber ATG

^{2}

University of Waterloo

^{3}

University of Toronto. Correspondence: q25wei@uwaterloo.ca, [mren3, wenyuan, ming.liang, byang10, urtasun]@uber.com

Abstract

In this paper, we propose an end-to-end self-driving network featuring a sparse attention module that learns to automatically attend to important regions of the input. The attention module specifically targets motion planning, whereas prior literature only applied attention in perception tasks. Learning an attention mask directly targeted for motion planning significantly improves the planner safety by performing more focused computation. Furthermore, visualizing the attention improves interpretability of end-to-end self-driving.

1 Introduction

Self-driving is one of today’s most impactful technological challenges, one that promises to bring safe and affordable transportation everywhere. Tremendous improvements have been made in self-driving perception systems, thanks to the success of deep learning. This has enabled accurate detection and localization of obstacles, providing a holistic understanding of the surrounding world, which is then sent to the motion planner to decide subsequent driving actions.

Despite the eminent success of these perception systems, their detection objective is mis-aligned with the self-driving vehicle’s overall goal—to drive safely to the destination. We typically train the perception systems to detect all objects in the sensor range, assigning each object an equal weight even if some objects are not important as they will never interact with the self-driving vehicle. For example, they could be far away or parked on the other side of the road as in Figure 1. As a result, a vast amount of computation and model capacity is wasted in recognizing very difficult instances that matter only for common metrics such as average precision (AP), but not so much for driving. This is in striking contrast with how humans drive: we focus our visual attention in areas that directly impact safe planning. Inspired by the use of visual attention in our brain, we aim to introduce attention to self-driving systems to efficiently and selectively process complex scenes.

Numerous studies in the past have explored adding sparse attention in deep neural networks to improve computation efficiency in classification [47, 8] and object detection [35, 11, 36]. In order to perform well on the metrics employed in common benchmarks, the attention mask in [35] still needs to cover all actors in the scene, slowing the network when the scene has many vehicles.

In this paper, our aim is to address these inconsistencies such that the amount of computation is optimized towards the end goal of motion planning. Specifically our contributions are as follows:

We learn an attention mask directly towards the motion planning objective for safe self-driving.
We use the attention mask to reweight object detection and motion forecasting losses in our joint end-to-end training, focusing the model capacity on objects that matter most. Different from manually prioritizing instances [34], here the weighting is entirely data-driven.
Our attention-based model achieves significantly reduced collision rate and improved planning performance, while at a much lower computation cost.
Attention mask visualization improves interpretability of end-to-end deep learning models in self-driving.

Figure 1: Left: a toy example. Right: our learned spatial attention in red with ego vehicle in blue and others in green. Not all actors impact our safe driving and so we should prioritize accordingly.

2 Related Work

Attention mechanism in deep learning: Human and other primate visual perception systems feature visual attention to reduce the complexity of the scene and speed-up inference [14, 13]. Earlier studies in visual saliency aimed to predict human gaze with no particular task in mind [17]. Attention mechanisms nowadays are built in as part of end-to-end models to optimize towards specific tasks. The attention modules are typically implemented as multiplicative gates to select features. This schema has shown to improve performance and interpretability on downstream tasks such as object recognition [1, 22, 47], instance segmentation [36], image captioning [49], question answering [29, 50], as well as other natural language processing applications [2, 46, 6]. The visualization of the end-to-end learned attention suggests that deep attention-based models have an intelligent understanding of the inputs by focusing on the most informative parts of the input.

Sparse activation in neural networks: Sparse coding models [31] use an overcomplete dictionary to achieve sparse activation in the feature space. In modern convolutional neural networks (CNNs), sparsity is typically brought by the widespread use of ReLU activation functions, but these are rather unstructured, and speed-up has only been shown on specially designed hardware [16, 42]. Structured spatial sparsity, on the other hand, can be made efficient by using a sparse convolution operator [9, 35, 10], which in turn allows the network to shift its focus on more difficult parts of the inputs [8, 23, 35, 20]. In self-driving, [34] proposed a ranking function to prioritize computations that would have the most impact on motion planning. Weight pruning [26, 28] is another popular way to achieve sparsity in the parameter space, which is an orthogonal direction to our method.

Attention and loss weighting in multi-task learning: Our end-to-end self-driving network is an instance of multi-task learning as all three tasks—perception, prediction and motion planning—are simultaneously solved by individual output branches with shared features. It is common to use a summation of all the loss functions, but sometimes there are conflicting objectives among the tasks. Prior literature in multi-task learning has studied dynamic weighting towards different loss components, by using training signals such as uncertainty [18], gradient norm [5], difficulty level [12], or entirely data-driven objectives [25, 37]. In [25, 37], task and example weights are learned by optimizing the performance of the main task. The attention mechanism has also been used in multi-task learning: in [27], a network applies task-specific attention masks on shared features to encourage the outputs to be more selective. Similar to dynamic loss weighting models [25], we exploit the learned attention towards weighting instance detection losses. Instead of using multiple attentions, as was done in [27], we use one single attention mask to optimize our main task: driving.

Safety-driven learnable motion planning: One of the primary motivations of introducing attention into an end-to-end motion planning network is to improve safety. Traditionally, safety for self-driving models was done in terms of formal model checking and validation [41, 45, 32, 33, 48]. More recently, with the widely available driving data, imitation learning has been introduced in self-driving to learn from cautious human driving [51, 7, 40, 39, 52]. Safety has also been considered in terms of explicitly learning a risk sensitive measure from human demonstration [44, 21]. In our work, although safety is not explicitly encoded in our loss function, we have experimentally verified that the sparse attention models are significantly better at avoiding collisions.

Figure 2: The full neural motion planner (NMP) [51] with backbone network and header networks.

3 Perceive, Attend and Drive

In this section, we present our framework for using learned, motion-planning aware attention. We first describe the end-to-end neural motion planner that serves as the starting point of our work, and then introduce our proposed attention module and attention-driven loss function, which enable us to focus the computation in areas that matter for the end task of driving.

Figure 3: Our sparse attention neural motion planner (SA-NMP), which takes in LiDAR and HDMaps data, and outputs perception, prediction, and planning. Attention is generated from a U-Net and applied on the input branches of residual blocks. Our attention is learned towards the planning task, which is a direct model output.

3.1 A review on Neural Motion Planner (NMP)

Our proposed model extends upon the neural motion planner (NMP), which jointly solves the perception, prediction and planning problems for self-driving. In this section we briefly review NMP depicted in Figure 2, and refer the reader to [51] for more details.

Input and backbone: NMP voxelizes LiDAR point clouds to a birds-eye-view (BEV) feature map and fuses them with $M$ channels of rasterized HDMap features to produce an input representation of size $(Z T^{'} + M) \times H \times W$ , where $Z, H, W$ are the height and spatial dimensions and $T^{'} = 10$ is the number of input LiDAR sweeps. The backbone consists of 5 blocks, with the first 4 producing multi-scale features that are concatenated and fed to the final block. Overall the backbone downsamples the spatial dimension by 4.

Multi-task headers: Given the features computed by the backbone, $X \in R^{128 \times \frac{H}{4} \times \frac{W}{4}}$ , NMP uses two separate headers for perception & prediction, and motion planning. The perception & prediction header consists of separate branches for classification and regression. The classification branch outputs a score for each anchor box at each spatial location over the feature map $X$ , while the regression branch outputs regression targets for each anchor box, including targets for localization offset, size, and heading angle. The planning header consists of convolution and deconvolution layers to produce a cost volume $C \in R^{T \times \frac{H}{4} \times \frac{W}{4}}$ representing the cost for the self-driving vehicle to be at each location and time, with $T$ the fixed future planning horizon.

Planning inference: At inference time, NMP samples $N$ trajectories that are physically realizable, and chooses the lowest cost trajectory for the ego-car:

τ^{⋆} (X) = a r g m i n τ_{1 \dots N} c (τ_{i}, X) .

(1)

where the cost of a trajectory $τ = (x_{t}, y_{t})_{t = 1}^{T}$ is the sum of all its waypoints in the cost volume:

c (τ, X) = T \sum t = 1 C_{t, x_{t}, y_{t}} (X),

(2)

We sample trajectories using a mixture of Clothoid, circle, and straight curves [43]. We refer the readers to [51] for more details on the sampling procedure.

3.2 Sparse Attention Neural Motion Planner (SA-NMP)

In this section, we propose our sparse spatial attention module for self-driving, shown in Figure 3, which learns to save computation while performing well on the end task of driving safely to the goal.

Input and backbone: We exploit the same input representation as NMP and use the same perception, prediction, and planning headers. The NMP backbone is replaced with the state-of-the-art backbone network of PnPNet [24], which uses cross-scale blocks throughout to fuse BEV sensory input. Each cross-scale block consists of 3 parallel branches at different resolutions that downsample the feature map, perform bulk computations, and then upsample back to the backbone resolution, before finally fusing cross-scale features across all branches. There is an additional residual connection across each cross-scale block. The final output feature from the backbone consists of 128 channels at 4x downsampled resolution, which is forwarded to the planning and detection headers. In addition to the improved performance, PnPNet can be easily scaled for different computational budgets by varying the depth and width of the cross-scale blocks.

Existing attention-driven approaches [35] tackle only the perception task and use either a road mask obtained from map information or a vehicle mask produced by a different perception module. As a consequence, they waste computation on areas that will not affect the self-driving car. We instead propose a novel approach that is end-to-end trainable and performs computation selectively for planning a safe maneuver. As shown in Fig. 3, the learned attention mask then gates the backbone network, limiting computation to areas where attention is active. By using binary attention, we can leverage sparse convolution to improve the computational efficiency.

Generating binary attention: Computational efficiency has been shown to be one of the most prominent advantages of using the attention mechanism. For soft attention masks the computation is still dense across the entire activation map, and therefore no computation savings can be achieved. SBNet [35] showed that a sparse convolution operator can achieve significant speed-ups with a given discrete binary attention mask. Here, we would also like to exploit the computational benefit of sparse convolution by using discrete attention outputs.

We utilize a network to predict a scalar score for each spatial location, and binarize the score to represent our sparse attention map. For efficiency and simplicity, we use a small U-Net [38] with skip connections and two downsample/upsample stages. We would like to apply the generated attention back to the BEV features in the model backbone so as to sparsify the spatial information, allowing computation to be focused on the important regions only. We choose to do so in a residual manner [47, 35] to avoid deteriorating the features throughout the backbone. Let $x + F (x)$ denote the normal residual block. Our attention mask is multiplied with the input to the residual block as follows:

R e s A t t e n d (x) = x + F (x ⊙ A),

(3)

where $⊙$ denotes elementwise multiplication. See Fig. 3 for an illustration of our architecture.

Learning binary attention with Gumbel softmax: In order to learn the attention generator and backpropagate through the binary attention map, we make use of the Gumbel softmax technique [15, 30] since the step function is not differentiable, and using the standard sigmoid function suffers from a more severe bias-variance trade-off [15]. Let $i, j$ denote spatial coordinates, and $z_{i, j}$ the scalar output from the attention U-Net. We first add the Gumbel noise on the logits as follows:

$π_{i, j}$	$= s i g m o i d (z_{i, j})$	(4)
$α_{i, j}^{(0)}$	$= log π_{i, j} + g_{i, j}^{(0)}$	(5)
$α_{i, j}^{(1)}$	$= log (1 - π_{i, j}) + g_{i, j}^{(1)},$	(6)

where $g_{i, j} = - log (- log u)$ , and $u$ is sampled from $Uniform [0, 1]$ . At inference time, hard attention $A_{i, j}$ can be obtained by comparing the logits,

A_{i, j}

= {\begin{matrix} 1 & if α_{i, j}^{(0)} \geq α_{i, j}^{(1)} 0 & otherwise . \end{matrix}

(7)

During training, however, we would like to approximate the gradient by using the straight-through estimator [15, 3]. Hence, in the backward pass, the step function is replaced with a softmax function with a temperature constant $K$ (where $~ A$ is the underlying soft attention):

{~ A}_{i, j}

= \frac{exp (α_{i, j}^{(0)} / K)}{exp (α_{i, j}^{(0)} / K) + exp (α_{i, j}^{(1)} / K)} .

(8)

3.3 Multi-Task Learning

We train our sparse neural motion planner (including the attention) end-to-end with a multi-task learning objective that combines planning ( $L_{plan}$ ) with perception & motion forecasting ( $L_{cls}$ , $L_{reg}$ ):

L = λ_{plan} L_{plan} + λ_{cls} L_{cls} + λ_{reg} L_{reg} + λ_{A} L_{A} + λ ∥ w ∥_{2}^{2},

(9)

where $L_{A}$ is an $ℓ_{1}$ loss, defined in Equation 17, that controls the sparsity of the attention mask, and $∥ w ∥_{2}^{2}$ is the standard weight decay term. Following [51, 24], we fix $λ_{plan} = 0.001, λ_{cls} = 1.0, λ_{reg} = 0.5$ .

Motion planning loss: The motion planning loss utilitizes the max-margin objective, where the ground-truth driving trajectory (performed by a human) should be of lower cost than other trajectories sampled by the model. Let $(x_{t}, y_{t})$ be the groundtruth trajectory and let $c_{t}$ be the cost volume output by the model. We sample $N$ lowest cost trajectories based on the cost volume: ${x_{t}^{(i)}, y_{t}^{(i)}}_{i = 1}^{N}$ , and penalize the maximum margin violation between the groundtruth and model samples:

L_{plan}

= max i = 1 \dots N T \sum t = 1 max {0, c_{t} - c_{t}^{(i)} + Δ_{t}^{(i)}},

(10)

where $Δ_{t}^{(i)}$ is the task loss capturing spatial differences, and traffic violations $v_{t}$ encoded in binary:

Δ_{t}^{(i)}

= {∥ ∥ (x_{t}, y_{t}) - (x_{t}^{(i)}, y_{t}^{(i)}) ∥ ∥}_{2} + v_{t}^{(i)} .

(11)

Perception & prediction (PnP) loss: This loss follows the standard classification and regression objectives for object detection. The classification part uses binary cross-entropy:

L_{cls, i, j} = \sum k - {^y}_{i, j, k} log (y_{i, j, k}) - (1 - {^y}_{i, j, k}) log (1 - y_{i, j, k}),

(12)

where $^y$ is the predicted classification score between 0 and 1, and $y$ is the binary groundtruth. For each detected instance, the model outputs a bounding box, and a pair of coordinates and angles for each future step. We reparameterize the shift of a bounding box $(x, y, w, h, θ)$ from an anchor bounding box $(x_{a}, y_{a}, w_{a}, h_{a}, θ_{a})$ in a 6-dimensional vector $δ$ :

δ_{t} = [\frac{x_{a} - x}{w}, \frac{y_{a} - y}{h}, log \frac{w_{a}}{w}, log \frac{h_{a}}{h}, sin (θ_{a} - θ), cos (θ_{a} - θ)] .

(13)

A regression loss is then applied for the trajectory of the instance up to time $T$ . For each spatial coordinate $(i, j)$ , we sum up the losses of all bounding boxes $b$ where the ground truth belongs to this location:

L_{reg, i, j} = \sum b \in (i, j) T \sum t = 0 S m o o t h L 1 ({^δ}_{b, t}, δ_{b, t}),

(14)

with $^δ$ the model predicted shifts and $δ$ the ground truth shifts.

Auxiliary loss masking: Our overall objective is to achieve good performance in motion planning, so PnP is an auxiliary task. Since the majority of our computation happens within the attended area, intuitively the model should not be penalized as severely for mis-detecting objects not in the attended area. We therefore propose to use our spatial attention mask $A$ to re-weight the PnP losses as follows:

	$L_{cls}$	$= γ_{1} \sum i, j A_{i, j} L_{cls, i, j} + γ_{0} \sum i, j L_{cls, i, j}$		(15)
	$L_{reg}$	$= γ_{1} \sum i, j A_{i, j} L_{reg, i, j} + γ_{0} \sum i, j L_{reg, i, j}$		(16)

where $γ_{1}$ weights attended instances, and $γ_{0}$ weights all instances. We fix $γ_{0} = 0.1$ and $γ_{1} = 0.9$ .

Attention sparsity loss: To encourage focused attention and high sparsity, we use an $ℓ_{1}$ regularizer on the attention mask as follows. We control sparsity with $λ_{A}$ in Equation 9.

L_{A} = \sum i, j A_{i, j} .

(17)

Figure 4: Planning performance of our learned sparse attention model compared to other baselines at varying computation budgets (lower is better on both metrics). Left: Drive4D; Right: nuScenes. Note that all models except for NMP use the same SA-NMP backbone, which can be scaled by changing the depth and width, and allows us to vary the computation of Dense SA-NMP and SA-NMP+Learned Attention.

4 Experimental Evaluation

We evaluated on a real-world driving dataset (Drive4D), training on over 1 million frames from 5,000 scenarios and validating on 5,000 frames from 500 scenarios, using both LiDAR and HD-maps. We also evaluated on nuScenes v1.0 [4], a large-scale public dataset, with a training set of over 200,000 frames and a test set of 5,000 frames. Due to the inaccurate localization they provide, we omitted HDMaps and only used LiDAR [24].

4.1 Implementation Details and Metrics

Training: To jointly train SA-NMP with attention, we use pretrained weights for the backbone and headers from training a SA-NMP without attention (dense) for two epochs. We train all our models with batch size 5 across 16 GPUs in parallel using Adam [19] optimizer. We use an initial learning rate of 0.0001 and a decay of 0.1 at 1.0 and 1.6 epoch(s), for a total of another 2.0 epochs.

Evaluation: To evaluate driving and safety performance, we focus on the following planning metrics which are accumulated over all 6 future timesteps (3s): Planning L2 is the L2 distance between waypoints of the predicted future ego trajectory and those of the ground-truth trajectory (characterized by human driving). Collision rate is the frequency of collisions between the planned ego trajectory and the ground truth trajectories of other actors in the scene. Lane violation rate measures the number of lane boundary violations by the planned ego trajectory. we do not evaluate this on nuScenes due to the inaccurate localization they provide,

Baselines: We compare our learned attention to baselines that are end-to-end trained using static attention masks obtained from priors. Road Mask covers the entire road as provided from the map data. Vehicle Mask strictly covers all detections in the input space, obtained from a PSPNet [53] trained for segmentation. Proximity Mask is a circular radius around the ego vehicle. Dense is not using sparse attention.

Drive4D	Backbone	Sparsity	Planning L2	Collision Rate	Lane Violation
Dataset	FLOPS		at 3s (m)	over 3s (%)	over 3s (%)
NMP [51]	39.18B	0.0%	2.279	0.657	2.780
Dense SA-NMP	22.73B	0.0%	2.207	0.639	1.350
SA-NMP+Vehicle Mask	5.43B	93.6%	2.297	0.639	1.387
SA-NMP+Proximity Mask	5.31B	94.0%	2.276	0.584	1.387
SA-NMP+Road Mask	12.85B	68.9%	2.194	0.548	1.296
SA-NMP+Learned Attn (Ours)	5.22B	95.0%	2.102	0.511	1.338

nuScenes	Backbone	Sparsity	Planning L2	Collision Rate
Dataset	FLOPS		at 3s (m)	over 3s (%)
NMP [51]	39.18B	0.0%	2.310	1.918
Dense SA-NMP	22.73B	0.0%	2.271	2.198
SA-NMP+Vehicle Mask	5.44B	94.0%	2.263	2.234
SA-NMP+Proximity Mask	5.31B	94.0%	2.103	2.073
SA-NMP+Learned Attn (Ours)	5.34B	94.0%	2.052	1.588

Table 1: Performance and efficiency of our learned attention model vs. dense and simple attention baselines.

4.2 Results

Quantitative results: With a sparse attention mask learned towards motion planning, we can leverage the sparsity in the network backbone to greatly reduce computational costs, while not only maintaining but improving model performance. In our experimental results, we use theoretical FLOPs to show the efficiency of our network, but this also translates to real time gains as SBNet [35] has been shown to leverage sparsity to achieve real speed-ups. The increase in efficiency from leveraging sparsity is shown in Table 1, where our learned attention model uses $\sim 80 %$ fewer FLOPs than Dense SA-NMP thanks to its 95% sparse attention mask. Also, even with an identical SA-NMP backbone as the baselines (except NMP), our model with learned attention performs better in all motion planning metrics, which indicates that focused backbone computation is greatly advantageous to the overall goal of safe planning. NMP+Road performs slightly better in Lane Violation due to the road mask attention focusing on all road/lane markings. However, this baseline uses more than double the FLOPs since its attention mask looks at the entire road surface at only 68.9% sparsity. From Fig. 4, our learned attention model clearly outperforms other baselines in collision rate and planning L2, across all computational budgets.

Figure 5: Visualization of the attention masks and planned trajectory comparing dense, road mask, vehicle mask, proximity mask, and our learned attention. Col A: baselines turn too fast and collide with vehicle ahead. B: baselines collide with the future position of a left-turning vehicle. C: a tight left-turn where all models collide with or nearly miss parked vehicles. D: a rear-end collision for all models.

Qualitative results: In Figure 5, we show examples of our learned attention compared to baselines. As expected, our model focuses on the road and vehicles directly ahead; however, it also diverts some amount of attention to distant vehicles and road markings. This ability to dynamically distribute attention is likely why our model outperforms the baselines, which attend either indiscriminately (Dense and Road) or too selectively (Vehicle and Proximity). From the visualizations, we can better understand our model’s improved collision avoidance. Since the attention is dynamic, our model is more effective at anticipating other vehicles resulting in more cautious planning. This is illustrated by Columns A, B in Fig. 5, where our planned trajectory avoids future collisions with others. The failure cases mostly arise from rear-end collisions, one of which is Column D where all models are hit by the trailing vehicle. Note that this arises as we evaluate in open-loop. Our model focuses on surrounding vehicles and not enough on the open road to its right, which would give the option of making a right turn.

Sparsity of learned attention: Table 2 shows the result of varying $λ_{A}$ from Eq. 9, which weights the $ℓ_{1}$ regularization term from Eq. 17, with other settings held constant. We found that overall motion planning performance improves with increased sparsity, or essentially more focused computation, and peaks at 95% sparsity.

Sparsity	$λ_{A}$	Planning L2	Collision Rate	Lane Violation
		at 3s (m)	over 3s (%)	over 3s (%)
Dense 0%	-	2.207	0.639	1.350
Ours 20%	$1.0 e^{- 8}$	2.179	0.547	1.352
Ours 50%	$1.0 e^{- 7}$	2.138	0.620	1.361
Ours 75%	$5.0 e^{- 7}$	2.132	0.584	1.387
Ours 95%	$1.0 e^{- 6}$	2.102	0.511	1.338
Ours 99%	$5.0 e^{- 6}$	2.211	0.566	1.367

Table 2: Varying learned attention sparsity with

λ_{A}

Model	Planning L2	Collision Rate	Lane Violation
	at 3s (m)	over 3s (%)	over 3s (%)
$γ_{1} = 1.00$	2.210	0.548	1.378
$γ_{1} = 0.90$	2.102	0.511	1.338
$γ_{1} = 0.75$	2.230	0.621	1.393
$γ_{1} = 0.50$	2.194	0.637	1.405
$γ_{1} = 0.00$	2.188	0.633	1.397

Table 3: Influence of loss reweighting ratio

γ_{1}

Perception and prediction (PnP) loss reweighting: Table 3 shows results with varying $γ_{1}$ and $γ_{0} = 1 - γ_{1}$ from Eq. 16 which control the weighting of the PnP loss computed on actors inside vs. outside the attention mask. All other variables are fixed, including sparsity at 95%. As $γ_{1}$ increases, the learned attention is less restricted by detection performance on all actors, and is able to focus on only the most important actors and parts of the road, distributing attention towards improving motion planning performance. Note that $γ_{1} = 1.0$ is an extreme case where PnP loss is computed only on actors within attention mask: the model learns to cheat by generating attention that avoids all actors resulting in no PnP learning signal, hence the poor performance. For our main experiments, we fix $γ_{1} = 0.9$ .

Detection performance: Since the overall goal is improved motion-planning with lighter computation, focusing on accurately detecting all actors indiscriminately would contradict the purpose of our learned sparse attention. We should not care as much about far away or irrelevant actors that have no effect on safe planning, and should instead focus our computation on important input regions. Table 4 compares detection performance between our learned attention and the baseline dense model evaluated on different subsets of actors in the scene. Full includes all actors in the input, while Attended Region is the subset of actors that lie within the attention mask. For evaluating the dense model, we use the attention mask generated by our learned model to get the Attended Region, ensuring that both models are evaluated on the same actor subsets in both settings. The results shows that our 95% sparse, learned attention model is better than the dense model at detecting actors within the attention mask, meaning that its performance is better focused on actors that it believes are important. This may explain the overall improved planning performance of our attention-driven models as demonstrated in the main quantitative and qualitative results.

Model	mAP on Full			mAP on Attended Region
	IoU@0.3	IoU@0.5	IoU@0.7	IoU@0.3	IoU@0.5	IoU@0.7
Dense SA-NMP	97.8	94.7	80.3	94.1	93.3	87.9
Ours 95% Sparse	96.3	92.1	74.9	94.2	93.8	88.5

Table 4: Detection performance on different input regions.

5 Conclusion

In this work we propose an end-to-end learned, sparse visual attention mechanism for self-driving, where the sparse attention mask gates the feature backbone computation. As opposed to existing methods that focus on using attention for perception only, our attention masks are directly optimized for motion planning, which enables our network to output better planned trajectories while achieving more efficiency with higher sparsity. As future work, the attention module can be extended to have recurrent feedbacks from the output layers to better leverage temporal information.

References

[1] J. Ba, V. Mnih, and K. Kavukcuoglu (2015) Multiple object recognition with visual attention. In ICLR, Cited by: §2.
[2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.
[3] Y. Bengio, N. Léonard, and A. C. Courville (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432. Cited by: §3.2.
[4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: A multimodal dataset for autonomous driving. CoRR abs/1903.11027. Cited by: §4.
[5] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) GradNorm: gradient normalization for adaptive loss balancing in deep multitask networks. In ICML, Cited by: §2.
[6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §2.
[7] H. Fan, F. Zhu, C. Liu, L. Zhang, L. Zhuang, D. Li, W. Zhu, J. Hu, H. Li, and Q. Kong (2018) Baidu apollo EM motion planner. CoRR abs/1807.08048. Cited by: §2.
[8] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. P. Vetrov, and R. Salakhutdinov (2017) Spatially adaptive computation time for residual networks. In CVPR, Cited by: §1, §2.
[9] M. Figurnov, A. Ibraimova, D. P. Vetrov, and P. Kohli (2016) PerforatedCNNs: acceleration through elimination of redundant convolutions. In NIPS, Cited by: §2.
[10] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: §2.
[11] C. Guo, B. Fan, J. Gu, Q. Zhang, S. Xiang, V. Prinet, and C. Pan (2019) Progressive sparse local attention for video object detection. In ICCV, Cited by: §1.
[12] M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In ECCV, Cited by: §2.
[13] L. Itti, C. Koch, and E. Niebur (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20 (11), pp. 1254–1259. External Links: ISSN 0162-8828 Cited by: §2.
[14] L. Itti, G. Rees, and J. K. Tsotsos (2005) Neurobiology of attention. Academic Press, Burlington. Cited by: §2.
[15] E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §3.2.
[16] P. Judd, A. D. Lascorz, S. Sharify, and A. Moshovos (2017) Cnvlutin2: ineffectual-activation-and-weight-free deep neural network computing. CoRR abs/1705.00125. Cited by: §2.
[17] T. Judd, K. A. Ehinger, F. Durand, and A. Torralba (2009) Learning to predict where humans look. In ICCV, Cited by: §2.
[18] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §2.
[19] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In ICLR, Cited by: §4.1.
[20] S. Kong and C. C. Fowlkes (2019) Pixel-wise attentional gating for scene parsing. In WACV, Cited by: §2.
[21] J. Lacotte, M. Ghavamzadeh, Y. Chow, and M. Pavone (2019) Risk-sensitive generative adversarial imitation learning. In AISTATS, Cited by: §2.
[22] H. Larochelle and G. E. Hinton (2010) Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, Cited by: §2.
[23] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In CVPR, Cited by: §2.
[24] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun (2020) PnPNet: end-to-end perception and prediction with tracking in the loop. In CVPR, Cited by: §3.2, §3.3, §4.
[25] X. Lin, H. S. Baweja, G. Kantor, and D. Held (2019) Adaptive auxiliary task weighting for reinforcement learning. In NeurIPS, Cited by: §2.
[26] B. Liu, M. Wang, H. Foroosh, M. F. Tappen, and M. Pensky (2015) Sparse convolutional neural networks. In CVPR, Cited by: §2.
[27] S. Liu, E. Johns, and A. J. Davison (2019) End-to-end multi-task learning with attention. In CVPR, Cited by: §2.
[28] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2.
[29] J. Lu, J. Yang, D. Batra, and D. Parikh (2016) Hierarchical question-image co-attention for visual question answering. In NIPS, Cited by: §2.
[30] C. J. Maddison, A. Mnih, and Y. W. Teh (2017) The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, Cited by: §3.2.
[31] B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision Research 37, pp. 3311–3325. Cited by: §2.
[32] P. F. Orzechowski, A. Meyer, and M. Lauer (2018) Tackling occlusions & limited sensor range with set-based safety verification. In ITSC, Cited by: §2.
[33] C. Pek and M. Althoff (2018) Computationally efficient fail-safe trajectory planning for self-driving vehicles using convex optimization. In ITSC, Cited by: §2.
[34] K. S. Refaat, K. Ding, N. Ponomareva, and S. Ross (2019) Agent prioritization for autonomous navigation. In IROS, Cited by: 2nd item, §2.
[35] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun (2018) SBNet: sparse blocks network for fast inference. In CVPR, Cited by: §1, §2, §3.2, §3.2, §3.2, §4.2.
[36] M. Ren and R. S. Zemel (2017) End-to-end instance segmentation with recurrent attention. In CVPR, Cited by: §1, §2.
[37] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In ICML, Cited by: §2.
[38] O. Ronneberger, P. Fischer, and T. Brox (2015) U-Net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §3.2.
[39] A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun (2020) Perceive, predict, and plan: safe motion planning through interpretable semantic representations. In ECCV, Cited by: §2.
[40] A. Sadat, M. Ren, A. Pokrovsky, Y. Lin, E. Yumer, and R. Urtasun (2019) Jointly learnable behavior and trajectory planning for self-driving vehicles. In IROS, Cited by: §2.
[41] S. Shalev-Shwartz, S. Shammah, and A. Shashua (2017) On a formal model of safe and scalable self-driving cars. CoRR abs/1708.06374. Cited by: §2.
[42] S. Shi and X. Chu (2017) Speeding up convolutional neural networks by exploiting the sparsity of rectifier units. CoRR abs/1704.07724. Cited by: §2.
[43] D. H. Shin and S. Singh (1990) Path generation for robot vehicles using composite clothoid segments. Cited by: §3.1.
[44] S. Singh, J. Lacotte, A. Majumdar, and M. Pavone (2018) Risk-sensitive inverse reinforcement learning via semi- and non-parametric methods. I. J. Robotics Res. 37 (13-14). Cited by: §2.
[45] Ö. S. Tas, F. Hauser, and C. Stiller (2018) Decision- time postponing motion planning for combinatorial uncertain maneuvering. In ITSC, Cited by: §2.
[46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.
[47] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In CVPR, Cited by: §1, §2, §3.2.
[48] K. Wong, Q. Zhang, M. Liang, B. Yang, R. Liao, A. Sadat, and R. Urtasun (2020) Testing the safety of self-driving vehicles by simulating perception and prediction. In ECCV, Cited by: §2.
[49] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In ICML, Cited by: §2.
[50] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola (2016) Stacked attention networks for image question answering. In CVPR, Cited by: §2.
[51] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun (2019) End-to-end interpretable neural motion planner. In CVPR, Cited by: Figure 2, §2, §3.1, §3.1, §3.3, Table 1.
[52] W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and R. Urtasun (2020) DSDNet: deep structured self-driving network. In ECCV, Cited by: §2.
[53] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. CVPR. Cited by: §4.1.

Perceive, Attend, and Drive: Learning Spatial Attention for Safe Self-Driving