PooDLe: Pooled and Dense Self-Supervised Learning from Naturalistic Videos
Abstract
Self-supervised learning has driven significant progress in learning from single-subject, iconic images. However, there are still unanswered questions about the use of minimally-curated, naturalistic video data, which contain dense scenes with many independent objects, imbalanced class distributions, and varying object sizes. In this paper, we propose PooDLe, a self-supervised learning method that combines an invariance-based bjective on pooled representations with a dense SSL objective that enforces equivariance to optical flow warping. Our results show that a unified objective applied at multiple feature scales is essential for learning effective image representations from naturalistic videos. We validate our method with experiments on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective.
1 Introduction
Humans and other animals learn visual understanding from a continuous stream of inputs with little explicit supervision. Self-supervised learning (SSL) (Chen et al., 2020; Grill et al., 2020; Chen and He, 2021; Caron et al., 2021; Bardes et al., 2021; He et al., 2022; Assran et al., 2023; Bardes et al., 2023a; He et al., 2020) has made great strides in learning without human annotations, becoming competitive with supervised learning. However, many methods still revolve around ImageNet (Deng et al., 2009), which is implicitly supervised through iconic images that contain a single, clear subject and a balanced class distribution. In contrast, naturalistic data like egocentric videos contain cluttered scenes, imbalanced classes, and objects of varying sizes, making them ill-suited for iconic methods.
Nevertheless, these naturalistic videos are still valuable for their information density and ease of collection, while also mimicking the real-life perspective of humans. Unfortunately, iconic methods, which pool global image representations, may perform poorly as dense scenes produce views containing independent subjects that are semantically incompatible (Figure 1(b), red boxes). Recent works have attempted to address this weakness by introducing 1) cropping (Selvaraju et al., 2021) or attention (Venkataramanan et al., 2024) mechanisms to account for multiple subjects, and 2) “dense SSL” objectives (Xiong et al., 2021; Wang et al., 2021b) with losses defined over regions of unpooled image representations.
While dense SSL methods avoid semantic mismatches, we discover that they are susceptible to spatial imbalance where larger background classes like the sky dominate the representation, while smaller classes like pedestrians are underrepresented. This is undesirable because smaller foreground objects should be prioritized over low-detail, repetitive background classes. Furthermore, this can be dangerous in applications like self-driving (Yu et al., 2020) where critical objects like pedestrians occupy less than of a video frame (Figure 1(b), green boxes and 1(c)). This contrasts with ImageNet (Deng et al., 2009) training, where models can easily learn semantics from iconic images with clear, single-subject views and a balanced class distribution. Surprisingly, dense methods like FlowE (Xiong et al., 2021) and supervised ImageNet pretraining achieve similar downstream performance while converging to very different solutions; the former prioritizes large background classes while the latter captures many small and rare classes, but with relatively poor specificity. Some dense SSL methods (Wang et al., 2021b; Parthasarathy et al., 2023; Venkataramanan et al., 2024) include losses that optimize a global, pooled representation to learn semantic information from dense scenes, but do not explore how to integrate these two objectives through means of architecture and augmentation strategies.
![]() |
![]() |
![]() |
To address the challenges of cluttered scenes and spatial imbalance, we propose a joint Pooled and Dense Learning (PooDLe) method that optimizes a dense SSL objective over full images and a pooled objective on smaller, semantically-aligned views. The combination of objectives captures both high-level semantics and fine-grained details, effectively representing both small objects and scene-level understanding. Our dense objective is adopted from FlowE, using optical flow warping to align dense feature maps. To adapt the pooled objective to dense scene data, we introduce a flow-informed cropping procedure that generates pairs of smaller “subcrops” with high alignment. These subcrops serve as pseudo-iconic views of foreground objects, functionally increasing the prevalence of smaller objects (Figure 1). Finally, we introduce a lightweight spatial decoder module (SDM) with top-down layers and UNet-like lateral connections (Ronneberger et al., 2015) to upsample high-level semantic representations and preserve smaller objects in the dense objective. We show that both objectives, combined with the SDM, is essential for capturing the semantics of smaller objects and achieving strong downstream task performance.
We pretrain on BDD100K, a dataset of dashcam driving videos as well as Walking Tours, a dataset of first-person walking videos (Venkataramanan et al., 2024). PooDLe achieves state-of-the-art performance on semantic segmentation and object detection benchmarks, with a notable gain on recognizing small objects. We also introduce Walking Tours Semantic (WT-Sem) as a new in-distribution semantic segmentation evaluation for Walking Tours. In our ablations, we show that our joint objective formulation and the SDM are critical for success. Finally, we study the effect of crop area, input resolution, number of subcrops and temporal stride.
In summary, our contributions are as follows:
-
We introduce PooDLe, a new SSL method which overcomes the challenges of spatial imbalance and cluttered scenes by unifying a flow equivariance, dense SSL objective and a pooled objective over pseudo-iconic subcrops alongside a spatial decoder module to effectively learn from naturalistic video. PooDLe achieves state-of-the-art performance on BDD100K (Yu et al., 2020) and Cityscapes (Cordts et al., 2016) semantic segmentation and BDD object detection. It also obtains the highest mIoU on ADE20K (Zhou et al., 2017), and on WT-Sem, our new, in-distribution semantic segmentation task for Walking Tours.
-
We deconstruct the BDD100K semantic segmentation task, identifying class categories by frequency and size within the dataset. We show that existing dense SSL methods and supervised ImageNet training produce different results across these categories, while PooDLe learns a balanced semantic and spatial representation to achieve strong, consistent performance.
-
We study the effects of global and subcrop area, input resolution and temporal stride between paired frames. We show the importance of maintaining pixel density by adjusting crop area when training with larger resolutions for learning visual representations. We also verified that smaller subcrop areas are able to better capture smaller classes. We believe these observations will be helpful in guiding future work on dense, naturalistic data.
2 Related Work
Self-supervised learning with iconic images.
Representation learning on iconic image datasets has a long history, from denoising autoencoders (Vincent et al., 2010) to joint embedding methods (Chen et al., 2020; He et al., 2020; Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Caron et al., 2021) to joint-embedding predictive architectures (Assran et al., 2023; Bardes et al., 2023b). Joint embedding methods learn representation invariance to visual changes produced by augmentations using contrastive (Chen et al., 2020; Oord et al., 2018), mean squared error (Grill et al., 2020), or classification (Caron et al., 2021, 2020) losses between corresponding pairs, pushing SSL to new heights on ImageNet classification. Later works extend these methods to curated, internet-scale data (Oquab et al., 2023) and include other modalities like text (Radford et al., 2021). Separately, MAE (He et al., 2022) learns via reconstruction of masked image regions. iBOT (Zhou et al., 2021) combines joint embedding methods with token reconstruction to achieve impressive results on ImageNet classification. The methods above have been primarily designed for iconic images and contain assumptions that may not transfer well to uncurated datasets, e.g. dense scenes. Methods leveraging multi-crop (Caron et al., 2020, 2021; Oquab et al., 2023; Zhou et al., 2021) generate small crops optimized to predict the representations of global crops for training on iconic images with little additional compute. In contrast, our subcrop strategy yields small, aligned crops as pseudo-iconic, paired views from otherwise dense scenes.
Training using dense multi-subject images.
Following the success of SSL on ImageNet, other works seek to learn from dense, multi-subject images where augmented views may not contain corresponding subjects for invariance learning. Wang et al. (2021b); Xie et al. (2021); Chen et al. (2021) extend joint embedding methods by leveraging feature similarity bootstrapped from standard invariance learning to identify positive pairs across dense, unpooled feature maps. Hénaff et al. (2021); Wang et al. (2021a) optimize dense losses, contrasting pixels belonging to different semantic classes; these methods require off-the-shelf segmentation modules. Ziegler and Asano (2022); Guo et al. (2023) utilize DINO (Caron et al., 2021) attention maps to identify training pairs, while ADCLR (Zhang et al., 2023b) identifies pairs using small “query” crops and the patches that attend to them. These methods advance the ability to learn from dense images with multiple objects, but still have limitations. Some rely on learning objectives that make assumptions about iconic data, while others struggle with the spatial imbalance problem that is especially prevalent in naturalistic data.
Learning image representations from video data.
Extending beyond images, other works have sought to capture the variance of objects through time by training on pairs of video frames. Gordon et al. (2020) adapts contrastive learning to use correlated frames as positive examples, while Jabri et al. (2020); Parthasarathy et al. (2023) identify positive pairs based on high similarity in representation space. FlowE (Xiong et al., 2021) builds on BYOL (Grill et al., 2020) and identifies positive spatial regions between frames using off-the-shelf flow. MC-JEPA (Bardes et al., 2023b) learns motion using video data by aligning latent representations throughout the feature pyramid while performing representation learning on ImageNet. Most recently, DoRA (Venkataramanan et al., 2024) proposes a new dense video dataset and extends DINO by clustering over many frames to identify and track objects for representation learning. In the MAE paradigm, Tong et al. (2022); Feichtenhofer et al. (2022) directly reconstruct sequences of frames while Weinzaepfel et al. (2022); Gupta et al. (2023) perform reconstruction given a corresponding overlapping frame. While PooDLe learns a rich image representation from video data similar to these existing methods, it distinguishes itself by leveraging a unified dense and pooled objective architecture, specifically designed to tackle the challenges posed by naturalistic data.

3 PooDLe: Pooled and Dense Learning from naturalistic videos
We present PooDLe, a self-supervised method for learning visual representations using paired frames from naturalistic, first-person videos. PooDLe combines two SSL objectives: a dense objective for learning representations of dense, crowded scenes; and a pooled objective on small subcrops sampled using flow-aware cropping augmentations. We also propose a lightweight spatial decoder module (SDM) that uses top-down decoder layers and UNet-like lateral connections to earlier encoder representations to both upsample the high-level representations and resurface fine-grained details and small objects that may get lost in downsampling operations. For a high-level overview of PooDLe, see Figure 2.
Preliminaries.
Inputs to the model are video frame pairs with dimensions , and dense optical flow . Randomly sampled augmentations and are applied to each example to create positive training pairs. In a similar setup to BYOL, the encoder and projector are denoted as a function using either online weights or offline, EMA-updated weights . The predictor module only has online weights . Separate projector and predictor modules are used for the pooled and dense objectives, but are not annotated for simplicity. We use a ResNet-50 backbone, as well as projectors and predictors following FlowE (Xiong et al., 2021) and BYOL (Grill et al., 2020), that are discarded after pretraining.
Dense SSL with flow equivariance.
The dense objective follows FlowE (Xiong et al., 2021) by using optical flow to align paired feature projections and . At a high-level, this objective minimizes differences in representation between corresponding regions. More specifically, the inverse augmentation functions and optical flow are used to align the representations and after upsampling to input resolution , the objective is the squared error:
(1) |
where normalization is applied after the predictor and flow warping.
Pooled objective with flow-informed subcrops.
First, we identify pseudo-iconic subcrop pairs. Unlike for iconic data, random crops from paired frames are unlikely to contain a common subject. To mitigate this problem, we once again use optical flow in a flow-informed cropping procedure to identify aligned training pairs. For each subcrop pair, we sample a random point in the target frame to serve as the crop center. It is then warped into the earlier frame using flow plus random jitter for paired center . A crop is made around each center, with an area sampled from of the global crop for subcrops and .
As we require both crop centers to land within the bounds of the image, subcrops tend to be center-biased (Peng et al., 2022) and lack diversity. To remedy this, we employ a grid-sampling procedure for selecting the initial crop center . Each global crop is divided into a grid with cells of side length for a grid. Each cell is selected without replacement, and a center is then uniformly sampled within the cell.
After pairs are generated, they are encoded by the backbone and the pooled objective projector. Unlike the dense objective, no alignment or upsampling is performed, and each projection is averaged-pooled over its spatial dimensions before computing the loss:
(2) |
where denotes average pooling over spatial dimensions followed by normalization. Our objective has each subcrop to predict its corresponding pair, which contains the same object in a different frame. This differs from multi-crop (Caron et al., 2020), where local crops predict global crops, which would be less effective for dense scenes as local crops only capture a subset of the objects in a frame.
![]() |
![]() |
Spatial Decoder Module (SDM).
We introduce the SDM (Figure 2(b)) to upsample high-level encoder features and preserve information from lower layers, particularly smaller foreground objects that may be lost during pooling operations. Its design draws inspiration from a convolutional UNet (Ronneberger et al., 2015) and FPN (Lin et al., 2017) and improves upon FlowE’s use of dilated convolutions to replace pooling by efficiently maintaining high-resolution representations while reducing activations and memory usage.
The SDM utilizes decoder blocks, each consisting of an upsample operation, a computation block of processing layers , and a UNet-like lateral connection. The output of each block is computed as:
(3) |
where is the representation after the encoder stage and is an earlier feature map of the same spatial dimensions as . The use of computation blocks and lateral connections is ablated in Table 4.2. Figure 3 contrasts a naive implementation that places both objectives at the top encoder level and PooDLe, which uses the SDM to integrate the two objectives in a complementary fashion.
4 Experiments
We pretrain PooDLe on raw videos from BDD100K (Yu et al., 2020) and Walking Tours (WT) (Venkataramanan et al., 2024) and evaluate them on semantic segmentation and object detection benchmarks. The BDD100K pretrained model is evaluated on in-distribution tasks as well as Cityscapes (Cordts et al., 2016), and the Walking Tours model on ADE20K (Agrawal et al., 2015) and our newly proposed Walking Tours Semantic benchmark. We also ablate our combination of loss functions and decoder components, as well as the effects of crop area and input resolutions.
4.1 Experiment Setup
Pretraining datasets.
1) BDD (Yu et al., 2020) consists of 100,000 dashcam driving videos collected in various weather conditions and times of day from New York and the San Francisco Bay Area. Each video is ~40 seconds long at 720p and 30 fps. We pretrain with the 70,000 videos in the training split and evaluate on the dataset’s semantic segmentation and object detection tasks. 2) Walking Tours (WT) (Venkataramanan et al., 2024) is a dataset of first-person YouTube videos of a continuous walkaround through various cities of Europe, Asia, and a wildlife safari. There are 10 videos, ranging from 59 minutes to 2 hours 55 minutes, at 720p and 30 fps. Each video contains large numbers of unique objects per frame and natural transitions in lighting and location. We use either the Venice video (WT-Venice) or all 10 videos (WT-all) following DoRA (Venkataramanan et al., 2024).
Technical details.
We use ResNet-50 (R50) (He et al., 2016a) as our feature encoder, with the dense projector and predictor networks following FlowE (Xiong et al., 2021) and pooled counterparts following BYOL (Grill et al., 2020). For SDM, we use two decoder stages, with each consisting of a upsample, a ResNet Bottleneck Block (He et al., 2016a), and a 2-layer convolutional MLP for the lateral connection. When training on BDD, we sample two frames that are seconds apart () from each video. We then take two large crops from the same image coordinates of area of the original image and resize them to pixels before applying augmentations. For each training epoch on WT, we divide each video into 10-second clips and randomly sample two frames seconds apart from each clip, and use crop area range . For both datasets, we apply color distortion and Gaussian blurring independently to each frame following BYOL (Grill et al., 2020). For the dense objective, we also apply random reversible affine transformations similar to FlowE (Xiong et al., 2021): random scaling of 0.9–1.1 and rotation of -10–10 degrees. For the local objective, we sample subcrop pairs with a crop area of of the initial crop, resized to for both BDD and WT. For subcrops, random spatial jitter is applied as of the initial crops’ height and width.
Baselines.
We use official implementations of DenseCL, PixPro, DINO, iBOT, and DoRA, and our own implementation of FlowE for pretraining on BDD. We use torchvision for supervised ImageNet (IN1K) and weights released online for ImageNet-pretrained DINO. We obtain weights from the authors of DoRA for iBOT, DINO-ViT, and DoRA pretrained on WT and use official implementations of DINO-R50, MAE, and PixPro for pretraining on WT. For PixPro, we use either its FPN decoder or high-resolution crops for pretraining and report results from the best-performing setting. We use crops to train all R50 baselines for more accurate comparisons.
Evaluation.
We adopt the evaluation protocol from FlowE (Xiong et al., 2021) for BDD and Cityscapes. We use DeepLab v1 (Chen et al., 2018) as the “linear” readout header and UperNet (Xiao et al., 2018) as the heavier readout head for semantic segmentation and Faster R-CNN with ResNet-C4 and Faster R-CNN with FPN (He et al., 2016b) as the standard and heavier readout headers for object detection. We do not include ViT object detection due to the lack of an established recipe. For semantic segmentation on ADE20K, we perform both linear readout following BDD and UperNet finetuning as described in iBOT (Zhou et al., 2021). We retain the SDM when evaluating PooDLe on semantic segmentation with linear readouts but discard it when using UperNet. We report mean intersection-over-union (mIoU), pixel-level accuracy (Acc), and mean average precision (mAP) as our evaluation metrics. Additional details on implementation and hyperparameters are provided in Appendix A.
4.2 Main Results
BDD100K Sem. Seg. | BDD100K Obj. Det. | Cityscapes Sem. Seg. | |||||||||||
Linear | UperNet | Det C4 | FPN | Linear | UperNet | ||||||||
Method | Arch | Ep. | Pretrain | mIoU | Acc | mIoU | Acc | mAP | mAP | mIoU | Acc | mIoU | Acc |
Scratch | R50 | - | - | 9.7 | 55.0 | 26.1 | 81.2 | 0.0 | 7.7 | 9.8 | 58.0 | 30.7 | 84.1 |
DINO (Caron et al., 2021) | ViT-S | 300 | BDD | 29.6 | 86.8 | 41.1 | 90.1 | - | - | 35.1 | 87.9 | 51.5 | 91.9 |
iBOT (Zhou et al., 2021) | ViT-S | 800 | BDD | 27.2 | 85.4 | 35.5 | 88.7 | - | - | 32.0 | 86.2 | 44.0 | 90.3 |
DoRA (Venkataramanan et al., 2024) | ViT-S | 200 | BDD | 33.2 | 88.1 | 43.3 | 90.7 | - | - | 37.4 | 88.7 | 50.8 | 92.0 |
DINO (Caron et al., 2021) | R50 | 100 | BDD | 13.1 | 64.7 | 25.6 | 80.3 | 0.3 | 11.9 | 14.9 | 69.4 | 29.2 | 81.4 |
PixPro (Xie et al., 2021) | R50 | 100 | BDD | 21.8 | 80.0 | 37.3 | 88.0 | 0.7 | 18.4 | 25.5 | 81.0 | 44.3 | 89.5 |
DenseCL (Wang et al., 2021b) | R50 | 100 | BDD | 24.2 | 84.9 | 41.8 | 90.0 | 0.7 | 20.3 | 26.6 | 85.6 | 53.2 | 91.9 |
FlowE (Xiong et al., 2021) | R50 | 100 | BDD | 35.7 | 88.5 | 47.3 | 91.5 | 3.2 | 23.8 | 43.1 | 89.5 | 57.7 | 93.1 |
PooDLe | R50 | 100 | BDD | 39.2 | 89.2 | 49.9 | 91.8 | 4.9 | 25.2 | 47.2 | 90.2 | 60.7 | 93.5 |
Supervised | R50 | 600 | IN1K | 36.7 | 84.7 | 55.2 | 92.0 | 3.6 | 24.9 | 46.8 | 87.4 | 63.4 | 93.7 |
PooDLe | R50 | 100 | BDD* | 44.7 | 90.7 | 54.1 | 92.7 | 3.9 | 28.0 | 52.0 | 91.5 | 65.1 | 94.4 |

BDD100K-pretrained models.
We report our results on semantic segmentation and object detection on the BDD100K benchmark in Table 1. PooDLe achieves superior performance on all readout tasks compared to prior methods, outperforming the strongest baseline FlowE by mIoU on linear and mIoU on UperNet for semantic segmentation, and mAP on C4 and mAP on FPN for object detection. We find that that PooDLe’s improved performance (Table 3) is attributed to better recognition of small and rare object classes. We also evaluate the transfer of PooDLe representations to new tasks by evaluating on the Cityscapes benchmark, where PooDLe outperforms all baselines. Figure 4 shows predicted segmentation masks, and Figure 14 and Figure 15 show additional evaluation results.
PooDLe also outperforms supervised IN1K pretraining, despite the latter’s advantage in learning small and rare classes present in BDD100K (spatial imbalance shown in Figure 1) due to ImageNet being a class-balanced dataset with iconic views of objects. In addition, we pretrain PooDLe on BDD100K with weights initialized from the supervised IN1K checkpoint, improving mIoU by and Acc by over the initialization weights on linear semantic segmentation. In Appendix G, we show PooDLe remains competitive against IN1K-pretrained baselines despite being trained in the challenging naturalistic video setting.
ADE20K Sem. Seg. | WT-Sem Sem. Seg. | ||||||||||
Method | Arch | Epoch | Pretrain | SemSeg Linear | Finetune | SemSeg Linear | Finetune | ||||
mIoU | Acc | mIoU | Acc | mIoU | Acc | mIoU | Acc | ||||
DINO (Caron et al., 2021) | R50 | 800 | IN1K | 15.7 | 61.5 | 43.0 | 80.5 | 8.8 | 76.7 | 17.8 | 87.6 |
DINO (Caron et al., 2021) | ViT-S | 100 | IN1K | - | - | 33.9 | - | - | - | - | - |
DINO (Caron et al., 2021) | ViT-S | 100 | WT-Venice | 7.8 | 57.7 | 29.2 | 74.7 | 4.6 | 73.7 | 11.0 | 83.0 |
iBOT (Zhou et al., 2021) | ViT-S | 100 | WT-Venice | - | - | 33.9 | - | - | - | - | - |
MAE (He et al., 2022) | ViT-S | 100 | WT-Venice | 7.4 | 55.1 | 24.1 | 71.4 | 4.3 | 72.6 | 8.9 | 81.5 |
DoRA (Venkataramanan et al., 2024) | ViT-S | 100 | WT-Venice | 14.1 | 63.5 | 35.2 | 77.7 | 6.2 | 76.9 | 13.6 | 85.7 |
DINO (Caron et al., 2021) | R50 | 100 | WT-Venice | 6.9 | 48.2 | 35.7 | 77.4 | 4.2 | 69.0 | 12.3 | 84.7 |
PixPro (Xie et al., 2021) | R50 | 100 | WT-Venice | 4.6 | 48.6 | 36.0 | 77.6 | 3.7 | 69.3 | 11.5 | 84.2 |
PooDLe | R50 | 20 | WT-Venice | 14.6 | 59.0 | 36.6 | 77.9 | 6.4 | 75.7 | 13.7 | 85.4 |
DINO (Caron et al., 2021) | ViT-S | 100 | WT-all | - | - | 34.1 | - | - | - | - | - |
MAE (He et al., 2022) | ViT-S | 100 | WT-all | 10.6 | 60.4 | 31.4 | 75.9 | 6.6 | 77.7 | 12.7 | 85.2 |
DoRA (Venkataramanan et al., 2024) | ViT-S | 100 | WT-all | 13.9 | 64.4 | 38.3 | 79.3 | 7.8 | 79.4 | 15.9 | 87.5 |
PooDLe | R50 | 20 | WT-all | 16.5 | 63.9 | 41.0 | 79.6 | 11.2 | 81.3 | 17.0 | 86.9 |
Method | Pretrain | All | Small | Large | Rare | Common |
DINO | BDD | 29.6 | 8.4 | 42.0 | 1.0 | 42.8 |
DenseCL | BDD | 24.2 | 1.6 | 37.4 | 0.0 | 35.4 |
DoRA | BDD | 33.2 | 11.9 | 45.6 | 2.8 | 47.3 |
FlowE | BDD | 35.7 | 12.2 | 49.3 | 10.7 | 47.2 |
PooDLe | BDD | 39.2 | 18.3 | 51.4 | 12.0 | 51.8 |
Supervised | IN1K | 36.7 | 27.2 | 42.2 | 16.1 | 46.2 |
PooDLe | BDD* | 44.7 | 25.2 | 56.1 | 17.9 | 57.1 |
WT-pretrained models.
We also train PooDLe on WT-Venice and WT-all. Table 2 shows results on ADE20K (Zhou et al., 2017) semantic segmentation using linear readout and finetuning following (Venkataramanan et al., 2024). Notably, when pretrained on WT-all, PooDLe obtains higher mIoU than DoRA on ADE20K linear readout, and mIoU on UperNet finetuning. PooDLe also performs better on WT-Venice, with a gain of mIoU over DoRA and mIoU over PixPro on ADE20K UperNet finetuning. We note that PooDLe uses a smaller ResNet-50 backbone and is trained for fewer epochs than DoRA, the strongest baseline. Despite these differences, these results show that PooDLe learns strong representations from naturalistic video captured in the open world. Figure 16 shows predicted segmentation masks for ADE20K.
Walking Tours Semantic benchmark.
While ADE20K is a challenging benchmark, it contains a mixture of indoor and outdoor scenes that can be out-of-distribution from scenes in Walking Tours. Therefore, we introduce Walking Tours Semantic (WT-Sem) to provide a more in-distribution benchmark to accompany the WT dataset. We find that when pretrained on WT-all, PooDLe outperforms DoRA (Venkataramanan et al., 2024) by and mIoU on linear readout and UperNet finetuning, respectively. To generate the dataset, we use OpenSeeD (Zhang et al., 2023a), a strong open-vocabulary segmentation model, to generate semantic segmentation masks for all videos in WT-all as well as 3 new walkaround videos. We use the Swin-L (Liu et al., 2021) variant of OpenSeeD finetuned on ADE20K semantic segmentation with a vocabulary of the 150 class labels from ADE20K to generate masks. See Appendix D for visualizations and details of WT-Sem.
Variant | Dense | Pool | Top-Down | Lateral | Flow | All | Small | Large | Rare | Common |
---|---|---|---|---|---|---|---|---|---|---|
1 FlowE | ✓ | RAFT | 28.8 | 8.7 | 40.5 | 1.8 | 29.2 | |||
2 | ✓ | ✓ | RAFT | 28.9 | 7.2 | 41.6 | 2.2 | 28.7 | ||
3 | ✓ | ✓ | ✓ | RAFT | 30.3 | 6.8 | 44.0 | 4.3 | 30.2 | |
4 | ✓ | ✓ | ✓ | RAFT | 30.3 | 10.9 | 41.7 | 2.4 | 31.1 | |
5 | ✓ | ✓ | ✓ | RAFT | 31.8 | 12.8 | 42.8 | 8.3 | 31.7 | |
6 PooDLe | ✓ | ✓ | ✓ | ✓ | UFlow | 33.7 | 14.1 | 45.1 | 8.9 | 33.8 |
7 PooDLe | ✓ | ✓ | ✓ | ✓ | RAFT | 34.2 | 15.0 | 45.5 | 9.0 | 34.5 |
classes | |||
---|---|---|---|
Subcrop area | small | large | all |
0.04 - 0.18 | 16.0 | 41.6 | 34.6 |
0.18 - 0.36 | 14.2 | 41.2 | 33.7 |
0.36 - 0.54 | 13.6 | 39.5 | 32.4 |
0.54 - 1.00 | 12.3 | 38.8 | 31.5 |
![]() |
![]() |
Class-based performance and IN1K initialization.
Naturalistic videos have imbalanced class representation and object sizes (Figure 1(c); e.g., “road” occupies of pixels while “bicycle” only occupies of pixels). Capturing information on these underrepresented classes is very challenging. To further demonstrate this phenomenon, we categorize BDD classes as “small” if they occupy of pixels and “large” for those that occupy . Separately, we define “rare” as classes that appear in of images and “common” as those that appear in . Table 3 shows linear readout mIoU for different class groupings, highlighting the impact of class and spatial imbalance. Full class-level statistics and designations are in Appendix H. We observe that FlowE performs well on large and common classes due to its dense loss, but struggles on small or rare classes. Meanwhile, supervised IN1K, benefiting from balanced pretraining data, effectively learns about smaller classes. PooDLe, with its unified objectives and spatial decoder module, significantly outperforms other BDD-pretrained models across all class groupings, particularly on small and rare classes. PooDLe, initialized from supervised IN1K weights, significantly improves upon supervised IN1K on large classes, from to , due to the dense objective, while remaining competitive with supervised IN1K on small classes.
4.3 Ablation studies
Table 4.2 shows ablation experiments, testing each of our contributions beginning from FlowE. Models trained without the decoder use dilated convolutions in place of pooling operations, as in FlowE (Xiong et al., 2021). Figure 3 shows how the dense and pooled objectives are composed with and without the decoder. For the ablations, models are trained for epochs on BDD and use a reduced resolution and area for the initial crops; we evaluate on BDD semantic segmentation using linear readout.
We observe that adding alone has little benefit (row 2) and including either the decoder as a spatial upsampler (row 3) or only the UNet-style lateral connections (row 4) also does not yield much benefit. Row 5 achieves +3% mIoU, showing that the top-down decoder is only effective when combined with the lateral connections for the full SDM, suggesting that preserving high-resolution information as well as including some capacity for feature processing are both important. However, when re-adding the pooled loss in addition to the decoder with lateral connections (row 7), we see a substantial 5.4% mIoU improvement. While the dense objective benefits from the full SDM, it has an even greater synergistic effect with the pooled loss. This may be because the pooled objective with subcrops can effectively learn about small objects while the full decoder helps propagate the semantic representations through to the dense loss.


We also demonstrate that PooDLe is able to perform well even with self-supervised flow. We train UFlow (Jonschkowski et al., 2020) model on KITTI (Geiger et al., 2013) and finetune it on BDD resulting in only a 0.5% mIoU loss compared to pretraining with RAFT. See Appendix A and J for details and visualizations.
4.4 Spatial and temporal cropping in self-supervised video learning
In this section, we study the effect of frame intervals and image cropping parameters used in data augmentation. Without a 1:1 image-to-concept relationship like in iconic data, the visible area of each frame can greatly affect representation learning. To study this, we perform 4 experiments varying: (1) subcrop area, (2) global crop area, (3) number of subcrops, (4) temporal stride . Crop and subcrop area refer to the fraction of the frame taken during random-resized cropping. Figure 6 depicts how crops transition from global to pseudo-iconic with decreasing crop area. Training recipe follows the ablations in section 4.3 and results should be compared to row 7 in Table 4.2.


Varying subcrop area.
First, we study how subcrop area affects our learned representations. We train 4 PooDLes using different subcrop ranges and a fixed global crop area of [0.125, 0.25] at resolution . Results are shown in Table 5 for all classes, as well as small and large class subgroupings. We observe that larger subcrop areas result in worse performance, with a larger relative drop for smaller classes. This is likely because larger crops contain multiple, smaller objects which breaks the pseudo-iconic assumption and produces false invariances.
Varying global crop area
Next, we vary both the global crop area from the raw video frame along with the input resolution to study their effects on self-supervised pretraining. We select 3 different resolutions, and sample the crop area from a Gaussian truncated to , with varying mean and . All three resolutions are trained with with the 2 smaller resolutions also trained at for higher granularity. Our results in Figure 5 show that larger crop areas and higher input resolutions, together, are important for maximizing performance. The largest model () produces the best results and peaks at while the other two peak at smaller crop areas. The model degrades more slowly in performance as crop area increases in comparison to the model.
Varying number of subcrops.
We also study how varying the number of subcrops affects performance. We train 4 PooDLes using subcrops on BDD100K and evaluate using linear readout, with results shown in Figure 7. Using 3 subcrops gives an initially large performance jump, and using additional subcrops provides more modest gains. We decide to use as our default option to balance between performance and computational efficiency.
Effect of temporal stride during frame sampling.
We study the effect of temporal stride by training PooDLe with on BDD100K, evaluated using linear readout (Figure 9). Performance peaks at and degrades only slightly at and, while dropping further at values and . When it is small, there is limited variance in object appearance, diminishing the value of video data, and when it is too large, correspondence between frames decreases and optical flow becomes unreliable. Note that for , we add jitter to the initial large crop by up to of the image size. Figure 8 shows frame sequences from 2 different videos, highlighting the high variability of motion in BDD100K.
4.5 Subcrops as pseudo-iconic training images
![]() |
![]() |
To understand the impact of pseudo-iconic subcrops on PooDLe’s performance, particularly on small objects, we analyze their effect on object prevalence in the pooled objective. Using a simulated circular object (Figure 10(a)), we calculate the probability of a subcrop capturing it as a pseudo-iconic view, i.e. subcrop “hit”, and compare this to the probability of a pixel landing on the object, emulating a pixel-level dense SSL objective. We count a subcrop hit if it reaches at least object coverage, which we justify as background classes generally have little visual variation and consequently, minimal impact on the pooled representations. We extend this analysis to the BDD100K semantic segmentation dataset, empirically simulating subcrops, and computing subcrop hit and pixel probabilities for varying object sizes. The simulation results, illustrated in Figure 10(b), show a greater relative difference between subcrop hit and pixel probabilities for smaller objects, indicating pseudo-iconic subcrops increase their prevalence in the pooled objective. This likely contributes towards PooDLe’s improved performance on small object classes compared to dense SSL methods. Further details of the analysis can be found in Appendix C.
5 Conclusion
Self-supervised learning on naturalistic videos presents many unsolved challenges, especially due to the presence of high-resolution, multi-object crowded scenes with severe spatial imbalance. Iconic methods rely on single-subject images, and dense methods struggle with the scale imbalance of objects. We propose PooDLe that combines pooled region-invariance learning and dense flow-equivariance learning objectives in a unified framework. PooDLe achieves state-of-the-art performance on downstream semantic segmentation and object detection evaluations compared to prior methods pretrained on the same video datasets, particularly on recognizing small objects. Our study on the effects of crop area, input resolution, and temporal stride also offers key insights on the design choices for video self-supervised learning.
Acknowledgements
We thank Jenny Zhu for her assistance in generating semantic segmentation labels for the WT-Sem dataset. The work is supported in part by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under grant RS-2024-00469482, funded by the Ministry of Science and ICT (MSIT) of the Republic of Korea in connection with the Global AI Frontier Lab International Collaborative Research. AW is supported by the NSERC PGS-D Scholarship. CH is supported by the DoD NDSEG Fellowship. The compute is supported in part through the Microsoft Accelerating Foundation Model Research (AFMR) program, a Google Cloud Platform (GCP) award, and NYU IT’s High Performance Computing resources, services, and staff expertise.
References
- Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §4.
- Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
- V-jepa: latent video prediction for visual representation learning. arXiv preprint arXiv:2404.08471. Cited by: §1.
- Vicreg: variance-invariance-covariance regularization for self-supervised learning. In International Conference on Learning Representations, Cited by: §1, §2.
- Mc-jepa: a joint-embedding predictive architecture for self-supervised learning of motion and content features. arXiv preprint arXiv:2307.12698. Cited by: §2, §2.
- Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, Cited by: §2, §3.
- Emerging properties in self-supervised vision transformers. In IEEE/CVF International Conference on Computer Vision, Cited by: Table 8, §1, §2, §2, Table 1, Table 2.
- Multisiam: self-supervised multi-instance siamese representation learning for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §2.
- DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.1.
- Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: Table 8.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, Cited by: §1, §2.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: item 1, §4.
- ImageNet: a large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1.
- Masked autoencoders as spatiotemporal learners. In Advances in Neural Information Processing Systems, Cited by: Appendix A, §2.
- Vision meets robotics: the kitti dataset. International Journal of Robotics Research. Cited by: Appendix A, §4.3.
- Watching the world go by: representation learning from unlabeled videos. arXiv preprint arXiv:2003.07990. Cited by: §2.
- Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, Cited by: Appendix A, Table 8, §1, §2, §2, §3, §4.1.
- Multi-level contrastive learning for dense prediction task. arXiv preprint arXiv:2304.02010. Cited by: §2.
- Siamese masked autoencoders. In Advances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36, pp. 40676–40693. Cited by: §2.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Table 2.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A, §4.1.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
- Efficient visual pretraining with contrastive detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §2.
- Space-time correspondence as a contrastive random walk. Advances in Neural Information Processing Systems. Cited by: §2.
- What matters in unsupervised optical flow. In European Conference on Computer Vision, Cited by: Appendix A, Appendix J, §4.3.
- Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §3.
- Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: Appendix D.
- Ddflow: learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: Appendix A.
- Swin transformer: hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, Cited by: §4.2.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.
- Dinov2: learning robust visual features without supervision. Transactions on Machine Learning Research. Cited by: §2.
- Self-supervised video pretraining yields robust and more human-aligned visual representations. Advances in Neural Information Processing Systems. Cited by: §1, §2.
- Crafting better contrastive views for siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §3.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, Cited by: §2.
- U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention International Conference, Cited by: §1, §3.
- CASTing your model: learning to localize improves self-supervised representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- Objects365: a large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Conference on Computer Vision, Cited by: Appendix D.
- RAFT: recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, Cited by: Appendix J.
- Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, Cited by: §2.
- Is imagenet worth 1 video? learning strong image encoders from 1 long unlabelled video. In International Conference on Learning Representations, Cited by: Appendix D, §1, §1, §1, §2, §4.1, §4.2, §4.2, Table 1, Table 2, §4.
- Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research. Cited by: §2.
- Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §2.
- Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 8, §1, §1, §2, Table 1.
- Croco: self-supervised pre-training for 3d vision tasks by cross-view completion. Advances in Neural Information Processing Systems. Cited by: §2.
- Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision, Cited by: Table 8, §4.1.
- Propagate yourself: exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2, Table 1, Table 2.
- Self-supervised representation learning from flow equivariance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: Appendix A, Appendix A, Table 8, §1, §1, §2, §3, §3, §4.1, §4.1, §4.3, Table 1, Table 4.
- Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Appendix A, Appendix E, item 1, §1, §4.1, §4.
- Barlow twins: self-supervised learning via redundancy reduction. In International Conference on Machine Learning, Cited by: §2.
- A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: Appendix D, §4.2.
- Patch-level contrasting without patch correspondence for accurate and dense contrastive representation learning. In International Conference on Learning Representations, Cited by: §2.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Figure 12, Appendix D, item 1, §4.2.
- Ibot: image bert pre-training with online tokenizer. In International Conference on Learning Representations, Cited by: Appendix A, Appendix A, Table 8, §2, §4.1, Table 1, Table 2.
- Self-supervised learning of object parts for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
Appendix
Appendix A Implementation details
Backbone.
As discussed in the pretraining details, we use a Resnet-50 (He et al., 2016a) as our backbone architecture. The projector model is a non-linear, 2-layer MLP (linear for pooled, convolutions for dense) that has a hidden dimension and projects down to dimensions. The predictor is the same network with channels. We follow BYOL (Grill et al., 2020) with a momentum starting at and increasing to throughout training.
Decoder details.
The decoder uses a single Bottleneck block from the ResNet architecture with a downsampling ratio in the number of channels. Upsampling in the decoder is and the lateral connection is a single linear convolutional layer that up-projects the input latent to match the decoder channels ( in the first decoder block and for the second block). As mentioned, 2 decoder blocks are used to achieve a total of upsampling.
Supervised and self-supervised flow prediction.
Flow is predicted using a supervised off-the-shelf RAFT model or an unsupervised UFlow (Jonschkowski et al., 2020) model that we train ourselves. For unsupervised training, we exactly follow UFlow and train on the KITTI (Geiger et al., 2013) dataset before finetuning on BDD100k (Yu et al., 2020) for 100,000 steps on daytime-only videos. The training and inference resolutions were set to to better match the inference setting. KITTI used adjacent frames (10Hz video) while BDD frames were sampled with a temporal stride of 10 (30Hz video).
Local cropping details.
paired local crops are sampled using the methods described. Cropping is performed using RandomResizedCrop with an output resolution of . Jitter is 10% of the input image size and a standard aspect ratio range of is used.
Loss details.
We sum our 2 loss functions directly and give them equal weight. The loss computation and warping function were applied to representations after reversing the affine transform and resizing to the input image resolution. This is to take full advantage of high resolution flow like in FlowE (Xiong et al., 2021). We also use flow-based occlusion to prevent misaligning occluded regions without correspondence. We use the same occlusion formulation as DDFlow (Liu et al., 2019) and parameters . We also mask out regions that are not visible after affine transformations for .
Our loss is symmetrical: we reverse the and so that both are encoded by the online weights and used for optimization at each training step.
Optimization details.
AdamW is used as the optimizer and a weight decay value of . A learning rate of is used with 32 GPUs and image pairs per GPU for a batch size total of . Cosine learning rate decay is used with a schedule for 300 epochs, despite early termination due to compute limitations. LR warmup is used for 2 training epochs. Full float32 precision is used during training.
Evaluation settings.
For all BDD and Cityscapes semantic segmentation and object detection readout tasks, we follow the setup described in FlowE (Xiong et al., 2021) for ResNet-based methods. For ViT-based methods, we adopt those settings, but use AdamW for the optimizer with a learning rate of and weight decay of , and a crop size of rather than the normal to accommodate the square aspect ratio used in ViT pretraining, following the semantic segmentation linear readout setup described in iBOT (Zhou et al., 2021). In addition, ViT-based methods require sliding window inference in order to achieve performance that is competitive with convolution-based methods.
For ADE20K and WT-Sem linear readout, we simply use the respective BDD linear readout settings for ResNet and ViT methods. For ADE20K and WT-Sem UperNet finetuning, we follow the procedure described in iBOT (Zhou et al., 2021) except we use a batch size of 4 for WT-Sem finetuning.
Ablation data sampling.
For all ablation experiments, we employ repeated sampling like in MAE-st (Feichtenhofer et al., 2022) which samples frames each time a video is encountered for faster data loading. Therefore, each pass through every video in the dataset counts as epochs.
Appendix B Compute resources
The full model is trained on 16 A100s and takes about 30h for 100 epochs on BDD100K or 18min per epoch. Walking Tours takes longer at 40min per epoch, as the number of training samples per epoch is larger.
Ablation-sized experiments were run on 2 or 4 H100/A100 GPUs for a total of 40 epochs, taking 20–40h depending on the configuration.
Appendix C Subcrop analysis
For the toy simulation of subcrops, we place a foreground object as a centered circle of varying size within a frame. We then simulate all possible subcrops of area . For each subcrop area, we compute subcrop hits, i.e. whether at least 5% of the subcrop contains the object, using numerical grid-based integration. We compute the subcrop hit probability, or subcrop hits over valid subcrops, averaged across subcrop areas, as well as the pixel probability, or object pixels over total image pixels.
We also emulate our training procedure for our empirical simulation of subcrops. For each of the 7,000 images in the BDD100K semantic segmentation training dataset, we sample two global crops with area sampled from and for each global crop, 4096 subcrops with area sampled from . We compute subcrop probability and pixel probability independently for the pixels of each foreground class: pole, traffic light, traffic sign, person, rider, car, truck, bus, train, motorcycle, bicycle. We then group the results into 10% quantile bins by object size (i.e. pixel proportions) and average the subcrop and pixel probabilities. We utilize a slightly different subcrop area range in the empirical simulation because our two-step global crop and subcrop procedure results in a logarithmic-like distribution.

We hypothesize that PooDLe’s improvement on spatially underrepresented classes, as shown in Table 10, is due to this subcrop effect. To quantify this effect on real data, we perform a similar exercise as above on the BDD100K semantic segmentation training set. We sample subcrops following our method and assign a class label to each subcrop. If over 10% of the subcrop is a foreground class (not road, sky, building, vegetation, sidewalk, fence, terrain), then we label the subcrop as the majority foreground class. Otherwise, the majority background class label is assigned. In Figure 11, we show the relative change in class distribution when using this subcrop class assignment. Foreground classes (green) increase in occurrence while background classes (blue) decrease in frequency, besides road.
Appendix D Walking Tours Semantic benchmark
![]() |
![]() |
![]() |
We create the WT-Sem benchmark by sampling a frame every 2 seconds from each of the 10 videos in WT-all as well as 3 new walkaround videos. The new walkaround videos are filmed in Rome, Torun, and Poznan, sourced from the same YouTube channel as WT (Venkataramanan et al., 2024) under the Creative Commons (CC-BY) license. The Swin-L variant of OpenSeed (Zhang et al., 2023a), pretrained on COCO (Lin et al., 2014) and Objects365 (Shao et al., 2019) and finetuned on ADE20K, is used to generate semantic segmentation masks. We utilize the 25,910 frames sourced from WT-all as the training set and the 6,170 frames sourced from the 3 new videos as the validation set. Figure 12 shows our analysis of WT-Sem in comparison to ADE20K (Zhou et al., 2017), where we observe that both datasets have long-tailed class distributions and WT-Sem has slightly higher number of unique classes per frame. We also visualize examples from the WT-Sem benchmark in Figure 13.

Appendix E Additional visualizations
We provide additional visualizations of results on our evaluated benchmarks: BDD100K (Yu et al., 2020) semantic segmentation (Figure 14), object detection (Figure 15) and ADE20K semantic segmentation (Figure 16). Once again, we note that PooDLe produces segmentation maps with clearer boundaries while also effectively capturing small objects.



Appendix F Accuracy values for class breakdown and ablations
Method | Pretrain | All | Small | Large | Rare | Common |
DINO | BDD | 86.8 | 12.3 | 88.3 | 2.3 | 87.8 |
DenseCL | BDD | 84.9 | 2.0 | 86.6 | 0.0 | 86.0 |
DoRA | BDD | 88.1 | 19.3 | 89.5 | 7.2 | 89.1 |
FlowE | BDD | 88.5 | 18.2 | 89.9 | 32.0 | 89.2 |
PooDLe | BDD | 89.2 | 33.6 | 90.3 | 34.2 | 89.9 |
Supervised | IN1K | 84.7 | 36.9 | 85.3 | 23.8 | 85.1 |
PooDLe | BDD* | 90.7 | 35.6 | 91.2 | 46.9 | 91.2 |
Variant | Dense | Pool | Top-Down | Lateral | Flow | All | Small | Large | Rare | Common |
---|---|---|---|---|---|---|---|---|---|---|
1 FlowE | ✓ | RAFT | 85.0 | 22.8 | 86.3 | 6.1 | 86.0 | |||
2 | ✓ | ✓ | RAFT | 86.2 | 14.2 | 87.6 | 6.3 | 87.1 | ||
3 | ✓ | ✓ | ✓ | RAFT | 86.8 | 11.9 | 87.7 | 13.6 | 87.7 | |
4 | ✓ | ✓ | ✓ | RAFT | 86.6 | 22.1 | 87.9 | 16.9 | 87.5 | |
5 | ✓ | ✓ | ✓ | RAFT | 84.2 | 25.5 | 85.3 | 28.2 | 84.9 | |
6 PooDLe | ✓ | ✓ | ✓ | ✓ | UFlow | 86.0 | 26.4 | 87.2 | 29.6 | 86.7 |
7 PooDLe | ✓ | ✓ | ✓ | ✓ | RAFT | 86.5 | 26.6 | 87.7 | 28.5 | 87.1 |
Appendix G Additional evaluation results
BDD100K Sem. Seg. | BDD100K Obj. Det. | Cityscapes Sem. Seg | |||||||||||
Linear | UperNet | Det C4 | FPN | Linear | UperNet | ||||||||
Method | Arch | Ep. | Pretrain | mIoU | Acc | mIoU | Acc | mAP | mAP | mIoU | Acc | mIoU | Acc |
Scratch | R50 | - | - | 9.7 | 55.0 | 26.1 | 81.2 | 0.0 | 7.7 | 9.8 | 58.0 | 30.7 | 84.1 |
PooDLe | R50 | 100 | BDD | 39.2 | 89.2 | 49.9 | 91.8 | 4.9 | 25.2 | 47.2 | 90.2 | 60.7 | 93.5 |
Supervised | R50 | 600 | IN1K | 36.7 | 84.7 | 55.2 | 92.0 | 3.6 | 24.9 | 46.8 | 87.4 | 63.4 | 93.7 |
BYOL (Grill et al., 2020) | R50 | 1000 | IN1K | 28.3 | - | 52.4 | - | 2.8 | 26.0 | 39.9 | - | 60.3 | - |
DenseCL (Wang et al., 2021b) | R50 | 200 | IN1K | 21.3 | 82.7 | 52.8 | 91.6 | 0.3 | 25.0 | 27.3 | 84.0 | 63.7 | 93.7 |
Supervised | ViT-S | 300 | IN1K | 41.9 | 88.5 | 50.9 | 91.4 | - | - | 46.8 | 87.4 | 63.4 | 93.7 |
DINO (Caron et al., 2021) | ViT-S | 800 | IN1K | 38.5 | 88.1 | 52.3 | 92.0 | - | - | 47.1 | 90.3 | 63.6 | 94.0 |
iBOT (Zhou et al., 2021) | ViT-S | 800 | IN1K | 44.4 | 89.6 | 54.2 | 92.2 | - | - | 52.1 | 91.5 | 65.3 | 94.3 |
PooDLe | R50 | 100 | BDD* | 44.7 | 90.7 | 54.1 | 92.7 | 3.9 | 28.0 | 52.0 | 91.5 | 65.1 | 94.4 |
We compare PooDLe against ImageNet-pretrained baselines in Table 8 and observe that PooDLe outperforms most baselines except iBOT and ImageNet supervised ViT. This result is encouraging, as pretraining on naturalistic video is more challenging due to spatial and class imbalance, yet is also a more realistic setting that enables the use of broader sets of usable data. Furthermore, we note that pretraining on class-balanced data such as ImageNet particularly benefits mIoU, which weighs all classes equally despite some classes only appearing in a tiny proportion of pixels in evaluation. Finally, PooDLe pretrained on BDD with weights initialized from the ImageNet supervised checkpoint surpasses all ImageNet-pretrained baselines on linear semantic segmentation.
Appendix H Per-class evaluation results
Method | Pretrain | Rd | Sky | Bldg | Veg | Car | Bus | Fence | Truck | Wall | S-walk | Terrain | Train | Pole | Bicycle | Person | M-cycle | Tr. Sign | Rider | Tr. Light |
DINO | BDD | 88.6 | 93.0 | 72.3 | 77.3 | 73.7 | 0.8 | 11.7 | 5.3 | 5.1 | 38.5 | 37.5 | 0 | 8.1 | 0 | 19.6 | 0 | 13.4 | 0 | 17.7 |
DenseCL | BDD | 82.0 | 88.2 | 68.3 | 72.6 | 63.0 | 0 | 1.5 | 0.5 | 0 | 17.7 | 7.2 | 0 | 0.3 | 0 | 0.1 | 0 | 1.1 | 0 | 9.4 |
DoRA | BDD | 89.9 | 93.6 | 75.4 | 79.9 | 76.6 | 5.1 | 17.9 | 11.0 | 10.8 | 44.6 | 42.7 | 0 | 13.3 | 0.7 | 25.2 | 0 | 20.5 | 0 | 23.9 |
FlowE | BDD | 90.6 | 92.9 | 75.8 | 79.6 | 80.8 | 32.9 | 23.5 | 22.7 | 15.3 | 45.7 | 32.4 | 0 | 12.9 | 11.8 | 28.7 | 4.1 | 15.9 | 0 | 12.1 |
PooDLe | BDD | 91.3 | 93.5 | 77.0 | 80.4 | 81.7 | 34.0 | 29.4 | 24.3 | 17.2 | 49.6 | 38.1 | 0 | 24.3 | 18.0 | 35.2 | 2.9 | 26.6 | 0 | 21.2 |
Supv. | IN | 79.8 | 88.8 | 70.0 | 77.2 | 72.6 | 24.9 | 21.8 | 14.4 | 7.2 | 18.4 | 31.8 | 0 | 22.8 | 36.8 | 40.2 | 19.5 | 31.8 | 8.2 | 31.2 |
PooDLe | BDD* | 92.6 | 94.0 | 80.3 | 82.2 | 84.8 | 54.7 | 34.9 | 33.4 | 17.8 | 56.3 | 42.2 | 0 | 25.7 | 27.1 | 41.5 | 7.7 | 39.0 | 0.1 | 35.2 |
Rd | Sky | Bldg | Veg | Car | Bus | Fence | Truck | Wall | S-walk | Terrain | Train | Pole | Bicycle | Person | M-cycle | Tr. Sign | Rider | Tr. Light | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Avg Pix. % / Im. | 22.0 | 18.2 | 15.0 | 14.4 | 8.4 | 3.7 | 3.4 | 3.2 | 3.1 | 3.1 | 2.8 | 2.1 | 1.0 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.4 |
Total % of Pix. | 21.3 | 17.3 | 13.2 | 13.2 | 8.1 | 0.6 | 1.0 | 1.0 | 0.5 | 2.0 | 1.0 | 0.0 | 0.9 | 0.1 | 0.3 | 0.0 | 0.3 | 0.0 | 0.2 |
Total % of Im. | 96.5 | 94.8 | 88.4 | 91.7 | 97.3 | 15.0 | 30.6 | 30.5 | 15.4 | 66.7 | 36.7 | 0.7 | 95.0 | 6.4 | 34.7 | 3.8 | 75.3 | 5.2 | 47.1 |
Size Grp. | L | L | L | L | L | L | L | L | L | L | L | L | S | S | S | S | S | S | S |
Freq. Grp. | C | C | C | C | C | R | C | C | R | C | C | R | C | R | C | R | C | R | C |
We provide a breakdown of IoU per class on BDD semantic segmentation linear readout in Table 9. In Table 10, we also provide dataset-level statistics for each class computed over the training split of 7,000 images in the BDD semantic segmentation dataset, namely average pixel percentage per image, total percentage of pixels over the dataset and total percentage of images that they appear in over the dataset. Size and frequency groupings are then independently defined using these statistics and used in Table 3. A class is considered ‘Large’ (L) if its average pixel percentage per image is and ‘Small’ (S) otherwise. Separately, we define a class as ‘Common’ (C) if the total percentage of images it appears in is and ‘Rare’ (R) otherwise. Notably, PooDLe achieves significant gains on small classes such as ‘Pole’, ‘Bicycle’, ‘Traffic Sign’, ‘Traffic Light’. Methods trained on BDD underperform supervised IN1K on classes rare in BDD such as ‘Rider’, likely because IN1K offers both abundant and iconic images of these object categories.
Appendix I Backbone computation cost
We provide a table detailing the number of FLOPs associated with various SSL methods and backbones. We note that our SDM is by far the most efficient upsampling approach for dense representation learning methods.
Architecture | Associated Methods | GFLOPs |
---|---|---|
ResNet-50 | DINO-R50 | 43.3 |
ResNet-50 + SDM | PooDLe | 60.5 |
ResNet-50 + FPN decoder | PixPro | 124.4 |
ResNet-50 + dilated convolutions | FlowE, DINO, DenseCL | 200.7 |
ViT-S/16 | DINO, iBOT, DoRA, MAE | 82.9 |
Appendix J Flow visualizations

In Figure 17, we compare the predicted flow maps generated from RAFT (Teed and Deng, 2020), an off-the-shelf supervised model, and our own unsupervised UFlow (Jonschkowski et al., 2020) model. The frame pairs are randomly sampled with . We do note that self-supervised flow, particularly on BDD100K, may exhibit noisy or splotchy results. This is possibly due to the inconsistent motion and large dark regions that do not offer sufficient photometric supervisory signal. This is in contrast to RAFT (Teed and Deng, 2020) which learns sharp edges like from supervised labels. Nevertheless, we find that this self-supervised flow is sufficient for training PooDLe.