Figure 1: The pipeline of our proposed 3D-to-2D generative pre-training. We design a query generator to encode pose conditions and implement cross-attention layers to transform 3D point cloud features to 2D view image features according to pose instruction. The predicted pose-dependent view image from 2D generator is supervised by ground truth view image via MSE loss.
With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of generative pre-training. In this paper, we propose a novel 3D-to-2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the superiority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also effective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNetPart segmentation tasks.
We show that our proposed TAP can be applied to any types of point cloud models and bring consistent improvements.
We show that our proposed TAP outperforms other 3D generative pre-training methods on Standard Transformers model.
We show that our proposed TAP achieves state-of-the-art performance on ScanObjectNN classification and ShapeNetPart part segmentation tasks without introducing pre-trained weights from other domains.
We show TAP can also bring consistent improvement to scene-level object detection and semantic segmentation backbones. We only pre-train 3DETR with TAP on object-level ShapeNet55 dataset and the improvement is still remarkable.
We show the visualization of images produced by our proposed 3D-to-2D generative pre-training.
Table 4: Ablation studies on Photograph module in TAP pre-training pipeline. We choose PointMLP as the backbone model and conduct ablation studies on ScanObjectNN dataset from two aspects: overall architectural designs and query designs. Please refer to our paper for more ablation details.
Figure 2: The visualization results of our proposed 3D-to-2D generative pre-training. The first row displays view images generated by our TAP pre-training pipeline and the second row shows ground truth images. Our TAP can produce view images with appropriate shapes and reflection colors, demonstrating its ability in capturing geometric structure and stereoscopic knowledge.