KeyDiff3D:
Unsupervised Monocular 3D Keypoint
Discovery from Multi-View Diffusion Priors

CVPR 2026

Yonsei University
KeyDiff3D teaser

KeyDiff3D enables 3D keypoint prediction and object manipulation from a single image using multi-view diffusion priors. It generalizes effectively to in-the-wild and out-of-domain scenarios across diverse categories, including both human and animal domains.

Abstract

Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

Method

From a single image, (1) a pretrained multi-view diffusion model provides novel views and multi-view features, (2) which are aggregated and lifted into a 3D feature volume for keypoint prediction, and (3) the predicted 3D keypoints are projected to the generated views to provide structural cues for self-supervised reconstruction.

KeyDiff3D method overview

Figure 2. The overall pipeline of KeyDiff3D.

3D Keypoint Estimation on Human3.6M

Our method outperforms all unsupervised single-view baselines and achieves competitive results with multi-view baselines using only a single-view image. It also achieves improved P-MPJPE compared to monocular human pose estimation methods employing human-specific priors.

Figure 3. Qualitative comparison on the Human3.6M dataset.

Animal Keypoint Estimation

We train our model on diverse animal categories using CUB-200-2011 and Stanford Dogs — both consist of single images captured in natural environments, without multi-view or 3D annotations. Our method reliably captures semantic parts and 3D structure under varied poses, articulations, and occlusions.

CUB-200-2011

Stanford Dogs

Figure 4. Qualitative results on CUB-200-2011 and Stanford Dogs.

Generalization to In-the-Wild and Out-of-Domain Inputs

Although trained only on indoor Human3.6M (five subjects) or Stanford Dogs, our model generalizes well to in-the-wild DAVIS images and AP-10K animal species — including rhinoceroses, zebras, and giraffes. Despite the large variations in shape, appearance, and limb structure, our method consistently predicts semantically meaningful keypoints.

Human3.6M to DAVIS

Human3.6M (train) → DAVIS (test)

Stanford Dogs to AP-10K

Stanford Dogs (train) → AP-10K (test)

Figure 5. Cross-domain generalization. Models trained on Human3.6M and Stanford Dogs generalize to in-the-wild DAVIS videos and AP-10K animal species.

Animatable 3D Object Generation

Our predicted 3D keypoints are aligned with the coordinate system of the diffusion model and the generated multi-view images. Combined with Gaussian Frosting reconstructions, this enables articulation and deformation of generated 3D objects without requiring object-specific skeleton design or manual rigging.

Figure 6. Animatable 3D model results.

Quantitative Comparison

3D keypoint accuracy on Human3.6M. Lower is better. * denotes results on a simplified subset with six actions.

Setting Method #Views #KP Regression MPJPE ↓ N-MPJPE ↓ P-MPJPE ↓
Human PoseSosa et al.118---96.4
Kundu et al.118-99.2--
Kundu et al.118---89.4
Yang et al. *418-85.685.679.3
Multi-ViewBKinD-3D415Linear125-105
BKinD-3D215Linear155-117
Honari et al.4322 hid MLP73.872.663.0
Single-ViewKeypoint-net1322 hid MLP158.7156.8112.9
Honari et al.1322 hid MLP125.73121.0489.05
Ours KeyDiff3D118Linear130.58127.6996.83
KeyDiff3D1182 hid MLP121.34118.2985.26
KeyDiff3D132Linear127.41124.9396.18
KeyDiff3D1322 hid MLP119.07116.0285.37
Ours * KeyDiff3D *118Linear102.39100.6080.16
KeyDiff3D *1182 hid MLP85.4784.3866.73

Table 1. Quantitative comparison of 3D keypoint estimation on Human3.6M.

BibTeX

@inproceedings{jeon2026keydiff3d,
  title     = {KeyDiff3D: Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors},
  author    = {Jeon, Subin and Cho, In and Hong, Junyoung and Cho, Woong Oh and Kim, Seon Joo},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.RS-2022-II220124, No. RS-2024-00457882), and Artificial Intelligence Graduate School Program grant funded by Yonsei University (RS-2020-II201361).