Doodle Your Keypoints:

Sketch-Based Few-Shot Keypoint Detection

1Department of Computer Science, University of Central Florida 2SketchX, CVSSP, University of Surrey, United Kingdom

Keypoint detection, usually approached in a supervised setting, along with cross-domain adaptation, can be set up in a few-shot paradigm by typically learning to localize novel keypoints from limited annotated photos, also adapting to novel classes. Unlike the existing works, we approach the problem in a cross-modal setup. (right) The proposed few-shot framework adapts to localize novel keypoints on photos of unseen classes given a few annotated sketches.

Abstract

Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.

Architecture

Architecture
Overview of the proposed few-shot key-point detection framework that processes sketches or edgemaps in the support set and photos in the query set. It employs an encoder to extract deep feature maps followed by the derivation of keypoint embeddings through Gaussian Pooling. Support keypoint prototypes are constructed by averaging keypoint embeddings after disentangling style information through the de-stylization network. Support-query correlation is performed by a point-to-point multiplication of prototype and a query feature map, and subsequently a descriptor network formulates a query descriptor, which is used for localization by the GBL module.

 
Architecture de-stylization network
The design of the de-stylization network \(Z\) to disentangle the styles fusing global context to local keypoint embeddings.

Dataset Visualization

Dataset
Sample support and query for training (top) and evaluation (bottom) for all evaluation settings. The support set is composed of edgemaps of the photos obtained from edge detectors.

Results

Class Keypoints Methods [email protected] on Animal Pose [email protected] on Animal Kingdom
Cat Cow Dog Horse Sheep Mean Mammal Amphibian Reptile Bird Fish Mean
Seen Base B-Vanilla 54.12 39.27 44.65 45.58 37.17 44.16 22.32 21.07 18.94 21.77 18.19 20.46
FSKD 58.95 44.61 49.53 50.21 40.47 48.75 25.52 24.42 21.81 25.73 22.24 23.94
Proposed 67.34 49.89 56.28 56.35 45.65 55.10 31.31 29.93 28.41 30.88 27.87 29.68
Novel B-Vanilla 24.70 15.62 19.08 12.45 18.44 18.06 9.67 7.24 6.55 8.41 4.96 7.37
FSKD 47.70 35.44 39.81 35.42 31.59 37.99 18.89 17.39 16.72 19.23 15.94 17.63
Proposed 55.69 43.09 46.58 43.94 36.39 45.14 25.05 23.27 22.56 24.04 21.78 23.34
Unseen Base B-Vanilla 43.03 40.31 36.28 44.72 38.03 40.47 12.83 14.12 12.57 13.68 12.41 13.12
FSKD 41.54 38.10 33.72 41.02 36.30 38.14 14.86 14.28 13.79 15.65 12.16 14.15
Proposed 47.36 42.97 38.30 46.17 41.03 43.17 21.98 20.15 18.96 21.52 17.19 19.96
Novel B-Vanilla 22.71 15.56 16.92 13.58 18.18 17.39 7.26 5.12 3.93 5.69 4.08 5.22
FSKD 36.75 35.76 32.84 32.66 31.58 33.92 10.96 9.34 9.68 11.45 8.86 10.06
Proposed 44.42 40.13 36.91 37.77 35.77 39.00 16.48 14.62 13.76 15.91 11.33 14.42
Qualitative
Visualising detection (⨯) and ground-truth (●) for novel keypoints for base classes (top) and unseen classes (bottom).

Results on Real Sketches

Class Keypoints Support [email protected] on Animal Pose
Cat Cow Dog Horse Sheep Mean
Seen Base Edgemap 67.34 49.89 56.28 56.35 45.65 55.10
Sketch 66.69 45.79 55.43 56.13 43.40 53.29
Novel Edgemap 55.69 43.09 46.58 43.94 36.39 45.14
Sketch 55.45 42.96 46.35 43.88 36.31 44.99
Unseen Base Edgemap 47.36 42.97 38.30 46.17 41.03 43.17
Sketch 45.90 42.47 37.82 45.36 40.45 42.40
Novel Edgemap 44.42 40.13 36.91 37.77 35.77 39.00
Sketch 43.79 39.91 36.17 37.56 35.02 38.49
Despite being trained on synthetic sketches or edgemaps, our framework generalizes well to real sketches, achieving comparable performance.

 
 
Real Sketch
Inference (⨯) with ground-truth (●) for base (top) and novel (bottom) keypoints with annotated sketch prompts.

Photo-based Few-shot Keypoint Detection

Class Keypoints Method [email protected]
Cat Cow Dog Horse Sheep Mean
Seen Base FSKD 68.66 52.70 59.24 58.53 45.04 56.83
Ours 66.97 51.38 57.72 57.31 43.81 55.44
Ours (MM) 80.16 61.34 73.70 67.44 57.85 68.10
Novel FSKD 60.84 47.78 53.44 49.21 38.47 49.95
Ours 59.17 46.49 51.89 47.93 37.65 48.63
Ours (MM) 67.51 49.92 59.05 53.06 43.45 54.60
Unseen Base FSKD 56.38 48.24 51.29 49.77 43.95 49.93
Ours 55.67 46.94 50.47 48.21 42.88 48.83
Ours (MM) 57.68 52.06 51.75 52.27 47.74 52.30
Novel FSKD 52.36 44.07 47.94 42.77 36.60 44.75
Ours 50.88 43.34 46.67 42.52 35.19 43.72
Ours (MM) 54.61 45.92 48.02 43.86 40.31 46.54
A quantitative comparison of the proposed method on query photos using photo only as support and both edgemap and photos (MM) as support. Additional modalities like sketch or edgemap along with photo in the support set provide additional guidance and boost performance.

BibTeX

@inproceedings{maity2025dykp,
title={Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection},
author={Subhajit Maity and Ayan Kumar Bhunia and Subhadeep Koley and Pinaki Nath Chowdhury and Aneeshan Sain and Yi-Zhe Song},
booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2025}}

Copyright: CC BY-NC-SA 4.0 © Subhajit Maity | Last updated: 10 Jul 2025 |Template Credit: Nerfies