Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch

Abstract

As sketch research has collectively matured over time, its adaptation for at-mass commercialisation emerges on the immediate horizon. Despite an already mature research endeavour for photos, there is no research on the efficient inference specifically designed for sketch data. In this paper, we first demonstrate existing state-of-the-art efficient light-weight models designed for photos do not work on sketches. We then propose two sketch-specific components which work in a plug-n-play manner on any photo efficient network to adapt them to work on sketch data. We specifically chose fine-grained sketch-based image retrieval (FG-SBIR) as a demonstrator as the most recognised sketch problem with immediate commercial value. Technically speaking, we first propose a cross-modal knowledge distillation network to transfer existing photo efficient networks to be compatible with sketch, which brings down number of FLOPs and model parameters by 97.96% percent and 84.89% respectively. We then exploit the abstract trait of sketch to introduce a RL-based canvas selector that dynamically adjusts to the abstraction level which further cuts down number of FLOPs by two thirds. The end result is an overall reduction of 99.37% of FLOPs (from 40.18G to 0.254G) when compared with a full network, while retaining the accuracy (33.03% vs 32.77%) -- finally making an efficient network for the sparse sketch data that exhibit even fewer FLOPs than the best photo counterpart.

Pilot Study

Unlike photos that hold pixel-dense information, sketches are sparse black and white lines. This begs the question, if a sketch rendered at a higher resolution with added computational burden would convey any extra semantic information than at a lower one. From the pilot study we observe, While FG-IBIR accuracy falls rapidly, FG-SBIR stays relatively stable against decreasing canvas-sizes, as photos (unlike sketches) containing pixel-dense perfect information, lose a lot of it while down-scaling. Furthermore, positive accuracy of FG-SBIR at 32 × 32, shows some sketches to hold sufficient semantic information for retrieval even at minimal canvas-size.

Architecture

SketchyNetV1: A smaller student network is trained from a larger pre-trained teacher via Knowledge Distillation.

We progress to SketchyNetV2 by training a canvas-size selector directed by objectives of increasing performance and reducing compute. It takes sketch as a vector and aims to predict an optimal canvas-size at which the sketch is rasterised. Rasterized sketch when processed by the trained student network minimises overall FLOPS, while retaining accuracy of corresponding full-resolution sketch-image.

Results

Exemplary sketches at their optimal canvas sizes.

Quantitative Analysis on FG-SBIR. Best viewed when zoomed in. Signifant reduction in FLOPs and model size with minimal accuracy drop.

Tradeoff between FLOPs and Accuracy. Any point on the curve can be chosen by varying a hyperparameter.

Ablative Studies.

BibTeX

@inproceedings{sain2025sketchdowntheflops,
title={Sketch Down the FLOPs: Towards Efficient Networks for Human Sketch},
author={Aneeshan Sain and Subhajit Maity and Pinaki Nath Chowdhury and Subhadeep Koley and Ayan Kumar Bhunia and Yi-Zhe Song},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}}

Sketch Down the FLOPs:

Towards Efficient Networks for Human Sketch

Our SketchyNetV1 compresses existing heavy FG-SBIR networks to deliver smaller models. Further enhanced via a canvas-selector module our SketchyNetV2 model minimises sketch-resolution dynamically to reduce FLOPs.

Video