"Scale-invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation"
by Xiaofei Zhou, Jia Lin, Dongmei Chen, Deyang Liu, Jiyong Zhang and Runmin Cong (Corresponding author: Dongmei Chen, Runmin Cong)
Accepted at the IEEE Transactions on Image Processing (T-IP)
📑 Paper | 🌐 Project Page
We propose SFM-Net, a novel framework for V-D-T (visible-depth-thermal) few-shot semantic segmentation.
✨ Key Highlights:
- 🔄 Asymmetric Multi-modal Fusion: Thermal images are fused with RGB in the encoder stage to extract rich semantic features. In contrast, we treat Depth as prior geometric information. It is integrated via a Prior-related Fusion (PF) module in the later stages to refine coarse predictions, avoiding noise interference from sparse depth maps during early feature extraction.
- 📏 Scale-invariant Feature Matching: To address significant object scale variations between support and query images, we propose Pixel-to-Patch Pooling (PTP-pool) units that utilize multi-scale pooling kernels to generate feature patches, enabling robust correlation modeling between pixels and patches across different sizes.
- 260221: adapt code to torch2.7-cuda12.6
- Python 3.10
- PyTorch 2.7.0
- cuda 12.6
Conda environment settings:
conda create -n sfmnet python=3.10
conda activate sfmnet
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126 # modify with real environment
pip install -r requirements.txt- Pre-trained backbones:
- ResNet-50 pretrained on ImageNet-1K by TIMM
- ResNet-101 pretrained on ImageNet-1K by TIMM
- Swin-B pretrained on ImageNet by Swin-Transformer
- Models we provided: 夸克网盘, Google Drive
Use scripts/train.sh and scripts/test.sh to train and inference separately.
| fold0 | fold1 | fold2 | fold3 | mIoU | FB-IoU | |
|---|---|---|---|---|---|---|
| VDT-2048-5i (1shot) | 82.7 | 71.6 | 66.7 | 57.5 | 69.6 | 83.6 |
| VDT-2048-5i (5shot) | 82.8 | 71.7 | 67.2 | 57.5 | 69.8 | 83.7 |
| Tokyo (1shot) | 32.1 | 16.9 | 34.8 | 47.5 | 32.8 | 62.2 |
| Tokyo (5shot) | 34.4 | 21.3 | 37.2 | 48.2 | 35.3 | 63.5 |
The benchmark results of our work can be accessed in:
The work is based on DCAMA and PMNet. Thanks for the open source contributions to these efforts!
if you find our work useful, please cite our paper, thank you!
@ARTICLE{zhou2026scale,
author={Zhou, Xiaofei and Lin, Jia and Chen, Dongmei and Liu, Deyang and Zhang, Jiyong and Cong, Runmin},
journal={IEEE Transactions on Image Processing},
title={Scale-Invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation},
year={2026},
volume={35},
number={},
pages={2198-2209},
doi={10.1109/TIP.2026.3663882}
}

