SFM-Net

"Scale-invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation"
by Xiaofei Zhou, Jia Lin, Dongmei Chen, Deyang Liu, Jiyong Zhang and Runmin Cong (Corresponding author: Dongmei Chen, Runmin Cong)
Accepted at the IEEE Transactions on Image Processing (T-IP)

📑 Paper | 🌐 Project Page

🧠 Overview

We propose SFM-Net, a novel framework for V-D-T (visible-depth-thermal) few-shot semantic segmentation.

✨ Key Highlights:

🔄 Asymmetric Multi-modal Fusion: Thermal images are fused with RGB in the encoder stage to extract rich semantic features. In contrast, we treat Depth as prior geometric information. It is integrated via a Prior-related Fusion (PF) module in the later stages to refine coarse predictions, avoiding noise interference from sparse depth maps during early feature extraction.
📏 Scale-invariant Feature Matching: To address significant object scale variations between support and query images, we propose Pixel-to-Patch Pooling (PTP-pool) units that utilize multi-scale pooling kernels to generate feature patches, enabling robust correlation modeling between pixels and patches across different sizes.

Changelogs

260221: adapt code to torch2.7-cuda12.6

⚡ Start

Requirements

Python 3.10
PyTorch 2.7.0
cuda 12.6

Conda environment settings:

conda create -n sfmnet python=3.10
conda activate sfmnet

pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126 # modify with real environment

pip install -r requirements.txt

Dataset

pretrain checkpoint download

Pre-trained backbones:
- ResNet-50 pretrained on ImageNet-1K by TIMM
- ResNet-101 pretrained on ImageNet-1K by TIMM
- Swin-B pretrained on ImageNet by Swin-Transformer
Models we provided: 夸克网盘, Google Drive

train and test

Use scripts/train.sh and scripts/test.sh to train and inference separately.

Benchmark Results (ResNet-50)

	fold0	fold1	fold2	fold3	mIoU	FB-IoU
VDT-2048-5i (1shot)	82.7	71.6	66.7	57.5	69.6	83.6
VDT-2048-5i (5shot)	82.8	71.7	67.2	57.5	69.8	83.7
Tokyo (1shot)	32.1	16.9	34.8	47.5	32.8	62.2
Tokyo (5shot)	34.4	21.3	37.2	48.2	35.3	63.5

The benchmark results of our work can be accessed in:

Acknowledgement

The work is based on DCAMA and PMNet. Thanks for the open source contributions to these efforts!

Citation

if you find our work useful, please cite our paper, thank you!

@ARTICLE{zhou2026scale,
  author={Zhou, Xiaofei and Lin, Jia and Chen, Dongmei and Liu, Deyang and Zhang, Jiyong and Cong, Runmin},
  journal={IEEE Transactions on Image Processing}, 
  title={Scale-Invariant Feature Matching Network for V-D-T Few-Shot Semantic Segmentation}, 
  year={2026},
  volume={35},
  number={},
  pages={2198-2209},
  doi={10.1109/TIP.2026.3663882}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
common		common
data		data
model		model
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFM-Net

🧠 Overview

Changelogs

⚡ Start

Requirements

Dataset

pretrain checkpoint download

train and test

Benchmark Results (ResNet-50)

Acknowledgement

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SFM-Net

🧠 Overview

Changelogs

⚡ Start

Requirements

Dataset

pretrain checkpoint download

train and test

Benchmark Results (ResNet-50)

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages