DOSeg | Camouflaged Pest Instance Segmentation

DOSeg (Domain-Knowledge-Guided Orthogonal Subspace Learning for Camouflaged Pest Instance Segmentation)

Timely perception of crop pests in complex field environments is a prerequisite for early prevention and targeted intervention. Accurate pest instance segmentation delineates pest individuals and pest-occupied areas at the pixel level, providing spatial evidence for incipient infestation detection and targeted control. However, severe camouflage interference arises from the high similarity between pests and host backgrounds in color, texture, and contour, while subtle morphological differences among related species further intensify fine-grained inter-class confusion. To address these challenges, we propose DOSeg, a domain-knowledge-guided orthogonal subspace learning framework for camouflaged pest instance segmentation. Specifically, dual-granularity semantic priors are constructed from fine-grained textual descriptions, preserving target-specific details while introducing category-level exclusion anchors. To alleviate feature entanglement caused by shared cross-class attributes, an Orthogonal Subspace Purification module is designed to explicitly suppress interference directions from non-target classes. Finally, semantic-guided mask generation decomposes prediction into semantic anchoring and mask assembly. Semantic activation maps suppress background responses in advance and improve the robustness of class-aware instance masks without compromising inference efficiency. In addition, we release PestCamouflage, the first public multimodal segmentation dataset for camouflaged pests, with instance-level pixel masks and fine-grained textual annotations. Experiments show that DOSeg achieves 76.9\% mAP@50--95, outperforms the compared visual and multimodal baselines, and maintains low inference latency.

Contributions.

We construct PestCamouflage, the first public multimodal instance segmentation dataset for camouflaged pests, containing 4,166 images, instance-level pixel masks for 25 pest categories, and fine-grained textual descriptions.
We propose DOSeg, a domain-knowledge-guided multimodal instance segmentation framework that uses dual-granularity semantic priors to align visual and textual features and improve segmentation accuracy under severe camouflage.
We design an Orthogonal Subspace Purification module that identifies and suppresses interference directions from non-target classes, thereby alleviating feature entanglement caused by cross-class shared attributes.
Experiments show that DOSeg outperforms existing visual-only and multimodal baselines in key segmentation accuracy and efficiency metrics.

Vision-language paradigms for camouflaged pests

Figure 1: Comparison of two vision-language paradigms in camouflaged pest scenarios. (a) Attention-based vision-language fusion enhances target responses but may still cause feature confusion. (b) Contrastive vision-language alignment pulls target features closer to textual semantics while pushing interfering features away, thereby yielding clearer boundaries.

Figure 2: Overview of the PestCamouflage dataset. (a) Pest categories across multiple orders and their image counts. (b) Examples of instance-level pixel masks in camouflaged field scenes. (c) After multi-source collection and data cleaning, candidate fine-grained textual descriptions are generated using templates and Qwen3-VL, while SAM2 is employed for mask-assisted annotation. Boxes, color highlights, and schematic flows are used only for visualization and are not included in the formal annotations.

Figure 3: Overall architecture of DOSeg. (a) Multiscale Visual Representation extracts pest features in field scenes through a lightweight backbone and a Path Aggregation Network, preserving semantic and boundary cues. (b) Text Semantic Construction builds dual-granularity semantic priors from target descriptions and non-target corpora, and suppresses interference among closely related pest categories through orthogonal subspace purification. (c) Semantic-Guided Mask Generation uses purified textual priors to derive class-specific activation maps, suppress visually similar backgrounds, and guide pest instance mask assembly.