Logo image
It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective
Conference proceeding

It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

Yunqian Wang, Xiaohong Li, Ruitao Feng, Yao Zhang, Yuekang Li and Zhiping Zhou
Proceedings 32nd Asia-Pacific Software Engineering Conference (APSEC), pp.947-956
32nd Asia-Pacific Software Engineering Conference (APSEC), 32nd (Macau, China, 02/12/2025–05/12/2025)
02/2026

Metrics

1 Record Views

Abstract

With the escalating threat of software vulnerabilities to the security of modern software systems, an increasing number of deep learning (DL) model-based vulnerability detectors have been developed for vulnerability detection. However, their practical reliability, consistency in usage, and adaptability across diverse software contexts remain unclear. This uncertainty may lead to unreliable detection results in practical applications, increased false positives and false negatives, and limited adaptability to newly emerged vulnerabilities. Conducting a large-scale and in-depth analysis of DL-based vulnerability detectors can help uncover critical factors influencing detection performance, improve the design and training of these models, and enhance their practical deployment in real-world scenarios. In this paper, we present VulTegra, a novel evaluation framework that, for the first time, conducts a multidimensional assessment comparing scratch-trained models and pre-trained-based models for vulnerability detection, while verifying key factors influencing detection performance. Our framework reveals that state-of-the-art (SOTA) detectors still suffer from low consistency, limited practical detection capabilities, and limited adaptability. Moreover, comparative results indicate that the increasingly favored pre-trained-based models are not universally superior to scratch-trained models; instead, they exhibit distinct strengths and application scenarios. Most importantly, our study highlights the limitations of relying solely on CWE-based classification and reveals a set of critical factors that significantly influence detection performance. Experimental validation shows that these factors have a substantial impact: modifying only any single factor led to recall improvements across all seven evaluated SOTA detectors, with six detectors also achieving higher F1 scores. Our findings provide deep insights into model behavior, highlighting the need to consider both vulnerability types and inherent code features to ensure practical applicability in real-world software environments.

Details

Logo image