Haifeng Shen

Professor, Faculty of Science and Engineering, Southern Cross University

Software engineering

Artificial intelligence

Human-centred computing

Software and application security

Journal article Peer reviewed

Automated detection of affected libraries from vulnerability reports

by Jinwei Xu, He Zhang, Xin Zhou, Yanjing Yang, Runfeng Mao, Xiaokang Li, Lanxin Yang and Haifeng Shen

Published 11/2025

Automated software engineering, 32, 2, 1 - 38

The growing reuse of third-party libraries in software supply chains increases the risk of being affected by the involved vulnerabilities. To strengthen software security, security vendors such as Snyk manage up-to-date vulnerability databases by associating reported vulnerabilities with their affected libraries, and contemporary digital organizations such as banking and software enterprises detect the third-party libraries they use if affected by these reported vulnerabilities. Existing studies focus on automating the detection process but make few efforts on detecting newly affected libraries, although new libraries (previously healthy) are constantly disclosed to be affected by new vulnerabilities. Moreover, existing studies do not seriously consider digital organizations’ concerns only about the libraries they use. In this paper, we propose an approach LibAlarm to address these challenges. We implement LibAlarm as a large language model-powered approach and compare it with the baseline approaches from multiple perspectives. Our experimental evaluation using 16,238 NVD reports indicates that LibAlarm improves the F1 by over 14% compared with baselines and detects over 40% newly affected libraries. For contemporary digital organizations, LibAlarm performs better than the baseline approaches with the F1 above 70% and the reduced false alarm ratio to 20%. Our case analysis using 540 NVD reports and 20 projects from Microsoft and Google demonstrates the effectiveness of LibAlarm. These results indicate that LibAlarm can help security vendors and digital organizations detect affected libraries from vulnerability reports.

Conference proceeding Open access

AUCAD: Automated Construction of Alignment Dataset from Log-Related Issues for Enhancing LLM-based Log Generation

by Hao Zhang, Dongjun Yu, Lei Zhang, Guoping Rong, Yongda Yu, Haifeng Shen, He Zhang, Dong Shao and Hongyu Kuang

Published 27/10/2025

Proceedings of the 16th International Conference on Internetwar(2025)e, 413 - 425

Internetware 2025: the 16th International Conference on Internetware, 20/06/2025–22/06/2025, Trondheim, Norway

Log statements have become an integral part of modern software systems. Prior research efforts have focused on supporting the decisions of placing log statements, such as where/what to log. With the increasing adoption of Large Language Models (LLMs) for code-related tasks such as code completion or generation, automated approaches for generating log statements have gained much momentum. However, the performance of these approaches still has a long way to go. This paper explores enhancing the performance of LLM-based solutions for automated log statement generation by post-training LLMs with a purpose-built dataset. Thus the primary contribution is a novel approach called AUCAD, which automatically constructs such a dataset with information extracting from log-related issues. Researchers have long noticed that a significant portion of the issues in the open-source community are related to log statements. However, distilling this portion of data requires manual efforts, which is labor-intensive and costly, rendering it impractical. Utilizing our approach, we automatically extract log-related issues from 1,537 entries of log data across 88 projects and identify 808 code snippets (i.e., methods) with retrievable source code both before and after modification of each issue (including log statements) to construct a dataset. Each entry in the dataset consists of a data pair representing high-quality and problematic log statements, respectively. With this dataset, we proceed to post-train multiple LLMs (primarily from the Llama series) for automated log statement generation. Both human and experimental evaluations indicate that these models significantly outperform existing LLM-based solutions, thereby validating the efficacy of our method for constructing a post-training dataset to enhance LLM-based log statement generation.

Journal article

DeepMetaIoT: A Multimodal Deep Learning Framework Harnessing Metadata for IoT Sensor Data Classification

by Muhammad Sakib Khan Inan, Kewen Liao, Haifeng Shen, Prem Prakash Jayaraman, Federico Montori and Dimitrios Georgakopoulos

Published 15/10/2025

IEEE internet of things journal, 12, 20, 42352 - 42363

Internet of Things (IoT) sensor data, which capture time series physical measurements such as temperature and humidity, often lack proper classification. This limits their effective understanding, integration, and reuse. While sensor metadata—textual descriptions of the measurements—is sometimes available, it is frequently incomplete or ambiguous. As a result, classification often depends solely on the time series data. Leveraging both time series sensor readings and textual metadata for automated and accurate classification remains a challenge due to the heterogeneity and inconsistency of these data sources. In this paper, we propose DeepMetaIoT, a multimodal deep learning framework that integrates time series and textual data for classification. DeepMetaIoT employs a cross-residual architecture comprising a time series encoder and a text encoder based on a pre-trained large language model, enabling effective fusion of both modalities. Experimental results on real-world IoT sensor datasets show that DeepMetaIoT consistently outperforms state-of-the-art machine learning and deep learning baselines.

Journal article Open access Peer reviewed

Fine-Tuning Large Language Models to Improve Accuracy and Comprehensibility of Automated Code Review

by Yongda Yu, Guoping Rong, Haifeng Shen, He Zhang, Dong Shao, Min Wang, Zhao Wei, Yong Xu and Juhong Wang

Published 31/01/2025

ACM transactions on software engineering and methodology, 34, 1, 1 - 26

As code review is a tedious and costly software quality practice, researchers have proposed several machine learning-based methods to automate the process. The primary focus has been on accuracy, that is, how accurately the algorithms are able to detect issues in the code under review. However, human intervention still remains inevitable since results produced by automated code review are not 100% correct. To assist human reviewers in making their final decisions on automatically generated review comments, the comprehensibility of the comments underpinned by accurate localization and relevant explanations for the detected issues with repair suggestions is paramount. However, this has largely been neglected in the existing research. Large language models (LLMs) have the potential to generate code review comments that are more readable and comprehensible by humans thanks to their remarkable processing and reasoning capabilities. However, even mainstream LLMs perform poorly in detecting the presence of code issues because they have not been specifically trained for this binary classification task required in code review. In this paper, we contribute Carllm (Comprehensibility of Automated Code Review using Large Language Models), a novel fine-tuned LLM that has the ability to improve not only the accuracy but, more importantly, the comprehensibility of automated code review, as compared to state-of-the-art pre-trained models and general LLMs.

Conference proceeding Peer reviewed

Code Comment Inconsistency Detection and Rectification Using a Large Language Model

by Guoping Rong, Yongda Yu, Song Liu, Xin Tan, Tianyi Zhang, Haifeng Shen and Jidong Hu

Published 2025

Proceedings from 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), 443

International Conference on Software Engineering, 27/04/2025–03/05/2025, Ottawa, Ontario, Canada

Comments are widely used in source code. If a comment is consistent with the code snippet it intends to annotate, it would aid code comprehension. Otherwise, Code Comment Inconsistency (CCI) is not only detrimental to the understanding of code, but more importantly, it would negatively impact the development, testing, and maintenance of software. To tackle this issue, existing research has been primarily focused on detecting inconsistencies with varied performance. It is evident that detection alone does not solve the problem; it merely paves the way for solving it. A complete solution requires detecting inconsistencies and, more importantly, rectifying them by amending comments. However, this type of work is scarce. In this paper, we contribute C4RLLaMA, a fine-tuned large language model based on the open-source CodeLLaMA. It not only has the ability to rectify inconsistencies by correcting relevant comment content but also outperforms state-of-the-art approaches in detecting inconsistencies. Experiments with various datasets confirm that C4RLLaMA consistently surpasses both post hoc and just-in-time CCI detection approaches. More importantly, C4RLLaMA outper-forms substantially the only known CCI rectification approach in terms of multiple performance metrics. To further examine C4RLLaMA's efficacy in rectifying inconsistencies, we conducted a manual evaluation, and the results showed that the percentage of correct comment updates by C4RLLaMAwas 65.0% and 55.9% in just-in-time and post hoc, respectively, implying C4RLLaMA's real potential in practical use.

Journal article Peer reviewed

DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection

by Yanjing Yang, Xin Zhou, Runfeng Mao, Jinwei Xu, Lanxin Yang, Yu Zhang, Haifeng Shen and He Zhang

Published 01/2025

The Journal of systems and software, 219, 112234

Software vulnerability detection is generally supported by automated static analysis tools, which have recently been reinforced by deep learning (DL) models. However, despite the superior performance of DL-based approaches over rule-based ones in research, applying DL approaches to software vulnerability detection in practice remains a challenge. This is due to the complex structure of source code, the black-box nature of DL, and the extensive domain knowledge required to understand and validate the black-box results for addressing tasks after detection. Conventional DL models are trained by specific projects and, hence, excel in identifying vulnerabilities in these projects but not in others. These models with poor performance in vulnerability detection would impact the downstream tasks such as location and repair. More importantly, these models do not provide explanations for developers to comprehend detection results. In contrast, Large Language Models (LLMs) with prompting techniques achieve stable performance across projects and provide explanations for results. However, using existing prompting techniques, the detection performance of LLMs is relatively low and cannot be used for real-world vulnerability detections. This paper contributes DLAP, a Deep Learning Augmented LLMs Prompting framework that combines the best of both DL models and LLMs to achieve exceptional vulnerability detection performance. Experimental evaluation results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts, as well as fine-turning on multiple metrics.

Journal article Peer reviewed

KRIOTA: A framework for Knowledge-management of dynamic Reference Information and Optimal Task Assignment in hybrid edge–cloud environments to support situation-aware robot-assisted operations

by Muhammad Aufeef Chauhan, Muhammad Ali Babar and Haifeng Shen

Published 11/2024

Future generation computer systems, 160, 489 - 504

Enabling an autonomous robotic system (ARS) to be aware of its operating environment can equip the system to deal with unknown and uncertain situations. While several conceptual models have been proposed to establish the fundamental concepts of situational awareness, it remains a challenge to make an ARS situation aware, in particular using a combination of low-cost resource-constraint robots at the tactical edge and powerful remote cloud nodes. This paper proposes a dynamic reference information (DRI) based knowledge management and optimal task assignment framework that manages knowledge extracted from DRI to assess the current situation as per given mission objectives and assigns tasks to different computing nodes, which include a combination of edge robots, edge computing nodes and cloud-hosted services. The proposed framework is referred to as KRIOTA. The framework has been designed using an architecture-centric approach. We have designed ontologies to classify and structure different elements of DRI hierarchically and associate the processing components of an ARS with the DRI. We have devised algorithms for the ARS to optimally assign tasks to relevant processing components on robots, edge computing nodes and cloud-hosted services for adaptive behaviour. We have evaluated the framework by demonstrating its implementation in a testbed named RoboPatrol. We have also demonstrated the performance, effectiveness and feasibility of the KRIOTA framework.

Conference proceeding Peer reviewed

DeepHeteroIoT: Deep Local and Global Learning over Heterogeneous IoT Sensor Data

by Muhammad Sakib Khan Inan, Kewen Liao, Haifeng Shen, Prem Prakash Jayaraman, Dimitrios Georgakopoulos and Ming Jian Tang

Published 19/07/2024

Mobile and Ubiquitous Systems: Computing, Networking and Services: 20th EAI International Conference, MobiQuitous 2023, Melbourne, VIC, Australia, November 14–17, 2023, Proceedings, Part I, 593, 119 - 135

EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking, and Services (MobiQuitous), 14/11/2023–17/09/2024, Melbourne, Australia

Internet of Things (IoT) sensor data or readings evince variations in timestamp range, sampling frequency, geographical location, unit of measurement, etc. Such presented sequence data heterogeneity makes it difficult for traditional time series classification algorithms to perform well. Therefore, addressing the heterogeneity challenge demands learning not only the sub-patterns (local features) but also the overall pattern (global feature). To address the challenge of classifying heterogeneous IoT sensor data (e.g., categorizing sensor data types like temperature and humidity), we propose a novel deep learning model that incorporates both Convolutional Neural Network and Bi-directional Gated Recurrent Unit to learn local and global features respectively, in an end-to-end manner. Through rigorous experimentation on heterogeneous IoT sensor datasets, we validate the effectiveness of our proposed model, which outperforms recent state-of-the-art classification methods as well as several machine learning and deep learning baselines. In particular, the model achieves an average absolute improvement of 3.37% in Accuracy and 2.85% in F1-Score across datasets.

Journal article Peer reviewed

Distilling Quality Enhancing Comments From Code Reviews to Underpin Reviewer Recommendation

by Guoping Rong, Yongda Yu, Yifan Zhang, He Zhang, Haifeng Shen, Dong Shah, Hongyu Kuang, Min Wang, Zhao Wei and Yong Xu ... (11 authors)

Published 07/2024

IEEE Transactions on Software Engineering, 50, 7, 1658 - 1674

Code review is an important practice in software development. One of its main objectives is for the assurance of code quality. For this purpose, the efficacy of code review is subject to the credibility of reviewers, i.e., reviewers who have demonstrated strong evidence of previously making quality-enhancing comments are more credible than those who have not. Code reviewer recommendation (CRR) is designed to assist in recommending suitable reviewers for a specific objective and, in this context, assurance of code quality. Its performance is susceptible to the relevance of its training dataset to this objective, composed of all reviewers’ historical review comments, which, however, often contains a plethora of comments that are irrelevant to the enhancement of code quality. Furthermore, recommendation accuracy has been adopted as the sole metric to evaluate a recommender's performance, which is inadequate as it does not take reviewers’ relevant credibility into consideration. These two issues form the ground truth problem in CRR as they both originate from the relevance of dataset used to train and evaluate CRR algorithms. To tackle this problem, we first propose the concept of Quality-Enhancing Review Comments ( QERC ), which includes three types of comments - change-triggering inline comments, informative general comments, and approve-to-merge comments. We then devise a set of algorithms and procedures to obtain a distilled dataset by applying QERC to the original dataset. We finally introduce a new metric – reviewer's credibility for quality enhancement (RCQE) – as a complementary metric to recommendation accuracy for evaluating the performance of recommenders. To validate the proposed QERC-based approach to CRR, we conduct empirical studies using real data from seven projects containing over 82K pull requests and 346K review comments. Results show that: (a) QERC can effectively address the ground truth problem by distilling quality-enhancing comments from the dataset containing original code reviews, (b) QERC can assist recommenders in finding highly credible reviewers at a slight cost of recommendation accuracy, and (c) even “wrong” recommendations using the distilled dataset are likely to be more credible than those using the original dataset.

Journal article Peer reviewed

Hamstring Strain Injury Risk Factors in Australian Football Change over the Course of the Season

by Aylwin Sim, Ryan G. Timmins, Joshua D. Ruddy, Haifeng Shen, Kewen Liao, Nirav Maniar, Jack T. Hickey, Morgan D. Williams and David A. Opar

Published 02/2024

Medicine and science in sports and exercise, 56, 2, 297 - 306

Background/aim
This study aimed to determine which factors were most predictive of hamstring strain injury (HSI) during different stages of the competition in professional Australian Football.

Methods
Across two competitive seasons, eccentric knee flexor strength and biceps femoris long head architecture of 311 Australian Football players (455 player seasons) were assessed at the start and end of preseason and in the middle of the competitive season. Details of any prospective HSI were collated by medical staff of participating teams. Multiple logistic regression models were built to identify important risk factors for HSI at the different time points across the season.

Results
There were 16, 33, and 21 new HSIs reported in preseason, early in-season, and late in-season, respectively, across two competitive seasons. Multivariate logistic regression and recursive feature selection revealed that risk factors were different for preseason, early in-season, and late in-season HSIs. A combination of previous HSI, age, height, and muscle thickness were most associated with preseason injuries (median area under the curve [AUC], 0.83). Pennation angle and fascicle length had the strongest association with early in-season injuries (median AUC, 0.86). None of the input variables were associated with late in-season injuries (median AUC, 0.46). The identification of early in-season HSI and late in-season HSI was not improved by the magnitude of change of data across preseason (median AUC, 0.67).

Conclusions
Risk factors associated with prospective HSI were different across the season in Australian Rules Football, with nonmodifiable factors (previous HSI, age, and height) mostly associated with preseason injuries. Early in-season HSI were associated with modifiable factors, notably biceps femoris long head architectural measures. The prediction of in-season HSI was not improved by assessing the magnitude of change in data across preseason.

Haifeng Shen

Professor, Faculty of Science and Engineering, Southern Cross University

Output list

Southern Cross University Social media