Output list
Journal article
Automated detection of affected libraries from vulnerability reports
Published 11/2025
Automated software engineering, 32, 2, 1 - 38
The growing reuse of third-party libraries in software supply chains increases the risk of being affected by the involved vulnerabilities. To strengthen software security, security vendors such as Snyk manage up-to-date vulnerability databases by associating reported vulnerabilities with their affected libraries, and contemporary digital organizations such as banking and software enterprises detect the third-party libraries they use if affected by these reported vulnerabilities. Existing studies focus on automating the detection process but make few efforts on detecting newly affected libraries, although new libraries (previously healthy) are constantly disclosed to be affected by new vulnerabilities. Moreover, existing studies do not seriously consider digital organizations’ concerns only about the libraries they use. In this paper, we propose an approach LibAlarm to address these challenges. We implement LibAlarm as a large language model-powered approach and compare it with the baseline approaches from multiple perspectives. Our experimental evaluation using 16,238 NVD reports indicates that LibAlarm improves the F1 by over 14% compared with baselines and detects over 40% newly affected libraries. For contemporary digital organizations, LibAlarm performs better than the baseline approaches with the F1 above 70% and the reduced false alarm ratio to 20%. Our case analysis using 540 NVD reports and 20 projects from Microsoft and Google demonstrates the effectiveness of LibAlarm. These results indicate that LibAlarm can help security vendors and digital organizations detect affected libraries from vulnerability reports.
Journal article
Published 15/10/2025
IEEE internet of things journal, 12, 20, 42352 - 42363
Internet of Things (IoT) sensor data, which capture time series physical measurements such as temperature and humidity, often lack proper classification. This limits their effective understanding, integration, and reuse. While sensor metadata—textual descriptions of the measurements—is sometimes available, it is frequently incomplete or ambiguous. As a result, classification often depends solely on the time series data. Leveraging both time series sensor readings and textual metadata for automated and accurate classification remains a challenge due to the heterogeneity and inconsistency of these data sources. In this paper, we propose DeepMetaIoT, a multimodal deep learning framework that integrates time series and textual data for classification. DeepMetaIoT employs a cross-residual architecture comprising a time series encoder and a text encoder based on a pre-trained large language model, enabling effective fusion of both modalities. Experimental results on real-world IoT sensor datasets show that DeepMetaIoT consistently outperforms state-of-the-art machine learning and deep learning baselines.
Journal article
Fine-Tuning Large Language Models to Improve Accuracy and Comprehensibility of Automated Code Review
Published 31/01/2025
ACM transactions on software engineering and methodology, 34, 1, 1 - 26
As code review is a tedious and costly software quality practice, researchers have proposed several machine learning-based methods to automate the process. The primary focus has been on accuracy, that is, how accurately the algorithms are able to detect issues in the code under review. However, human intervention still remains inevitable since results produced by automated code review are not 100% correct. To assist human reviewers in making their final decisions on automatically generated review comments, the comprehensibility of the comments underpinned by accurate localization and relevant explanations for the detected issues with repair suggestions is paramount. However, this has largely been neglected in the existing research. Large language models (LLMs) have the potential to generate code review comments that are more readable and comprehensible by humans thanks to their remarkable processing and reasoning capabilities. However, even mainstream LLMs perform poorly in detecting the presence of code issues because they have not been specifically trained for this binary classification task required in code review. In this paper, we contribute Carllm (Comprehensibility of Automated Code Review using Large Language Models), a novel fine-tuned LLM that has the ability to improve not only the accuracy but, more importantly, the comprehensibility of automated code review, as compared to state-of-the-art pre-trained models and general LLMs.
Journal article
Published 01/2025
The Journal of systems and software, 219, 112234
Software vulnerability detection is generally supported by automated static analysis tools, which have recently been reinforced by deep learning (DL) models. However, despite the superior performance of DL-based approaches over rule-based ones in research, applying DL approaches to software vulnerability detection in practice remains a challenge. This is due to the complex structure of source code, the black-box nature of DL, and the extensive domain knowledge required to understand and validate the black-box results for addressing tasks after detection. Conventional DL models are trained by specific projects and, hence, excel in identifying vulnerabilities in these projects but not in others. These models with poor performance in vulnerability detection would impact the downstream tasks such as location and repair. More importantly, these models do not provide explanations for developers to comprehend detection results. In contrast, Large Language Models (LLMs) with prompting techniques achieve stable performance across projects and provide explanations for results. However, using existing prompting techniques, the detection performance of LLMs is relatively low and cannot be used for real-world vulnerability detections. This paper contributes DLAP, a Deep Learning Augmented LLMs Prompting framework that combines the best of both DL models and LLMs to achieve exceptional vulnerability detection performance. Experimental evaluation results confirm that DLAP outperforms state-of-the-art prompting frameworks, including role-based prompts, auxiliary information prompts, chain-of-thought prompts, and in-context learning prompts, as well as fine-turning on multiple metrics.
Journal article
Published 11/2024
Future generation computer systems, 160, 489 - 504
Enabling an autonomous robotic system (ARS) to be aware of its operating environment can equip the system to deal with unknown and uncertain situations. While several conceptual models have been proposed to establish the fundamental concepts of situational awareness, it remains a challenge to make an ARS situation aware, in particular using a combination of low-cost resource-constraint robots at the tactical edge and powerful remote cloud nodes. This paper proposes a dynamic reference information (DRI) based knowledge management and optimal task assignment framework that manages knowledge extracted from DRI to assess the current situation as per given mission objectives and assigns tasks to different computing nodes, which include a combination of edge robots, edge computing nodes and cloud-hosted services. The proposed framework is referred to as KRIOTA. The framework has been designed using an architecture-centric approach. We have designed ontologies to classify and structure different elements of DRI hierarchically and associate the processing components of an ARS with the DRI. We have devised algorithms for the ARS to optimally assign tasks to relevant processing components on robots, edge computing nodes and cloud-hosted services for adaptive behaviour. We have evaluated the framework by demonstrating its implementation in a testbed named RoboPatrol. We have also demonstrated the performance, effectiveness and feasibility of the KRIOTA framework.
Journal article
Distilling Quality Enhancing Comments From Code Reviews to Underpin Reviewer Recommendation
Published 07/2024
IEEE Transactions on Software Engineering, 50, 7, 1658 - 1674
Code review is an important practice in software development. One of its main objectives is for the assurance of code quality. For this purpose, the efficacy of code review is subject to the credibility of reviewers, i.e., reviewers who have demonstrated strong evidence of previously making quality-enhancing comments are more credible than those who have not. Code reviewer recommendation (CRR) is designed to assist in recommending suitable reviewers for a specific objective and, in this context, assurance of code quality. Its performance is susceptible to the relevance of its training dataset to this objective, composed of all reviewers’ historical review comments, which, however, often contains a plethora of comments that are irrelevant to the enhancement of code quality. Furthermore, recommendation accuracy has been adopted as the sole metric to evaluate a recommender's performance, which is inadequate as it does not take reviewers’ relevant credibility into consideration. These two issues form the ground truth problem in CRR as they both originate from the relevance of dataset used to train and evaluate CRR algorithms. To tackle this problem, we first propose the concept of Quality-Enhancing Review Comments ( QERC ), which includes three types of comments - change-triggering inline comments, informative general comments, and approve-to-merge comments. We then devise a set of algorithms and procedures to obtain a distilled dataset by applying QERC to the original dataset. We finally introduce a new metric – reviewer's credibility for quality enhancement (RCQE) – as a complementary metric to recommendation accuracy for evaluating the performance of recommenders. To validate the proposed QERC-based approach to CRR, we conduct empirical studies using real data from seven projects containing over 82K pull requests and 346K review comments. Results show that: (a) QERC can effectively address the ground truth problem by distilling quality-enhancing comments from the dataset containing original code reviews, (b) QERC can assist recommenders in finding highly credible reviewers at a slight cost of recommendation accuracy, and (c) even “wrong” recommendations using the distilled dataset are likely to be more credible than those using the original dataset.
Journal article
Hamstring Strain Injury Risk Factors in Australian Football Change over the Course of the Season
Published 02/2024
Medicine and science in sports and exercise, 56, 2, 297 - 306
Background/aim
This study aimed to determine which factors were most predictive of hamstring strain injury (HSI) during different stages of the competition in professional Australian Football.
Methods
Across two competitive seasons, eccentric knee flexor strength and biceps femoris long head architecture of 311 Australian Football players (455 player seasons) were assessed at the start and end of preseason and in the middle of the competitive season. Details of any prospective HSI were collated by medical staff of participating teams. Multiple logistic regression models were built to identify important risk factors for HSI at the different time points across the season.
Results
There were 16, 33, and 21 new HSIs reported in preseason, early in-season, and late in-season, respectively, across two competitive seasons. Multivariate logistic regression and recursive feature selection revealed that risk factors were different for preseason, early in-season, and late in-season HSIs. A combination of previous HSI, age, height, and muscle thickness were most associated with preseason injuries (median area under the curve [AUC], 0.83). Pennation angle and fascicle length had the strongest association with early in-season injuries (median AUC, 0.86). None of the input variables were associated with late in-season injuries (median AUC, 0.46). The identification of early in-season HSI and late in-season HSI was not improved by the magnitude of change of data across preseason (median AUC, 0.67).
Conclusions
Risk factors associated with prospective HSI were different across the season in Australian Rules Football, with nonmodifiable factors (previous HSI, age, and height) mostly associated with preseason injuries. Early in-season HSI were associated with modifiable factors, notably biceps femoris long head architectural measures. The prediction of in-season HSI was not improved by assessing the magnitude of change in data across preseason.
Journal article
Published 12/2023
IEEE transactions on software engineering, 49, 12, 5223 - 5249
Continuous Integration (CI) enables developers to detect defects early and thus reduce lead time. However, the high frequency and long duration of executing CI have a detrimental effect on this practice. Existing studies have focused on using CI outcome predictors to reduce frequency. Since there is no reported project using predictive CI, it is difficult to evaluate its economic impact. This research aims to investigate predictive CI from a process perspective, including why and when to adopt predictors, what predictors to be used, and how to practice predictive CI in real projects. We innovatively employ Software Process Simulation to simulate a predictive CI process with a Discrete-Event Simulation (DES) model and conduct simulation-based experiments. We develop the Rollback-based Identification of Defective Commits (RIDEC) method to account for the negative effects of false predictions in simulations. Experimental results show that: 1) using predictive CI generally improves the effectiveness of CI, reducing time costs by up to 36.8% and the average waiting time before executing CI by 90.5%; 2) the time-saving varies across projects, with higher commit frequency projects benefiting more; and 3) predictor performance does not strongly correlate with time savings, but the precision of both failed and passed predictions should be paid more attention. Simulation-based evaluation helps identify overlooked aspects in existing research. Predictive CI saves time and resources, but improved prediction performance has limited cost-saving benefits. The primary value of predictive CI lies in providing accurate and quick feedback to developers, aligning with the goal of CI.
Journal article
Revisit security in the era of DevOps: An evidence-based inquiry into DevSecOps industry
Published 08/2023
IET software, 17, 4, 435 - 454
By adopting agile and lean practices, DevOps aims to achieve rapid value delivery by speeding up development and deployment cycles, which however lead to more security concerns that cannot be fully addressed by an isolated security role only in the final stage of development. DevSecOps promotes security as a shared responsibility integrated into the DevOps process that seamlessly intertwines development, operations, and security from the start throughout to the end of cycles. While some companies have already begun to embrace this new strategy, both industry and academia are still seeking a common understanding of the DevSecOps movement. The goal of this study is to report the state-of-the-practice of DevSecOps, including the impact of DevOps on security, practitioners' understanding of DevSecOps, and the practices associated with DevSecOps as well as the challenges of implementing DevSecOps. The authors used a mixed-methods approach for this research. The authors carried out a grey literature review on DevSecOps, and surveyed the practitioners of DevSecOps in industry of China. The status quo of DevSecOps in industry is summarized. Three major software security risks are identified with DevOps, where the establishment of DevOps pipeline provides opportunities for security-related activities. The authors classify the interpretations of DevSecOps into three core aspects of DevSecOps capabilities, cultural enablers, and technological enablers. To materialise the interpretations into daily software production activities, the recommended DevSecOps practices from three perspectives-people, process, and technology. Although a preliminary consensus is that DevSecOps is regarded as an extension of DevOps, there is a debate on whether DevSecOps is a superfluous term. While DevSecOps is attracting an increasing attention by industry, it is still in its infancy and more effort needs to be invested to promote it in both research and industry communities.
Journal article
Published 01/05/2023
IEEE transactions on software engineering, 49, 5, 3071 - 3088
The microservice architecture has been commonly adopted by large scale software systems exemplified by a wide range of online services. Service monitoring through anomaly detection and root cause analysis (RCA) is crucial for these microservice systems to provide stable and continued services. However, compared with monolithic systems, software systems based on the layered microservice architecture are inherently complex and commonly involve entities at different levels of granularity. Therefore, for effective service monitoring, these systems have a special requirement of multi-granular RCA. Furthermore, as a large proportion of anomalies in microservice systems pertain to problematic code, to timely troubleshoot these anomalies, these systems have another special requirement of RCA at the finest code-level. Microservice systems rely on telemetry data to perform service monitoring and RCA of service anomalies. The majority of existing RCA approaches are only based on a single type of telemetry data and as a result can only support uni-granular RCA at either application-level or service-level. Although there are attempts to combine metric and tracing data in RCA, their objective is to improve RCA's efficiency or accuracy rather than to support multi-granular RCA. In this article, we propose a new RCA solution TrinityRCL that is able to localize the root causes of anomalies at multiple levels of granularity including application-level, service-level, host-level, and metric-level, with the unique capability of code-level localization by harnessing all three types of telemetry data to construct a causal graph representing the intricate, dynamic, and nondeterministic relationships among the various entities related to the anomalies. By implementing and deploying TrinityRCL in a real production environment, we evaluate TrinityRCL against two baseline methods and the results show that TrinityRCL has a significant performance advantage in terms of accuracy at the same level of granularity with comparable efficiency and is particularly effective to support large-scale systems with massive telemetry data.