Biography and expertise
Biography
Ruitao FENG is a lecturer at the Faculty of Science and Engineering, Southern Cross University. He is driven by curiosity and passion, with a love for music, movies, food, and travel. He is eager to collaborate with brilliant, ambitious students who dream big—let's work together to make them a reality!
He is a member of SCU's Research Clusters:
- ZeroWaste
- Harvest to Health
His work contributes to the following UN Sustainable Development Goals![]()
Research
His research centres on security and quality assurance in software-enabled systems, particularly AI4Sec & SE. This encompasses learning-based intrusion/anomaly detection, malicious behaviour recognition for malware, and code vulnerability detection. His previous research outcomes primarily go to top conferences/journals (CORE A/A*, CCF A) in Computer Security and Software Engineering.
Teaching
INFO6002 Cybersecurity Essentials
ISYS1003 Cybersecurity Management
ISYS6002 Advanced Cybersecurity
COMP2004 Systems Security and Operation
DATA2001 Database Systems
ISYS3001 Managing Software Development (UA)
PROG6001 Managing Software Development Projects (UA)
Other
He is also looking for self-motivated Ph.D. students with strong programming skills and relevant experience. Please email your CV and cover letter if you are interested. For SCU undergraduate and master students who wish to do honours or minor thesis, please feel free to reach out.
Links
Organisational affiliations
Past affiliations
Highlights - Output
Conference proceeding
CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift
Published 19/06/2025
Proceedings of the ACM on software engineering, 2, FSE, 1687 - 1709
With the rapid advancement of cloud-native computing, securing cloud environments has become an important task. Log-based Anomaly Detection (LAD) is the most representative technique used in different systems for attack detection and safety guarantee, where multiple LAD methods and relevant datasets have been proposed. However, even though some of these datasets are specifically prepared for cloud systems, they only cover limited cloud behaviors and lack information from a whole-system perspective. Another critical issue to consider is normality shift, which implies that the test distribution could differ from the training distribution and highly affect the performance of LAD. Unfortunately, existing works only focus on simple shift types such as chronological changes, while other cloud-specific shift types are ignored, e.g., different deployed cloud architectures. Therefore, a dataset that captures diverse cloud system behaviors and various types of normality shifts is essential.
To fill this gap, we construct a dataset CAShift to evaluate the performance of LAD in cloud, which considers different roles of software in cloud systems, supports three real-world normality shift types (application shift, version shift, and cloud architecture shift), and features 20 different attack scenarios in various cloud system components. Based on CAShift, we conduct a comprehensive empirical study to investigate the effectiveness of existing LAD methods in normality shift scenarios. Additionally, to explore the feasibility of shift adaptation, we further investigate three continuous learning approaches, which are the most common methods to mitigate the impact of distribution shift. Results demonstrated that 1) all LAD methods suffer from normality shift where the performance drops up to 34%, and 2) existing continuous learning methods are promising to address shift drawbacks, but the ratio of data used for model retraining and the selection of algorithms highly affect the shift adaptation, with an increase in the F1-Score of up to 27%. Based on our findings, we offer valuable implications for future research in designing more robust LAD models and methods for LAD shift adaptation.
Journal article
MedExChain: Enabling Secure and Efficient PHR Sharing Across Heterogeneous Blockchains
First online publication 12/06/2025
IEEE internet of things journal, First online, 16, 1 - 1
With the proliferation of intelligent healthcare systems, patients' Personal Health Records (PHR) generated by the Internet of Medical Things (IoMT) in real-time play a vital role in disease diagnosis. The integration of emerging blockchain technologies significantly enhanced the data security inside intelligent medical systems. However, data sharing across different systems based on varied blockchain architectures is still constrained by the unsolved performance and security challenges. This paper constructs a cross-chain data sharing scheme, termed MedExChain, which aims to securely share PHR across heterogeneous blockchain systems. The MedExChain scheme ensures that PHR can be shared across chains even under the performance limitations of IoMT devices. Additionally, the scheme incorporates Cryptographic Reverse Firewall (CRF) and a blockchain audit mechanism to defend against both internal and external security threats. The robustness of our scheme is validated through BAN logic, Scyther tool, Chosen Plaintext Attack (CPA) and Algorithm Substitution Attack (ASA) security analysis verification. Extensive evaluations demonstrate that MedExChain significantly minimizes computation and communication overhead, making it suitable for IoMT devices and fostering the efficient circulation of PHR across diverse blockchain systems.
Journal article
First online publication 31/01/2025
ACM transactions on software engineering and methodology, First online, 1 - 38
Malware family labels and key features used for the decision-making of Android malware detection models fall short of precise comprehension of malicious behaviors due to their coarse granularity. To solve these problems, in this paper, we first introduce the concept of the malicious behavior trajectory (MBT) and propose an innovative approach called ProMal. ProMal aims to automatically generate malware descriptions with fine granularity through extracted MBTs from malware for users. Specifically, a labeled dataset of MBTs is constructed through substantial human efforts to build a behavioral knowledge graph (BxKG). The BxKG is scalable and can be automatically updated using two strategies to ensure its completeness and timeliness: 1) taking into consideration the evolution of Android SDKs, and 2) mining new MBTs by leveraging the widely-used malware datasets. We highlight that the knowledge graph is essential in ProMal, which can reason new MBTs based on existing MBTs because of its structured data representation and semantic relation modeling, and thus helps effectively extract real MBTs in Android malware. We evaluated ProMal on a recent malware dataset where researcher-crafted malware descriptions are available, and the Precision, Recall, and F1-Score of MBT identification based on BxKG reached 96.97%, 91.43%, and 0.94, respectively, outperforming the state-of-the-art approaches. Taking MBTs identified from Android malware as inputs, precise, fine-grained, and human-readable descriptions can be generated using the large language model, whose readability and usability are verified through a user study. The generated descriptions play a significant role in interpreting and comprehending malware behaviors.
Conference proceeding
Published 02/12/2024
Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, 4509 - 4523
CCS '24: ACM SIGSAC Conference on Computer and Communications Security, 14/10/2024–18/10/2024, Salt Lake City, Utah, United States
Deep learning (DL) based anomaly detection has shown great promise in the field of security due to its remarkable performance in various tasks. However, the issue of poor interpretability in DL models has significantly impeded their deployment in practical security applications. Despite the progress made in existing studies on DL explanations, the majority of them focus on providing local explanations for individual samples, neglecting the global understanding of the model knowledge. Furthermore, most explanations for supervised models fail to apply to anomaly detection due to their different learning mechanisms.
In this work, we address the gap in the existing research by proposing GEAD, a novel global explanation for DL-based anomaly detection, to extract high-fidelity rules from DL models. We apply GEAD to two security applications, network intrusion detection and system log anomaly detection, and demonstrate the efficacy with three usages: comparing model knowledge with expert knowledge, identifying knowledge discrepancies between models, and combining model and expert knowledge. We provide several case studies to showcase how GEAD can significantly enhance existing anomaly detection systems. Moreover, we provide a real-world deployment in a SCADA system to showcase the potential in practice. Some important insights are drawn to help the community understand and improve anomaly detection systems in security.
Journal article
Reinventing Multi-User Authentication Security From Cross-Chain Perspective
Published 18/09/2024
IEEE transactions on information forensics and security, 19, 8908 - 8923
Blockchain systems encompass many distinct and autonomous entities, each utilizing its own self-contained identity authentication algorithm. Unlike identity authentication within a singular blockchain, cross-chain scenarios demand special attention due to their pivotal role in enabling the acknowledgment of users' identities across diverse domains. This capability is the foundational prerequisite for the circulation of resources across different chains. Consequently, the central challenge for cross-chain systems lies in establishing mutual recognition and trust in users' digital identities. This paper proposes a Multi-User Proxy Re-Signature (MU-PRS) algorithm, facilitating the cross-chain conversion of signatures from multiple users. Concurrently, This paper propose the Multi-Notary Signature Conversion (MN-SC) mechanism, designed to address the challenge posed by disparate system mechanisms across blockchains during cross-chain authentication. Leveraging the MU-PRS algorithm and MN-SC mechanism, we present a Multi-User Cross-Chain Authentication Scheme (MU-CCAS) within a heterogeneous blockchain environment. This scheme enables the verification of identities of multiple cross-chain users through a single signature verification. This innovative approach not only addresses the centralization issues inherent in third-party cross-chain authentication but also significantly enhances the efficiency of identity authentication. The evaluation results demonstrate MU-CCAS's superior security over existing solutions in three dimensions: BAN logic, Scyther verification, and security attribute analysis. Additionally, it establishes that MU-PRS and MU-CCAS have low computational overhead, easy implementation, and excel in algorithm, scheme, and cross-chain performance. Overall, our work provides a robust and efficient framework for cross-chain authentication, addressing centralization challenges and enhancing digital security.
Journal article
Beyond Fidelity: Explaining Vulnerability Localization of Learning-Based Detectors
Published 06/2024
ACM transactions on software engineering and methodology, 33, 5, 1 - 33
Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. However, the shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. To address this, various explanation approaches have been proposed to explain the predictions by highlighting important features, which have been demonstrated effective in domains such as computer vision and natural language processing. Unfortunately, there is still a lack of in-depth evaluation of vulnerability-critical features, such as fine-grained vulnerability-related code lines, learned and understood by these explanation approaches. In this study, we first evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations, measured by two quantitative metrics including fidelity and vulnerability line coverage rate. Our results show that fidelity alone is insufficent for evaluating these approaches, as fidelity incurs significant fluctuations across different datasets and detectors. We subsequently check the precision of the vulnerability-related code lines reported by the explanation approaches, and find poor accuracy in this task among all of them. This can be attributed to the inefficiency of explainers in selecting important features and the presence of irrelevant artifacts learned by DL-based detectors.
Conference proceeding
TransRepair: Context-aware Program Repair for Compilation Errors
Published 05/01/2023
Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineerin, ASE 2022, 1 - 13
ASE '22: 37th IEEE/ACM International Conference on Automated Software Engineering, 10/10/2022–14/10/2022, Rochester, MI, United States
Automatically fixing compilation errors can greatly raise the productivity of software development, by guiding the novice or AI programmers to write and debug code. Recently, learning-based program repair has gained extensive attention and became the state-of-the-art in practice. But it still leaves plenty of space for improvement. In this paper, we propose an end-to-end solution TransRepair to locate the error lines and create the correct substitute for a C program simultaneously. Superior to the counterpart, our approach takes into account the context of erroneous code and diagnostic compilation feedback. Then we devise a Transformer-based neural network to learn the ways of repair from the erroneous code as well as its context and the diagnostic feedback. To increase the effectiveness of TransRepair, we summarize 5 types and 74 fine-grained sub-types of compilations errors from two real-world program datasets and the Internet. Then a program corruption technique is developed to synthesize a large dataset with 1,821,275 erroneous C programs. Through the extensive experiments, we demonstrate that TransRepair outperforms the state-of-the-art in both single repair accuracy and full repair accuracy. Further analysis sheds light on the strengths and weaknesses in the contemporary solutions for future improvement.
Journal article
Multi-label Classification for Android Malware Based on Active Learning
Published 17/10/2022
IEEE transactions on dependable and secure computing, 1 - 18
The existing malware classification approaches (i.e., binary and family classification) can barely benefit subsequent analysis with their outputs. Even the family classification approaches suffer from lacking a formal naming standard and an incomplete definition of malicious behaviors. More importantly, the existing approaches are powerless for one malware with multiple malicious behaviors, while this is a very common phenomenon for Android malware in the wild. So that both of them actually cannot provide researchers with a direct and comprehensive enough understanding of malware. In this paper, we propose MLCDroid, an ML-based multi-label classification approach that can directly indicate the existence of pre-defined malicious behaviors. With an in-depth analysis, we summarize 6 basic malicious behaviors from real-world malware with security reports and construct a labeled dataset. We compare the results of 70 algorithm combinations to evaluate the effectiveness (best at 73.3%). Faced with the challenge of the expensive cost of data annotation, we further propose an active learning approach based on data augmentation, which can improve the overall accuracy to 86.7% with a data augmentation of 5,000+ high-quality samples from an unlabeled malware dataset. This is the first multi-label Android malware classification approach intending to provide more information on fine-grained malicious behaviors.
Journal article
Deep Learning for Coverage-Guided Fuzzing: How Far are We?
Published 09/09/2022
IEEE transactions on dependable and secure computing, 1 - 13
Fuzzing is a widely-used software vulnerability discovery technology, many of which are optimized using coverage-feedback. Recently, some techniques propose to train deep learning (DL) models to predict the branch coverage of an arbitrary input owing to its always-available gradients etc. as a guide. Those techniques have proved their success in improving coverage and discovering bugs under different experimental settings. However, DL models, usually as a magic black-box, are notoriously lack of explanation. Moreover, their performance can be sensitive to the collected runtime coverage information for training, indicating potentially unstable performance. In this work, we conduct a systematic empirical study on 4 types of DL models across 6 projects to (1) revisit the performance of DL models on predicting branch coverage (2) demystify what specific knowledge do the models exactly learn, (3) study the scenarios where the DL models can outperform and underperform the traditional fuzzers, and (4) gain insight into the challenges of applying DL models on fuzzing. Our empirical results reveal that existing DL-based fuzzers do not perform well as expected, which is largely affected by the dependencies between branches, unbalanced sample distribution, and the limited model expressiveness. In addition, the estimated gradient information tends to be less helpful in our experiments. Finally, we further pinpoint the research directions based on our summarized challenges.
Journal article
Enhancing Security Patch Identification by Capturing Structures in Commits
Published 25/07/2022
IEEE transactions on dependable and secure computing, 1 - 15
With the rapid increasing number of open source software (OSS), the majority of the software vulnerabilities in the open source components are fixed silently, which leads to the deployed software that integrated them being unable to get a timely update. Hence, it is critical to design a security patch identification system to ensure the security of the utilized software. However, most of the existing works for security patch identification just consider the changed code and the commit message of a commit as a flat sequence of tokens with simple neural networks to learn its semantics, while the structure information is ignored. To address these limitations, in this paper, we propose our well-designed approach E-SPI, which extracts the structure information hidden in a commit for effective identification. Specifically, it consists of the code change encoder to extract the syntactic of the changed code with the BiLSTM to learn the code representation and the message encoder to construct the dependency graph for the commit message with the graph neural network (GNN) to learn the message representation. We further enhance the code change encoder by embedding contextual information related to the changed code. To demonstrate the effectiveness of our approach, we conduct the extensive experiments against six state-of-the-art approaches on the existing dataset and from the real deployment environment. The experimental results confirm that our approach can significantly outperform current state-of-the-art baselines.