Output list
Preprint
Posted to a preprint site 09/02/2026
arXiv (Cornell University)
In many applications involving intelligent agents, the overwhelming volume of alerts (mostly false) generated by the agents may desensitize users and cause them to overlook critical issues, leading to the so-called ''alert fatigue''. A common strategy is to train a reflection model as a filter to intercept false alerts with labelled data collected from user verification feedback. However, a key challenge is the noisy nature of such data as it is often collected in production environments. As cleaning noise via manual annotation incurs high costs, this paper proposes a novel method ConceptRM for constructing a high-quality corpus to train a reflection model capable of effectively intercepting false alerts. With only a small amount of expert annotations as anchors, ConceptRM creates perturbed datasets with varying noise ratios and utilizes co-teaching to train multiple distinct models for collaborative learning. By analyzing the consensus decisions of these models, it effectively identifies reliable negative samples from a noisy dataset. Experimental results demonstrate that ConceptRM significantly enhances the interception of false alerts with minimal annotation cost, outperforming several state-of-the-art LLM baselines by up to 53.31% on in-domain datasets and 41.67% on out-of-domain datasets.
Preprint
AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context
Posted to a preprint site 27/01/2026
arXiv (Cornell University)
High-quality evaluation benchmarks are pivotal for deploying Large Language Models (LLMs) in Automated Code Review (ACR). However, existing benchmarks suffer from two critical limitations: first, the lack of multi-language support in repository-level contexts, which restricts the generalizability of evaluation results; second, the reliance on noisy, incomplete ground truth derived from raw Pull Request (PR) comments, which constrains the scope of issue detection. To address these challenges, we introduce AACR-Bench a comprehensive benchmark that provides full cross-file context across multiple programming languages. Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects often overlooked in original PRs, resulting in a 285\% increase in defect coverage. Extensive evaluations of mainstream LLMs on AACR-Bench reveal that previous assessments may have either misjudged or only partially captured model capabilities due to data limitations. Our work establishes a more rigorous standard for ACR evaluation and offers new insights on LLM based ACR, i.e., the granularity/level of context and the choice of retrieval methods significantly impact ACR performance, and this influence varies depending on the LLM, programming language, and the LLM usage paradigm e.g., whether an Agent architecture is employed. The code, data, and other artifacts of our evaluation set are available at https://github.com/alibaba/aacr-bench .
Preprint
Leveraging Self-Paced Learning for Software Vulnerability Detection
Posted to a preprint site 12/11/2025
arXiv (Cornell University)
Software vulnerabilities are major risks to software systems. Recently, researchers have proposed many deep learning approaches to detect software vulnerabilities. However, their accuracy is limited in practice. One of the main causes is low-quality training data (i.e., source code). To this end, we propose a new approach: SPLVD (Self-Paced Learning for Software Vulnerability Detection). SPLVD dynamically selects source code for model training based on the stage of training, which simulates the human learning process progressing from easy to hard. SPLVD has a data selector that is specifically designed for the vulnerability detection task, which enables it to prioritize the learning of easy source code. Before each training epoch, SPLVD uses the data selector to recalculate the difficulty of the source code, select new training source code, and update the data selector. When evaluating SPLVD, we first use three benchmark datasets with over 239K source code in which 25K are vulnerable for standard evaluations. Experimental results demonstrate that SPLVD achieves the highest F1 of 89.2%, 68.7%, and 43.5%, respectively, outperforming the state-of-the-art approaches. Then we collect projects from OpenHarmony, a new ecosystem that has not been learned by general LLMs, to evaluate SPLVD further. SPLVD achieves the highest precision of 90.9%, demonstrating its practical effectiveness.
Preprint
Posted to a preprint site 25/09/2025
arXiv (Cornell University)
Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.
Preprint
Distilling Desired Comments for Enhanced Code Review with Large Language Models
Posted to a preprint site 29/12/2024
arXiv
There has been a growing interest in using Large Language Models (LLMs) for code review thanks to their proven proficiency in code comprehension. The primary objective of most review scenarios is to generate desired review comments (DRCs) that explicitly identify issues to trigger code fixes. However, existing LLM-based solutions are not so effective in generating DRCs for various reasons such as hallucination. To enhance their code review ability, they need to be fine-tuned with a customized dataset that is ideally full of DRCs. Nevertheless, such a dataset is not yet available, while manual annotation of DRCs is too laborious to be practical. In this paper, we propose a dataset distillation method, Desiview, which can automatically construct a distilled dataset by identifying DRCs from a code review dataset. Experiments on the CodeReviewer dataset comprising more than 150K review entries show that Desiview achieves an impressive performance of 88.93%, 80.37%, 86.67%, and 84.44% in terms of Precision, Recall, Accuracy, and F1, respectively, surpassing state-of-the-art methods. To validate the effect of such a distilled dataset on enhancing LLMs' code review ability, we first fine-tune the latest LLaMA series (i.e., LLaMA 3 and LLaMA 3.1) to build model Desiview4FT. We then enhance the model training effect through KTO alignment by feeding those review comments identified as non-DRCs to the LLMs, resulting in model Desiview4FA. Verification results indicate that Desiview4FA slightly outperforms Desiview4FT, while both models have significantly improved against the base models in terms of generating DRCs. Human evaluation confirms that both models identify issues more accurately and tend to generate review comments that better describe the issues contained in the code than the base LLMs do.
Preprint
Posted to a preprint site 25/12/2024
arXiv (Cornell University)
Log statements have become an integral part of modern software systems. Prior research efforts have focused on supporting the decisions of placing log statements, such as where/what to log. With the increasing adoption of Large Language Models (LLMs) for code-related tasks such as code completion or generation, automated approaches for generating log statements have gained much momentum. However, the performance of these approaches still has a long way to go. This paper explores enhancing the performance of LLM-based solutions for automated log statement generation by post-training LLMs with a purpose-built dataset. Thus the primary contribution is a novel approach called AUCAD, which automatically constructs such a dataset with information extracting from log-related issues. Researchers have long noticed that a significant portion of the issues in the open-source community are related to log statements. However, distilling this portion of data requires manual efforts, which is labor-intensive and costly, rendering it impractical. Utilizing our approach, we automatically extract log-related issues from 1,537 entries of log data across 88 projects and identify 808 code snippets (i.e., methods) with retrievable source code both before and after modification of each issue (including log statements) to construct a dataset. Each entry in the dataset consists of a data pair representing high-quality and problematic log statements, respectively. With this dataset, we proceed to post-train multiple LLMs (primarily from the Llama series) for automated log statement generation. Both human and experimental evaluations indicate that these models significantly outperform existing LLM-based solutions, thereby validating the efficacy of our method for constructing a post-training dataset to enhance LLM-based log statement generation.