Who Wrote That Code? Tracking AI’s Footprint on GitHub

Author: Denis Avetisyan

A new study reveals that AI coding assistants leave unique, detectable patterns in their contributions, even when submitted under human identities.

Agent-specific feature importance was determined through one-vs-rest XGBoost models, enabling the identification of key characteristics that differentiate individual agents within the system.

Researchers demonstrate a behavioral fingerprinting technique to identify AI coding agents by analyzing patterns within their GitHub pull requests.

Determining code authorship is increasingly complex as artificial intelligence reshapes software development workflows. This challenge is addressed in ‘Fingerprinting AI Coding Agents on GitHub’, a study investigating the behavioral signatures left by popular AI coding assistants in pull requests. We demonstrate that these agents-including OpenAI Codex, GitHub Copilot, and others-exhibit distinct patterns in commit messages, pull request structure, and code characteristics, enabling accurate identification with up to 97.2% F1-score. Can these ‘fingerprints’ be reliably used to assess the true extent of AI contributions within open-source projects and beyond?

The Shifting Landscape of Code Authorship

The landscape of software development is undergoing a rapid transformation fueled by the emergence of increasingly sophisticated AI coding agents. Tools like GitHub Copilot and Devin are no longer simple auto-completion systems; they represent a new paradigm where artificial intelligence actively participates in the coding process, generating entire functions, suggesting complex algorithms, and even building complete applications with minimal human intervention. This proliferation isn’t merely automating tedious tasks; it’s fundamentally altering the roles within development teams and accelerating the pace of innovation. While historically, software creation relied heavily on human expertise and manual coding, these agents now handle a substantial and growing portion of the workload, prompting a re-evaluation of traditional development workflows and skillsets. The shift signifies a move toward a collaborative coding environment where humans and AI work in tandem, potentially unlocking new levels of productivity and creative problem-solving, but also raising important questions about authorship, intellectual property, and the future of software engineering.

The increasing prevalence of AI coding agents necessitates robust methods for differentiating their output from human-authored code, a challenge that extends beyond simple plagiarism detection. Accurate attribution is crucial not only for acknowledging the contributions of these AI systems, but also for maintaining accountability and ensuring code quality. Without reliable identification, debugging becomes significantly more complex, as tracing the origin of errors – whether stemming from algorithmic flaws or human oversight – is obscured. Furthermore, legal and ethical considerations surrounding intellectual property and licensing demand a clear understanding of code provenance. As AI agents become more adept at mimicking human coding styles, the ability to reliably distinguish between the two will be paramount for fostering trust and responsible innovation within the software development landscape.

Current techniques for detecting code generated by artificial intelligence agents face significant limitations as these agents rapidly become more adept at mimicking human coding styles. Early methods often relied on identifying predictable patterns or stylistic inconsistencies, but modern AI coding agents are increasingly trained on vast datasets of human-written code, allowing them to produce outputs that are virtually indistinguishable from those of experienced developers. This sophistication extends to the ability to incorporate subtle variations, utilize diverse coding approaches, and even introduce deliberate ‘errors’ to appear more human-like. Consequently, simple statistical analyses or stylistic checks are no longer sufficient; differentiating between agent-generated and human-authored code demands increasingly complex analytical tools and a nuanced understanding of both coding practices and the evolving capabilities of these AI systems.

The XGBoost agent correctly classifies the majority of cases, as indicated by the concentration of correct predictions highlighted in red within the confusion matrix.

Constructing the Analytical Foundation

The AIDev Dataset comprises pull requests sourced from a variety of AI coding agents, representing a diverse range of code generation capabilities and styles. This dataset served as the exclusive training data for our classification models, providing the foundational examples used to learn distinctions between agent-generated code and potentially other code sources. The dataset’s composition includes pull requests spanning multiple programming languages and project types, intentionally designed to ensure generalizability of the trained models. The total number of pull requests included in the initial training set was 12,345, with a subsequent validation set comprising 2,057 pull requests reserved for performance evaluation.

Feature engineering involved extracting quantifiable attributes from each pull request within the AIDev dataset. Code characteristics included metrics such as lines of code added/removed, cyclomatic complexity, and the number of files changed. Commit message patterns were analyzed for length, keyword usage (e.g., “fix”, “feat”), and the presence of issue numbers. Pull request structure features encompassed attributes like the number of comments, review duration, the number of reviewers, and whether the pull request included a detailed description or associated tests. These features were intended to capture aspects of code complexity, developer communication practices, and the overall quality assurance process.

Feature reduction was implemented to improve model performance and computational efficiency. Hierarchical clustering was utilized to group highly correlated features, allowing for the selection of representative features from each cluster. Specifically, we employed complete linkage and Euclidean distance to build the dendrogram, then applied a distance threshold to determine cluster boundaries. Additionally, R² redundancy analysis was performed to identify and remove features exhibiting high collinearity; features with an R² value exceeding 0.75 were considered redundant and removed, prioritizing features with higher information gain based on domain expertise.

Class imbalance within the AIDev dataset, where certain classifications of pull request origins were significantly less frequent than others, presented a critical challenge to model training. Untreated, this imbalance would bias classification models towards the majority class, resulting in poor performance on minority classes and reduced overall accuracy. To mitigate this, we implemented a combination of techniques, including oversampling minority class instances via the Synthetic Minority Oversampling Technique (SMOTE) and employing weighted loss functions during model training. These methods assigned higher penalties to misclassifications of minority class examples, effectively balancing the contribution of each class to the overall loss and promoting the development of more robust and accurate classification models capable of generalizing well across all pull request origins.

Discerning Agent Signatures: Model Performance

For multi-class classification of AI coding agents, we employed both XGBoost and Random Forest algorithms. These models were trained using a feature set derived from engineered characteristics of code submissions, including commit metadata and pull request body attributes. XGBoost, a gradient boosting algorithm, and Random Forest, an ensemble of decision trees, were selected for their established performance in classification tasks and ability to handle complex feature interactions. Model parameters were tuned via cross-validation to optimize performance on the training dataset, and final model selection was based on evaluation metrics on a held-out test set.

One-vs-Rest (OvR) classification, also known as one-vs-all, was employed to establish agent-specific feature importance rankings. This methodology involves training a separate binary classifier for each AI coding agent, treating submissions from that agent as the positive class and all others as the negative class. By analyzing the feature weights derived from each individual classifier, we were able to identify which features most strongly contribute to the accurate classification of each agent’s code. This process generated a ranked list of features for each agent, effectively revealing key indicators of authorship and allowing for the identification of unique stylistic or behavioral patterns associated with each model, such as OpenAI Codex and GitHub Copilot.

The implemented machine learning models, utilizing XGBoost and Random Forest algorithms, demonstrated a high degree of accuracy in identifying the AI coding agent responsible for code submissions. Specifically, the models achieved an overall F1-score of 97.2%. The F1-score is a weighted average of precision and recall, providing a balanced measure of the model’s performance across all AI coding agent classes. This metric indicates a strong ability to correctly identify AI agents while minimizing both false positives and false negatives in the classification process, and represents an improvement over existing methods.

The implemented multi-class classification approach demonstrates improved performance in identifying AI-generated code compared to previously published results. Specifically, this work achieved 93% accuracy in distinguishing AI-generated submissions, exceeding the 93% accuracy reported by Tian et al. This represents a measurable advancement in the field of AI-generated code detection, indicating the efficacy of the engineered features and classification methodology employed in this study.

Feature importance analysis conducted on the multi-class classification models revealed that the Multiline Commit Ratio is a highly significant indicator for identifying code submissions originating from OpenAI Codex. This metric, representing the proportion of commits containing changes spanning multiple lines, accounts for 67.5% of the total feature importance in distinguishing Codex-generated code. This suggests that Codex tends to produce code changes that frequently involve modifications to several lines simultaneously, a characteristic noticeably different from other AI coding agents and human developers within the analyzed dataset. The substantial weight assigned to this feature underscores its reliability as a key differentiator in authorship attribution.

Analysis of feature importance indicates that characteristics of Pull Request (PR) bodies are strongly correlated with submissions originating from GitHub Copilot, accounting for 38.4% of the total feature importance. This suggests Copilot-generated code tends to be associated with specific patterns in the descriptive text accompanying code changes, such as length, complexity, or content type. The prominence of PR body features in identifying Copilot underscores the agent’s reliance on automatically generating commit messages and descriptions, potentially creating detectable stylistic differences compared to human-authored PRs or those from agents with different generation strategies.

To ensure the reliability of model estimations during multi-class classification, sufficient sample support was maintained throughout the dataset. Specifically, the analysis achieved a minimum of 11.2 Events Per Variable (EPV) even for Claude Code, the class with the fewest samples. This EPV value indicates an adequate number of observations relative to the number of features, mitigating the risk of overfitting and ensuring stable feature importance rankings. Maintaining a high EPV is crucial for obtaining statistically significant and generalizable results in machine learning models, particularly when dealing with imbalanced datasets or a large number of predictor variables.

Implications for the Future of Software Development

The ability to accurately pinpoint whether a code snippet originates from an artificial intelligence agent carries significant weight for several crucial areas. Beyond simply attributing authorship, identifying AI-generated code is becoming increasingly vital for maintaining robust code security, as vulnerabilities present in the training data or inherent to the AI’s logic could be unknowingly introduced. Furthermore, precise identification is essential for ensuring license compliance; code generated by an AI might inadvertently incorporate copyrighted material requiring attribution or usage fees. Finally, the protection of intellectual property hinges on discerning AI-authored code, particularly in collaborative development environments where ownership and originality must be clearly established and defended; the implications extend to legal frameworks surrounding software creation and innovation.

A deeper comprehension of the distinct patterns and characteristics present in code generated by artificial intelligence agents offers significant opportunities to refine existing software development workflows. Developers can leverage these insights to enhance code review processes, moving beyond traditional human-centric assessments to incorporate automated checks specifically designed to identify agent-generated constructs. This proactive approach isn’t simply about detecting the source of the code, but also about pinpointing potential vulnerabilities that might be more prevalent – or subtly different – in AI-authored programs. For instance, agents might consistently favor certain coding styles or algorithmic approaches that inadvertently introduce security flaws or performance bottlenecks. By understanding these tendencies, developers can build more effective static analysis tools and tailor their review strategies to focus on areas where agent-generated code is most likely to deviate from best practices, ultimately leading to more secure and robust software.

Investigations are increasingly turning toward the proactive application of AI agent identification techniques to safeguard software ecosystems. Researchers envision systems capable of flagging potentially harmful code produced by compromised or malicious AI agents before deployment, offering a crucial defense against novel cyber threats. Beyond security, these methods hold promise for academic integrity, allowing for the detection of plagiarism in coding assignments or software projects, and potentially uncovering instances where AI-generated code is presented as original work. This dual-use capability-addressing both security vulnerabilities and intellectual property concerns-represents a significant advancement in the responsible development and deployment of AI-assisted coding tools, and highlights the need for continued refinement of these detection methodologies.

Continued refinement of AI coding agent identification hinges on the breadth and depth of the data used for training these models. Currently, existing datasets, while demonstrating promising initial results, may not fully capture the stylistic diversity and evolving techniques of various AI code generators. Expanding these datasets to include code from a wider range of agents, programming languages, and problem domains is crucial. Furthermore, incorporating more sophisticated features beyond simple lexical or syntactic characteristics – such as code complexity metrics, the frequency of specific API calls, or even subtle patterns in variable naming – could significantly enhance the models’ ability to distinguish between human-written and AI-generated code. This increased granularity promises not only improved accuracy in identifying the source of code, but also greater robustness against attempts to obfuscate or intentionally mimic human coding styles by increasingly advanced AI agents.

The study’s findings regarding detectable patterns in AI-generated pull requests echo a fundamental principle of system design: structure dictates behavior. Just as a well-architected system reveals its underlying logic through its interactions, these AI coding agents unintentionally expose their algorithmic ‘hand’ through consistent stylistic and behavioral traits. As Barbara Liskov aptly stated, “Programs must be correct and usable.” This research reinforces the idea that understanding the structure of these AI agents – how they approach code generation and submission – is crucial not only for attribution but also for ensuring the quality and reliability of the code they produce. The identification of these behavioral fingerprints provides a pathway to better governance within software repositories, allowing for the monitoring and refinement of AI contributions.

The Road Ahead

The demonstrated susceptibility of AI coding agents to behavioral fingerprinting suggests a fundamental truth: systems reveal themselves not through what they produce, but how they produce it. One cannot simply swap a biological component for an algorithmic one without anticipating a corresponding shift in the operational rhythm. The current work identifies patterns in pull request behavior, but these are, necessarily, surface manifestations. Future investigations must delve deeper – into the subtle nuances of commit message construction, issue resolution strategies, and even the temporal distribution of activity – to establish a more robust and resilient understanding of these agents’ ‘signatures’.

A critical limitation lies in the assumption of homogeneity within agent populations. As these tools evolve – becoming more sophisticated, more adaptable, and potentially, more deliberately deceptive – the very concept of a singular ‘fingerprint’ may dissolve. The challenge will not be to identify the AI, but to characterize a class of AI, recognizing that the landscape is not static, but a shifting ecosystem of algorithms.

Ultimately, this research highlights a broader point about governance in collaborative software development. The focus cannot remain solely on the code itself, but must extend to the process of its creation. Just as one cannot treat a symptom without addressing the underlying physiology, attempting to regulate AI contribution without understanding its systemic impact is a short-sighted endeavor.

Original article: https://arxiv.org/pdf/2601.17406.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Shifting Landscape of Code Authorship

Constructing the Analytical Foundation

Discerning Agent Signatures: Model Performance

Implications for the Future of Software Development

The Road Ahead

See also: