Pair Programming with AI: How Humans and Code Assistants Really Work Together

Author: Denis Avetisyan

New research sheds light on the dynamics of collaborative coding between humans and large language models, revealing both the promise and the pitfalls of this emerging partnership.

Figure 2.Interaction Patterns The system reveals interaction patterns emerge not from imposed structure, but from the complex interplay of individual agents, each responding to local stimuli and propagating effects throughout the network-a process mirroring the unpredictable growth of a biological system where localized failures are inevitable, yet contribute to overall resilience, and where $x \neq y$ always holds true for any two agents.

An empirical study of multi-turn conversations demonstrates patterns in human-AI coding collaboration, identifies challenges in instruction following, and highlights factors impacting user satisfaction.

While large language models (LLMs) increasingly support complex coding tasks through conversational interfaces, a systematic understanding of how humans and LLMs collaborate in practice remains limited. This research, ‘Decoding Human-LLM Collaboration in Coding: An Empirical Study of Multi-Turn Conversations in the Wild’, empirically analyzes real-world user-LLM coding dialogues to reveal patterns in interaction, challenges in instruction following, and key drivers of user satisfaction. Findings demonstrate that task type significantly shapes collaborative patterns, LLMs struggle more with bug fixing and refactoring, and certain task types yield demonstrably lower user satisfaction. How can these insights inform the development of more adaptive and effective LLM interfaces for AI-assisted software development?

The Echo Chamber of Interaction

Large Language Models (LLMs) have rapidly become the driving force behind a new wave of interactive applications, fundamentally changing how humans and machines communicate. These models, trained on massive datasets of text and code, now underpin sophisticated chatbots, virtual assistants, and even creative writing tools capable of engaging in extended, multi-turn dialogues. Unlike earlier rule-based systems, LLMs leverage the power of deep learning to understand context, generate human-quality responses, and adapt to the nuances of conversation. This shift isn’t simply about more natural interactions; it’s enabling entirely new application paradigms, from personalized education and therapeutic support to automated customer service and complex problem-solving, with LLMs increasingly integrated into daily life and reshaping the landscape of human-computer interaction.

Despite advancements in large language models, consistently maintaining coherent and satisfying dialogues proves remarkably challenging. Current models struggle with sustained instruction following, achieving only approximately 48.24% loose accuracy at the instruction level. This suggests a significant gap between a model’s ability to respond to individual prompts and its capacity to remember and correctly apply instructions across multiple conversational turns. The difficulty isn’t simply about understanding language; it’s about maintaining contextual awareness and consistently aligning responses with the user’s evolving goals throughout a conversation, highlighting a key limitation in the pursuit of truly interactive artificial intelligence.

Assessing the effectiveness of conversational AI extends far beyond traditional accuracy measurements. While quantifying whether a model provides a factually correct response is important, it fails to capture the holistic quality of an interaction. True evaluation necessitates nuanced metrics centered on user experience, considering factors like dialogue coherence, engagement, and perceived helpfulness. Researchers are increasingly focused on developing methods that measure these subjective qualities – such as turn-by-turn satisfaction, the naturalness of the conversation flow, and the model’s ability to adapt to user needs – as these better reflect whether an interaction is genuinely useful and enjoyable. Simply achieving a high percentage of correct answers doesn’t guarantee a positive user experience; a conversation can be technically accurate yet frustrating or unhelpful if it lacks empathy, clarity, or a sense of natural dialogue.

The distribution of core tasks shifts with conversation length, indicating task complexity evolves as interactions progress.

Mapping the Conversational Terrain

Large-scale datasets such as LMSYS-Chat-1M and WildChat are increasingly utilized for research into multi-turn dialogue systems. LMSYS-Chat-1M, composed of 1.1 million shared conversations from the LMSYS Chatbot Arena, provides a diverse range of user interactions and LLM responses. WildChat, another significant resource, consists of 83k multi-turn conversations collected from ShareGPT. These datasets enable the analysis of extended dialogues, moving beyond single-turn question-answering, and facilitate the development of more robust and contextually aware language models. The scale of these resources is critical, allowing for statistical significance in identifying trends and patterns within conversational dynamics that would be impossible to detect with smaller datasets.

Analysis of large dialogue datasets, such as LMSYS-Chat-1M and WildChat, demonstrates that multi-turn conversations exhibit varied structural patterns beyond simple back-and-forth exchanges. These patterns include linear progressions, where each utterance directly follows the previous one; tree structures, representing branching conversations with multiple responses to a single prompt; and star structures, characterized by a central prompt eliciting numerous independent replies. Notably, conversations exhibiting a tree structure demonstrate a high degree of non-compliance, with 94.28% of responses deviating from the expected conversational path or failing to address the originating prompt directly. This indicates a significant challenge in maintaining coherence and relevance within branching dialogue scenarios.

The ability of Large Language Models (LLMs) to effectively navigate diverse conversational scenarios is directly linked to their exposure to and understanding of varied dialogue patterns. Datasets containing multi-turn interactions demonstrate that conversations rarely follow a single linear path; instead, they frequently branch into tree or star structures. LLMs trained primarily on linear data may struggle with these more complex exchanges, exhibiting reduced performance in scenarios requiring backtracking, clarification, or the handling of multiple subtopics. Consequently, incorporating training data that reflects the full spectrum of interaction patterns-including those with high rates of non-compliance or topic switching-is essential for developing LLMs capable of robust and adaptive dialogue management. This improved adaptability translates to more natural, coherent, and user-satisfying conversational experiences.

Deconstructing the Illusion of Understanding

The SPUR Framework – which stands for Specificity, Plausibility, Usefulness, and Reasoning – provides a systematic methodology for developing evaluation rubrics focused on conversational quality. This framework deconstructs conversational turns into measurable components based on these four criteria: the clarity and detail of the response (Specificity), the logical coherence and factual correctness (Plausibility), the relevance to the user’s needs (Usefulness), and the presence of supporting rationale or explanation (Reasoning). By applying SPUR, evaluators can generate standardized rubrics that move beyond subjective assessments and instead quantify aspects of conversation quality, allowing for consistent and reproducible evaluation of conversational AI systems. The resulting rubrics facilitate a granular analysis of system performance, identifying strengths and weaknesses in specific conversational attributes.

The Satisfaction Trajectory Method analyzes user satisfaction not as a single endpoint value, but as a dynamic function of each turn within a conversational interaction. This approach involves tracking satisfaction levels – typically measured via post-interaction surveys or implicit feedback signals – at multiple points during the conversation. Data is then aggregated to create a satisfaction profile, or trajectory, revealing how satisfaction increases, decreases, or plateaus over time. Analysis of these trajectories allows for the identification of specific conversational elements or turns that correlate with positive or negative shifts in user satisfaction, providing granular insights beyond overall satisfaction scores and enabling targeted improvements to conversational systems.

Statistical validation of factors influencing user satisfaction employs non-parametric tests due to the frequently observed non-normal distribution of satisfaction scores. The Kolmogorov-Smirnov Test is utilized to assess whether sample data follows a specified distribution, in this case verifying deviations from normality that necessitate the use of non-parametric alternatives. The Kruskal-Wallis H Test, an extension of the Mann-Whitney U test, then determines if there are statistically significant differences in satisfaction levels across different groups or conditions; this test compares the medians of multiple independent samples. These tests provide objective evidence supporting the impact of specific conversational attributes or system modifications on user satisfaction, allowing for data-driven optimization.

Evaluation reliability is established through the application of Cohen’s Kappa, a statistical measure of inter-rater agreement. Manual validation procedures, employing multiple annotators, achieved a 70% inter-annotator agreement rate. This indicates a substantial level of consistency in the labeling of conversational quality metrics, minimizing subjective bias and strengthening the validity of the evaluation dataset. A Kappa value of 0.70 or higher generally signifies excellent agreement, demonstrating the robustness of the annotation process and the trustworthiness of the resulting data for quantitative analysis.

Current evaluation data indicates an overall conversation-level loose accuracy of 24.07%. This metric represents the percentage of conversations successfully meeting predefined success criteria, as determined through automated and manual evaluation processes. The calculation considers a range of factors, including task completion, information relevance, and user engagement. This figure serves as a baseline for assessing the performance of conversational AI systems and tracking improvements through iterative development and model refinement. Further analysis is ongoing to identify key areas impacting this metric and to establish benchmarks for different conversation types and complexity levels.

Average user satisfaction ratings varied depending on the specific task and subtask completed.

The Fragility of Simulated Competence

Assessing large language models extends beyond simple conversation; rigorous evaluation now focuses on specialized tasks demanding precise information retrieval. One such area is Database Knowledge Query, where an LLM’s ability to accurately extract and present data is paramount. Recent studies employ specific metrics to gauge performance in this domain, revealing an average satisfaction score of 3.99. This indicates a generally positive, though not flawless, capability in accessing and relaying information stored within databases. The evaluation process doesn’t merely check for correct answers, but also considers the clarity, conciseness, and overall usefulness of the delivered information, providing a nuanced understanding of the LLM’s practical utility in data-driven applications.

DeepSeek-Reasoner represents a significant advancement in evaluating large language models, moving beyond simple pass/fail metrics to offer detailed assessments of both instruction adherence and user contentment. This evaluator doesn’t merely check if a model completes a task, but rather how well it understands and executes nuanced instructions, and to what degree the output satisfies user expectations. By analyzing responses with a greater degree of sophistication, DeepSeek-Reasoner can pinpoint specific areas where a model excels or falters, offering valuable insights for targeted improvements. The system considers factors beyond factual correctness, incorporating elements of coherence, relevance, and overall quality to provide a more holistic and human-aligned evaluation of performance. This nuanced approach is crucial for building truly helpful and reliable artificial intelligence.

The principle of the Recency Effect suggests that human memory disproportionately favors recent experiences when forming judgments and overall impressions. This cognitive bias is particularly relevant when evaluating large language models, as user satisfaction is heavily influenced by the quality of the most recent interactions. Consequently, even a model demonstrating strong overall performance can suffer diminished ratings if its final responses are subpar or unhelpful. Researchers are increasingly recognizing that evaluating LLMs requires careful consideration of this temporal dynamic, moving beyond aggregate scores to analyze how performance trends over time impact user perception and, ultimately, the perceived value of the technology. This highlights the need for evaluation metrics that aren’t solely based on cumulative results, but also account for the model’s ability to maintain consistent quality throughout a conversation or series of tasks.

Evaluations reveal a significant performance gap across different specialized tasks for large language models, notably with Configuration Debug achieving an average satisfaction score of just 2.69. This comparatively low rating suggests that current models struggle with the intricacies of identifying and resolving configuration errors, a task demanding precise logical reasoning and attention to detail. The results indicate that while LLMs excel at general conversational abilities, applying that intelligence to practical, technically-focused debugging requires substantial improvement, highlighting a critical area for future development and refinement of these systems.

Current large language models exhibit surprisingly high rates of non-compliance when tasked with code generation and refactoring, with 87.50% and 86.96% of attempts failing to fully adhere to provided instructions, respectively. This suggests a significant limitation in their ability to consistently translate natural language requests into functional and correct code, even when the task appears straightforward. The high failure rates aren’t simply due to errors in execution, but rather a fundamental difficulty in understanding and implementing the desired changes as specified by the user, indicating a gap between their apparent fluency and genuine programming competence. Further research is needed to identify the root causes of this non-compliance and develop techniques to improve the reliability of these models for software development applications.

The study of human-LLM collaboration reveals a curious truth: these systems aren’t built, they become. Each interaction, each revised line of code, is a mutation in a growing organism. It echoes a sentiment shared by the late Paul Erdős: “A mathematician knows a lot of things, but a good mathematician knows where to find them.” This research doesn’t offer a blueprint for perfect code synthesis, but rather a map of the wilderness where human intention and algorithmic response meet. The unpredictable nature of multi-turn dialogue, particularly the challenges in instruction following, suggests that striving for complete control is a fool’s errand. The system simply grows, and one can only observe, adapt, and hope for elegant failures – for every refactor begins as a prayer and ends in repentance.

What Lies Ahead?

This study illuminates the contours of a collaboration, not a conquest. The patterns observed aren’t failures of instruction, but symptoms of a fundamental mismatch – a system attempting to follow direction, when what’s needed is a partner capable of understanding intent. A system isn’t built with perfect instructions; it’s grown through shared context, through the subtle negotiations of a common language. The focus, then, shouldn’t be on refining prompts, but on cultivating a space where ambiguity is tolerated, even welcomed.

Current metrics of ‘user satisfaction’ feel… provisional. They measure ease of use, certainly, but little about the deeper shifts in a developer’s practice. Does this collaboration foster genuine creativity, or merely accelerate the production of predictable code? The true cost isn’t measured in time saved, but in the erosion of problem-solving skills, in the outsourcing of critical thinking. Resilience lies not in isolating components, but in forgiveness between them – in systems that gracefully accommodate error, and even learn from it.

The future isn’t about more data, or larger models. It’s about acknowledging the inherent messiness of creation. A system isn’t a machine to be perfected, it’s a garden – neglect it, and you’ll grow technical debt. The next step isn’t about building a better tool, but about nurturing a more symbiotic relationship – one where the human and the machine aren’t simply working together, but evolving with each other.

Original article: https://arxiv.org/pdf/2512.10493.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Echo Chamber of Interaction

Mapping the Conversational Terrain

Deconstructing the Illusion of Understanding

The Fragility of Simulated Competence

What Lies Ahead?

See also: