AI and Qualitative Insights: A Cautious Outlook for Software Engineering

Author: Denis Avetisyan

A new review examines the promise and limitations of using generative AI tools to analyze qualitative data in software engineering research.

Generative AI offers potential benefits for coding assistance, but requires careful methodological alignment and sustained human oversight to ensure rigorous qualitative research.

While qualitative insights are central to understanding the human elements of software engineering, claims of fully automating analysis with generative AI often overstate current capabilities. This paper, ‘GenAI Is No Silver Bullet for Qualitative Research in Software Engineering’, critically examines the emerging intersection of large language models and qualitative methods within the field. Our analysis reveals that while GenAI tools offer promising assistance for tasks like coding, their effective implementation requires careful alignment with research epistemology and sustained human oversight to ensure rigor. What practical strategies can researchers adopt to harness the benefits of GenAI while safeguarding the validity and trustworthiness of qualitative findings in software engineering?

The Essence of Understanding: Unveiling Human Realities

Software engineering, while often quantified through lines of code and bug reports, fundamentally concerns human endeavors – the needs, motivations, and collaborative processes of developers and users alike. Qualitative research serves as an indispensable tool for unraveling these complexities, offering insights that purely quantitative methods often miss. By employing techniques like interviews and ethnographic studies, researchers can explore the ‘why’ behind observed behaviors, uncover hidden assumptions influencing design choices, and gain a nuanced understanding of how software integrates into people’s lives. This deeper comprehension is crucial for building truly user-centered systems, fostering effective teamwork, and ultimately, creating software that not only works but also meaningfully addresses human needs and enhances the overall experience.

Traditional qualitative research, while providing rich contextual understanding, often presents practical limitations for software engineering projects. Methods like in-depth interviews and ethnographic studies demand significant time investment from both researchers and participants, creating bottlenecks in iterative development cycles. The intensive nature of data collection and analysis-typically involving manual coding and thematic interpretation-makes scaling these approaches to larger user groups or more frequent feedback loops exceptionally challenging. This resource intensity frequently delays critical insights, potentially leading to misaligned development efforts and ultimately hindering the ability to rapidly adapt software based on genuine user needs and experiences.

Establishing validity and reliability in qualitative software engineering research presents unique hurdles, amplified when dealing with intentionally small sample sizes-a common necessity due to the resource-intensive nature of in-depth investigations. Unlike quantitative studies where statistical power mitigates the effects of limited data, qualitative findings hinge on the depth and nuanced interpretation of participant experiences. Researchers must therefore employ rigorous techniques – such as triangulation, member checking, and detailed audit trails – to demonstrate the trustworthiness of their conclusions. These methods help to establish confidence that the identified themes accurately reflect participant perspectives, rather than researcher bias, and that the findings are transferable to other, similar contexts despite the limited scope of the study. A transparent articulation of the research process and a clear justification for sample size are also critical to building credibility and acknowledging the inherent limitations of drawing broad generalizations from focused qualitative inquiries.

A robust qualitative study in software engineering doesn’t merely collect data, but fundamentally grounds its interpretation within a clearly defined epistemology – a theory of knowledge that shapes how researchers understand and validate findings. This philosophical underpinning dictates which data are considered relevant, how observations are framed, and ultimately, what constitutes meaningful insight. Without explicit epistemological awareness, interpretations risk being subjective or biased, hindering the transferability and trustworthiness of results. Researchers might unknowingly prioritize certain perspectives, overlook crucial contextual factors, or impose pre-conceived notions onto the data. Consequently, acknowledging and articulating the chosen epistemology-be it constructivism, interpretivism, or another framework-is not simply an academic exercise, but a vital step in ensuring the rigor and credibility of qualitative inquiry, allowing for transparent and defensible conclusions regarding complex human-centered challenges.

Gathering Richness: Methods for Deep Data Collection

Respondent strategies, encompassing both interviews and surveys, are foundational to qualitative data collection due to their direct engagement with individuals possessing relevant experiences or perspectives. Interviews, typically unstructured or semi-structured, allow for in-depth exploration of participant viewpoints and nuanced responses unattainable through other methods. Surveys, while often quantitative, can incorporate open-ended questions to gather qualitative data on attitudes, beliefs, and behaviors from a larger sample size. The selection of either method, or a mixed-methods approach, depends on the research question and the desired level of detail; however, both rely on establishing rapport and minimizing researcher bias to ensure data validity and reliability. Careful consideration of sampling techniques is also critical to ensure the respondents adequately represent the population of interest.

Field studies involve the direct observation of activities within naturally occurring environments to gather comprehensive data regarding behaviors, interactions, and contextual factors. This method prioritizes in situ data collection, allowing researchers to document practices as they unfold without the artificiality of controlled experiments or the recall bias inherent in interviews. Data acquisition typically employs techniques such as participant observation, where the researcher immerses themselves within the studied environment, and systematic observation, utilizing pre-defined coding schemes to quantify specific behaviors. The resulting data provides a nuanced understanding of how actions are embedded within complex social, cultural, and physical contexts, offering insights unattainable through other research methodologies. Data gathered may include field notes, audio/video recordings, and artifact analysis, all contributing to a holistic portrayal of the phenomenon under investigation.

Grounded Theory is a research methodology focused on developing theory from data, rather than testing pre-existing hypotheses. The process involves iterative data collection and analysis, beginning with initial coding of qualitative data – such as interview transcripts or field notes – to identify key concepts and patterns. These initial codes are then refined through constant comparison, where new data is systematically compared to existing data to identify similarities, differences, and emerging themes. This comparative analysis leads to the development of categories and, ultimately, theoretical propositions. Theoretical sampling – purposefully selecting data to further develop and refine these emerging concepts – is a core component, continuing until theoretical saturation is reached, meaning no new insights are being gained from additional data collection.

Reflexivity, as a methodological practice, necessitates researchers explicitly acknowledge and critically examine their own preconceptions, experiences, and potential biases throughout the research process. This involves ongoing self-assessment regarding how the researcher’s background, beliefs, and values might influence data collection, analysis, and interpretation. Documentation of this reflective process, often through research memos or journals, provides transparency and allows for critical evaluation of potential subjectivity. Failing to account for researcher bias can lead to skewed findings, inaccurate conclusions, and compromised validity; therefore, consistent reflexive practice is essential for ensuring research rigor and trustworthiness, particularly in qualitative studies where the researcher is a primary instrument of data gathering and analysis.

From Data to Insight: The Art of Coding

Coding, in the context of qualitative research, is a systematic process of analyzing data – such as interview transcripts, field notes, or documents – to identify recurring patterns, concepts, and themes. This involves assigning labels, or ‘codes’, to segments of data that represent these identified elements. These codes are not simply summaries; they represent interpretations of the data’s meaning and are foundational for building higher-level analytical frameworks. The process is iterative, with initial codes often refined and reorganized as the researcher’s understanding of the data evolves. Effective coding moves beyond descriptive labeling to capture nuanced meaning and facilitate the development of theoretically-informed insights from the raw data.

Researchers utilize three primary coding strategies: deductive, inductive, and hybrid. Deductive coding begins with a pre-defined coding scheme based on existing theory or prior research, applied to the data to confirm or refute hypotheses. Its strength lies in focused analysis, but it may overlook emergent themes. Inductive coding, conversely, starts with the data itself, allowing themes to emerge organically without pre-conceived notions; while maximizing discovery, it can be time-consuming and subjective. Hybrid coding combines both approaches, starting with a preliminary deductive framework and then allowing for inductive refinement as new patterns arise, offering a balance between focused inquiry and open exploration.

Qualitative Data Analysis Software (QDAS) packages, such as NVivo, Atlas.ti, and MAXQDA, facilitate the coding process by enabling researchers to import, organize, and annotate qualitative data – including interview transcripts, field notes, and documents. These tools allow for efficient searching, tagging, and retrieval of coded segments, and can assist in identifying relationships between codes. However, QDAS does not autonomously generate meaningful interpretations; expert judgment remains crucial for defining codes, ensuring their consistent application, resolving ambiguities, and contextualizing findings within the broader research framework. The software serves as an aid to, but does not replace, the researcher’s analytical thinking and theoretical understanding.

Member checking, a recognized trustworthiness procedure in qualitative research, involves systematically sharing preliminary findings with study participants to assess the accuracy and credibility of the researcher’s interpretations. This process typically entails providing participants with transcripts, codes, or thematic summaries and soliciting their feedback on whether the analysis accurately reflects their experiences and perspectives. Discrepancies identified through member checking can then be addressed through further data analysis, clarification with participants, or refinement of the research findings. While not guaranteeing validity, member checking demonstrably increases confidence in the data’s representational accuracy and enhances the study’s overall trustworthiness by directly incorporating participant perspectives into the analytic process.

The Emerging Landscape: AI and Qualitative Inquiry

Generative AI, and specifically Large Language Models, is rapidly emerging as a powerful resource for qualitative researchers. These models offer the potential to significantly streamline traditionally labor-intensive processes, such as the initial organization and analysis of textual or interview data. While qualitative research prioritizes nuanced understanding and interpretive depth, GenAI tools can assist with preliminary coding, identifying potential themes, and summarizing extensive datasets – effectively reducing the time spent on foundational tasks. This allows researchers to focus more intently on the interpretive aspects of their work, exploring the ‘why’ behind observed patterns and developing richer, more insightful conclusions. Though still in its early stages, the integration of these technologies promises to reshape the landscape of qualitative inquiry, offering a compelling blend of computational efficiency and human expertise.

Generative AI is rapidly becoming a valuable asset in managing the complexities of qualitative data analysis. Researchers are increasingly leveraging these tools to expedite traditionally time-consuming processes, such as preliminary coding of interview transcripts or observational field notes. Beyond simply organizing data, GenAI assists in identifying emergent themes within large datasets, effectively sifting through substantial volumes of text to highlight recurring patterns and concepts. This capability extends to summarizing key findings, providing concise overviews that allow researchers to quickly grasp the essence of complex qualitative information and focus their interpretive efforts. While not intended to replace human analysis, these AI-powered tools offer a powerful means of enhancing efficiency and uncovering initial insights from expansive qualitative research projects.

Current integration of generative artificial intelligence within the field of computer-supported social science is nascent, yet demonstrably growing. A preliminary analysis of presentations at the CSCW 2025 conference indicates that GenAI tools were utilized in the research behind 3.3% of papers-representing seven out of a total of 209 submissions. This figure, though relatively small, provides a baseline for tracking the adoption of these technologies within qualitative research methodologies. It suggests an emerging trend, as researchers begin to explore the potential of large language models to assist with-and potentially transform-traditional approaches to understanding human behavior and social interaction.

Current applications of generative AI in qualitative research reveal a distinct pattern of utilization; the technology demonstrates strong alignment with human coders – achieving substantial agreement, as measured by Cohen’s κ values consistently at or above 0.7 – when applied to deductive coding schemes. These schemes rely on pre-defined categories, allowing the AI to efficiently categorize text based on explicit rules. However, this capability sharply diminishes when confronted with the nuances of interpretive methodologies such as grounded theory or ethnographic analysis. These approaches demand the identification of emergent themes and subtle contextual understandings, skills that currently lie beyond the capacity of these models and explain the relative absence of AI assistance in these more exploratory qualitative domains.

Recent investigations, notably the work of Montes et al., demonstrate a clear preference for codes generated by Large Language Models, with 61% of participants favoring LLM-derived themes over human-generated alternatives. However, this preference doesn’t necessarily equate to analytical depth; while LLMs excel at identifying surface-level patterns and frequently aligning with pre-defined categories, human analysts consistently uncover more nuanced, latent interpretations within qualitative data. This suggests that while GenAI can significantly aid in the initial stages of qualitative analysis – offering efficiency and consistency – it currently struggles to replicate the interpretive leaps and contextual understanding characteristic of experienced researchers. The findings highlight a potential for synergistic collaboration, where LLMs handle repetitive coding tasks, freeing human analysts to focus on the more complex work of uncovering meaning and theoretical insight.

The exploration of Generative AI’s role in qualitative software engineering research reveals a necessary tension between efficiency and epistemological rigor. The study rightly emphasizes that while GenAI tools offer potential for automating tasks like initial coding, maintaining the integrity of thematic analysis-a cornerstone of grounded theory-demands critical human oversight. This aligns with Tim Bern-Lee’s assertion, “The Web is more a social creation than a technical one.” The study’s insistence on careful alignment between AI assistance and research philosophy underscores the fundamentally social act of knowledge construction, even when mediated by technology. Automation, devoid of critical engagement, risks obscuring the nuanced understanding crucial to robust qualitative inquiry. Unnecessary complexity introduced by uncritical AI implementation is indeed violence against attention.

Further Refinements

The demonstrated utility of Generative AI in assisting qualitative analysis-particularly in the laborious phases of coding-does not resolve the fundamental questions. The alignment of algorithmic output with epistemological commitments remains a critical, and largely unaddressed, challenge. Software engineering, as a field, frequently prioritizes function over fundamental understanding; the temptation to accept computationally derived insights without rigorous validation is substantial, and predictably human. Clarity is compassion for cognition, and a superficially coherent theme, generated by an opaque process, offers little genuine insight.

Future work must move beyond simply applying these tools and instead focus on their limits. Investigation into the biases inherent within large language models-and how those biases manifest in qualitative datasets-is paramount. The field needs less enthusiasm for automation, and more precise metrics for assessing the fidelity of GenAI-assisted analysis. Emotion is a side effect of structure, and a beautifully rendered, yet structurally unsound, analysis is merely a pleasing illusion.

Ultimately, the value of this technology will not be determined by what it can do, but by what researchers are willing to forgo in the pursuit of efficiency. Perfection is reached not when there is nothing more to add, but when there is nothing left to take away, and the current trajectory suggests a willingness to accept substantial subtraction from the foundations of rigorous qualitative inquiry.

Original article: https://arxiv.org/pdf/2603.08951.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Essence of Understanding: Unveiling Human Realities

Gathering Richness: Methods for Deep Data Collection

From Data to Insight: The Art of Coding

The Emerging Landscape: AI and Qualitative Inquiry

Further Refinements

See also: