Can AI Write Science? A New Study Suggests It’s Becoming Possible

Author: Denis Avetisyan


Research reveals that AI-generated scientific abstracts, with human oversight, can achieve acceptance rates similar to those written entirely by humans.

Expert authors exhibit a paradoxical editing behavior when reviewing abstracts: lacking knowledge of the source, they substantially revise human-written text alongside AI-generated content, but when informed an abstract originates from AI, edits become meticulous stylistic refinements geared towards increasing acceptance chances, suggesting a bias influencing critical assessment rather than inherent quality detection.
Expert authors exhibit a paradoxical editing behavior when reviewing abstracts: lacking knowledge of the source, they substantially revise human-written text alongside AI-generated content, but when informed an abstract originates from AI, edits become meticulous stylistic refinements geared towards increasing acceptance chances, suggesting a bias influencing critical assessment rather than inherent quality detection.

The study investigates the impact of AI authorship disclosure on the editing process and reviewer perceptions of AI-assisted scientific writing.

While scientific writing demands precision and expertise, the potential of Large Language Models to genuinely assist-rather than simply generate-remains unclear. This study, ‘Accepted with Minor Revisions: Value of AI-Assisted Scientific Writing’, investigates how LLMs can support domain experts in composing scientific abstracts through a randomized controlled trial incentivizing thorough editing. Results demonstrate that AI-generated abstracts, with careful revision, can reach comparable levels of acceptability to human-authored work, though perceptions of authorship significantly influence editing behavior and reviewer decisions. Does this suggest a future of collaborative scientific writing where source disclosure is paramount to fostering trust and maximizing the value of AI assistance?


Deconstructing the Scholarly Record: A System Under Examination

The advancement of scientific knowledge hinges fundamentally on the clear and accurate dissemination of research findings, yet the process of translating complex data into accessible written form often presents a substantial hurdle for researchers. While expertise in a specific field doesn’t automatically translate to proficiency in scientific writing, this skill is crucial for securing funding, publishing impactful papers, and ultimately, driving innovation. This bottleneck isn’t merely a matter of stylistic finesse; poorly communicated research can lead to misinterpretations, hinder replication, and slow the overall pace of discovery. Consequently, addressing the challenges researchers face in effectively conveying their work is paramount to accelerating scientific progress and ensuring that valuable insights reach the widest possible audience.

The integration of Large Language Models (LLMs) into scientific writing presents a double-edged sword. While offering the potential to alleviate the burdens of drafting and editing, and potentially accelerating the dissemination of research, their use introduces critical concerns regarding the quality and veracity of published work. LLMs, trained on vast datasets, can generate text that appears coherent and scientifically sound, yet may contain subtle inaccuracies, unsupported claims, or even outright fabrications. This necessitates rigorous scrutiny and verification – a process complicated by the ‘black box’ nature of many LLMs, making it difficult to trace the origins of information or assess the reasoning behind generated content. Maintaining transparency regarding the use of AI assistance, therefore, becomes paramount to ensure accountability and uphold the integrity of scientific communication, demanding new standards for authorship and peer review in an era of increasingly sophisticated artificial intelligence.

The integration of artificial intelligence into scientific writing necessitates a detailed examination of the interplay between authors and reviewers when AI-assisted content is submitted for publication. Recent research indicates that abstracts generated by Large Language Models, when subjected to careful human editing, can achieve acceptance rates statistically indistinguishable from those crafted solely by human researchers. This suggests that AI isn’t necessarily replacing scientific authors, but rather functioning as a powerful tool – one that requires judicious oversight to maintain scientific rigor and clarity. Understanding how reviewers evaluate these blended submissions – identifying potential biases or focusing on specific elements – is vital for optimizing the use of AI and ensuring that the pursuit of scientific knowledge remains both efficient and trustworthy. The study highlights a crucial shift: the focus isn’t solely on whether AI can assist, but on how that assistance is perceived and validated within the peer-review process.

A word cloud analysis reveals key linguistic differences between texts generated by AI and humans, as identified by both authors and reviewers.
A word cloud analysis reveals key linguistic differences between texts generated by AI and humans, as identified by both authors and reviewers.

Mapping the Editing Process: A Behavioral Dissection

An incentivized randomized trial was conducted to model editing behaviors observed in author-reviewer interactions. The study design leveraged principles of Behavioral Science to create a realistic simulation wherein participants – acting as authors – were presented with abstracts, both human- and AI-generated, and compensated for revisions. Random assignment ensured that participants were not systematically informed of the text’s origin – a critical element for isolating the impact of source disclosure. This approach allowed for the controlled observation of editing patterns and the quantification of modifications made to both types of abstracts under conditions mirroring a typical academic submission process. The incentive structure was designed to encourage thorough editing and reflect the effort expected in a genuine peer-review scenario.

The experimental design incorporated a manipulation of source disclosure to determine its impact on editing practices. Participants were presented with abstracts – both human- and AI-generated – under conditions where authorship was either disclosed (indicating AI generation) or concealed. This allowed for a controlled comparison of editing behaviors – specifically the frequency and nature of structural revisions and cohesion-focused modifications – based solely on the knowledge, or lack thereof, regarding the text’s origin. By isolating this single variable, the study aimed to quantify the extent to which awareness of AI generation influences how individuals approach and modify written content.

The study’s methodology focused on text editing analysis, specifically quantifying Structural Revisions – alterations to the abstract’s organization and framework – and modifications made to ensure Cohesion and Alignment with scientific communication standards. A total of 495 abstracts were subjected to this analysis, allowing for a statistically significant evaluation of editing patterns. These revisions were categorized and measured to determine the extent and nature of changes made by editors, providing quantitative data on how editors approach text refinement and clarity.

Study 2 (N=495) revealed that disclosing human authorship significantly reduced editing of abstracts compared to non-disclosure, while the effect of disclosing AI authorship depended on authors’ willingness to use generative AI, and source disclosure negligibly impacted confidence levels.
Study 2 (N=495) revealed that disclosing human authorship significantly reduced editing of abstracts compared to non-disclosure, while the effect of disclosing AI authorship depended on authors’ willingness to use generative AI, and source disclosure negligibly impacted confidence levels.

Decoding Linguistic Weaknesses: The Anatomy of Revision

Analysis of AI-generated abstracts revealed a consistent pattern of linguistic weakness related to nominalization-the process of forming nouns from verbs or adjectives. This manifests as an overuse of abstract nouns and noun phrases where verb-based constructions would improve clarity and concision. Specifically, the research identified instances where actions or processes were presented as static entities, hindering readability and requiring targeted editing to convert nominalized phrases back into more dynamic, verb-driven expressions. Correcting these instances consistently improved the overall coherence and accessibility of the abstracts.

Analysis of editing patterns revealed a statistically significant correlation between author education level and the ability to identify and correct linguistic issues in AI-generated abstracts. Researchers with higher levels of education, as indicated by advanced degrees and publication history, consistently demonstrated a greater capacity for recognizing errors related to grammar, clarity, and style. This was evidenced by a higher frequency of substantive edits – those addressing linguistic flaws rather than content – compared to researchers with less experience. The observed trend suggests that prolonged engagement with academic writing and peer review processes cultivates a more refined sensitivity to linguistic nuance and strengthens the skillset required for effective content evaluation and revision.

Analysis of author evaluation and editing behavior revealed a strong correlation between perceived readability and confidence in assessing content, irrespective of its origin. Specifically, authors provided substantive comments on 139 out of 297 abstracts, indicating engagement with linguistic or content-related issues. Conversely, 54 of the 297 abstracts received a designation of “no comment,” suggesting authors found these abstracts readily acceptable based on initial readability assessments. This data indicates that when abstracts were perceived as less readable, authors were significantly more likely to offer detailed feedback, while a substantial proportion of abstracts were accepted without comment when they met a basic readability threshold.

Analysis of 495 abstracts reveals that AI-generated text requires significantly fewer edits than human-written text (p = 0.0293), with the reduction in editing effort correlated with editor education level and abstract readability, suggesting that perceived smoothness-rather than AI authorship-drives editor confidence.
Analysis of 495 abstracts reveals that AI-generated text requires significantly fewer edits than human-written text (p = 0.0293), with the reduction in editing effort correlated with editor education level and abstract readability, suggesting that perceived smoothness-rather than AI authorship-drives editor confidence.

The Implications of Assisted Authorship: A Shifting Paradigm

Research indicates that the perceived origin of a scientific abstract – whether attributed to a human author or an artificial intelligence – significantly shapes the subsequent editing behavior of peer reviewers. Studies consistently revealed that abstracts labeled as AI-generated received more extensive and often more critical revisions, particularly concerning language clarity and logical flow, even when directly compared to abstracts of equivalent quality presented as human-authored. This suggests a demonstrable bias wherein reviewers approach AI-generated text with a heightened scrutiny, potentially focusing on stylistic imperfections rather than the underlying scientific merit. The findings highlight a crucial need to address these implicit biases within the peer review process, ensuring that evaluations are based solely on the quality and validity of the research itself, and not on preconceived notions about its authorship.

Analysis of reviewer feedback, captured through sentiment analysis and visualized as word clouds, revealed recurring themes surrounding the perceived deficiencies of AI-generated abstracts. These visualizations highlighted concerns not with factual accuracy, but rather with issues of stylistic coherence and nuanced argumentation; terms like “awkward,” “repetitive,” and “vague” frequently appeared, indicating that while the content often presented information correctly, it lacked the polish and subtle reasoning expected in strong scientific writing. This qualitative data suggests that current AI models struggle with the more subtle aspects of communication – maintaining a consistent voice, building a logical flow, and effectively conveying complex ideas – leading reviewers to identify a disconnect between technical correctness and overall quality. Understanding these specific challenges is crucial for refining AI tools and fostering more effective collaboration between humans and artificial intelligence in scientific publishing.

The increasing prevalence of artificial intelligence in scientific writing necessitates a fundamental shift towards transparency and the development of supporting tools. This research highlights that while AI can assist in content creation, it doesn’t inherently guarantee clarity or logical flow – qualities crucial for effective scientific communication. Consequently, there’s a growing need for tools designed not just to generate text, but to actively assess and improve its cohesion, ensuring alignment with established scientific standards. Such tools could flag potential ambiguities, identify logical gaps, and promote a consistent voice – ultimately fostering a more rigorous and reliable dissemination of knowledge. Addressing these challenges proactively will be essential to maintain the integrity of scientific publishing and build trust in AI-assisted research.

The experiment randomly assigned authors to revise either AI- or human-written abstracts, with or without source disclosure, and then used a blinded review process with majority voting to assess the quality of the revisions.
The experiment randomly assigned authors to revise either AI- or human-written abstracts, with or without source disclosure, and then used a blinded review process with majority voting to assess the quality of the revisions.

The study’s acceptance of AI-assisted abstracts, even with disclosure, mirrors a fundamental principle of systems analysis. Kolmogorov once stated, “The regularities of nature are not given to us ready-made; we must discover them.” This research doesn’t simply accept the status quo of scientific writing; it actively probes the boundaries of authorship and evaluation. By subjecting AI-generated text to peer review, the researchers aren’t merely assessing readability-they’re reverse-engineering the very mechanisms of scientific validation. The findings suggest that the ‘rules’ governing acceptance aren’t absolute, but rather adaptable, revealing a system ripe for intelligent augmentation and challenging long-held assumptions about human creativity and scholarly contribution. The willingness to test these boundaries, to intentionally introduce a variable like AI authorship, exemplifies the core tenet of understanding through controlled disruption.

What’s Next?

This work confirms a suspicion long held by those who treat knowledge as a fundamentally reconstructible phenomenon: the output is separable from the originator. That an AI can generate text accepted by peer review-even with editing-is less a revelation about artificial intelligence, and more a demonstration of the inherent, often unacknowledged, redundancy within the scientific communication system. The code, it seems, can be rewritten without altering the program’s execution.

The subtle shifts in editing behavior based on AI disclosure, however, are the truly interesting artifact. It suggests reviewers aren’t evaluating what is written, but who wrote it – or, more precisely, what assumptions they bring to the task. Future research should focus not on improving AI’s ability to mimic human prose, but on quantifying and mitigating this bias. Can blinding protocols be developed that extend to authorship origin?

Ultimately, this line of inquiry forces a reckoning. If the structure of a scientific argument can be successfully replicated by a non-sentient entity, then the value proposition of scientific writing isn’t necessarily in the prose itself, but in the conceptual scaffolding – the questions asked, the experimental design, the interpretation of data. The next step isn’t to perfect the algorithm, but to reverse-engineer the assumptions embedded within the human evaluation process – to fully map the operating system before attempting to debug it.


Original article: https://arxiv.org/pdf/2511.12529.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-18 17:52