The AI Muse and the Question of Originality

Author: Denis Avetisyan


As generative AI reshapes creative landscapes, a fundamental challenge arises: how do we define and protect ownership in a world of algorithmically-derived art?

The replication of the box generator, denoted as [latex] g_{box} [/latex], confirms the consistency and robustness of the foundational system established in Figure 1.
The replication of the box generator, denoted as [latex] g_{box} [/latex], confirms the consistency and robustness of the foundational system established in Figure 1.

This review introduces a novel framework for evaluating copyright infringement based on dependence analysis and the geometric structure of permissible generation in expansive datasets.

Existing copyright law, predicated on substantial similarity, struggles to address the stylistic imitation central to modern generative AI. In ‘Creative Ownership in the Age of AI’, we propose a novel criterion defining infringement not by copying content, but by dependence on existing works within the AI’s training corpus. We model generative systems using closure operators and characterize the structural properties of ‘permissible generation’-outputs free from infringement-revealing an asymptotic dichotomy linked to the distribution of creative influence. Will this framework enable sustainable innovation while safeguarding the rights of creators in an increasingly AI-driven landscape?


Unraveling the Generative Landscape

Contemporary creative endeavors are rapidly being reshaped by generative systems – algorithms capable of producing novel content ranging from text and images to music and code. This reliance, however, introduces unprecedented challenges to established notions of originality and ownership. Traditionally, copyright law protects the expression of an author’s individual creativity; yet, when an algorithm synthesizes content based on a vast dataset, determining authorship becomes significantly blurred. The question isn’t simply whether the output is new, but how new, and to what extent it reflects the creative contributions of the algorithm’s designers, the data used to train it, or even a genuinely emergent property of the system itself. These developments necessitate a re-evaluation of intellectual property frameworks to address the unique characteristics of AI-generated works and establish clear guidelines for attributing creation and managing rights in this evolving landscape.

The core of any generative system resides in what can be termed the ‘Generator’ – a complex mathematical function that fundamentally dictates the range of possible outputs derived from a specific input dataset, or corpus. This Generator isn’t simply a tool for replication; it’s a transformative map that takes the statistical properties of the input – patterns, frequencies, relationships – and translates them into a multidimensional space of potential creations. [latex] G(x) = y [/latex], where ‘x’ represents the input and ‘y’ the generated output, is a simplification of this concept. The Generator defines the boundaries of this creative space, determining not only what can be produced, but also the probability of any given output occurring, and crucially, how similar or divergent it will be from the original source material. Understanding the specific characteristics of this mathematical function is therefore paramount to evaluating the novelty and originality of generated content.

The legal ramifications of generative systems hinge on characterizing the underlying ‘Generator’ – the mathematical function transforming input data into novel outputs. Crucially, three properties of this Generator – preservation, monotonicity, and idempotence – are becoming central to copyright assessments. Preservation dictates whether core elements of the input are retained in the output, potentially indicating derivative work. Monotonicity describes the relationship between input complexity and output complexity; a non-monotonic generator could produce unexpectedly original results, challenging traditional copyright notions. Finally, idempotence, where repeated inputs yield the same output, speaks to the degree of transformative creation. If a generator consistently produces near-identical outputs from slight variations in input, it suggests limited originality. Analyzing these properties allows for a more nuanced understanding of generative processes, moving beyond simple comparisons to the input corpus and informing legal decisions regarding authorship and intellectual property rights in this rapidly evolving technological landscape.

The intersection of the complements of each convex hull violation set [latex]P=\bigcap_{c=1}^{5}P_{c_{i}}[/latex] identifies the set of points that satisfy all constraints defined by the corpus [latex]C = \{c_{1}, c_{2}, \dots, c_{5}\}[/latex].
The intersection of the complements of each convex hull violation set [latex]P=\bigcap_{c=1}^{5}P_{c_{i}}[/latex] identifies the set of points that satisfy all constraints defined by the corpus [latex]C = \{c_{1}, c_{2}, \dots, c_{5}\}[/latex].

Deconstructing Generative Outputs: Defining Boundaries of Creation

Generative models, when assessed for copyright implications, yield two distinct output sets. The ‘Permissible Set’ consists of creations demonstrably independent of any pre-existing copyrighted work within the training corpus; these outputs are considered legally safe for distribution. Conversely, the ‘Violation Set’ comprises outputs that exhibit substantial similarity to, or are directly derived from, protected materials. This determination is made by analyzing the generated content for elements that correspond to specific input works, establishing a reliance that could constitute copyright infringement. The delineation between these sets is critical for evaluating the legal risk associated with generative AI outputs.

The Convex Hull Generator defines a permissible creation set by identifying the outermost points – representing stylistic or feature boundaries – within a corpus of existing works; any creation falling within this ‘hull’ is considered independent. Conversely, the Splice Generator constructs the Violation Set by directly combining or interpolating elements from the corpus, resulting in outputs demonstrably derived from protected source material. Both generators operate by analyzing a dataset of existing works to statistically define boundaries, with the Convex Hull focusing on permissible variation and the Splice Generator explicitly highlighting derivative content. These methods allow for the quantifiable distinction between independent creation and copyright infringement by establishing a measurable relationship between generated outputs and the originating corpus.

The ‘Box Generator’ functions as a comprehensive model of generative AI output by systematically combining the ‘Permissible Set’ and ‘Violation Set’ creation methodologies. This process involves constructing a high-dimensional ‘box’ representing the entirety of possible outputs, then identifying regions within that box definitively attributable to source material-the ‘Violation Set’-and those independently created-the ‘Permissible Set’. By analyzing the relative volumes of these sets and the transition areas between them, the ‘Box Generator’ provides a quantifiable framework for assessing the likelihood of copyright infringement in generated content, effectively mapping the boundaries between lawful and unlawful creative production. This analytical approach allows for a data-driven determination of originality and reliance on existing protected works.

The convex hull generator's permissible set remains unchanged when adding certain works, as shown in panel (a), but can be expanded by others, illustrated in panel (b) where adding [latex]c_4[/latex] creates a singleton intersection point between the line segments [latex]\overline{c_1c_3}[/latex] and [latex]\overline{c_2c_4}[/latex].
The convex hull generator’s permissible set remains unchanged when adding certain works, as shown in panel (a), but can be expanded by others, illustrated in panel (b) where adding [latex]c_4[/latex] creates a singleton intersection point between the line segments [latex]\overline{c_1c_3}[/latex] and [latex]\overline{c_2c_4}[/latex].

The Role of Distribution: Sculpting the Landscape of Generative Risk

The composition of the ‘Permissible’ and ‘Violation’ sets in generative model risk assessment is directly determined by the probability distribution of features present in the training corpus. A corpus with a distribution concentrated on a limited number of feature combinations will yield a smaller ‘Permissible’ set – the space of outputs sufficiently distinct from existing works – and a correspondingly larger ‘Violation’ set. Conversely, a corpus exhibiting a broad, diffuse distribution of features expands the ‘Permissible’ set, providing more room for original content generation, while shrinking the ‘Violation’ set. This relationship is fundamental because the generator, during sampling, effectively operates within the constraints of this distribution; outputs are more likely to fall within frequently observed feature combinations, increasing the probability of unintentional replication of protected material if the distribution is highly concentrated. Therefore, understanding the underlying feature distribution is crucial for quantifying and mitigating generative risk.

The statistical distribution of features within the training data directly impacts the likelihood of generating derivative or infringing content. A ‘light-tailed distribution’ indicates that most data points cluster closely around the mean, reducing the probability of sampling rare or extreme values that closely resemble protected works during generation. Conversely, a ‘heavy-tailed distribution’ contains a higher proportion of extreme values, increasing the probability that the generator will sample and reproduce features highly similar to those found in copyrighted material. This heightened similarity elevates the risk of producing outputs that constitute copyright infringement, as the generator is more likely to rely on, and replicate, specific elements present in the training corpus.

Generator stability and predictable output are refined by mathematical properties such as Uniform Lower Semicontinuity, which ensures small changes in input lead to correspondingly small changes in output. The Radon Number, a key metric in measure theory, provides a sufficient condition for the non-emptiness of the permissible set – the range of outputs that do not infringe on existing works – when dealing with convex-valued generators. Specifically, a Radon Number equal to [latex]d+2[/latex] guarantees a non-empty permissible set, where [latex]d[/latex] represents the dimensionality of the generator’s output space. This mathematical framework allows for a more precise assessment of generative model behavior and potential intellectual property risks.

Copyright and the Future of Generative AI: Navigating a Shifting Paradigm

The core of copyright infringement isn’t simply replication, but rather a question of origination – specifically, whether a creative work would have come into existence without the influence of a pre-existing, copyrighted piece. This principle, termed ‘counterfactual generability’, assesses if the allegedly infringing work is a derivative born from the protected work, or if it would have independently emerged even in the absence of that influence. Determining this requires a hypothetical exploration: if the copyrighted work were removed from existence, would the new creation still be realized in the same form? A finding of infringement hinges on establishing that the protected work served as a crucial, non-substitutable impetus for the creation, effectively demonstrating that the work wouldn’t exist in its current form without it. This shifts the focus from surface similarities to the underlying generative process, introducing a complex, yet vital, layer to copyright assessment.

The increasing prevalence of collective copyright – arrangements where creative works are jointly owned by multiple parties – introduces significant challenges to determining copyright infringement in the age of generative AI. Unlike works with a single, clearly defined owner, identifying the precise ‘Violation Set’ – the specific portions of a training dataset contributing to infringing output – becomes exponentially more complex when rights are shared. Each copyright holder within a collective possesses a claim, broadening the scope of potential infringement, as any unauthorized use of a collectively owned element could trigger legal repercussions from multiple sources. This distributed ownership model necessitates a shift in legal thinking, moving beyond simple one-to-one comparisons of input and output to a more granular analysis of contribution and permission within the collective, a task considerably more difficult than assessing infringement for singularly owned works.

Generative artificial intelligence fundamentally operates by identifying patterns within vast datasets, and subsequently creating new content based on those learned associations. This process necessitates a careful consideration of copyright law, as the legality of generated outputs hinges not simply on resemblance to existing works, but on whether the creation would have independently emerged. As these systems are fed increasingly large and diverse datasets – a scenario described as ‘light-tailed innovation’ – the proportion of legitimately permissible creations to the total number of possible outputs is predicted to approach unity. This suggests a future where, statistically, infringement becomes less likely, not because the systems are avoiding copying, but because the sheer scale of the generative space dwarfs the protected content, effectively diluting the possibility of unlawful replication and rendering traditional notions of copyright increasingly difficult to apply.

The exploration of permissible generation, as detailed in the paper, inherently defines boundaries-a space of creative output constrained by existing works. This resonates with Albert Camus’ assertion: “The struggle itself…is enough to fill a man’s heart. One must imagine Sisyphus happy.” The continuous refinement of the ‘violation set’ and the application of mathematical principles like the Radon number represent a Sisyphean task – perpetually defining and redefining the limits of originality in an ever-expanding corpus. Just as Sisyphus finds meaning in his endless labor, so too does this research find value in meticulously mapping the contours of creative ownership, even knowing those boundaries are fluid and subject to constant re-evaluation. The study’s focus on light-tailed distributions, and the consequential impact on defining acceptable deviation, highlights the necessity of acknowledging what lies outside the permissible generation – the unseen works that could potentially influence outcomes.

What Lies Ahead?

The proposed framework, viewing permissible generation as a growing, structurally defined space, offers a new lens through which to assess the thorny issue of copyright in the age of generative AI. However, the model remains, at its heart, a simplification. The reliance on convex hulls and Radon numbers, while mathematically elegant, presupposes a degree of order in the ‘violation set’ – the collection of copyrighted works – that may not exist in practice. Real-world datasets are rarely so neatly bounded; the edges of permissible generation are likely far more fractal, riddled with unpredictable anomalies.

Future work must grapple with the distribution of similarity. The study hints at the importance of ‘light-tailed distributions’ in differentiating permissible from impermissible generation, but a deeper exploration of the statistical landscape is needed. Is there a universal threshold beyond which any degree of similarity constitutes infringement, or is it a context-dependent judgment, shifting with the artistic intent and cultural significance of the generated work? The answer, it seems, will not be found solely within the equations.

Ultimately, this investigation serves as a reminder: the map is not the territory. The model is a microscope, illuminating certain patterns within the vast and complex space of creative expression. But the true nature of originality, and the evolving relationship between human and artificial creativity, will continue to elude complete capture, remaining a subject of ongoing inquiry and, perhaps, perpetual debate.


Original article: https://arxiv.org/pdf/2602.12270.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-02-13 18:47