AI Transparency in the Age of LLMs: A Human-Centered Research Approach to Artificial Christmas Tree Assembly

The Rise of Language Models in Creative Writing Assistance

Artificial intelligence (AI), particularly in the form of large language models (LLMs), has the potential to revolutionize how we approach creative writing and expression. Recent advancements have enabled LLMs to generate human-like text that can assist writers with a variety of tasks, from crafting compelling arguments to producing creative fiction and poetry. Tools powered by LLMs, such as Google’s WorkSpace Labs, Grammarly, and Sudowrite, have become increasingly popular, offering users a range of writing support features.

However, as these LLM-based writing assistants gain widespread adoption, concerns have emerged regarding their impact on human creativity and linguistic diversity. LLM-generated text often exhibits idiosyncrasies that can lead to homogenized content, potentially diminishing the unique voice and style of individual writers. Clichés, unnecessary exposition, awkward phrasing, and lack of specificity are just a few of the common issues that can arise when relying heavily on AI-generated text.

Embracing a Human-Centered Approach to AI Writing Assistance

To address these challenges and ensure that AI writing tools enhance rather than hinder human creativity, a human-centered research approach is essential. By collaborating closely with expert writers and editors, we can develop a comprehensive understanding of the specific strengths and weaknesses of LLM-generated text, ultimately guiding the design of more effective and aligned writing assistance tools.

Taxonomy of Edits: Identifying and Categorizing LLM Idiosyncrasies

As a first step, we conducted a formative study with a group of experienced creative writers, tasking them with editing LLM-generated paragraphs across various genres, from literary fiction to creative nonfiction. Through this process, we were able to identify a comprehensive taxonomy of seven key edit categories:

Clichés: The use of hackneyed phrases or overly common imagery that lack originality or depth.
Unnecessary/Redundant Exposition: Redundant or non-essential parts of the text that could be removed or rephrased for conciseness.
Purple Prose: Unnecessary ornamental and overly verbose language that disrupts the narrative flow.
Poor Sentence Structure: Feedback on the construction of sentences, recommending changes for better flow, clarity, or impact.
Awkward Word Choice and Phrasing: Suggestions for better word choices or more precise phrasing to enhance clarity and readability.
Lack of Specificity and Detail: Need for more concrete details or specific information to enrich the text and make it more engaging.
Tense Consistency: Comments pointing out inconsistencies in verb tense that need to be addressed for uniformity.

This taxonomy, grounded in the expertise of professional writers, serves as a valuable framework for understanding and addressing the common idiosyncrasies that plague LLM-generated text.

The LAMP Corpus: Annotated Edits for Enhancing AI Writing

Armed with this taxonomy, we embarked on a large-scale annotation study, collaborating with 18 MFA-trained creative writers to edit 1,057 LLM-generated paragraphs. The resulting LAMP (Language model Authored, Manually Polished) corpus contains over 8,000 fine-grained edits, providing a rich dataset for analyzing the editing process and informing the development of more effective AI writing assistance tools.

Our analysis of the LAMP corpus revealed several key insights:

Consistent Editing Patterns Across LLM Families: Surprisingly, we found no significant differences in the perceived writing quality or the types of edits needed across texts generated by different LLM families, including GPT-4, Claude 3.5 Sonnet, and Llama 3.1. This suggests that the idiosyncrasies identified in our taxonomy are common across LLM-generated content, regardless of the specific model used.
The Importance of Meaning-Preserving and Meaning-Changing Edits: Our analysis showed that the majority (70%) of the edits made by writers were meaning-preserving, focusing on improving clarity, flow, and readability, while the remaining 30% involved more substantial semantic changes to enhance the text’s specificity, depth, and originality.
The Varied Approaches of Expert Editors: When examining paragraphs edited by multiple writers, we observed that individual approaches can differ significantly. Some editors prioritize preserving the original voice and make minimal changes, while others take a more interventionist stance, heavily revising the text to align with their vision. This diversity in editing styles is a positive aspect, as it prevents homogenization and promotes the preservation of individual expression.

Automated Editing: Bridging the Gap Between Human and AI Writing

While expert editing can effectively enhance the quality of LLM-generated text, scaling this approach to meet the growing demand for AI writing assistance is not feasible. To address this challenge, we explored the potential of automated editing methods, leveraging the insights and annotations from the LAMP corpus.

Detecting Problematic Spans in LLM-Generated Text

We first tackled the task of automatically identifying problematic spans in LLM-generated text, using few-shot learning techniques to train models that can detect and categorize issues according to our established taxonomy. Our experiments showed that these automated methods can achieve a moderate level of overlap with the edits made by human experts, demonstrating the potential for scalable identification of LLM idiosyncrasies.

Rewriting Problematic Spans with Category-Specific Prompts

Building on the detection capabilities, we then developed a two-step automated editing pipeline that not only identifies problematic spans but also generates revisions to address them. By using category-specific prompts that incorporate examples from the LAMP corpus, our LLM-based rewriting methods were able to produce edits that, when combined with the automated detection, were consistently preferred by expert writers over the original LLM-generated text.

Towards a Brighter Future for AI-Assisted Writing

The findings from our human-centered research approach highlight the importance of aligning LLM-based writing tools with the standards and practices of expert writers. By incorporating the insights from the LAMP corpus and the automated editing techniques we developed, we can create AI writing assistants that enhance human creativity and preserve linguistic diversity, rather than homogenizing content or diminishing individual expression.

As the use of LLMs in creative writing continues to expand, it is crucial that we prioritize transparency, user-centered design, and a deep understanding of the strengths and limitations of these powerful language models. By working closely with writers, editors, and other domain experts, we can ensure that AI writing assistance empowers and elevates the art of written expression, rather than replacing or devaluing the human touch.

Conclusion

In the age of large language models, the future of creative writing lies in a harmonious collaboration between human expertise and AI-powered tools. By embracing a human-centered research approach, we have gained valuable insights into the idiosyncrasies of LLM-generated text and developed methods to effectively enhance its quality through targeted editing. As we continue to refine and expand these techniques, we can look forward to a future where AI writing assistants seamlessly complement and amplify the creativity and unique voices of human writers, ushering in a new era of literary expression.