IBC2022: This Technical Paper proposes a preliminary evaluation framework for story and script development that identifiesfour criteria -creative, emotional, information flow and realism–as a way to classify and review Artificial Intelligence (AI) authored materials.


Assessing the quality and relevanceof the output of Natural Language Generation (NLG) systems is a challenge. It is difficult to assess empirically. The proposed preliminary Story and Script Evaluation Framework (SSEF) seeks to address this by combining qualitative and quantitative methods to evaluate AI-created material using four criteria –plotting codified story elements on Freytag’s Pyramid, measuring emotional connectedness and reactions, assessing if the created scenes or the story flows logically and determining how real or plausible the story is. The planned framework is flexible enough to work with different story genres, though it is primarily designed for scripts and screenplays and short or long-form novels. A key feature of SSEF is that it examines the AI-generated content from the point-of-view of the reader or audience member. It is focused on the impact a story has on the individual and not on the technology or adherence to a particular narrative theory or story genre. Developing techniques to streamline and assess the emotional criteria requires a deep understanding of emotions, emotional connection and emotional responses and the bond between author/writer and an audience. To do this successfully involves recognising the importance of empathy and emotional connection in storytelling. Injecting empathy could also go some way to enabling AI to create contextually correct, emotionally challenging stories. If or when AI achieves high level, intense connections with an audience, then its storytelling will have the ability to be more immersive, more challenging, more compelling and ultimately, more enjoyable. Part of the evolution of NLG is the development of tools like the Story and Script Evaluation Framework which can provide another way to refine the creation of a story and mitigate the issue of AI not knowing the meaning of the sentences it was creating or the implications of decisions made by the generating engine about emotional depth for a story.


This paper proposes a preliminary evaluation framework for story and script development that identifies four criteria - creative, emotional, information flow and realism – as a way to classify and review Artificial Intelligence (AI) authored materials. The evaluation process will be undertaken from the point-of-view of the reader or the audience and not focussed on the technology, data set or natural language generation characteristics. As an alternative to Untrained Automatic Metrics techniques, SSEF will also allow evaluation of human-authored material, facilitating a side-by-side comparison of, for example, written output from AI and humans created from the same brief from the point of view of the audience. Techniques like Untrained Automatic Metrics look only at the text and not its affect upon a person. To undertake an evaluation using the suggested criteria will require codifying story elements, measuring emotional connectedness and reactions, assessing if the created scenes or the story flows logically and determining how real or plausible the story is.

Using narrative, imagery and drama, to communicate a story’s purpose is to evoke an emotional response and move a person in some way. Stories are either descriptive or narrative in form and include poetry, fiction, short stories, scripts and screenplays. Irrespective of the form, these styles are all made up of the same core elements and all have an objective to engage and emotionally affect an audience.

Automatic story creation using AI requires Natural Language Generation (NLG) technologies to create long, coherent passages that realistically express a logical progression of events in the best way possible. Artificial Intelligence (AI) has had some success in writing expository passages that have featured in newspaper stories and as background profiles of, for example, sport stars. At present, the approximate 1500-word limit is a function of the level of development of NLG technology. It is an emerging technology with many unsolved problems and challenges arising from data sparsity and complexity and the dynamic characteristics of data available for the system to use to learn.

There are examples of scripts and short stories created using any one of several natural language generation engines. Some scripts have also been made into short films which are viewable but not (yet) Academy Award winners. The current projects out in the public domain have all found ways to work around shortcomings of AI such as the word limit or the use of less than perfect data sets. The other key limitation is the absence of empathy in AI material. Empathy and developing an emotional connection with an audience are crucial for AI to be a really useful tool to either create stories and scripts or to be used to assist human authors and writers. While cognitive and intellectual empathy can be learnt, emotional empathy must be experienced. This is something AI cannot do –it can learn about but cannot truly understand it because it has not experienced it. This presents a big challenge for AI.

The proposed framework will be suitable and adaptable for use across all five writing genres – expository, descriptive, narrative, persuasive and journals and letters. However, the current focus is on fiction, scripts and screenplays as these areas offer the greatest challenge as well as the widest opportunity to innovate and build a body of knowledge of methods and techniques to evaluate, review, edit and curate AI authored text. There are techniques such as extrinsic or task-based evaluation, subjective human ratings and metrics-based using automated systems such as LEPOR, ROUGE, BLEU and METEOR. The automated solutions are predominantly linguistic-based and summarisation-based and do not evaluate the AI material from the point-of-view of the intended audience. Both task-based and subjective human evaluations are currently manual processes and are time-consuming. The preliminary Story and Script Evaluation Framework proposed, addresses the time and cost issues by combining the existing understanding of story structure with novel data collection methods to build a body of knowledge of emotional reaction and intensity married to the story elements and viewed from the point-of-view of the audience.

Much of the activity in the sphere of AI and storytelling/ story creation is around mechanics and structure. Exponents are like builders and landlords, more focussed on the building (what it is) and not who the tenants are (who is creating material) and what they are doing (their creations) and how much people are enjoying the structure (their emotional reaction or attachment). AI can write fiction, poetry, short stories, scripts and screenplays – but it needs human input to outline a premise and basic story elements. AI cannot think on its own. It needs a reference. While Natural Language Generation uses knowledge of the art of how humans communicate it does not know about creativity or the spark of an idea.

Download the paper below