Technical paper: This paper presents the main challenges behind archive metadata for dataset production.


The current technological development trend in AI suggests that it will be a pervasive end-to-end component in all future media systems from production to distribution and considered as one of the “new normals” of a typical media production infrastructure of the near future.

In this context, key assets of main European and world broadcasters are represented by the immense (and growing) number of archived objects, that together with their relevant metadata are seen by all major technology providers as a ground truth treasure. But how much is this belief true? Is it really so advantageous to consider archive metadata as an easy-to-use source of ground truth for machine learning tools? This paper will present the main challenges behind this approach and how these could be addressed by applying a rigorous and structured approach at what can be identified as a new process: dataset production.

By defining and following key requirements for the dataset production process, the paper will illustrate some basic tools enabling decision taking about the effectiveness of the possible alternatives (e.g., metadata adaptation vs. metadata re-make) and will propose a theoretical background for the generation of future-proof datasets.


In the current era, the usage of Artificial Intelligence (AI) technologies in industrial processes is becoming commonplace in many sectors including finance, manufacturing, automotive and – of course – media and entertainment. The applications range is extremely wide and goes from business data analytics to automated quality control, web & social mining, multimedia classification, automated driving and many more (6)(7).

The level of penetration of AI tools in production processes is evolving from a simple support to business decisions to full-fledged substitution of human decision makers. If on the one side this scenario poses unprecedented challenges in terms of ethics, labour policy, safety and liability, on the other side represents an unmissable opportunity to implement new areas of business otherwise unfeasible. The media sector is certainly one in which the potential of AI technologies may give its best results, and probably one in which risks related to the application of AI in the value chain can be better-mitigated w.r.t. other critical sectors (e.g., healthcare, finance, automotive) due to its inherent nature. There is also another key enabling factor in the media domain, namely the availability of an alleged immense amount of data. However, how much of this data is actually usable, through what processes, and at what cost?

This paper tries to elaborate on this problem, based on the observations and experiences of the past 20+ years in applied R&D in the field of AI in media processes (especially in archives).

Download the paper below