Artificial intelligence (AI) systems built on incomplete or biased data will often exhibit problematic outcomes, leading to negative unintended consequences that affect the very communities that are already marginalized, underserved, or underrepresented. And yet there are few, if any, standard methods of data analysis to check for the ‘health’ of data, particularly before model development.
With an aim of mitigating harms caused by automated decision-making systems, The Dataset Nutrition Label tool enhances context, content, and legibility of datasets. Drawing from the analogy of the Nutrition Facts Label on food, the Label highlights the “ingredients” of a dataset to help shed light on how (or whether) the dataset is healthy for use.
The Dataset Nutrition Label is a diagnostic framework that lowers the barrier to standardized data analysis by providing a distilled yet comprehensive overview of dataset “ingredients” before AI model development. This framework is optimized for the data practitioner journey and leverages potential use cases for the data alongside alerts or flags that highlight known issues and possible mitigation strategies.
The Label is intended to drive robust data analysis practices by making it easier and faster for data scientists to interrogate and select datasets; increase overall quality of models by driving the use of better and more appropriate datasets for those models; and enable the creation and publishing of responsible datasets by those who collect, clean and publish data.
The web-based Label includes three distinct windows of information: Label Overview (below), Objectives & Alerts, and Dataset Info panes.
The Label Overview provides overall dataset information including known issues (alerts) and indicators (badges) for key questions such as whether the data has undergone ethical review.
The Data Nutrition Project is a research organization and product development team composed of technologists, designers, academics and scientists. Together, we are excited to continue the work of driving better AI through the exploration and development of practical tools.
Since launching the second generation of the Dataset Nutrition Label in early 2021, the team has turned our focus to a number of initiatives for this year and beyond: