Datasets, bias, licenses, and terms of use: A large and longitudinal study on the documentation of hugging face machine learning models

IRIS

Similar to what happens when reusing libraries available through dependency management systems, developers of Machine Learning (ML)-intensive systems often reuse (and extend) pre-trained models available on ML model-specific forges. However, if these models are not adequately documented, such a lack of transparency could lead to undesired consequences in terms of bias, fairness, and the trustworthiness of the underlying data, as well as potentially legal implications. In this paper, we study the level of transparency of ML models hosted by Hugging Face (HF), a popular hub for pre-trained ML models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their licensing, and whether projects using such models take these licenses into account, and (iv) as a complement to licensing, also declare terms of use. Moreover, we examine how transparency across the investigated dimensions changed over time between 2023 and 2024. The study has been conducted by combining (i) a manual analysis of samples of top-downloaded models and (ii) an automated licensing compatibility analysis with respect to GitHub projects leveraging HF models. Results indicate that, even after over one year of observation, pre-trained models still have limited exposure to their training datasets, possible biases, and adopted licenses. Additionally, we identified several instances of potential licensing violations by client projects. Terms of use—forbidding illegal/unethical use and use beyond models’ capabilities—are present in a limited percentage of models belonging to a few model families. Our findings motivate further research to enhance the transparency of ML models, which may lead to the development, creation, and adoption of Artificial Intelligence Bills of Materials.

Datasets, bias, licenses, and terms of use: A large and longitudinal study on the documentation of hugging face machine learning models

Pepe F.;Nardone V.;Mastropaolo A.;Canfora G.;Bavota G.;Di Penta M.

2026-01-01

Abstract

Similar to what happens when reusing libraries available through dependency management systems, developers of Machine Learning (ML)-intensive systems often reuse (and extend) pre-trained models available on ML model-specific forges. However, if these models are not adequately documented, such a lack of transparency could lead to undesired consequences in terms of bias, fairness, and the trustworthiness of the underlying data, as well as potentially legal implications. In this paper, we study the level of transparency of ML models hosted by Hugging Face (HF), a popular hub for pre-trained ML models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their licensing, and whether projects using such models take these licenses into account, and (iv) as a complement to licensing, also declare terms of use. Moreover, we examine how transparency across the investigated dimensions changed over time between 2023 and 2024. The study has been conducted by combining (i) a manual analysis of samples of top-downloaded models and (ii) an automated licensing compatibility analysis with respect to GitHub projects leveraging HF models. Results indicate that, even after over one year of observation, pre-trained models still have limited exposure to their training datasets, possible biases, and adopted licenses. Additionally, we identified several instances of potential licensing violations by client projects. Terms of use—forbidding illegal/unethical use and use beyond models’ capabilities—are present in a limited percentage of models belonging to a few model families. Our findings motivate further research to enhance the transparency of ML models, which may lead to the development, creation, and adoption of Artificial Intelligence Bills of Materials.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2026
			
	Parole chiave
	
				Bias and fairness
Documentation
Licensing
Pre-trained machine learning models
Terms of use
Transparency
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12070/73665

Citazioni

ND

0

ND

social impact