How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study

IRIS

Pre-trained Machine Learning (ML) models help to create ML-intensive systems without having to spend conspicuous resources on traimng a new model from the ground up. However, the lack of transparency for such models could lead to undesired consequences in terms of bias, fairness, trustworthiness of the underlying data, and, potentially even legal implications. Taking as a case study the transformer models hosted by Hugging Face, a popular hub for pre-trained ML models, this paper empirically investigates the transparency of pre-trained transformer models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their license, and whether projects using such models take these licenses into account. Results indicate that pre-trained models still have a limited exposure of their traimng datasets, possible biases, and adopted licenses. Also, we found several cases of possible licensing violations by client projects. Our findings motivate further research to improve the transparency of ML models, which may result in the definition, generation, and adoption of Artificial Intelligence Bills of Materials.

How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study

Pepe Federica;Nardone V.;Mastropaolo A.;Canfora G.;Bavota G.;Di Penta M.

2024-01-01

Abstract

Pre-trained Machine Learning (ML) models help to create ML-intensive systems without having to spend conspicuous resources on traimng a new model from the ground up. However, the lack of transparency for such models could lead to undesired consequences in terms of bias, fairness, trustworthiness of the underlying data, and, potentially even legal implications. Taking as a case study the transformer models hosted by Hugging Face, a popular hub for pre-trained ML models, this paper empirically investigates the transparency of pre-trained transformer models. We look at the extent to which model descriptions (i) specify the datasets being used for their pre-training, (ii) discuss their possible training bias, (iii) declare their license, and whether projects using such models take these licenses into account. Results indicate that pre-trained models still have a limited exposure of their traimng datasets, possible biases, and adopted licenses. Also, we found several cases of possible licensing violations by client projects. Our findings motivate further research to improve the transparency of ML models, which may result in the definition, generation, and adoption of Artificial Intelligence Bills of Materials.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2024
			
	Parole chiave
	
				and Fairness
Bias
Deep Learning
Empirical Study
ML-Intensive Systems
Pre-trained models
Transparency
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12070/67171

Citazioni

ND

6

ND

social impact