An empirical study on the accuracy of GitHub's dependency graph and the nature of its inaccuracy

IRIS

Context: GitHub's dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness. Objective: This paper quantitatively and qualitatively analyzes the accuracy of GitHub's dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are. Method: After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files. Results: Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator. Conclusion: Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.

An empirical study on the accuracy of GitHub's dependency graph and the nature of its inaccuracy

Bifolco D.;Romano Simone;Nocera S.;Francese R.;Scanniello G.;Di Penta M.

2025-01-01

Abstract

Context: GitHub's dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness. Objective: This paper quantitatively and qualitatively analyzes the accuracy of GitHub's dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are. Method: After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files. Results: Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator. Conclusion: Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Parole chiave
	
				Dependency graph
Empirical study
GitHub
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12070/73669

Citazioni

ND

1

ND

social impact