Context: GitHub's dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness. Objective: This paper quantitatively and qualitatively analyzes the accuracy of GitHub's dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are. Method: After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files. Results: Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator. Conclusion: Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.
An empirical study on the accuracy of GitHub's dependency graph and the nature of its inaccuracy
Bifolco D.;Di Penta M.
2025-01-01
Abstract
Context: GitHub's dependency graph is a tool that eases Software Composition Analysis (SCA), and it is leveraged not only by other tools or by practitioners in their analyses but also by researchers when conducting studies on open-source projects. However, its potential inaccuracy may seriously harm its applicability and usefulness. Objective: This paper quantitatively and qualitatively analyzes the accuracy of GitHub's dependency graphs for Java and Python projects, how such accuracy has changed over time, and what the likely pitfalls and limitations of the dependency graph are. Method: After creating statistically significant samples of Java and Python projects, we analyzed their dependency graph in two directions, forward (by looking at dependencies), backward (by looking at dependents), and inspected their manifest/lock files. Results: Results indicate that in our sample, dependencies have over 27% of inaccuracy, and dependents up to 10%. Errors depend on several reasons, among others, an oversimplified processing of manifest/lock files by the dependency graph generator. Conclusion: Our results provide (i) guidelines for researchers to understand the threats arising in studies based on the dependency graph and (ii) insights to practitioners and tool builders to enhance their SCA, given the current limitations of the dependency graph.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


