Abstract
Much of the data about free, libre, and open source (FLOSS) software development comes from studies of code repositories used for managing projects. This paper presents a method for integrating data about open source projects by way of matching projects (entities) and deleting duplicates across multiple code repositories. After a review of the relevant literature, a few of the methods are chosen and applied to the FLOSS domain, including a simple scoring system for confidence in pairwise project matches. Finally, the paper describes limitations of this approach and recommendations for future work.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
7 References
Batini, C., Lenzerini, M., Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Comp. Surveys, 18:4. 323–364.
Conklin, M. (2005). Beyond low-hanging fruit: Seeking the next generation of FLOSS data mining. In Proc. 2 nd Intl. Conf. on Open Source Sys. Como, Italy. 47–56.
Doan, A., Domingos, P., Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. In Proc. of the ACM SIGMOD. Santa Barbara, CA, USA. 509–520.
Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Object matching for information integration: A profiler-based approach. In Proc. of the IJCAI Workshop on Information Integration on the Web. Acapulco, Mexico. 53–58.
Howison, J., Conklin, M., Crowston, K. (2005). OSSmole: A Collaborative Repository for FLOSS Research Data and Analyses. In Proc. of the 1st Intl. Conf. on Open Source Sys. Genova, Italy. 54–59.
Menestrina, D., Benejelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In Proc. of 1st Int. VLDB Workshop on Clean Databases. Seoul, Korea.
On, B-W., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. of the 5th ACM/IEEE-CS Joint Conf. on Digital Libraries. Denver, CO, USA. 344–353.
Rahm, E. and Bernstein, P. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10. 334–350.
Robles, G. and Gonzalez-Barahona, J. (2005). Developer identification methods for integrated data from various sources. In Proc. of the Mining Software Repositories Workshop (MSR2005). 1–5.
Winkler, W. (1999). The State of Record Linkage and Current Research Problems. Technical Report, Statistical Research Division, US Bureau of the Census.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 International Federation for Information Processing
About this paper
Cite this paper
Conklin, M. (2007). Project Entity Matching across FLOSS Repositories. In: Feller, J., Fitzgerald, B., Scacchi, W., Sillitti, A. (eds) Open Source Development, Adoption and Innovation. OSS 2007. IFIP — The International Federation for Information Processing, vol 234. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-72486-7_4
Download citation
DOI: https://doi.org/10.1007/978-0-387-72486-7_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-72485-0
Online ISBN: 978-0-387-72486-7
eBook Packages: Computer ScienceComputer Science (R0)