Much of the data about free, libre, and open source (FLOSS) software development comes from studies of code repositories used for managing projects. This paper presents a method for integrating data about open source projects by way of matching projects (entities) and deleting duplicates across multiple code repositories. After a review of the relevant literature, a few of the methods are chosen and applied to the FLOSS domain, including a simple scoring system for confidence in pairwise project matches. Finally, the paper describes limitations of this approach and recommendations for future work.


Textual Description Open Source Project Frequency Match Dictionary Word Code Repository 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

7 References

  1. 1.
    Batini, C., Lenzerini, M., Navathe, S. (1986). A comparative analysis of methodologies for database schema integration. ACM Comp. Surveys, 18:4. 323–364.CrossRefGoogle Scholar
  2. 2.
    Conklin, M. (2005). Beyond low-hanging fruit: Seeking the next generation of FLOSS data mining. In Proc. 2 nd Intl. Conf. on Open Source Sys. Como, Italy. 47–56.Google Scholar
  3. 3.
    Doan, A., Domingos, P., Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. In Proc. of the ACM SIGMOD. Santa Barbara, CA, USA. 509–520.Google Scholar
  4. 4.
    Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Object matching for information integration: A profiler-based approach. In Proc. of the IJCAI Workshop on Information Integration on the Web. Acapulco, Mexico. 53–58.Google Scholar
  5. 5.
    Howison, J., Conklin, M., Crowston, K. (2005). OSSmole: A Collaborative Repository for FLOSS Research Data and Analyses. In Proc. of the 1st Intl. Conf. on Open Source Sys. Genova, Italy. 54–59.Google Scholar
  6. 6.
    Menestrina, D., Benejelloun, O., Garcia-Molina, H. (2006). Generic entity resolution with data confidences. In Proc. of 1st Int. VLDB Workshop on Clean Databases. Seoul, Korea.Google Scholar
  7. 7.
    On, B-W., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proc. of the 5th ACM/IEEE-CS Joint Conf. on Digital Libraries. Denver, CO, USA. 344–353.Google Scholar
  8. 8.
    Rahm, E. and Bernstein, P. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10. 334–350.MATHCrossRefGoogle Scholar
  9. 9.
    Robles, G. and Gonzalez-Barahona, J. (2005). Developer identification methods for integrated data from various sources. In Proc. of the Mining Software Repositories Workshop (MSR2005). 1–5.Google Scholar
  10. 10.
    Winkler, W. (1999). The State of Record Linkage and Current Research Problems. Technical Report, Statistical Research Division, US Bureau of the Census.Google Scholar

Copyright information

© International Federation for Information Processing 2007

Authors and Affiliations

  • Megan Conklin
    • 1
  1. 1.ElonUSA

Personalised recommendations