Hierarchical Word Mover Distance for Collaboration Recommender System
Natural Language Processing (NLP) techniques have enabled automated analysis over a large collection of documents, which makes it possible to quantitatively compare researcher profiles based on their publications. This paper proposes a novel researcher similarity measuring system which combines a variety of techniques, including topic modelling, Word2vec and word mover distance calculations on publication abstracts. The proposed method, implemented in python, matches researchers based upon a document’s texts by evaluating the semantic meanings of words and topics. The distances between researchers are calculated over various text features in an hierarchical structure. Results show that the system is successful in identifying existing co-authorships from sample data despite co-authorship properties having been removed, as well as suggesting valid potential academic collaboration links from related research areas irrespective of previous collaboration activity.
Dr Joel Nothman from the Sydney Informatics Hub has provided valuable suggestions and feedbacks to this work.
Prof. Nick Enfield, director of SSSHARC, Faculty of Arts and Social Sciences, the University of Sydney, initiated the question and supported this work.
The major development was conducted under the Capstone student project program initiated by the School of IT, the University of Sydney.
- 2.Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. CoRR abs/1204.1956 (2012)Google Scholar
- 4.Gollapalli, S.D., Mitra, P., Giles, C.L.: Similar researcher search in academic environments. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 167–170. ACM, New York (2012)Google Scholar
- 6.Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)Google Scholar
- 7.Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments. In: Information Processing and Management, pp. 779–840 (2000)Google Scholar
- 9.Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37 (2015)Google Scholar
- 10.Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)Google Scholar
- 13.Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 460–467. IEEE, September 2009Google Scholar
- 14.Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50, May 2010Google Scholar
- 16.Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1105–1112. ACM, New York (2009)Google Scholar