P2V: large-scale academic paper embedding

  • Yi ZhangEmail author
  • Fen Zhao
  • Jianguo Lu


Academic papers not only contain text but also links via citation links. Representing such data is crucial for many tasks, such as classification, disambiguation, duplicates detection, recommendation and influence prediction. The success of the skip-gram model has inspired many algorithms for learning embeddings for words, documents, and networks. However, there is limited research on learning the representation of linked documents such as academic papers. In this paper, we propose a new neural network based algorithm, called P2V (paper2vector), to learn high-quality embeddings for academic papers on large-scale datasets. We compare our model with traditional non-neural network based algorithms and state-of-the-art neural network methods on four datasets of various sizes. The largest dataset we used contains 46.64 million papers and 528.68 million citation links. Experimental results show that P2V achieves state-of-the-art performance in paper classification, paper similarity, and paper influence prediction task.


Embedding Data Representation Academic Paper 



Funding was provided by Natural Sciences and Engineering Research Council of Canada (Grant No. RGPIN-2014-04463).


  1. 1.School of Computer ScienceUniversity of WindsorWindsorCanada

