Skip to main content

Topic Modeling: Measuring Scholarly Impact Using a Topical Lens

  • Chapter
  • First Online:
Measuring Scholarly Impact

Abstract

Topic modeling is a well-received, unsupervised method that learns thematic structures from large document collections. Numerous algorithms for topic modeling have been proposed, and the results of those algorithms have been used to summarize, visualize, and explore the target document collections. In general, a topic modeling algorithm takes a document collection as input. It then discovers a set of salient themes that are discussed in the collection and the degree to which each document exhibits those topics. Scholarly communication has been an attractive application domain for topic modeling to complement existing methods for comparing entities of interest. In this chapter, we explain how to apply an open source topic modeling tool to conduct topic analysis on a set of scholarly publications. We also demonstrate how to use the results of topic modeling for bibliometric analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Asuncion, A., Welling, M., Smyth, P., & Teh, Y. (2009). On smoothing and inference for topic models. Proceedings of the Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18–21 June.

    Google Scholar 

  • Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  MathSciNet  Google Scholar 

  • Blei, D. M., Griffiths, T. L., & Jordan, M. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2), 1–30.

    Article  MathSciNet  Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    MATH  Google Scholar 

  • Borg, I., & Groenen, P. J. F. (2005). Modern multidimensional scaling (2nd ed.). New York: Springer.

    Google Scholar 

  • Buntine, W. L. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.

    Google Scholar 

  • Chang, J., & Blei, D. M. (2010). Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1), 124–150.

    Article  MATH  MathSciNet  Google Scholar 

  • Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Proceedings of 23rd Advances in Neural Information Processing Systems, Vancouver, Canada, 7–12 December.

    Google Scholar 

  • Ding, Y. (2011a). Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. Journal of Informetrics, 5(1), 187–203.

    Article  Google Scholar 

  • Ding, Y. (2011b). Topic-based PageRank on author co-citation networks. Journal of the American Society for Information Science and Technology, 62(3), 449–466.

    Google Scholar 

  • Ding, Y. (2011c). Community detection: Topological vs. topical. Journal of Informetrics, 5(4), 498–514.

    Article  Google Scholar 

  • Erosheva, E., Fienberg, S., & Lafferty, J. (2004). Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101(1), 5220–5227.

    Article  Google Scholar 

  • Gerrish, S., & Blei, D. M. (2010). A language-based approach to measuring scholarly impact. Proceedings of the 26th International Conference on Machine Learning, Haifa, Israel, 21–24 June.

    Google Scholar 

  • Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. ACM SIGKDD Explorations Newsletter, 7(2), 3–12.

    Article  Google Scholar 

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235.

    Article  Google Scholar 

  • Hofmann, T. (1999, August 15–19). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57), Berkeley, CA, USA.

    Google Scholar 

  • Kim, H., Sun, Y., Hockenmaier, J., & Han, J. (2012). ETM: Entity topic models for mining documents associated with entities. 2012 I.E. 12th International Conference on Data Mining (pp. 349–358). IEEE.

    Google Scholar 

  • Liu, X., Zhang, J., & Guo, C. (2012). Full-text citation analysis: Enhancing bibliometric and scientific publication ranking. Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 1975–1979), Brussels, Belgium. ACM.

    Google Scholar 

  • Mann, G. S., Mimno, D., & McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. The ACM Joint Conference on Digital Libraries, Chapel Hill, North Carolina, USA, 11–15 June.

    Google Scholar 

  • Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Proceedings of Knowledge Discovery and Data Mining Conference (pp. 490–499).

    Google Scholar 

  • Nallapati, R., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, 24–27 August.

    Google Scholar 

  • Natale, F., Fiore, G., & Hofherr, J. (2012). Mapping the research on aquaculture. A bibliometric analysis of aquaculture literature. Scientometrics, 90(3), 983–999.

    Article  Google Scholar 

  • Newman, D., Chemudugunta, C., & Smyth, P. (2006). Statistical entity-topic models. Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, Pennsylvania, USA, 20–23 August.

    Google Scholar 

  • Newman, M., & Girvan, M., (2004). Finding and evaluating community structure in networks. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, 69, 026113

    Google Scholar 

  • Newman, M. E. J. (2005), Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323–351

    Google Scholar 

  • Ponte, J. M., & Croft, W. B. (1998, August 24–28). A language modeling approach to information retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia (pp. 275–281).

    Google Scholar 

  • Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada (pp. 487–494).

    Google Scholar 

  • Song, M., Kim, S. Y., Zhang, G., Ding, Y., & Chambers, T. (2014). Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central. Journal of the American Society for Information Science and Technology, 65(2), 352–371.

    Article  Google Scholar 

  • Steyvers, M., Smyth, P., & Griffiths, T. (2004 August 22–25). Probabilistic author-topic models for information discovery. Proceeding of the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 306–315), Seattle, Washington, USA.

    Google Scholar 

  • Tang, J., Jin, R., & Zhang, J. (2008, December 15–19). A topic modeling approach and its integration into the random walk framework for academic search. Proceedings of 2008 I.E. International Conference on Data Mining (ICDM2008) (pp. 1055–1060), Pisa, Italy.

    Google Scholar 

  • Van Eck, N.J., & Waltman, L. (2009). How to normalizecooccurance data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635–1651.

    Google Scholar 

  • Van Eck, N. J., Waltman, L., Noyons, E. C. M., & Butter, R.K. (2010). Automatic term identification for bibliometric mapping. Sceientometrics, 82(3), 581–596.

    Google Scholar 

  • Zhai, C., & Lafferty, J. (2001, September 9–13). A study of smoothing methods for language models applied to ad hoc information retrieval. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 334–342), New Orleans, LA, USA.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Song .

Editor information

Editors and Affiliations

Appendix: Normalization, Mapping, and Clustering Techniques Used by VOSviewer

Appendix: Normalization, Mapping, and Clustering Techniques Used by VOSviewer

In this appendix, we provide a more detailed description of the normalization, mapping, and clustering techniques used by VOSviewer.

1.1 Normalization

We first discuss the association strength normalization (Van Eck & Waltman, 2009) used by VOSviewer to normalize for differences between nodes in the number of edges they have to other nodes. Let aij denote the weight of the edge between nodes i and j, where aij = 0 if there is no edge between the two nodes. Since VOSviewer treats all networks as undirected, we always have aij = aji. The association strength normalization constructs a normalized network in which the weight of the edge between nodes i and j is given by

$$ {s}_{ij}=\frac{2m{a}_{ij}}{k_i{k}_j}, $$
(11.12)

where k i (k j ) denotes the total weight of all edges of node i (node j) and m denotes the total weight of all edges in the network. In mathematical terms,

$$ {k}_i={\displaystyle \sum_j{a}_{ij}} \mathrm{and} m=\frac{1}{2}{\displaystyle \sum_i{k}_i}. $$
(11.13)

We sometimes refer to s ij as the similarity of nodes i and j. For an extensive discussion of the rationale of the association strength normalization, we refer to Van Eck and Waltman (2009).

1.2 Mapping

We now consider the VOS mapping technique used by VOSviewer to position the nodes in the network in a two-dimensional space. The VOS mapping technique minimizes the function

$$ V\left({\mathbf{x}}_1,\dots, {\mathbf{x}}_n\right)={\displaystyle \sum_{i<j}{s}_{ij}{\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_j\right\Vert}^2} $$
(11.14)

subject to the constraint

$$ \frac{2}{n\left(n-1\right)}{\displaystyle \sum_{i<j}\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_j\right\Vert }=1, $$
(11.15)

where n denotes the number of nodes in a network, x i denotes the location of node i in a two-dimensional space, and ||x i  − x j || denotes the Euclidean distances between nodes i and j. VOSviewer uses a variant of the SMACOF algorithm (e.g., Borg & Groenen, 2005) to minimize (11.14) subject to (11.15). We refer to Van Eck et al. (2010) for a more extensive discussion of the VOS mapping technique, including a comparison with multidimensional scaling.

1.3 Clustering

Finally, we discuss the clustering technique used by VOSviewer. Nodes are assigned to clusters by maximizing the function

$$ V\left({c}_1,\dots, {c}_n\right)={\displaystyle \sum_{i<j}\delta \left({c}_i,{c}_j\right)\left({s}_{ij}-\gamma \right)} $$
(11.16)

where c i denotes the cluster to which node i is assigned, δ(c i , c j ) denotes a function that equals 1 if c i  = c j and 0 otherwise, and γ denotes a resolution parameter that determines the level of detail of the clustering. The higher the value of γ, the larger the number of clusters that will be obtained. The function in (11.16) is a variant of the modularity function introduced by Newman and Girvan (2004) and Newman (2005) for clustering the nodes in a network. There is also an interesting mathematical relationship between on the one hand the problem of minimizing (11.14) subject to (11.15) and on the other hand the problem of maximizing (11.16). Because of this relationship, the mapping and clustering techniques used by VOSviewer constitute a unified approach to mapping and clustering the nodes in a network. We refer to Waltman et al. (2010) for more details. We further note that VOSviewer uses the recently introduced smart local moving algorithm (Waltman & Van Eck, 2013) to maximize (11.16).

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Song, M., Ding, Y. (2014). Topic Modeling: Measuring Scholarly Impact Using a Topical Lens. In: Ding, Y., Rousseau, R., Wolfram, D. (eds) Measuring Scholarly Impact. Springer, Cham. https://doi.org/10.1007/978-3-319-10377-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10377-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10376-1

  • Online ISBN: 978-3-319-10377-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics