Topic Modeling: Measuring Scholarly Impact Using a Topical Lens

Song, Min; Ding, Ying

doi:10.1007/978-3-319-10377-8_11

Min Song⁴ &
Ying Ding⁵

4704 Accesses
4 Citations

Abstract

Topic modeling is a well-received, unsupervised method that learns thematic structures from large document collections. Numerous algorithms for topic modeling have been proposed, and the results of those algorithms have been used to summarize, visualize, and explore the target document collections. In general, a topic modeling algorithm takes a document collection as input. It then discovers a set of salient themes that are discussed in the collection and the degree to which each document exhibits those topics. Scholarly communication has been an attractive application domain for topic modeling to complement existing methods for comparing entities of interest. In this chapter, we explain how to apply an open source topic modeling tool to conduct topic analysis on a set of scholarly publications. We also demonstrate how to use the results of topic modeling for bibliometric analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Asuncion, A., Welling, M., Smyth, P., & Teh, Y. (2009). On smoothing and inference for topic models. Proceedings of the Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18–21 June.
Google Scholar
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.
Article MathSciNet Google Scholar
Blei, D. M., Griffiths, T. L., & Jordan, M. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2), 1–30.
Article MathSciNet Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
MATH Google Scholar
Borg, I., & Groenen, P. J. F. (2005). Modern multidimensional scaling (2nd ed.). New York: Springer.
Google Scholar
Buntine, W. L. (1994). Operations for learning with graphical models. Journal of Artificial Intelligence Research, 2, 159–225.
Google Scholar
Chang, J., & Blei, D. M. (2010). Hierarchical relational models for document networks. The Annals of Applied Statistics, 4(1), 124–150.
Article MATH MathSciNet Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-Graber, J. L., & Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. Proceedings of 23rd Advances in Neural Information Processing Systems, Vancouver, Canada, 7–12 December.
Google Scholar
Ding, Y. (2011a). Scientific collaboration and endorsement: Network analysis of coauthorship and citation networks. Journal of Informetrics, 5(1), 187–203.
Article Google Scholar
Ding, Y. (2011b). Topic-based PageRank on author co-citation networks. Journal of the American Society for Information Science and Technology, 62(3), 449–466.
Google Scholar
Ding, Y. (2011c). Community detection: Topological vs. topical. Journal of Informetrics, 5(4), 498–514.
Article Google Scholar
Erosheva, E., Fienberg, S., & Lafferty, J. (2004). Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences, 101(1), 5220–5227.
Article Google Scholar
Gerrish, S., & Blei, D. M. (2010). A language-based approach to measuring scholarly impact. Proceedings of the 26th International Conference on Machine Learning, Haifa, Israel, 21–24 June.
Google Scholar
Getoor, L., & Diehl, C. P. (2005). Link mining: A survey. ACM SIGKDD Explorations Newsletter, 7(2), 3–12.
Article Google Scholar
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228–5235.
Article Google Scholar
Hofmann, T. (1999, August 15–19). Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 50–57), Berkeley, CA, USA.
Google Scholar
Kim, H., Sun, Y., Hockenmaier, J., & Han, J. (2012). ETM: Entity topic models for mining documents associated with entities. 2012 I.E. 12th International Conference on Data Mining (pp. 349–358). IEEE.
Google Scholar
Liu, X., Zhang, J., & Guo, C. (2012). Full-text citation analysis: Enhancing bibliometric and scientific publication ranking. Proceedings of the 21st ACM International Conference on Information and Knowledge Management (pp. 1975–1979), Brussels, Belgium. ACM.
Google Scholar
Mann, G. S., Mimno, D., & McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. The ACM Joint Conference on Digital Libraries, Chapel Hill, North Carolina, USA, 11–15 June.
Google Scholar
Mei, Q., Shen, X., & Zhai, C. (2007). Automatic labeling of multinomial topic models. Proceedings of Knowledge Discovery and Data Mining Conference (pp. 490–499).
Google Scholar
Nallapati, R., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, 24–27 August.
Google Scholar
Natale, F., Fiore, G., & Hofherr, J. (2012). Mapping the research on aquaculture. A bibliometric analysis of aquaculture literature. Scientometrics, 90(3), 983–999.
Article Google Scholar
Newman, D., Chemudugunta, C., & Smyth, P. (2006). Statistical entity-topic models. Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, Pennsylvania, USA, 20–23 August.
Google Scholar
Newman, M., & Girvan, M., (2004). Finding and evaluating community structure in networks. Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, 69, 026113
Google Scholar
Newman, M. E. J. (2005), Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323–351
Google Scholar
Ponte, J. M., & Croft, W. B. (1998, August 24–28). A language modeling approach to information retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia (pp. 275–281).
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, Canada (pp. 487–494).
Google Scholar
Song, M., Kim, S. Y., Zhang, G., Ding, Y., & Chambers, T. (2014). Productivity and influence in bioinformatics: A bibliometric analysis using PubMed central. Journal of the American Society for Information Science and Technology, 65(2), 352–371.
Article Google Scholar
Steyvers, M., Smyth, P., & Griffiths, T. (2004 August 22–25). Probabilistic author-topic models for information discovery. Proceeding of the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 306–315), Seattle, Washington, USA.
Google Scholar
Tang, J., Jin, R., & Zhang, J. (2008, December 15–19). A topic modeling approach and its integration into the random walk framework for academic search. Proceedings of 2008 I.E. International Conference on Data Mining (ICDM2008) (pp. 1055–1060), Pisa, Italy.
Google Scholar
Van Eck, N.J., & Waltman, L. (2009). How to normalizecooccurance data? An analysis of some well-known similarity measures. Journal of the American Society for Information Science and Technology, 60(8), 1635–1651.
Google Scholar
Van Eck, N. J., Waltman, L., Noyons, E. C. M., & Butter, R.K. (2010). Automatic term identification for bibliometric mapping. Sceientometrics, 82(3), 581–596.
Google Scholar
Zhai, C., & Lafferty, J. (2001, September 9–13). A study of smoothing methods for language models applied to ad hoc information retrieval. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 334–342), New Orleans, LA, USA.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Library and Information Science, Yonsei University, Seoul, South Korea
Min Song
Department of Information and Library Science, School of Informatics and Computing, Indiana University, Bloomington, IN, USA
Ying Ding

Authors

Min Song
View author publications
You can also search for this author in PubMed Google Scholar
Ying Ding
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Song .

Editor information

Editors and Affiliations

School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
Ying Ding
University of Antwerp, Antwerp, Belgium
Ronald Rousseau
University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA
Dietmar Wolfram

Appendix: Normalization, Mapping, and Clustering Techniques Used by VOSviewer

In this appendix, we provide a more detailed description of the normalization, mapping, and clustering techniques used by VOSviewer.

1.1 Normalization

We first discuss the association strength normalization (Van Eck & Waltman, 2009) used by VOSviewer to normalize for differences between nodes in the number of edges they have to other nodes. Let aij denote the weight of the edge between nodes i and j, where aij = 0 if there is no edge between the two nodes. Since VOSviewer treats all networks as undirected, we always have aij = aji. The association strength normalization constructs a normalized network in which the weight of the edge between nodes i and j is given by

$$ {s}_{ij}=\frac{2m{a}_{ij}}{k_i{k}_j}, $$

(11.12)

where k _i (k _j) denotes the total weight of all edges of node i (node j) and m denotes the total weight of all edges in the network. In mathematical terms,

$$ {k}_i={\displaystyle \sum_j{a}_{ij}} \mathrm{and} m=\frac{1}{2}{\displaystyle \sum_i{k}_i}. $$

(11.13)

We sometimes refer to s _ij as the similarity of nodes i and j. For an extensive discussion of the rationale of the association strength normalization, we refer to Van Eck and Waltman (2009).

1.2 Mapping

We now consider the VOS mapping technique used by VOSviewer to position the nodes in the network in a two-dimensional space. The VOS mapping technique minimizes the function

$$ V\left({\mathbf{x}}_1,\dots, {\mathbf{x}}_n\right)={\displaystyle \sum_{i<j}{s}_{ij}{\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_j\right\Vert}^2} $$

(11.14)

subject to the constraint

$$ \frac{2}{n\left(n-1\right)}{\displaystyle \sum_{i<j}\left\Vert {\mathbf{x}}_i-{\mathbf{x}}_j\right\Vert }=1, $$

(11.15)

where n denotes the number of nodes in a network, x _i denotes the location of node i in a two-dimensional space, and ||x _i − x _j|| denotes the Euclidean distances between nodes i and j. VOSviewer uses a variant of the SMACOF algorithm (e.g., Borg & Groenen, 2005) to minimize (11.14) subject to (11.15). We refer to Van Eck et al. (2010) for a more extensive discussion of the VOS mapping technique, including a comparison with multidimensional scaling.

1.3 Clustering

Finally, we discuss the clustering technique used by VOSviewer. Nodes are assigned to clusters by maximizing the function

$$ V\left({c}_1,\dots, {c}_n\right)={\displaystyle \sum_{i<j}\delta \left({c}_i,{c}_j\right)\left({s}_{ij}-\gamma \right)} $$

(11.16)

where c _i denotes the cluster to which node i is assigned, δ(c _i, c _j) denotes a function that equals 1 if c _i = c _j and 0 otherwise, and γ denotes a resolution parameter that determines the level of detail of the clustering. The higher the value of γ, the larger the number of clusters that will be obtained. The function in (11.16) is a variant of the modularity function introduced by Newman and Girvan (2004) and Newman (2005) for clustering the nodes in a network. There is also an interesting mathematical relationship between on the one hand the problem of minimizing (11.14) subject to (11.15) and on the other hand the problem of maximizing (11.16). Because of this relationship, the mapping and clustering techniques used by VOSviewer constitute a unified approach to mapping and clustering the nodes in a network. We refer to Waltman et al. (2010) for more details. We further note that VOSviewer uses the recently introduced smart local moving algorithm (Waltman & Van Eck, 2013) to maximize (11.16).

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Song, M., Ding, Y. (2014). Topic Modeling: Measuring Scholarly Impact Using a Topical Lens. In: Ding, Y., Rousseau, R., Wolfram, D. (eds) Measuring Scholarly Impact. Springer, Cham. https://doi.org/10.1007/978-3-319-10377-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-10377-8_11
Published: 29 September 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10376-1
Online ISBN: 978-3-319-10377-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Topic Modeling: Measuring Scholarly Impact Using a Topical Lens

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Normalization, Mapping, and Clustering Techniques Used by VOSviewer

Appendix: Normalization, Mapping, and Clustering Techniques Used by VOSviewer

1.1 Normalization

1.2 Mapping

1.3 Clustering

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation