Analytics and visualization of citation network applying topic-based clustering
- 131 Downloads
Survey of papers is not an easy task for novice researchers because it may happen that they miss appropriate keywords for their survey. It often takes a long time for young researchers to find research papers even when they use famous search engines like Google Scholar. In addition, they may not be familiar with understanding positions of papers in their research fields smoothly. To resolve this problem, many researchers have studied citation network visualization techniques for surveying papers. However, it is still often difficult to observe the complicated relations across multiple research fields or traverse the entire relations in their interest. Additional clues, as well as a citation network, are therefore important for survey of papers. In this paper, we proposed a visualization technique for citation networks applying a topic-based paper clustering. Our technique categorizes papers applying LDA (latent dirichlet allocation) and constructs clustered networks consisting of the papers. We applied the technique to three datasets. The results of our visualization technique demonstrated that the proposed technique could contribute to help users to understand the positions of papers in the research fields. We conducted subjective evaluation compared with time-oriented technique and demonstrated that our technique was more helpful for novice researchers like students to find papers.
KeywordsVisualization Citation network Edge bundling Reference Topic-based clustering
Finding research papers is a very important task to understand trends in research fields and find related papers. Researchers use text-based portal websites such as Google Scholar (2018), ACM Digital Library (2018), and IEEE Xplore Digital Library (2018). Researchers look up for the references of papers they have read. However, it is not always easy for novice researchers to survey papers they want to read and instantly understand positions of the papers in the research fields with their search results. Moreover, young researchers may miss papers in case that they do not find the appropriate keywords, or in case that papers they really want to survey straddle multiple research fields. We define keywords in this paper as the terms which consist of a topic, not terms which the authors annotate. A topic includes multiple keywords.
There have been many studies on visualization of citation networks, including Mackinlay et al. (1995) and Small (1999), which aimed to alleviate these difficulties. However, we suppose still there are many open problems on visualization of citation networks. For example, researchers continuously trigger for new fusions of multiple fields, and therefore, they need to organize and understand the relations of papers that cover multiple research fields. Another problem is while surveying papers in unfamiliar research fields. Papers in the unfamiliar research fields sometimes do not include the terms well used in a research field which the users are familiar with. Conversely, terms may have very different meanings depending on the research field. In such cases, we find that the papers are not what we expected after we read them. Understanding the positions of the papers in the research fields is important for researchers to identify whether the papers are related to the topics which they want to survey.
Find much-cited papers which include user-specified topics.
Find papers whose contents are similar to the focused papers, such as the papers which do not contain the user-specified keywords but belong to the user-interested topics.
Find the contents of papers which have citation relations between the papers using user-specified keywords or belonging to user-interested topics.
Find tightly related pairs of topics.
Categorize papers that have similar topics to the same group (Sect. 3.1).
Place papers that belong to the category in a circular region (Sect. 3.2).
Place citing and cited papers closer by applying a force-directed layout algorithm (Sect. 3.2).
Summarize citation relations by applying an edge bundling method (Sect. 3.2).
We applied the proposed technique to the datasets describing the citations of papers published in ACM SIGGRAPH, IEEE Transactions on Visualization and Computer Graphics, and IEEE Computer Graphics and Applications. This paper introduces the visualization results with these datasets and discusses the effectiveness of our technique.
Our technique applies a general purpose graph layout technique, against many studies on visualization of citation relations apply time-oriented visualization design. We aimed to represent the positions of user-interested papers in a set of topics and relations between pairs of topics. We, therefore, applied our own algorithm on clustered graph layout (Itoh et al. 2009; Nakazawa et al. 2012), not time-oriented visualization. The paper presents our subjective evaluation compared with an existing time-ordered technique. We asked the questionnaires to the 21 graduate students as an example of novice researchers. We have presented the result of our technique (Nakazawa et al. 2015), so we introduce a technique to apply the dataset of multiple conferences in addition.
Proposed a technique to visualize citation network based on their topics.
Demonstrated that our technique could be helpful to grasp the positions of papers in research fields with the visualization results.
Compared with the time-ordered visualization technique.
2 Related work
This section introduces existing visualization techniques for topics or keywords of text corpus and science literature. There have been many visualization techniques for topics in text corpus (Liu et al. 2012; Stasko et al. 2008). Some of these techniques focus on the topics of science literature. Lee et al. (2005) presented PaperLens as a visualization technique that first applies the mixture distribution model to the titles and keywords, then estimates their topics, and finally shows papers by topics and publication years. Shahaf et al. (2012) introduced their technique which visualizes the relationships among terms looking like metro maps. As other examples of the analyses for topics of publications, Henry et al. (2007) visualized networks of well-used keywords and co-authors, and analyzed features of multiple conferences. CiteRivers (Heimerl et al. 2016) visualized the trend of the topics mentioned in the papers and the number of papers for each conference or journal by citation flow. Users can easily understand the trend of the topic they focus on and which conference or journal is related to it. These techniques support to understand the transition and trend of the research topics in a certain conference or a journal. However, these techniques just visualize topics mentioned in the papers; they do not support citation relations. It is difficult for these existing techniques to navigate to the related papers that are not directly related to the user-interested topics but cited by papers of the user-interested topics. Therefore, they do not satisfy the requirements mentioned in Sect. 1.
Many researchers have been studied visualization of citation networks. Citation network topology (Brandes and Willhalm 2002; Chen 2006) is one of these approaches to visualize citation patterns. Brandes and Willhalm (2002) presented a visualization technique for citation networks with topographic maps, which places the hub papers cited by many papers higher than the other papers. It also arranges the papers that have similar citation pattern closer. That enables us to easily find the hub papers and the groups of papers that have similar citation patterns. These techniques assist users to easily find well-cited papers and understand citation patterns. However, it is difficult to find papers by only applying the citation topology when users do not know appropriate papers for a clue to track the citation relations.
Many researchers have also studied time-ordered visualization techniques (Matejka et al. 2012; Stasko et al. 2013; Van Eck and Waltman 2014) for citation network. CiteNetExplorer (Van Eck and Waltman 2014) applies a transitive reduction of citation network and put them in chronological order. It assigns colors of nodes which denote publications to the attributes like successor and predecessor. Citeology (Matejka et al. 2012) orders papers based on the numbers of their citations with respect to each year and places them from the center of a display space. It can visualize up to eight hops of the citations. This study represents structures of citation networks by placing nodes corresponding to papers in the time-series order. When a citation network has complicated relations across multiple research fields, it causes serious edge crossing and cluttering which bring bad impact on readability. Visualization results with heavy cluttering prevent the users from finding the positions of papers, while the users want to understand the positions of the interested papers in the research fields. New papers always cite the older papers, so we do not need to apply time-ordered visualization design for citation network when we show a direction of the citations. CiteVis (Stasko et al. 2013) visualizes citation relations by highlighting the citing nodes and the cited nodes when a user clicks a node. This technique reduces a visual clutter of edges and enables users to understand the citation relations of the clicked paper and the trend of the number of presented papers in a conference.
One of our goals is to help novice researchers to survey papers that they want to read and understand the positions of the papers in the research fields with search results instantly. To achieve this goal, we think that both the citation relations and the topics of the papers are important as described in the previous section. Therefore, we propose a visualization technique that concerned both the citation relations and the topics of the papers. Still a small number of studies have addressed both of them. Dunne et al. (2012) proposed an integrated visualization of a citation network and a paper summary description. Users can simultaneously look at the citations, ranking based on the citation count, and a summary description of papers in the cluster generated by graph clustering based on citation structure. This representation has a bottleneck that it may require larger display spaces. Also, the network visualization shows only papers extracted by the keyword-based search. Therefore, users may miss papers if these papers do not use the user-specified keywords or they are not cited by papers using such keywords.
Though these novel visualization techniques have been presented, it is not still always easy to find important papers using such techniques. One of the reasons is that these existing techniques often require users to manually specify the papers whose citations they want to figure out. It often happens that novice researchers do not know all the appropriate keywords and, therefore, it is not easy for them to determine which papers they should read. The second reason is that it may happen to miss the papers which do not have citation relations but have similar contents when we only focus on citation relations to find papers. The third reason is that many recent new research fields have triggered fusions of multiple research fields. Researchers need to organizationally understand the relations of papers that cover such multiple fields along with their fusion. However, there seems no visualization techniques addressing this problem.
3 Proposed technique
This section describes the processing flow of the presented technique. We treat the papers as nodes and citations as directed edges of a network. The technique classifies the papers based on their contents to construct a hierarchical network. The technique then applies our hierarchical network layout technique with an edge bundling algorithm. Our implementation also provides rendering and interaction techniques.
3.1 Clustering papers
The proposed technique applies LDA (latent dirichlet allocation) (Blei et al. 2003) to categorize papers based on the contents of papers. LDA is a generative topic model which allows a document to include various topics. It is generally used because it can avoid overfitting the data. As a paper can include multiple topics, we think LDA is appropriate for our purpose. It could solve the problem to categorize papers that straddle multiple research fields. The technique applies LDA to the set of all the paper abstracts to estimate topics and calculate the topic distribution for each abstract. LDA needs to be given the number of topics, so we determine the number heuristically. We regard these topics as research fields and categorize all papers based on them. The technique supposes a paper is related to the particular topic, if a value of the topic distribution is larger than the threshold. We removed unnecessary words from the abstract as a preprocessing to improve the quality of clustering results. The removed words included non-important words such as prepositions, or too frequently used terms such as “propose” and “technique.” Then, we presumed the contents of the topic from 20 words whose probability is highest on the topic. Our clustering allows a paper to belong to multiple topics. A cluster in this paper can include multiple topics. For example, there is a cluster including only topic A whereas papers about topic A and topic B belong to another cluster.
3.2 Network layout
3.3 Color scaling for network rendering
Use the color except pink to compare the edges of the clicked node and the other edges
Use two different colors to compare the nodes which are clicked at the first time and the next time
3.4 User Interface
Figure 5 shows a snapshot of the user interface we implemented. The left side of the window features the drawing space, while the right side features two tabs. One of the tabs features various GUI widgets. Users can scale and shift the view, switch the edge bundling mode, and set its threshold, using the GUI widgets shown in Fig. 5a, c. When a user clicks a node corresponding to a particular paper, the technique displays the details of the paper such as the digital object identifier (DOI), title, authors, year, and abstract, on the panel featured by the other tab. At the same time, it highlights the edges of the clicked node and those of the nodes that are connected to the clicked node. This edge highlight function is applicable to two nodes together and this enables to compare the citations of each paper.
We implemented the proposed technique with Java Development Kit (JDK) 1.6.0. In this section, we show some results visualizing citation networks.
4.1 An example of a conference proceeding
We applied a citation network dataset consisting of 1072 full papers published in the SIGGRAPH conferences during 1990–1994, and during 2000–2010, provided by the ACM Digital Library ACM Digital Library (2018). We extracted the title, publication year, abstract, references, and authors from html files of the papers. We did not apply the paper information during 1995–1999, because we could not extract the abstracts from ACM Digital Library.
4.1.1 Example of hardware and GPU
Suppose that a user survey for research papers on “hardware and GPU.” Figure 6 shows an example when the user selected the “hardware and GPU” category. We could observe that the cluster in the center contained papers categorized only to “hardware and GPU” had dense relationships between the “physical simulation”, “lighting”, and “shape modeling” categories. We also found that the cited bundles of the “hardware and GPU” cluster are thicker than the citing ones, which means many papers in these research fields “physical simulation,” “lighting,” and “shape modeling” refer to the papers in the “hardware and GPU” cluster, and the researches in these fields have often evolved based on the researches in the “hardware and GPU” category. Among these relationships, especially, the relation between the “hardware and GPU” and “lighting” clusters clearly shows the above fact. Therefore, we expect that the “hardware and GPU” cluster could give a clue to the research team that develops hardware systems when they want to know which research fields their products are well applied.
4.1.2 Example of lighting and CG algorithm
A fast shadow algorithm for area light sources using back projection (in 1994)
The irradiance Jacobian for partially occluded polyhedral sources (in 1994)
A clustering algorithm for radiosity in complex environments (in 1994)
Illuminating micro-geometry based on precomputed visibility (in 2000)
Efficient image-based methods for rendering soft shadows (in 2000)
Conservative volumetric visibility with occluder fusion (in 2000)
4.1.3 Example with a keyword
Continuous capture of skin deformation (Sand et al. 2003)
Building efficient, accurate character skins from examples (Mohr and Gleicher 2003)
Capturing and animating skin deformation in human motion (Park and Hodgins 2006)
Data-driven modeling of skin and muscle deformation (Park and Hodgins 2008)
4.2 Examples of journals
VR and AR
Geometry and modeling
Animation and motion
Lighting and rendering
Software and environment
Color and projection
GPU and hardware
5.1 Preliminary questionnaires
What do you want to know when you search for papers?
What technique do you want for surveying papers well?
What do you want to know if you look into the citation network visualization in a particular conference for twenty years?
Regarding the question 2, more than half of the students mentioned that word-based smart search techniques are important for paper survey processes, including synonym recommendation and search refinement. This result proves that novice researchers including graduated students had troubles while selecting keywords to search for papers. Regarding the question 3, we roughly divide the answers into three categories, “the transition of research fields”, “the citation relations”, and “both research fields and citation relations, or what they reveal in combination”. It demonstrates the demands to understand both research fields and citations.
5.2 Evaluation: comparison with time-oriented visual representation
The transition of papers amount published in the conference every year.
The main topic of the conference.
The trend of a research topic by year.
The research fields that seem to have a strong relationship with a field you focus on.
Much-cited papers on a certain topic.
The latest paper on a certain topic.
The content trends of papers citing the paper you read (or clicked).
Papers whose contents are similar to the paper you read (or clicked).
Papers that had a great influence on the paper you read (or clicked).
We presented a visualization technique of citation networks for survey of research papers and discussed the visualization results. Our technique applies topic-based paper clustering to construct a hierarchical network. It then applies a hybrid force-directed and space-filling network layout algorithm, and an edge bundling technique with Catmull–Rom spline curve. Our GUI design realizes the requirement R1, as a function of topic selection filtering. We applied datasets of publication of ACM SIGGRAPH, IEEE Transactions on Computer Graphics and Visualization, and IEEE Computer Graphics and Applications. These results showed that our technique could help to understand the positions of papers in research fields and find papers even when users do not know all appropriate keywords. Also, our technique could show the trends of topics and citation relations in a particular conference or journal. The case of visualization with a keyword “skin” demonstrated that our technique satisfies the requirements R2 and R3 and the case of hardware topic of ACM SIGGRAPH showed for the requirement R4. This paper also introduced the results of the user evaluation compared with a time-oriented visualization technique. The result demonstrated that our technique was more helpful for novice researchers like students to find papers. For future work, we think combining paper information and author information like PivotPaths (Drk et al. 2012) would be helpful for survey of papers.
- ACM Digital Library, http://dl.acm.org/
- Brandes U, Willhalm T (2002) Visualization of bibliographic networks with a reshaped landscape metaphor. In: Proceedings of the symposium on data visualisation, vol 2002, pp 159–164Google Scholar
- Google Scholar, https://scholar.google.co.jp/
- IEEE Xplore Digital Library, http://ieeexplore.ieee.org/
- Il Park S , Hodgins JK (2008) Data-driven modeling of skin and muscle deformation. In: ACM SIGGRAPH 2008 papers (SIGGRAPH ’08). ACM, New York, NY, USA, Article 96Google Scholar
- Itoh T, Muelder C, Ma K, Sese J (2009) A hybrid space-filling and force-directed layout method for visualizing multiple-category graphs. In: IEEE pacific visualization symposium, pp 121–128Google Scholar
- Lee B, Czerwinski M, Robertson G, Bederson BB (2005) Understanding research trends in conferences using PaperLens. In: CHI’05 extended abstracts on Human factors in computing systems, pp 1969–1972Google Scholar
- Liu S, Zhou MX, Pan S, Song Y, Qian W, Cai W, Lian X (2012) TIARA: interactive, topic-based visual text summarization and analysis. ACM Trans Intell Syst Technol 3, 2, Article 25, 28Google Scholar
- Mackinlay JD, Rao R, Card SK (1995) An organic user interface for searching citation links. In: The SIGCHI conference on Human factors in computing systems, pp 67–73Google Scholar
- Matejka J, Grossman T, Fitzmaurice G (2012) Citeology: visualizing paper genealogy. In: CHI’12 extended abstracts on human factors in computing systems, pp 181–190Google Scholar
- Nakazawa R, Itoh T, Saito T (2015) A visualization of research papers based on the topics and citation network. In: 19th international conference on information visualisation (iV), pp 283–289Google Scholar
- Nakazawa R, Itoh T, Sese J, Terada A (2012) Integrated visualization of gene network and ontology applying a hierarchical graph visualization technique. In: 16th International conference on information visualization (iV), pp 81–86Google Scholar
- Sand P, McMillan L, Popovi J (2003) Continuous capture of skin deformation. In: ACM SIGGRAPH, (2003) Papers (SIGGRAPH ’03), ACM, New York, NY, USA, pp 578–586Google Scholar
- Shahaf D, Guestrin C, Horvitz E (2012) Metro maps of science. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1122–1130Google Scholar
- Stasko J, Choo J, Han Y, Hu M, Pileggi H, Sadana R, Stolper CD (2013) Citevis: Exploring conference paper citation data visually. Posters of IEEE InfoVisGoogle Scholar
- Tsumura N, Ojima N, Sato K, Shiraishi M, Shimizu H, Nabeshima H, Akazaki S, Hori K, Miyake Y (2003) Image-based skin color and texture analysis/synthesis by extracting hemoglobin and melanin information in the skin. ACM Trans. Graph. 22, 3 (July 2003), pp 770–779Google Scholar
- Van Eck NJ, Waltman L (2014) CitNetExplorer: a new software and tool for analyzing and visualizing citation networks. J Inf 8(4):802–823Google Scholar
- Weyrich T, Matusik W, Pfister H, Bickel B, Donner C, Tu C, McAndless J, Lee J, Ngan A, Jensen HW, Gross M (2006) Analysis of human faces using a measurement-based skin reflectance model. ACM Trans. Graph. 25, 3 (July 2006), pp 1013–1024Google Scholar