, Volume 100, Issue 3, pp 787–799 | Cite as

Empirical study of constructing a knowledge organization system of patent documents using topic modeling

  • Zhengyin Hu
  • Shu Fang
  • Tian Liang


A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.


Topic model Term clumping Knowledge organization system Text clustering Principal Component Analysis 



Derwent Innovations Index


Knowledge Organization System


Large Aperture Optical Elements


Latent Dirichlet Allocation


MAchine Learning for LanguagE Toolkit (a toolkit for machine learning developed by Andrew et al. at University of Massachusetts Amherst)


Natural Language Processing


Principal Components Analysis


  1. Almeida, J., Barbosa, L., Pais, A., & Formosinho, S. (2007). Improving hierarchical cluster analysis: A new method with outlier detection and automatic clustering. Chemometrics and Intelligent Laboratory Systems, 87, 208–217.CrossRefGoogle Scholar
  2. Blei, D. M. (2011). Probabilistic Topic Models. Resource document. Department of Computer Science of Princeton University. Accessed 6 March 2012.
  3. Blei, D. M., Griffiths, T. L., & Jordan, M. I. (2010). The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2) doi:  10.1145/1667053.1667056.
  4. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3, 993–1022.MATHGoogle Scholar
  5. Dietz, L., & Stewart, A. (2006). Utilize Probabilistic Topic Models to Enrich Knowledge Bases. Resource document. Fraunhofer Integrated Publication and Information Systems Institute (IPSI). Accessed 6 March 2012.
  6. Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101(suppl.1), 5228–5235.Google Scholar
  7. Hodge, G. (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Resource document. The Digital Library Federation of Council on Library and Information Resources. Accessed 10 November 2012.
  8. Ian, D., & Ravi, S.S. (2005). Towards Efficient and Improved Hierarchical Clustering with Instance and Cluster Level Constraints. The CiteSeerX Resources. Accessed 22 May 2013.
  9. Kleinsorge, R., Willis, J., & Emrick, S. (2007). AMIA 2007 Tutorial T12 UMLS® Overview. Resource document. National Library of Medicine in National Institutes of Health. Accessed 8 March 2012.
  10. Kunal, P., Suju, R., & Joydeep, G. (2006). Automatic Construction of N-ary Tree Based Taxonomies. Data Mining Workshops, 2006. Hong Kong, pp. 75–79.Google Scholar
  11. Kvarv, G. S. (2007). Ontology Learning: Suggesting Associations from Text. Master Dissertation. Norwegian University of Science and Technology, pp.87.Google Scholar
  12. McCallum, Kachites. A. (2002). MALLET: A Machine Learning for Language Toolkit. Open Source Software. University of Massachusetts Amherst. Accessed 2 December 2011.
  13. Mimno, D. (2011). Machine Learning with MALLET. Resource document. Information Extraction and Synthesis Laboratory, Department of CS UMass, Amherst. Accessed 2 March 2012.
  14. Xue, Q. L., Yan, Q. S., Shi, X. L., & Hai, X. W. (2012). Automatic taxonomy construction from keywords, KDD ‘12 Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, pp. 1433–1441.Google Scholar
  15. Zhang, Y., Porter, A. L., & Hu, Z. Y. (2012). An Inductive Method for “Term Clumping”: A Case Study on Dye-Sensitized Solar Cells, the International Conference on Innovative Methods for Innovation Management and Policy, Beijing, P.R.China, May 21–25.Google Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2014

Authors and Affiliations

  1. 1.Chengdu Document and Information Center, Chinese Academy of SciencesChengduChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations