, Volume 121, Issue 3, pp 1239–1268 | Cite as

Analysis of the effect of data properties in automated patent classification

  • Juan Carlos GomezEmail author


Patent classification is a task performed in patent offices around the world by experts, where they assign category codes to a patent application based on its technical content. Nowadays, the number of applications is constantly growing and there is an economical interest on developing accurate and fast models to automate the classification task. In this paper, we present a methodology to systematically analyze the effect of three patent data properties and two classification details on the patent classification task: patent section to use for training/testing, document representation, patent codes to use for training, use of the hierarchy of categories, and the base classifier. For the analysis we create a diversity of models by combining different options for the properties. We evaluate the models in detail using standard patent datasets in two languages, English and German, considering three performance metrics, using statistical tests to validate the results and comparing them with other models in the literature. Our research findings indicate that it is important to follow a methodology to properly choose the options for the data properties to build a model according to our goal, considering classification accuracy and computational efficiency. Some combinations of options build models with good results but with high computational cost, whilst other build model that produce slightly worst results but at a fraction of the training time.


Patent classification Hierarchical classification Multilabel classification Document representation Supervised learning IPC 


  1. Abbas, A., Zhang, L., & Khan, S. U. (2014). A literature review on the state-of-the-art in patent analysis. World Patent Information, 37, 3–13.CrossRefGoogle Scholar
  2. Arts, S., Cassiman, B., & Gomez, J. C. (2018). Text matching to measure patent similarity. Strategic Management Journal, 39(1), 62–84.CrossRefGoogle Scholar
  3. Basile, A., Dwyer, G., Medvedeva, M., Rawee, J., Haagsma, H., & Nissim, M. (2017). N-gram: New Groningen author-profiling model. arXiv preprint arXiv:1707.03764
  4. Bennett, P. N., & Nguyen, N. (2009). Refined experts: Improving classification in large taxonomies. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval (pp. 11–18). ACM.Google Scholar
  5. Benzineb, K., & Guyot, J. (2011). Automated patent classification. In M. Lupu, K. Mayer, J. Tait, & A. J. Trippe (Eds.), Current challenges in patent information retrieval (Vol. 29, pp. 239–261). Berlin: Springer. CrossRefGoogle Scholar
  6. Bi, W., & Kwok, J. T. (2014). Mandatory leaf node prediction in hierarchical multilabel classification. IEEE Transactions on Neural Networks and Learning Systems, 25(12), 2275–2287.CrossRefGoogle Scholar
  7. Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the 13th ACM international conference on information and knowledge management (pp. 78–87). ACM.Google Scholar
  8. Chen, Y. L., & Chang, Y. C. (2012). A three-phase method for patent classification. Information Processing & Management, 48(6), 1017–1030.CrossRefGoogle Scholar
  9. Cinar, Y. G., Zoghbi, S., & Moens, M. F. (2015). Inferring user interests on social media from text and images. In 2015 IEEE international conference on data mining workshop (ICDMW) (pp. 1342–1347). IEEE.Google Scholar
  10. Dallachiesa, M., Aggarwal, C., & Palpanas, T. (2014). Node classification in uncertain graphs. In Proceedings of the 26th international conference on scientific and statistical database management (pp. 1–4). ACM.Google Scholar
  11. D’hondt, E., Verberne, S., Koster, C., & Boves, L. (2013). Text representations for patent classification. Computational Linguistics, 39(3), 755–775.CrossRefGoogle Scholar
  12. D’hondt, E., Verberne, S., Oostdijk, N., Beney, J., Koster, C., & Boves, L. (2014). Dealing with temporal variation in patent categorization. Information Retrieval, 17(5), 520–544.CrossRefGoogle Scholar
  13. D’hondt, E., Verberne, S., Oostdijk, N., & Boves, L. (2017). Patent classification on subgroup level using balanced winnow. In Current challenges in patent information retrieval (pp. 299–324). Springer.Google Scholar
  14. Fall, C., Törcsvári, A., Fiévet, P., & Karetka, G. (2004). Automated categorization of German-language patent documents. Expert Systems with Applications, 26(2), 269–277.CrossRefGoogle Scholar
  15. Fall, C. J., & Benzineb, K. (2002). Literature survey: Issues to be considered in the automatic classification of patents. Tech. rep., World Intellectual Property Organization.Google Scholar
  16. Fall, C. J., Törcsvári, A., Benzineb, K., & Karetka, G. (2003). Automated categorization in the international patent classification. SIGIR Forum, 37(1), 10–25.CrossRefGoogle Scholar
  17. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.zbMATHGoogle Scholar
  18. Giachanou, A., & Salampasis, M. (2014). IPC selection using collection selection algorithms. In Proceedings of the 2014 information retrieval facility conference, Lecture Notes in Computer Science (Vol. 8849, pp. 41–52). SpringerGoogle Scholar
  19. Giachanou, A., Salampasis, M., & Paltoglou, G. (2015). Multilayer source selection as a tool for supporting patent search and classification. Information Retrieval Journal, 18(6), 559–585.CrossRefGoogle Scholar
  20. Gomez, J. C., & Moens, M. F. (2010). Using biased discriminant analysis for email filtering. In Proceedings of the 14th international conference on knowledge-based and intelligent information and engineering systems, Lecture Notes in Computer Science (Vol. 6276, pp. 566–575). Springer.Google Scholar
  21. Gomez, J. C., & Moens, M. F. (2012). Hierarchical classification of web documents by stratified discriminant analysis. In Proceedings of the 2012 information retrieval facility conference, Lecture Notes in Computer Science (Vol. 7356, pp. 94–108). Springer.Google Scholar
  22. Gomez, J. C., & Moens, M. F. (2014). Minimizer of the reconstruction error for multi-class document categorization. Expert Systems with Applications, 41(3), 861–868.CrossRefGoogle Scholar
  23. Gomez, J. C., & Moens, M. F. (2014) A survey of automated hierarchical classification of patents. In Professional search in the modern world (pp. 215–249). Springer.Google Scholar
  24. Guyot, J., Benzineb, K., Falquet, G., & Shift, S. (2010). myclass: A mature tool for patent classification. In Proceedings of CLEF 2010 (notebook papers/LABs/workshops).Google Scholar
  25. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD explorations newsletter, 11(1), 10–18.CrossRefGoogle Scholar
  26. Härtinger, S., & Clarke, N. (2015). Using patent classification to discover chemical information in a free patent database: Challenges and opportunities. Journal of Chemical Education, 93(3), 534–541.CrossRefGoogle Scholar
  27. Iwayama, M., Fujii, A., & Noriko, K. (2005). Overview of classification subtask at NTCIR-5 patent retrieval task. In Proceedings of the NII test collection for IR systems-5. NTCIR.Google Scholar
  28. Iwayama, M., Fujii, A., & Noriko, K. (2007). Overview of classification subtask at NTCIR-6 patent retrieval task. In Proceedings of the NII Test Collection for IR Systems-6. NTCIR.Google Scholar
  29. Kim, J. H., & Choi, K. S. (2007). Patent document categorization based on semantic structural information. Information Processing & Management, 43(5), 1200–1215.CrossRefGoogle Scholar
  30. Koster, C. H. A., Seutter, M., & Beney, J. (2003). Multi-classification of patent applications with Winnow. In Proceedings of the 5th international Andrei Ershov Memorial, Lecture Notes in Computer Science (Vol. 2890, pp. 546–555). Springer.Google Scholar
  31. Krier, M., & Zaccà, F. (2002). Automatic categorisation applications at the European patent office. World Patent Information, 24(3), 187–196.CrossRefGoogle Scholar
  32. Lamirel, J. C., Cuxac, P., Chivukula, A. S., & Hajlaoui, K. (2015). Optimizing text classification through efficient feature selection based on quality metric. Journal of Intelligent Information Systems, 45(3), 379–396.CrossRefGoogle Scholar
  33. Li, Y., & Shawe-Taylor, J. (2007). Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing & Management, 43(5), 1183–1199.CrossRefGoogle Scholar
  34. Lupu, M., & Hanbury, A. (2013). Patent retrieval. Foundations and Trends in Information Retrieval, 7(1), 1–97.CrossRefGoogle Scholar
  35. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).Google Scholar
  36. Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2008). Overview of the patent mining task at the NTCIR-7 workshop. In Proceedings of the NII test collection for IR systems-7. NTCIR.Google Scholar
  37. Nanba, H., Fujii, A., Iwayama, M., & Hashimoto, T. (2010) Overview of the patent mining task at the NTCIR-8 workshop. In Proceedings of the NII test collection for IR systems-8. NTCIR.Google Scholar
  38. Noh, H., Jo, Y., & Lee, S. (2015). Keyword selection and processing strategy for applying text mining to patent analysis. Expert Systems with Applications, 42(9), 4348–4360.CrossRefGoogle Scholar
  39. Piroi, F. (2010). CLEF-IP 2010: Classification task evaluation summary. Tech. Rep. IRF-TR-2010-00005, Information Retrieval Facility.Google Scholar
  40. Piroi, F., Lupu, M., Hanbury, A., & Zenz, V. (2011).CLEF-IP 2011: Retrieval in the intellectual property domain. In Proceedings of CLEF 2011 (Notebook Papers/Labs/Workshop).Google Scholar
  41. Rodriguez-Esteban, R., & Bundschus, M. (2016). Text mining patents for biomedical knowledge. Drug Discovery Today, 21(6), 997–1002.CrossRefGoogle Scholar
  42. Rossi, R. G., de Andrade Lopes, A., & Rezende, S. O. (2016). Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. Information Processing & Management, 52(2), 217–257.CrossRefGoogle Scholar
  43. Rousu, J., Saunders, C., Szedmak, S., & Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research, 7, 1601–1626.MathSciNetzbMATHGoogle Scholar
  44. Seneviratne, D., Geva, S., Zuccon, G., Ferraro, G., Chappell, T., & Meireles, M. (2015). A signature approach to patent classification. In Proceedings of the 11th Asia information retrieval societies conference, Lecture Notes in Computer Science (Vol. 9460, pp. 413–419). Springer.Google Scholar
  45. Shalaby, W., Zadrozny, W., & Gallagher, S. (2014). Knowledge based dimensionality reduction for technical text mining. In Proceedings of the 2014 IEEE international conference on big data (pp. 39–44). IEEE.Google Scholar
  46. Tikk, D., Biró, G., & Yang, J. (2005). Experiment with a hierarchical text categorization method on WIPO patent collections. In Applied research in uncertainty modeling and analysis, International Series in Intelligent Technologies (Vol. 20, pp. 283–302). Springer.Google Scholar
  47. Trappey, A. J. C., Hsu, F. C., Trappey, C. V., & Lin, C. I. (2006). Development of a patent document classification and search platform using a back-propagation network. Expert Systems with Applications, 31(4), 755–765.CrossRefGoogle Scholar
  48. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetzbMATHGoogle Scholar
  49. Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In O. Maimon & L. Rokach (Eds.), Data mining and knowledge discovery handbook (pp. 667–685). Boston: Springer. Scholar
  50. Verberne, S., & D’hondt, E. (2011). Patent classification experiments with the Linguistic Classification System LCS in CLEF-IP 2011. In Proceedings of CLEF 2011 (Notebook Papers/Labs/Workshop).Google Scholar
  51. Verberne, S., Vogel, M., & D’hondt, E. (2010). Patent classification experiments with the linguistic classification system LCS. In Proceedings of CLEF 2010 (Notebook Papers/LABs/Workshops).Google Scholar
  52. Wang, D., Ferraro, G., Suominen, H., & Jefferson, O. A. (2014). Automated categorisation of patent claims that reference human genome sequences. In Proceedings of the 2014 Australasian document computing symposium (pp. 117–120). ACM.Google Scholar
  53. Wang, X. L., Chen, Y. Y., Zhao, H., & Lu, B. L. (2014). Parallelized extreme learning machine ensemble based on min-max modular network. Neurocomputing, 128, 31–41.CrossRefGoogle Scholar
  54. Wang, X. L., Zhao, H., & Lu, Bl. (2014). A meta-top-down method for large-scale hierarchical classification. IEEE Transactions on Knowledge and Data Engineering, 26(3), 500–513.CrossRefGoogle Scholar
  55. Zhang, L., Li, L., & Li, T. (2015). Patent mining: A survey. ACM SIGKDD Explorations Newsletter, 16(2), 1–19.CrossRefGoogle Scholar
  56. Zhang, X. (2014). Interactive patent classification based on multi-classifier fusion and active learning. Neurocomputing, 127, 200–205.CrossRefGoogle Scholar
  57. Zhu, F., Wang, X., Zhu, D., & Liu, Y. (2015). A supervised requirement-oriented patent classification scheme based on the combination of metadata and citation information. International Journal of Computational Intelligence Systems, 8(3), 502–516.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2019

Authors and Affiliations

  1. 1.Departamento de Ingeniería Electrónica, DICISUniversidad de GuanajuatoSalamancaMexico

Personalised recommendations