Skip to main content
Log in

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.

  2. Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.

  3. 3] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.

    Article  Google Scholar 

  4. Manning C D, Raghavan P, SchÄutze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.

  5. Schutze H, Hull D A, Pedersen J O. A comparison of classifiers and document representations for the routing problem. In Proc. the 18th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 1995, pp.229-237.

  6. Blanzieri E, Bryl A. A survey of learning-based techniques of email spam ¯ltering. Arti¯cial Intelligence Review, 2008, 29(1): 63-92.

  7. Kao A, Quach L, Poteet S, Woods S. User assisted text classification and knowledge management. In Proc. the 12th International Conference on Information and Knowledge Management, November 2003, pp.524-527.

  8. Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vector machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.

  9. Kessler B, Numberg G, SchÄutze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for ComputationalLinguistics, August 1997, pp.32-38.

  10. Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Research and Development in Information Retrieval, July 2000, pp.256-263.

  11. Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., 1989.

  12. Lu Q, Getoor L. Link-based classification. In Proc. International Conference on Machine Learning, August 2003, pp.496-503.

  13. Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kau®man, 2002.

  14. Oh H J, Myaeng S H, Lee M H. A practical hypertext categorization method using links and incrementally available class information. In Proc. the 23rd ACM Int. SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.264-271.

  15. Angelova R, Weikum G. Graph-based text classification: Learn from your neighbors. In Proc. the 29th Annual Int. SIGIR Conf. Research and Development in Information Retrieval Conference, August 2006, pp.485-492.

  16. Tseng Y H, Ho Z P, Yang, K S, Chen C C. Mining term networks from text collections for crime investigation. Expert Systems with Applications, 2012, 39(11): 10082-10090.

    Google Scholar 

  17. Wang W, Do D B, Lin X. Term graph model for text classification. In Proc. International Conference on Advanced Data Mining and Applications, July 2005, pp.19-30.

  18. Newman M. Networks: An Introduction. Oxford University Press, 2010.

  19. Widrow B, Ho® M E. Adaptive switching circuits. In Neurocomputing: Foundation of Research, Anderson J A (ed.), Cambridge.USA: MIT Press, 1998, pp.123-134.

  20. Rossi R G, Faleiros T P, Lopes A A, Rezende S O. Inductive model generation for text categorization using a bipartite heterogeneous network. In Proc. the 12th International Conference on Data Mining, December 2012, pp.1086-1091.

  21. Melville P, Gryc W, Lawrence R D. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proc. the 15th International Conference on Knowledge Discovery and Data Mining, June 2009, pp.1275-1284.

  22. Boiy E, Hens P, Deschacht K, Moens M F. Automatic sentiment analysis in on-line text. In Proc. the 11th International Conference on Electronic Publishing, June 2007, pp.349-360.

  23. Durant K T, Smith M D. Predicting the political sentiment of web log posts using supervised machine learning techniques coupled with feature selection. In Proc. the 8th International Workshop on Knowledge Discovery on the Web, August 2006, pp.187-206.

  24. Chen R C, Hsieh C H. Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 2006, 31(2): 427-435.

    Article  Google Scholar 

  25. Wilcox A, Hripcsak G. Medical text representations for inductive learning. In Proc. American Medical Informatics Association Symposium, Nov. 2000, pp.923-927.

  26. Sun A, Lim E P, Ng W K. Web classification using support vector machine. In Proc. the 4th International Workshop on Web Information and Data Management, November 2002, pp.96-99.

  27. Yu H, Han J, Chang K C C. PEBL: Positive example based learning for Web page classification using SVM. In Proc. the 8th International Conference on Knowledge Discovery and Data Mining, July 2002, pp.239-248.

  28. Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 2002, 18(2/3): 219-241.

    Article  Google Scholar 

  29. Dumais S T, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.256-263.

  30. Han E H, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 2001, pp.53-65.

  31. Yang Y. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1(1/2): 69-90.

    Article  Google Scholar 

  32. Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C. An evaluation of naive Bayesian anti-spam filtering. In Proc. Workshop on Machine Learning in the New Information Age, May 2000, pp.9-17.

  33. Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.

    Article  Google Scholar 

  34. Han E, Karypis G. Centroid-based document classification: Analysis and experimental results. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, June 2000, pp.424-431.

  35. Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85.

    Article  Google Scholar 

  36. Marcacini R M, Cherman E A, Metz J, Rezende S O. A fast dendrogram refinement approach for unsupervised expansion of hierarchies. In Proc. ECML/PKDD Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification, September 2012, pp. 1-12.

  37. Frank E, Bouckaert R R. Naive Bayes for text classification with unbalanced classes. In Proc. the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases, September 2003, pp.503-510.

  38. Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regularized transductive classification on heterogeneous information networks. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2010, pp.570-586.

  39. Chiang M, Liou J, Wang J, Peng W, Shan M. Exploring heterogeneous information networks and random walk with restart for academic search. Knowledge and Information Systems, 2013, 36(1): 59-82.

    Article  Google Scholar 

  40. Xue G R, Shen D, Yang Q et al. IRC: An iterative reinforcement categorization algorithm for interrelated Web objects. In Proc. the 4th International Conference on Data Mining, November 2004, pp. 273{280.

  41. Yin Z, Li R, Mei Q, Han J. Exploring social tagging graph for web object classification. In Proc. International Conference on Knowledge Discovery and Data Mining, June 2009, pp.957-966.

  42. Zhou D, Bousquet O, Lal T N, Weston J, SchÄolkopf B. Learning with local and global consistency. In Proc. Advances in Neural Information Processing Systems, December 2003.

  43. Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.

    Article  Google Scholar 

  44. Markov A, Last M, Kandel A. Model-based classification of Web documents represented by graphs. In Proc. WEBKDD, August 2006, pp.84-89.

  45. Mishra M, Huan J, Bleik S, Song M. Biomedical text categorization with concept graph representations using a controlled vocabulary. In Proc. the 11th International Workshop on Data Mining in Bioinformatics, August 2012, pp.26-32.

  46. 46] Cancho R F, Sole R V, Kohler. Patterns in syntactic dependency networks. Physical Review E, 2004, 69(1): 051915.

  47. Sousa C A R, Rezende S O, Batista G E A P A. Influence of graph construction on semi-supervised learning. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2013, pp.160-175.

  48. Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.

    Article  Google Scholar 

  49. Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann, 2005.

  50. Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd International Conference on Machine Learning, June 2006, pp.161-168.

  51. Kohonen T, Barna G, Chrisley R. Statistical pattern recognition with neural networks: Benchmarking studies. In Proc. International Conference on Neural Networks, July 1988, pp.61-68.

  52. Demsar J. Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning Research, 2006, 7(1): 1-30.

    MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rafael Geraldeli Rossi.

Additional information

The work is supported by São Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 76 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rossi, R.G., de Andrade Lopes, A., de Paulo Faleiros, T. et al. Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network. J. Comput. Sci. Technol. 29, 361–375 (2014). https://doi.org/10.1007/s11390-014-1436-7

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-014-1436-7

Keywords

Navigation