Advertisement

Soft Computing

, Volume 23, Issue 4, pp 1239–1255 | Cite as

CCODM: conditional co-occurrence degree matrix document representation method

  • Wei Wei
  • Chonghui GuoEmail author
  • Jingfeng Chen
  • Lin Tang
  • Leilei Sun
Methodologies and Application
  • 216 Downloads

Abstract

Document representation is a key problem in document analysis and processing tasks, such as document classification, clustering and information retrieval. Especially for unstructured text data, the use of a suitable document representation method would affect the performance of the subsequent algorithms for applications and research. In this paper, we propose a novel document representation method called the conditional co-occurrence degree matrix document representation method (CCODM), which is based on word co-occurrence. CCODM not only considers the co-occurrence of terms but also considers the conditional dependencies of terms in a specific context, which leads to more available and useful structural and semantic information being retained from the original documents. Extensive experimental classification results with different supervised and unsupervised feature selection methods show that the proposed method, CCODM, achieves better performance than the vector space model, latent Dirichlet allocation, the general co-occurrence matrix representation method and the document embedding method.

Keywords

Document representation Word co-occurrence Conditional co-occurrence degree matrix Classification Feature selection 

Notes

Acknowledgements

This work was supported in part by the Natural Science Foundation of China [Grant Numbers 71771034, 71501023, 71421001] and the Open Program of State Key Laboratory of Software Architecture [Item Number SKLSAOP1703]. Besides, We are very grateful to Dr. Deqing Wang (Wang et al. 2016b) for giving us all the code of RP-GSO and Dr. Xiangzhu Meng for guiding us to do all the experiments on doc2vec. We would like to thank the anonymous reviewers for their constructive comments on this paper.

Compliance with ethical standards

Conflict of interest

Wei Wei, Chonghui Guo and Lin Tang have received research grants from Neusoft Corporation (Shenyang, PR China). Jingfeng Chen and Leilei Sun declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

References

  1. Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768. doi: 10.1016/j.eswa.2011.09.160 Google Scholar
  2. Benabdeslem K, Elghazel H, Hindawi M (2016) Ensemble constrained laplacian score for efficient and robust semi-supervised feature selection. Knowl Inf Syst 49(3):1161–1185. doi: 10.1007/s10115-015-0901-0 Google Scholar
  3. Bengio Y, Courville A, Vincent P (2014) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. doi: 10.1109/TPAMI.2013.50 Google Scholar
  4. Bengio Y, Schwenk H, Sencal J, Morin F, Gauvain J (2003) Neural probabilistic language models. J Mach Learn Res 3(6):1137–1155, doi: 10.1162/153244303322533223, http://dl.acm.org/citation.cfm?id=944919.944966
  5. Bernotas M, Laurutis R (2007) The peculiarities of the text document representation, using ontology and tagging-based clustering technique. J Inf Technol Control 36(2):217–220Google Scholar
  6. Bettina G, Kurt H (2017) Topicmodels: an R package for fitting topic models. Version 0.2-6. doi: 10.18637/jss.v040.i13
  7. Bhushan S, Danti A (2017) Classification of text documents based on score level fusion approach. Pattern Recognit Lett 94:118–126. doi: 10.1016/j.patrec.2017.05.003 Google Scholar
  8. Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022, http://dl.acm.org/citation.cfm?id=944919.944937
  9. Boulares M, Jemni M (2016) Learning sign language machine translation based on elastic net regularization and latent semantic analysis. Artif Intell Rev 46(2):145–166. doi: 10.1007/s10462-016-9460-3 Google Scholar
  10. Bullinaria J, Levy J (2012) Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav Res Methods 44(3):890–907. doi: 10.3758/s13428-011-0183-8 Google Scholar
  11. Cambria E, Gastaldo P, Bisio F, Zunino R (2015) An ELM-based model for affective analogical reasoning. Neurocomputing 149:443–455. doi: 10.1016/j.neucom.2014.01.064 Google Scholar
  12. Cheng X, Yan X, Lan Y, Guo J (2014) Btm: topic modeling over short texts. IEEE Trans Knowl Data Eng 26(12):2928–2941. doi: 10.1109/TKDE.2014.2313872 Google Scholar
  13. Du Y, Liu W, Lv X, Peng G (2015) An improved focused crawler based on semantic similarity vector space model. Appl Soft Comput 36:392–407. doi: 10.1016/j.asoc.2015.07.026 Google Scholar
  14. Farahat A, Kamel M (2011) Statistical semantics for enhancing document clustering. Knowl Inf Syst 28(2):365–393. doi: 10.1007/s10115-010-0367-z Google Scholar
  15. Franco-Salvador M, Gupta P, Rosso P, Banchs R (2016) Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl Based Syst 111:87–99. doi: 10.1016/j.knosys.2016.08.004 Google Scholar
  16. Hsu C, Huang W (2016) Integrated dimensionality reduction technique for mixed-type data involving categorical values. Appl Soft Comput 43:199–209. doi: 10.1016/j.asoc.2016.02.015 Google Scholar
  17. Huang H, Kuo Y (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. IEEE Trans Fuzzy Syst 18(6):1098–1111. doi: 10.1142/S0218001411008890 Google Scholar
  18. Ibrahim O, Landa-Silva D (2016) Term frequency with average term occurrences for textual information retrieval. Soft Comput 20(8):3045–3061. doi: 10.1007/s00500-015-1935-7 Google Scholar
  19. Jin L, Gong W, Fu W, Wu H (2015) A text classifier of english movie reviews based on information gain. In: The 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence, pp 454–457. doi: 10.1109/ACIT-CSI.2015.86
  20. Johnson-laird P, Oatley K (1989) The language of emotions: an analysis of a semantic field. Cogn Emot 3(3):81–123. doi: 10.1080/02699938908408075 Google Scholar
  21. Keikha M, Khonsari A, Oroumchian F (2009) Rich document representation and classification: an analysis. Knowl Based Syst 22(1):67–71. doi: 10.1016/j.knosys.2008.06.002 Google Scholar
  22. Lau R, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31–43. doi: 10.1109/MCI.2013.2291689 Google Scholar
  23. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pp 1188–1196Google Scholar
  24. Li J, Li J, Fu X, Masud M, Huang J (2016) Learning distributed word representation with multi-contextual mixed embedding. Knowl Based Syst 106:220–230. doi: 10.1016/j.knosys.2016.05.045 Google Scholar
  25. Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22. http://CRAN.R-project.org/doc/Rnews/
  26. Liaw A, Wiener M (2015) Package ’randomForest’. Breiman and Cutlers random forests for classification and regression. Version 4.6-12. https://www.stat.berkeley.edu/~breiman/RandomForests/
  27. Liu Q, Zhang H, Yu H, Cheng X (2004) Chinese lexical analysis using cascaded hidden Markov model. J Comput Res Dev 41(8):1421–1429Google Scholar
  28. Liu Z, Yu W, Deng Y, Bian Z (2010) A feature selection method for document clustering based on part-of-speech and word co-occurrence. In: 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol 5, pp 2331–2334. doi: 10.1109/FSKD.2010.5569827
  29. Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: finding and explaining differences between sentences. Knowl Based Syst 119:186–199. doi: 10.1016/j.knosys.2016.12.013 Google Scholar
  30. Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr J 14(2):178–203. doi: 10.1007/s10791-010-9141-9 Google Scholar
  31. Lu M, Zhao X, Zhang L, Li F (2016) Semi-supervised concept factorization for document clustering. Inf Sci 331:86–98. doi: 10.1016/j.ins.2015.10.038 MathSciNetzbMATHGoogle Scholar
  32. Miao Y, Grefenstette E, Blunsom P (2017) Discovering discrete latent topics with neural variational inference. arXiv preprint arXiv:1706.00359
  33. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
  34. Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space, pp 1–12. arXiv preprint arXiv:1301.3781
  35. Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54. doi: 10.1162/COLI_a_00241 MathSciNetGoogle Scholar
  36. Nguyen A, Yosinski J, Clune J (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 427–436, http://arxiv.org/abs/1412.1897
  37. Pessiot J, Kim Y, Amini M, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manag 46(2):180–192. doi: 10.1016/j.ipm.2009.09.007 Google Scholar
  38. Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. IEEE Trans Knowl Data Eng 23(7):961–976. doi: 10.1109/TKDE.2010.27 Google Scholar
  39. Radim Ř, Petr S (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, pp 45–50Google Scholar
  40. Ravi D, Bober M, Farinella G, Guarnera M, Battiato S (2016) Semantic segmentation of images exploiting DCT based features and random forest. Pattern Recognit 52:260–273. doi: 10.1016/j.patcog.2015.10.021 Google Scholar
  41. Ren F, Sohrab M (2013) Class-indexing-based term weighting for automatic text classification. Inf Sci 236:109–125. doi: 10.1016/j.ins.2013.02.029 Google Scholar
  42. Rule A, Cointet J, Bearman P (2015) Lexical shifts, substantive changes, and continuity in State of the Union discourse. Proc Natl Acad Sci USA 112(35):10,837–10,844. doi: 10.1073/pnas.1512221112 Google Scholar
  43. Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. doi: 10.1145/361219.361220 zbMATHGoogle Scholar
  44. Tang G, Xia Y, Sun J, Zhang M, Zheng TF (2015) Statistical word sense aware topic models. Soft Comput 19(1):13–27Google Scholar
  45. Trovati M, Bessis N (2016) An influence assessment method based on co-occurrence for topologically reduced big data sets. Soft Comput 20(5):2021–2030. doi: 10.1007/s00500-015-1621-9 Google Scholar
  46. Vila M, Bardera A, Feixas M, Sbert M (2011) Tsallis mutual information for document classification. Entropy 13(9):1694–1707. doi: 10.3390/e13091694 zbMATHGoogle Scholar
  47. Wang H (2015) Study on the application of feature selection for big text data using expected cross entropy. J Inf Comput Sci 12(18):6835–6843. doi: 10.12733/jics20150077 Google Scholar
  48. Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-Test feature selection approach based on term frequency for text categorization. Pattern Recognit Lett 45(11):1–10. doi: 10.1016/j.patrec.2014.02.013 Google Scholar
  49. Wang D, Shen H, Truong Y (2016a) Efficient dimension reduction for high-dimensional matrix-valued data. Neurocomputing 190:25–34. doi: 10.1016/j.neucom.2015.12.096 Google Scholar
  50. Wang D, Zhang H, Liu R, Liu X, Wang J (2016b) Unsupervised feature selection through Gram–Schmidt orthogonalization—a word co-occurrence perspective. Neurocomputing 173(P3):845–854. doi: 10.1016/j.neucom.2015.08.038 Google Scholar
  51. Wu Z, Zhu H, Li G, Cui Z, Huang H, Li J, Chen E, Xu G (2017) An efficient Wikipedia semantic matching approach to text document classification. Inf Sci 393:15–28. doi: 10.1016/j.ins.2017.02.009
  52. Xiao Q, Song R (2017) Motion retrieval based on motion semantic dictionary and HMM inference. Soft Comput 21(1):255–265. doi: 10.1007/s00500-016-2059-4 MathSciNetGoogle Scholar
  53. Xu H, Zhang F, Wang W (2015) Implicit feature identification in Chinese reviews using explicit topic mining model. Knowl Based Syst 76:166–175. doi: 10.1016/j.knosys.2014.12.012
  54. Yan H, Yang J (2014) Joint laplacian feature weights learning. Pattern Recognit 47(3):1425–1432. doi: 10.1016/j.patcog.2013.09.038 zbMATHGoogle Scholar
  55. Yang Y, Pedersen J (1997) A comparative study on feature selection in text categorization. In: Proceedings of fourteenth international conference on machine learning (ICML), vol 4, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
  56. Zheng Y, Han W, Zhu C (2014) A novel feature selection method based on category distribution and phrase attributes. In: International conference on trustworthy computing and services (ISCTCS), Berlin, Heidelberg, pp 25–32. doi: 10.1007/978-3-662-47401-3_4
  57. Zhou Q, Zhou H, Li T (2016) Cost-sensitive feature selection using random forest: selecting low-cost subsets of informative features. Knowl Based Syst 95:1–11. doi: 10.1016/j.knosys.2015.11.010 Google Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  • Wei Wei
    • 1
    • 2
  • Chonghui Guo
    • 1
    • 2
    Email author
  • Jingfeng Chen
    • 1
  • Lin Tang
    • 1
    • 2
    • 3
  • Leilei Sun
    • 1
  1. 1.Institute of Systems EngineeringDalian University of TechnologyDalianPeople’s Republic of China
  2. 2.State Key Laboratory of Software Architecture (Neusoft Corporation)ShenyangPeople’s Republic of China
  3. 3.City InstituteDalian University of TechnologyDalianPeople’s Republic of China

Personalised recommendations