International Journal of Speech Technology

, Volume 22, Issue 3, pp 785–815 | Cite as

An efficient framework of utilizing the latent semantic analysis in text extraction

  • Ahmad Hussein AbabnehEmail author
  • Joan Lu
  • Qiang Xu


The use of the latent semantic analysis (LSA) in text mining demands large space and time requirements. This paper proposes a new text extraction method that sets a framework on how to employ the statistical semantic analysis in the text extraction in an efficient way. The method uses the centrality feature and omits the segments of the text that have a high verbatim, statistical, or semantic similarity with previously processed segments. The identification of similarity is based on a new multi-layer similarity method that computes the similarity in three statistical layers, it uses the Jaccard similarity and the vector space model in the first and second layers respectively, and uses the LSA in the third layer. The multi-layer similarity restricts the use of the third layer for the segments that the first and second layers failed to estimate their similarities. Rouge tool is used in the evaluation, but because Rouge does not consider the extract’s size, we supplemented it with a new evaluation strategy based on the compression rate and the ratio of the sentences intersections between the automatic and the reference extracts. Our comparisons with classical LSA and traditional statistical extractions showed that we reduced the use of the LSA procedure by 52%, and we obtained 65% reduction on the original matrix dimensions, also, we obtained remarkable accuracy results. It is concluded that the employment of the centrality feature with the proposed multi-layer framework yields a significant solution in terms of efficiency and accuracy in the field of text extraction.


Automatic text extraction Multi-layer similarity Latent semantic analysis Vector space model 



Automatic text extraction


Automatic text summarization


Compression rate


Latent semantic analysis


Multi-layer similarity


Natural langauge processing


Ratio of sentences intersections


Vector space model



  1. Abdel Fattah, M., Ren, F. (2008). Probabilistic neural network based text summarization. International Conference on Natural Language Processing and Knowledge Engineering, Beijing.Google Scholar
  2. Abdel Fattah, M., & Ren, F. (2009). GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Computer Speech & Language, 23(1), 126–144.Google Scholar
  3. Al-Kabi, M., Kazakzeh, A., Abu Ata, B., Al-Rababah, A., & Alsmad, I. (2015). A novel root based Arabic stemmer. Journal of King Saud University: Computer and Information Sciences, 27(2), 94–113.Google Scholar
  4. Al-Kabia, M. (2013). Towards improving Khoja rule-based Arabic stemmer. 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT) (pp. 1–6), IEEE.Google Scholar
  5. Al-Radaideh, Q. A., & Bataineh, D. Q. (2018). A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms. Cognitive Computation, 10(4), 651–669.Google Scholar
  6. Alruily, M., Hammami, N., Goudjil, M. (2013). Using transitive construction for developing arabic text summarization system. Computer and Information Technology (WCCIT), 2013 World Congress on 2013, 1–2.Google Scholar
  7. Aone, C., Okurowski, M. E., Gorlinsky, J. (1998). Trainable, scalable summarization using robust NLP and machine learning. The 17th international Conference on Computational linguistics: Association for Computational Linguistics (pp. 62–66). Stroudsburg.Google Scholar
  8. Azmia, A. M., & Al-Thanyyan, S. (2012). A Text Summarizer for Arabic. Computer Speech & Language, 26(4), 260–273.Google Scholar
  9. Ba-Alwi, F. M., Gaphari, G. H., & Al-Duqaimi, F. N. (2015). Arabic text summarization using latent semantic analysis. British Journal of Applied Science & Technology, 10(2), 1–14.Google Scholar
  10. Babar, S., & Patil, P. D. (2015). Improving performance of text summarization. Procedia Computer Science, 46, 354–363.Google Scholar
  11. Baxendale, P. (1958). Machine-made index for technical literature-an experiment. IBM Journal of Research and Development, 2(4), 354–361.Google Scholar
  12. Binwahlan, M. S., Salim, N., & Suanmali, L. (2009). Intelligent model for automatic text summarization. Information Technology Journal (Asian Network for Scientific Information), 8(8), 1249–1255.Google Scholar
  13. Chen, K.-Y. C., Liu, S.-H., Chen, B., Wang, H.-M., Jan, E.-E., Hsu, W.-L., et al. (2015). Extractive broadcast news summarization leveraging recurrent neural network language modeling techniques. IEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 23(8), 1322–1334.Google Scholar
  14. Chen, Q.-C., Wang, X.-L., Liu, B.-Q., Wang, Y.-Y. (2002). Subtopic segmentation of Chinese document: an adapted dotplot approach. Proceedings of International Conference on Machine Learning and Cybernetics. IEEE (pp. 1571–1576). Beijing.Google Scholar
  15. Dai, S., Diao, Q., & Zhou, C. (2005, September). Performance comparison of language models for information retrieval. IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 721–730). Springer, Boston, MA.Google Scholar
  16. Donga, T., Haidar, A., Tomov, S., & Dongarra, J. (2018). Fast SVD for large-scale matrices. Journal of Computational Science, 26, 237–245.Google Scholar
  17. Douzidia, F. S, & Lapalme, G. (2004). Lakhas, an Arabic summarization system. Proceedings of DUC’04.Google Scholar
  18. Edmundson, H. P. (1969). New methods in automatic extracting. Journal of the ACM (JACM), 16, 264–285.zbMATHGoogle Scholar
  19. El-Haj, M. O., Hammo, B. H. (2008). Evaluation of query-based arabic text summarization system. 2008 International Conference on Natural Language Processing and Knowledge engineering, IEEE (pp. 1–7). Beijing.Google Scholar
  20. El-Shishtawy, T., El-Ghannam, F. (2012). Keyphrase based arabic summarizer (KPAS). Informatics and Systems (INFOS), 2012 8th International Conference: IEEE. NLP-7–NLP-14. Cairo: IEEE.Google Scholar
  21. Ferreira, R., Cabral, L. D., Dueire, R., Freitasa, S. F., Cavalcantia, G. D., Lima, R., et al. (2013). Assessing sentence scoring techniques for extractive text summarization. Expert System with Applications, 40(14), 5755–5764.Google Scholar
  22. Froud, H., Lachkar, A., & Ouatik, S. A. (2013). Arabic text summarization based on latent semantic analysis to enhance Arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP), 3(1), 79–95.Google Scholar
  23. Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization techniques a survey. Artificial Intelligence Review, 47(1), 1–66.Google Scholar
  24. Ghwanmeh, S., Kanaan, G., Al-Shalabi, R., Rabab’ah, S. (2009). Enhanced algorithm for extracting the root of Arabic words. Computer graphics, imaging, and visualization, 2009. CGIV ‘09. Sixth International Conference: IEEE (pp. 388–391). Tia.Google Scholar
  25. Ghwanmeh, S., Kannan, G., Al-Shalabi, R., Ababneh, A. (2009). An enhanced text-classification-based Arabic information retrieval system. In Utilizing information technology systems across disciplines: Advancements in the application of computer science (pp. 37–44). IGI Global.Google Scholar
  26. Halteren, H.V., Teufel, S. (2003) Examining the consensus between human summaries: Initial experiments with factoid analysis. HLT-NAACL-DUC ‘03 Proceedings of the HLT-NAACL 03 on Text Summarization Workshop (pp. 57–64). Nijmegen: Association for Computational Linguistics.Google Scholar
  27. Hanandeh, E. (2013). Building an automatic thesaurus to enhance information retrieval. IJCSI International Journal of Computer Science Issues, 10(1), 676.Google Scholar
  28. Harwath, D., Hazen, T. J. (2012). Topic identification based extrinsic evaluation of summarization techniques applied to conversational speech. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (pp. 5073–5076). Kyoto.Google Scholar
  29. Hassel, M. (2004). Evaluation of automatic text summarization. Sweden: Licentiate Thesis Stockholm. Retrieved from
  30. He, Z., Deng, S., & Xu, X. (2006). A fast greedy algorithm for outlier mining. In W. K. Ng, M. Kitsuregawa, J. Li, & K. Chang (Eds.), Advances in knowledge discovery and data mining. PAKDD 2006. Lecture notes in computer science (Vol. 3918). New York: Springer.Google Scholar
  31. Jing, H., Barzilay, R., McKeown, K., Elhadad, M. (1998). Summarization evaluation methods: Experiments and analysis. AAAI symposium on intelligent summarization.Google Scholar
  32. Kiyoumarsi, F. (2015). Evaluation of automatic text summarizations based on human summaries. Procedia: Social and Behavioral Sciences, 192, 83–91.Google Scholar
  33. Kupiec, J., Pedersen, J., Chen, F. (1995). A trainable document summarizer. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 68–73). Seattle WA.Google Scholar
  34. Lin, C. Y. (1995). Topic identification by concept gene. Proceedings of the Thirty-third Conference of the Association of Computational Ling (pp. 308–310). Boston.Google Scholar
  35. Lin, C. Y. (1999). Training a selection function for extraction. Proceedings of the eighth international conference on information and knowledge management (pp. 55–62). New York.Google Scholar
  36. Lin, C. Y. (2001). SEE. Retrieved from
  37. Lin, C. Y. (2004). Rouge: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004). Barcelona.Google Scholar
  38. Lin C Y (1997) Identify Topics by Concept Signatures. Technical report. Marina Del Rey: Information Sciences InstituteGoogle Scholar
  39. Luhn, H. P. (1957). A statistical approach to mechanize encoding and searching of literary information. IBM Journal of Research and Development (IBM), 1(4), 309–317.MathSciNetGoogle Scholar
  40. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, 159–165.MathSciNetGoogle Scholar
  41. Luo, S., Wang, Y., Feng, X., Hu, Z. (2018). A study of multi-label event types recognition on Chinese financial texts. EuroSymposium on systems analysis and design (pp. 146–158). Springer, Cham.Google Scholar
  42. Mani, I. (2001). Automatic summarization.Google Scholar
  43. Mani, I., Bloedorn, E., & Gates, B. (1998). Using cohesion and coherence models for text summarization. Reston: AAAI Technical Report.Google Scholar
  44. Marcu, D. (1998). Improving summarization through rhetorical parsing tuning. Workshop on Very Large Corpora. ACL Anthology Network (pp. 206–215).Google Scholar
  45. Mashechkin, I. V., Petrovskiy, M. I., Popov, D. S., & Tsarev, D. V. (2011). Automatic text summarization using latent semantic analysis. Programming and Computer Software, 37(6), 299–305.MathSciNetzbMATHGoogle Scholar
  46. Meena, Y. K., & Gopalani, D. (2015). Domain independent framework for automatic text summarization. Procedia Computer Science, 48, 722–727.Google Scholar
  47. Mei, J.-P., & Chen, L. (2012). SumCR: A new subtopic-based extractive approach for text summarization. Knowledge and Information Systems, 31(3), 527–545.Google Scholar
  48. Mihalcea, R., Ceylan, H. (2007). Explorations in automatic book summarization. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 380–389). Prague: Association for Computation Linguistics.Google Scholar
  49. Nenkova, A., & McKeown, K. (2012). A survey of text summarization techniques. In C. C. Aggarwal & E. ChengXiang Zhai (Eds.), Mining text data (pp. 43–76). New York: Springer.Google Scholar
  50. Ngoc, P. V., & Tran, V. T. (2018). Latent semantic analysis using a dennis coefficient for, english sentiment classification in a parallel system. International Journal of Computers Communications & Control, 13(3), 408–428.MathSciNetGoogle Scholar
  51. Pierre-Etienne, G., Guy, L. (2011). Framework for abstractive summarization using text-to-text generation. MTTG’11 Proceedings of the Workshop on Monolingual Text-to-Text Generation (pp. 64–73). Strouds.Google Scholar
  52. Ramanujam, N., & Kaliappan, M. (2016). An automatic multi-document text summarization approach based on naïve bayesian classifier using timestamp strategy. The Scientific World Journal, 2016, 1–10.Google Scholar
  53. Rayner, K., Elizabeth, S. R., Michael, M. E., Mary, P. C., & Rebecca, T. (2016). So much to read, so little time how do we read, and can speed reading help? Psychological Science in the Public Interest, 17(1), 4–34.Google Scholar
  54. Sankarasubramaniam, Y., Ramanathan, K., & Ghosh, S. (2014). Text summarization using wikipedia. Information Processing and Management, 50(3), 443–461.Google Scholar
  55. Shams, R., Hashem, M., Hossain, A., Akter, S. R, & Gope, M. (2010). Corpus-based web document summarization using statistical and linguistic approach. Computer and Communication Engineering (ICCCE). 2010 International Conference. IEEE (pp. 1–6). Kuala Lumpur.Google Scholar
  56. Singh, J. N., & Dwivedi, S. K. (2013). A comparative study on approaches of vector space model in information retrieval. International Journal of Computer Applications, 975, 8887.Google Scholar
  57. Song, S., Huang, H., & Ruan, T. (2018). Abstractive Text Summarization Using LSTM-CNN based deep learning. Multimedia Tools and Applications, 78, 857–875.Google Scholar
  58. Sparck, J. K., & Galliers, J. R. (1995). Evaluating natural language processing systems: An analysis and review. New York: Springer.Google Scholar
  59. Svore, K. M., Vanderwende, L., Burges, C. J. (2008). Using signals of human interest to enhance single-document summarization. Technical Report, Association for the Advancement of Artificial Intelligence Google Scholar
  60. Tayal, M. A., Raghuwanshi, M., & Malik, L. G. (2017). ATSSC: Development of an approach based on soft computing for text summarization. Computer Speech & Language, 41, 214–235.Google Scholar
  61. Wang, Y., & Ma, J. (2013). A comprehensive method for text summarization based on latent semantic analysis. Berlin: Springer.Google Scholar
  62. Wang, Q., Xu, J., & Craswell, N. (2013). Regularized latent semantic indexing: A new approach to large-scale topic modeling. ACM Transactions on Information Systems, 31(1), 1–44.Google Scholar
  63. Yang, R., Bu, Z., Xia, Z. (2012) Automatic summarization for chinese text using affinity propagation clustering and latent semantic analysis. Web Information Systems and Mining, Lecture Notes in Computer Science.Google Scholar
  64. Yanmin, C., Bingquan, L., & Xiaolong, W. (2007). Automatic text summarization based on textual cohesion. Journal of Electronics, 24(3), 338–346.Google Scholar
  65. Yates, R. B., & Neto, B. R. (1999). Modern information retrieval. Boston: Addison-Wesley Longman.Google Scholar
  66. Yeh, J.-Y., Hao-RenKe Yanga, W.-P., & Meng, I.-H. (2005). Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41(1), 75–95.Google Scholar
  67. Yousefi, A. M., & Hamey, L. (2017). Text summarization using unsupervised deep learning. Expert Systems with Applications, 68, 93–105.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Computing and EngineeringUniversity of HuddersfieldHuddersfieldUK

Personalised recommendations