Skip to main content

Combining Apriori Approach with Support-Based Count Technique to Cluster the Web Documents

  • Conference paper
  • First Online:
Computational Intelligence in Data Mining

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 556))

  • 1227 Accesses

Abstract

The dynamic Web where thousands of pages are updated in every second is growing at lightning speed. Hence, getting required Web documents in a fraction of time is becoming a challenging task for the present search engine. Clustering, which is an important technique of data mining can shed light on this problem. Association technique of data mining plays a vital role in clustering the Web documents. This paper is an effort in that direction where the following techniques have been proposed:

  1. (1)

    a new feature selection technique named term-term correlation has been introduced which reduces the size of the corpus by eliminating noise and redundant features.

  2. (2)

    a novel technique named Support Based Count (SBC) has been proposed which combines with traditional Apriori approach for clustering the Web documents.

Empirical results on two benchmark datasets show that the proposed approach is more promising compared to the traditional clustering approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ai.stanford.edu/~rion/parsing/minipar_viz.html.

  2. 2.

    http://tartarus.org/martin/PorterStemmer/.

  3. 3.

    https://radimrehurek.com/gensim/tutorial.html.

  4. 4.

    http://qwone.com/~jason/20Newsgroups/.

  5. 5.

    http://www.dmoz.org.

References

  1. A. Spink, D. Wolfram, M. B. Jansen, and T. Saracevic, “Searching the web: The public and their queries,” Journal of the American society for information science and technology, vol. 52, no. 3, pp. 226–234, 2001.

    Google Scholar 

  2. W. B. Croft, “A model of cluster searching based on classification,” Information systems, vol. 5, no. 3, pp. 189–195, 1980.

    Google Scholar 

  3. J. Tang, “Improved k-means clustering algorithm based on user tag,” Journal of Convergence Information Technology, vol. 12, pp. 124–130, 2010.

    Google Scholar 

  4. C. X. Lin, Y. Yu, J. Han, and B. Liu, “Hierarchical web-page clustering via in-page and cross-page link structures,” in Advances in Knowledge Discovery and Data Mining. Springer, 2010, pp. 222–229.

    Google Scholar 

  5. X. Gu, X. Wang, R. Li, K. Wen, Y. Yang, and W. Xiao, “A new vector space model exploiting semantic correlations of social annotations for web page clustering,” in Web-Age Information Management. Springer, 2011, pp. 106–117.

    Google Scholar 

  6. P. Worawitphinyo, X. Gao, and S. Jabeen, “Improving suffix tree clustering with new ranking and similarity measures,” in Advanced Data Mining and Applications. Springer, 2011, pp. 55–68.

    Google Scholar 

  7. M. T. Hassan and A. Karim, “Clustering and understanding documents via discrimination information maximization,” in Advances in Knowledge Discovery and Data Mining. Springer, 2012, pp. 566–577.

    Google Scholar 

  8. P. Li, B. Wang, and W. Jin, “Improving web document clustering through employing user-related tag expansion techniques,” Journal of Computer Science and Technology, vol. 27, no. 3, pp. 554–566, 2012.

    Google Scholar 

  9. R. K. Roul, S. Varshneya, A. Kalra, and S. K. Sahay, “A novel modified apriori approach for web document clustering,” in Computational Intelligence in Data Mining-Volume 3. Springer, 2015, pp. 159–171.

    Google Scholar 

  10. A. Inokuchi, T. Washio, and H. Motoda, “An apriori-based algorithm for mining frequent substructures from graph data,” in Principles of Data Mining and Knowledge Discovery. Springer, 2000, pp. 13–23.

    Google Scholar 

  11. M. Steinbach, G. Karypis, V. Kumar et al., “A comparison of document clustering techniques,” in KDD workshop on text mining, vol. 400, no. 1. Boston, 2000, pp. 525–526.

    Google Scholar 

  12. G. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, 1975.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajendra Kumar Roul .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper

Roul, R.K., Sahay, S.K. (2017). Combining Apriori Approach with Support-Based Count Technique to Cluster the Web Documents. In: Behera, H., Mohapatra, D. (eds) Computational Intelligence in Data Mining. Advances in Intelligent Systems and Computing, vol 556. Springer, Singapore. https://doi.org/10.1007/978-981-10-3874-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-3874-7_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-3873-0

  • Online ISBN: 978-981-10-3874-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics