Abstract
There are various ways of web page classification but they take higher time to compute with lesser accuracy. Hence, there is a need to invent an efficient algorithm in order to reduce time and increase web page classification result. It is generally find that a few tags like title can contain the principle substance of text, and these patterns may have an impact on the adequacy of text classification. Although, the most widely recognized text weighting calculations, called term frequency inverse documents frequency (TF-IDF) doesn’t consider the structure of website pages. To take care of this issue, another feature tags weighting calculation is put in advanced. It thinks about the web page structure data like title, Meta tags, head etc. also content the useful information. In this proposed study first web site pages data are pre-processed and find text weight using TFIDF, after that using feature tag weighting calculation, frequent and important tags will find; then on the basis of text weight and tags weight, web document will classify.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Sadegh, A.H., Hossein, R., Behroo, N.: Web page classification using social tag. In: IEEE International Conference on Computational Science and Engineering, vol. 4, no. 1, pp. 588–593 (2009)
Jain, A., Sharma, R., Dixit, G., Tomar, V.: Page ranking algorithms in web mining, limitations of existing methods and a new method for indexing web pages. In: International Conference on Communication Systems and Network Technologies, vol. 3, no. 1, pp. 640–645. IEEE (2013)
Gowri, R., Lavanya, R.: A novel classification of web service composition and optimization approach using skyline algorithm integrated with agents. In: IEEE Computational Intelligence and Computing Research (ICCIC), pp. 26–28 (2013)
Tomar, G.S., Verma, S., Jha, A.: Web page classification using modified naïve bayesian approach. In: IEEE TENCON 2006, Hong Kong, pp. 14–17 (2006)
Kejing, H., Henyang, C.: Structure-based classification of web documents using support vector machine. In: Proceedings of CCIS 2016, pp. 215–219. IEEE (2016)
Jose, J., Lal, P.S.: A rough set approach to identify content and navigational pages at a website, pp. 5–9. IEEE (2008)
Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: IEEE Networked Computing and Advanced Information Management, vol. 2, no. 1, pp. 290–295 (2008)
Keller, M., Hartenstein, H.: GRABEX: a graph-based method for web site block classification and its application on mining breadcrumb trails. In: WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT), pp. 290–297. IEEE (2013)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: IEEE Data Mining, pp. 250–257 (2002)
Ryan, L., Michal, C., Lei, Y.: Using visual features for fine-grained genre classification of web pages. In: Proceedings of the 41st Annual IEEE Hawaii International Conference on System Sciences, vol. 1, no. 10, pp. 7–10 (2008)
Mun, Y., Lee, M., Cho, D.: Classification of web link information and implementation of dynamic web page using Link Map System. In: IEEE Granular Computing, pp. 26–28 (2008)
Qian, Q., Li, J., Cai, J., Zhang, R., Xin, M.: An anomaly intrusion detection method based on PageRank algorithm. In: International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing, pp. 2226–2230. IEEE (2013)
Dushyant, R.: A review on web mining. Int. J. Eng. Res. Technol. (IJERT) (2012)
Sarac, E., Ozel, S.A.: Web page classification using firefly optimization. In: IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA) (2013)
Ye, F., Zhang, F., Luo, X., Xu, L.: Research on measuring semantic correlation based on the Wikipedia hyperlink network, pp. 309–314. IEEE (2013)
Zou, J.Q., Chen, G.L., Guo, W.Z.: Chinese web page classification using no se-tolerant up port vector machines. In: Natural Language Processing and Knowledge Engineering, IEEE NLP-KE, pp. 785–790 (2005)
Sinka, M.P., Corne, D.W.: BankSearch dataset (2005). http://www.pedal.reading.ac.uk/bansearchdataset/
Lu, Y., Peng, Y.: Feature weighting improvement of web text categorization based on particle swarm optimization algorithm. J. Comput. 10(1), 260–269 (2006)
Chen, G., Choi, B.: Web page genre classification. In: Proceedings of the ACM Symposium on Applied Computing, pp. 2353–2357 (2008)
Abramson, M., Aha, D.M.: What’s in a URL? Genre classification from URL. In: Workshops at the 26th Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, pp. 1–8 (2012)
Zhu, J., Xie, Q., Yu, S.I., Wong, W.H.: Exploiting link structure for web page genre identification. Data Min. Knowl. Discov. 1–26 (2015)
Acknowledgements
I would like to thank all the people those who helped me to give the knowledge about these research papers. I am thankful to Dr. Prateek Srivastava & Dr. Prasun Chakrabarti to encourage and guided in this topic which helped me to speed up the work for structure based web page classification for fast search. Finally, I like to acknowledge all the websites and IEEE papers which I have gone through and referred to create this research paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Verma, K., Srivastava, P., Chakrabarti, P. (2018). Exploring Structure Oriented Feature Tag Weighting Algorithm for Web Documents Identification. In: Zelinka, I., Senkerik, R., Panda, G., Lekshmi Kanthan, P. (eds) Soft Computing Systems. ICSCS 2018. Communications in Computer and Information Science, vol 837. Springer, Singapore. https://doi.org/10.1007/978-981-13-1936-5_20
Download citation
DOI: https://doi.org/10.1007/978-981-13-1936-5_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1935-8
Online ISBN: 978-981-13-1936-5
eBook Packages: Computer ScienceComputer Science (R0)