Data Engineered Content Extraction Studies for Indian Web Pages

  • Bhanu Prakash KollaEmail author
  • Arun Raja Raman
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 711)


The recent innovations in the Internet and cellular communications have opened many interesting and exciting areas of social and research activity, and one of the basic driving forces for this is the Web page containing data in different forms. Data can be in mobile or Internet based and can be online or off-line and normally of sizes ranging from kilo to terabytes. In the Indian context, these can relate to computer-generated, printed, or archived data in different languages and dialects. The present study is focused on applying engineering aspects to data so that a smart set is used to generate content in a short period, so that further developments can be easier. After a brief overview on the complexities of Indian Web pages and current approaches in data mining, a basic pixel-based approach is developed along with data reduction and abstraction to be used with classification approaches for content extraction. During data reduction, engineering approach based on organizing and adapting for suitable inputs for classification is highlighted, and a case study is given here for analysis.


Classification Data engineering Reduction Pixel based Knowledge Mining 


  1. 1.
    A. Busch, W. W. Boles and S. Sridharan, “Texture for Script Identification”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No.11, IEEE Computer Society, 2005, pp. 1720–1732.CrossRefGoogle Scholar
  2. 2.
    Deng Cai, Yu Shipeng and Wen Jirong, (2003) “VIPS: a vision-based page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 406–417.Google Scholar
  3. 3.
    S. Kavitha, P. Shivakumara, G. Hemantha Kumar and C. L. Tan, “A Robust Script Identification System For Historical Indian Document Images”, Malaysian Journal of Computer Science. Vol. 28(4), 2015, pp 283–300.CrossRefGoogle Scholar
  4. 4.
    P. Krishnan, N. Sankaran, A. K. Singh and C. V. Jawahar, “Towards a robust OCR system for Indic scripts”. Document Analysis Systems, IEEE, April 2014, pp. 141–145.Google Scholar
  5. 5.
    Maha Al-Yahya, Sawsan Al-Malak, Luluh Aldhubayi, “Ontological Lexicon Enrichment: The Badea System For Semi-Automated Extraction Of Antonymy Relations From Arabic Language Corpora”, Malaysian Journal of Computer Science. Vol. 29(1), 2016, pp 56–73.CrossRefGoogle Scholar
  6. 6.
    Kolla Bhanu Prakash, Dorai RangaSwamy, M, A, Raja Raman, Arun (2012), ANN for Multi-lingual Regional Web Communication, ICONIP 2012, Part V, LNCS 7667, pp. 473–478.CrossRefGoogle Scholar
  7. 7.
    Kolla Bhanu Prakash, Dorai RangaSwamy, M, A, Raja Raman, Arun (2012), Statistical Interpretation for Mining Hybrid Regional Web Documents, ICIP 2012, CCIS 292, pp. 503–512.CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of Computer Science EngineeringKoneru Lakshmaiah Education FoundationGunturIndia
  2. 2.Department of Structural EngineeringIIT MadrasChennaiIndia

Personalised recommendations