HiRE - A Heuristic Approach for User Generated Record Extraction

Chandrakanth, S.; Thilagam, P. Santhi

doi:10.1007/978-3-319-28034-9_4

S. Chandrakanth¹⁶ &
P. Santhi Thilagam¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9581))

Included in the following conference series:

International Conference on Distributed Computing and Internet Technology

753 Accesses

Abstract

User Generated Content extraction is the extraction of user posts, viz., reviews and comments. Extraction of such content requires the identification of their record structure, so that after the content is extracted, proper filtering mechanisms can be applied to eliminate the noises. Hence, record structure identification is an important prerequisite step for text analytics. Most of the existing record structure identification techniques search for repeating patterns to find the records. In this paper, a heuristic based approach is proposed. This method uses the implicit logical organization present in the records and outputs the record structure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Buttler, D., Liu, L., Pu, C.: A fully automated object extraction system for the world wide web. In: 21st International Conference on Distributed Computing Systems, pp. 361–370. IEEE (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P., et al.: Roadrunner: towards automatic data extraction from large web sites. VLDB 1, 109–118 (2001)
Google Scholar
Chang, C.H., Lui, S.C.: Iepad: information extraction based on pattern discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688. ACM (2001)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proceedings of the 14th International Conference on World Wide Web, pp. 76–85. ACM (2005)
Google Scholar
Liu, B., Zhai, Y.: NET – a system for extracting web data from flat and nested data records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Chapter Google Scholar
Miao, G., Tatemura, J., Hsiung, W.P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: Proceedings of the 18th International Conference on World Wide Web, pp. 981–990. ACM (2009)
Google Scholar
Song, X., Liu, J., Cao, Y., Lin, C.Y., Hon, H.W.: Automatic extraction of web data records containing user-generated content. In: Proceedings of the 19th ACM International Conference on Information and knowledge Management, pp. 39–48. ACM (2010)
Google Scholar
Weninger, T., Hsu, W.H.: Text extraction from the web via text-to-tag ratio. In: 19th International Workshop on Database and Expert Systems Application. DEXA 2008, pp. 23–28. IEEE (2008)
Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage (1990)
Google Scholar
Chandrakanth, S., Thilagam, P.S.: User generated content extraction from web. Master’s thesis, National Institute of Technology Karnataka, Surathkal, India (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology Karnataka, Surathkal, Karnataka, India
S. Chandrakanth & P. Santhi Thilagam

Authors

S. Chandrakanth
View author publications
You can also search for this author in PubMed Google Scholar
P. Santhi Thilagam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Santhi Thilagam .

Editor information

Editors and Affiliations

Microsoft Research, Redmond, Washington, USA
Nikolaj Bjørner
Indian Institute of Technology Delhi, New Delhi, India
Sanjiva Prasad
IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA
Laxmi Parida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chandrakanth, S., Thilagam, P.S. (2016). HiRE - A Heuristic Approach for User Generated Record Extraction. In: Bjørner, N., Prasad, S., Parida, L. (eds) Distributed Computing and Internet Technology. ICDCIT 2016. Lecture Notes in Computer Science(), vol 9581. Springer, Cham. https://doi.org/10.1007/978-3-319-28034-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-28034-9_4
Published: 25 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28033-2
Online ISBN: 978-3-319-28034-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics