Abstract
On the World Wide Web a large numbers of information is available in the form of semi-structured format. Knowledge discovery in semi-structured document has been recognized as promising task. Since semi structured document is typically hidden within HTML formatting intended for human viewing the details of which vary widely from site to site and frequent changes made to their formatting so we can’t construct a global schema, discovery of interesting rules form it is complex and tedious process. Most of the existing system uses hand-coded wrappers to extract information, which is monotonous and time consuming. An intelligent and automated method is needed for their processing. Learning grammatical information from given sample of semi-structured documents has attracted lots of attention in the past decades. To understand “what say the data” is necessary to know the structure of data to discover the syntactic-semantic knowledge of its language.
The problem of learning the correct grammar for the unknown language form finite example of the language is known as grammatical inference problem. In automated grammar learning, the task is to infer grammar rules from given information about the target language. If example belongs to the target language it is called positive example otherwise it is called negative example. In this paper we propose a grammar inference methodology to automate the construction of grammar rules and facilitate the process of information extraction. We are using hybrid technique of association analysis and sequential algorithm to generate context free grammar rules from semi-structured document (HTML document).
Our algorithm that infers a sequential pattern from a sequence of discrete HTML tags. The basic insight is that sub-string is selected on the basis of high support factor by taking entire sentences into account. Which appears more frequently in string can be replaced by a grammatical rule that generate the sub-string, and this process is repeated many times, producing a single length rules of the sequence. The result is strictly a context-free grammar rules, which provide a compact summary of corpora that aids understanding of its properties.
Chapter PDF
Similar content being viewed by others
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thakur, R., Jain, S., Chaudhari, N.S. (2012). Incremental Discovery of Sequential Pattern from Semi-structured Document Using Grammatical Inference. In: Ramanujam, R., Ramaswamy, S. (eds) Distributed Computing and Internet Technology. ICDCIT 2012. Lecture Notes in Computer Science, vol 7154. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28073-3_30
Download citation
DOI: https://doi.org/10.1007/978-3-642-28073-3_30
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28072-6
Online ISBN: 978-3-642-28073-3
eBook Packages: Computer ScienceComputer Science (R0)