Skip to main content
Log in

Learning from syntax generalizations for automatic semantic annotation

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Nowadays, there is a huge amount of textual data coming from on-line social communities like Twitter or encyclopedic data provided by Wikipedia and similar platforms. This Big Data Era created novel challenges to be faced in order to make sense of large data storages as well as to efficiently find specific information within them. In a more domain-specific scenario like the management of legal documents, the extraction of semantic knowledge can support domain engineers to find relevant information in more rapid ways, and to provide assistance within the process of constructing application-based legal ontologies. In this work, we face the problem of automatically extracting structured knowledge to improve semantic search and ontology creation on textual databases. To achieve this goal, we propose an approach that first relies on well-known Natural Language Processing techniques like Part-Of-Speech tagging and Syntactic Parsing. Then, we transform these information into generalized features that aim at capturing the surrounding linguistic variability of the target semantic units. These new featured data are finally fed into a Support Vector Machine classifier that computes a model to automate the semantic annotation. We first tested our technique on the problem of automatically extracting semantic entities and involved objects within legal texts. Then, we focus on the identification of hypernym relations and definitional sentences, demonstrating the validity of the approach on different tasks and domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. http://www.statisticbrain.com/twitter-statistics/

  2. http://www.telegraph.co.uk/technology/twitter/9945505/Twitter-in-numbers.html

  3. http://www.wikipedia.org/

  4. http://nlp.stanford.edu/software/index.shtml

  5. We only used the constraint that the hypernym has to be different from the hyponym.

  6. http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

References

  • Berland, M., & Charniak, E. (1999). Finding parts in very large corpora. In Annual meeting association for computational linguistics (Vol. 37, pp. 57–64). Association for computational linguistics.

  • Biagioli, C., Francesconi, E., Passerini, A., Montemagni, S., Soria, C. (2005). Automatic semantics extraction in law documents. In Proceedings of the 10th international conference on artificial intelligence and law: ICAIL (pp. 133–140). ACM.

  • Biemann, C. (2005). Ontology learning from text: a survey of methods. In LDV forum (Vol. 20, pp. 75–93).

  • Boella, G., di Caro, L., Humphreys, L., Robaldo, L., van der Torre, L. (2012). Nlp challenges for eunomos, a tool to build and manage legal knowledge. In Proceedings of the 8th international conference on language resources and evaluation (LREC).

  • Boella, G., & Di Caro, L. (2013). Supervised learning of syntactic contexts for uncovering definitions and extracting hypernym relations in text databases. In Machine learning and knowledge discovery in databases (pp. 64–79). Berlin Heidelberg: Springer.

  • Boella, G., Di Caro, L., Robaldo, L. (2013). Semantic relation extraction from legislative text using genera-lized syntactic dependencies and support vector machines. In Theory, practice, and applications of rules on the web (pp. 218–225). Berlin Heidelberg: Springer.

  • Boella, G., Martin, M., Rossi, P., van der Torre, L., Violato, A. (2012). Eunomos, a legal document and knowledge management system for regulatory compliance. In Proceedings of information systems: a crossroads for organization, management, accounting and engineering (ITAIS) conference. Berlin: Springer.

  • Borg, C., Rosner, M., Pace, G. (2009). Evolutionary algorithms for definition extraction. In Proceedings of the 1st workshop on definition extraction (pp. 26–32). Association for computational linguistics.

  • Buitelaar, P., Cimiano, P., Magnini, B. (2005). Ontology learning from text: an overview. Ontology learning from text: methods, evaluation and applications, 123, 3–12.

    Google Scholar 

  • Candan, K., Di Caro, L., Sapino, M. (2008). Creating tag hierarchies for effective navigation in social media. In Proceedings of the 2008 ACM workshop on search in social media (pp. 75–82). ACM.

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273–297.

    MATH  Google Scholar 

  • Cui, H., Kan, M.Y., Chua, T.S. (2007). Soft pattern matching models for definitional question answering. ACM transactions on information systems, 25(2). doi:10.1145/1229179.1229182.

  • Del Gaudio, R., & Branco, A. (2007). Automatic extraction of definitions in portuguese: a rule-based approach. Progress in artificial intelligence, 659–670.

  • Fahmi, I., & Bouma, G. (2006). Learning to identify definitions using syntactic features. In Proceedings of the EACL 2006 workshop on learning structured information in natural language applications (pp. 64–71).

  • Ferraresi, A., Zanchetta, E., Baroni, M., Bernardini, S. (2008). Introducing and evaluating ukwac, a very large web-derived corpus of english. In Proceedings of the 4th web as corpus workshop (WAC-4) can we beat Google (pp. 47–54).

  • Fortuna, B., Mladenič, D., Grobelnik, M. (2006). Semi-automatic construction of topic ontologies. Semantics, Web and Mining, 121–131.

  • Gibson, J. (1977). The concept of affordances. Perceiving, acting, and knowing, 67–82.

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H. (2009). The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1), 10–18.

    Article  Google Scholar 

  • Harris, Z. (1954). Distributional structure. Word, 10(23), 146–162.

    Google Scholar 

  • Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on computational linguistics (Vol. 2, pp. 539–545). Association for computational linguistics.

  • Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G. (2012). Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial intelligence.

  • Hovy, E., Philpot, A., Klavans, J., Germann, U., Davis, P., Popper, S. (2003). Extending metadata definitions by automatically extracting and organizing glossary definitions. In Proceedings of the 2003 annual national conference on digital government research (pp. 1–6). Digital Government Society of North America.

  • Klavans, J., & Muresan, S. (2001). Evaluation of the definder system for fully automatic glossary construction. In Proceedings of the AMIA symposium (p. 324). American medical informatics association.

  • Lesmo, L. (2009). The turin university parser at evalita 2009. Proceedings of EVALITA, 9.

  • Lesmo, L., Mazzei, A., Palmirani, M., Radicioni, D.P. (2013). Tulsi: an nlp system for extracting legal modificatory provisions. Artificial intelligence and law, 1–34.

  • de Maat, E., Krabben, K., Winkels, R. (2010). Machine learning versus knowledge based classification of legal texts. In Proceedings of legal knowledge and information systems conference: JURIX 2010 (pp. 87–96). IOS Press. http://portal.acm.org/citation.cfm?id=1940559.1940573.

  • Miller, G.A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Moschitti, A., & Bejan, C.A. (2004). A semantic kernel for predicate argument classification. In CoNLL-2004.

  • Navigli, R., & Ponzetto, S.P. (2010). Babelnet: building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 216–225). Association for computational linguistics.

  • Navigli, R., & Velardi, P. (2010). Learning word-class lattices for definition and hypernym extraction. In Proceedings of the 48th annual meeting of the association for computational linguistics (pp. 1318–1327). Uppsala: Association for computational linguistics. http://www.aclweb.org/anthology/P10-1134.

  • Navigli, R., Velardi, P., Ruiz-Martnez, J.M. (2010). An annotated dataset for extracting definitions and hypernyms from the web. In Proceedings of the 7th international conference on language resources and evaluation (LREC’10). Valletta: European Language Resources Association (ELRA).

  • Norman, D.A. (1999). Affordance, conventions, and design. Interactions, 6(3), 38–43.

    Article  Google Scholar 

  • Ponzetto, S., & Strube, M. (2007). Deriving a large scale taxonomy from wikipedia. In Proceedings of the national conference on artificial intelligence (Vol. 22, p. 1440). Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press. 1999.

  • Salton, G., Wong, A., Yang, C.S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620. doi:10.1145/361219.361220.

    Article  MATH  Google Scholar 

  • Storrer, A., & Wellinghoff, S. (2006). Automated detection and annotation of term definitions in german text corpora. In Proceedings of LREC (Vol. 2006).

  • Velardi, P., Faralli, S., Navigli, R. (2012). Ontolearn reloaded: a graph-based algorithm for taxonomy induction.

  • Westerhout, E. (2009). Definition extraction using linguistic and structural features. In Proceedings of the 1st workshop on definition extraction, WDE ’09 (pp. 61–67). Stroudsburg: association for computational linguistics. http://dl.acm.org/citation.cfm?id=1859765.1859775.

  • Yamada, I., Torisawa, K., Kazama, J., Kuroda, K., Murata, M., De Saeger, S., Bond, F., Sumida, A. (2009). Hypernym discovery based on distributional similarity and hierarchical structures. In Proceedings of the 2009 conference on empirical methods in natural language processing (Vol. 2, pp. 929–937). Association for computational linguistics.

  • Yang, H., & Callan, J. (2008). Ontology generation for large email collections. In Proceedings of the 2008 international conference on digital government research (pp. 254–261). Digital Government Society of North America.

  • Zhang, C., & Jiang, P. (2009). Automatic extraction of definitions. In: 2nd IEEE international conference on computer science and information technology, 2009. ICCSIT 2009 (pp. 364–368). doi:10.1109/ICCSIT.2009.5234687.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luigi Di Caro.

Additional information

The work has been funded by the project ITxLaw with Compagnia di San Paolo

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Boella, G., Caro, L.D., Ruggeri, A. et al. Learning from syntax generalizations for automatic semantic annotation. J Intell Inf Syst 43, 231–246 (2014). https://doi.org/10.1007/s10844-014-0320-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0320-9

Keywords

Navigation