Advertisement

Making Structured Data Searchable via Natural Language Generation

with an Application to ESG Data
  • Jochen L. Leidner
  • Darya Kamkova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8132)

Abstract

Relational Databases are used to store structured data, which is typically accessed using report builders based on SQL queries. To search, forms need to be understood and filled out, which demands a high cognitive load. Due to the success of Web search engines, users have become acquainted with the easier mechanism of natural language search for accessing unstructured data. However, such keyword-based search methods are not easily applicable to structured data, especially where structured records contain non-textual content such as numbers.

We present a method to make structured data, including numeric data, searchable with a Web search engine-like keyword search access mechanism. Our method is based on the creation of surrogate text documents using Natural Language Generation (NLG) methods that can then be retrieved by off-the-shelf search methods.

We demonstrate that this method is effective by applying it to two real-life sized databases, a proprietary database comprising corporate Environmental, Social and Governance (ESG) data and a public-domain environmental pollution database, respectively, in a federated scenario. Our evaluation includes speed and index size investigations, and indicates effectiveness (P@1 = 84%, P@5 = 92%) and practicality of the method.

Keywords

Keyword Search United Nations Environment Programme Database Schema Unstructured Data Natural Language Generation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the Web. Communications of the ACM 54(2), 72–79 (2011)CrossRefGoogle Scholar
  2. 2.
  3. 3.
    Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press (2008)Google Scholar
  4. 4.
    Leidner, J.L., Bos, J., Dalmas, T., Curran, J.R., Clark, S., Bannard, C.J., Webber, B.L., Steedman, M.: QED: The Edinburgh TREC-2003 question answering system. In: TREC Workshop Notes, pp. 631–635 (2003)Google Scholar
  5. 5.
    Androutsopoulos, I., Ritchie, G.D., Thanisch, P.: Natural language interfaces to databases – an introduction. Natural Language Engineering 1(1), 29–81 (1995)CrossRefGoogle Scholar
  6. 6.
    Blunschi, L., Jossen, C., Kossmann, D., Mori, M., Stockinger, K.: Data-thirsty business analysts need SODA: Search over data warehouse. In: Macdonald, C., Ounis, I., Ruthven, I. (eds.) Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, pp. 2525–2528. ACM (2011)Google Scholar
  7. 7.
    Chen, Y., Wang, W., Liu, Z., Lin, X.: Keyword search on structured and semi-structured data. In: Proceedings of the 35th SIGMOD International Conference on Management of Data, SIGMOD 2009, pp. 1005–1010. ACM, New York (2009)CrossRefGoogle Scholar
  8. 8.
    Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: Proceedings of the 18th International Conference on Data Engineering (ICDE), pp. 5–16. IEEE Computer Society, Washington, DC (2002)CrossRefGoogle Scholar
  9. 9.
    Bicer, V., Tran, T., Nedkov, R.: Ranking support for keyword search on structured data using relevance models. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1669–1678. ACM, New York (2011)Google Scholar
  10. 10.
    Coffman, J., Weaver, A.C.: Structured data retrieval using cover density ranking. In: Proceedings of the 2nd International Workshop on Keyword Search on Structured Data, KEYS 2010, pp. 1:1–1:6. ACM, New York (2010)Google Scholar
  11. 11.
    Garcia-Alvarado, C., Ordonez, C.: Keyword search across databases and documents. In: Proceedings of the 2nd International Workshop on Keyword Search on Structured Data, KEYS 2010, pp. 2:1–2:6. ACM, New York (2010)Google Scholar
  12. 12.
    Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 903–914. ACM, New York (2008)CrossRefGoogle Scholar
  13. 13.
    The World Bank Group: Environment (2012), http://data.worldbank.org/topic/environment (cited March 3, 2013)
  14. 14.
    Harmancioglu, N.B., Singh, V.P., Alpaslan, M.N. (eds.): Environmental Data Management. Kluwer Academic Publishers, Norwell (1998)Google Scholar
  15. 15.
    United Nations Environment Programme: Environmental data explorer (2012), http://geodata.grid.unep.ch/ (cited March 3, 2013)
  16. 16.
    Kihn, E., Zhizhin, M., Siquig, R., Redmon, R.: The environmental scenario generator (ESG): A distributed environmental data archive analysis tool. Data Science Journal 3, 10–28 (2004)CrossRefGoogle Scholar
  17. 17.
    Bicer, V., Tran, T., Abecker, A., Nedkov, R.: KOIOS: Utilizing semantic search for easy-access and visualization of structured environmental data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part II. LNCS, vol. 7032, pp. 1–16. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  18. 18.
    Ribando, J.M., Bonne, G.: A new quality factor: Finding alpha with ASSET4 ESG data, Research Note (2010)Google Scholar
  19. 19.
    Paiva, D.S.: A survey of applied natural language generation systems. Technical Report ITRI-98-03, University of Brighton (1998)Google Scholar
  20. 20.
    Piwek, P., van Deemter, K.: Constraint-based natural language generation: A survey. Technical report, Open University, Technical Report No. 2006/03 (2006)Google Scholar
  21. 21.
    Reiter, E., Dale, R.: Building Natural Language Generation Systems. Studies in Natural Language Processing. Cambridge University Press (2000)Google Scholar
  22. 22.
    The Apache Foundation: Hibernate Search (2012), http://www.hibernate.org/subprojects/search.html (cited March 3, 2013)
  23. 23.
    U.S. Environmental Protection Agency (EPA), Toxic spill data (2013), http://data.gov (cited March 3, 2013)

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Jochen L. Leidner
    • 1
  • Darya Kamkova
    • 1
  1. 1.Catalyst LabThomson Reuters Global ResourcesBaarSwitzerland

Personalised recommendations