Skip to main content

Flint: From Web Pages to Probabilistic Semantic Data

  • Chapter
  • First Online:
  • 1356 Accesses

Part of the book series: Data-Centric Systems and Applications ((DCSA))

Abstract

A large and increasing number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants). The great chance to create applications that rely on the huge amount of data taken from these sites has been discussed for more than a decade now, but in practice, only a small fraction of such information is currently used. The main reason is that extracting and integrating web data of good quality is an expensive task, which often requires human intervention. In this chapter, we present the main results of the Flint project, which aims at developing automatic and domain-independent tools to perform all the steps required to benefit from Web data: discovering data-intensive web sites containing information about entities of interest, extracting and integrating the published data, and performing a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. The results of the processing are semantically annotated data that can be used to populate a probabilistic database and to develop novel applications.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The distance between an attribute and a mapping is from the centroid of the mapping.

  2. 2.

    The names of the models presented in this chapter are inspired by those introduced by Dong et al. in [31].

References

  1. Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. DL ’00, pp. 85–94 (2000)

    Google Scholar 

  2. Amento, B., Terveen, L.G., Hill, W.C.: Does “authority” mean quality? predicting expert quality ratings of web documents. SIGIR, pp. 296–303 (2000)

    Google Scholar 

  3. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. ACM SIGMOD international conference on management of data (SIGMOD’2003), San Diego, California, pp. 337–348 (2003)

    Google Scholar 

  4. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. IJCAI (2007)

    Google Scholar 

  5. Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin, Heidelberg, New York (2008)

    Google Scholar 

  6. Bilke, A., Naumann, F.: Schema matching using duplicates. ICDE, pp. 69–80 (2005)

    Google Scholar 

  7. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 1063–1064. ACM, New York (2010)

    Google Scholar 

  8. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Redundancy-driven web data extraction and integration. WebDB (2010)

    Google Scholar 

  9. Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Automatically building probabilistic databases from the web. WWW (Companion Volume), pp. 185–188 (2011)

    Google Scholar 

  10. Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. WEBIST (2005)

    Google Scholar 

  11. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Supporting the automatic construction of entity aware search engines. WIDM, pp. 149–156 (2008)

    Google Scholar 

  12. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Probabilistic models to reconcile complex data from inaccurate data sources. CAiSE, pp. 83–97 (2010)

    Google Scholar 

  13. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Contextual data extraction and instance-based integration. International workshop on searching and integrating new web data sources (VLDS) (2011)

    Google Scholar 

  14. Blanco, L., Crescenzi, V., Merialdo, P., Papotti, P.: Wrapper generation for overlapping web sources. Web Intelligence (WI) (2011)

    Google Scholar 

  15. Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. WWW, pp. 437–446 (2011)

    Google Scholar 

  16. Brin, S.: Extracting patterns and relations from the World Wide Web. Proceedings of the First Workshop on the Web and Databases (WebDB’98) (in conjunction with EDBT’98, pp. 102–108 (1998)

    Google Scholar 

  17. Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 29(4), 45–51 (2006)

    Google Scholar 

  18. Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data integration for the relational web. PVLDB 2(1), 1090–1101 (2009)

    Google Scholar 

  19. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)

    Google Scholar 

  20. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. Comput. Networks (Amsterdam, Netherlands) 31(11–16), 1623–1640 (1999)

    Google Scholar 

  21. Chang, K.C.C., Bin, H., Zhen, Z.: Toward large scale integration: building a metaquerier over databases on the web. CIDR 2005, pp. 44–66 (2005)

    Google Scholar 

  22. Chuang, S.L., Chang, K.C.C., Zhai, C.X.: Context-aware wrapping: synchronized data extraction. VLDB, pp. 699–710 (2007)

    Google Scholar 

  23. Clemen, R.T., Winkler, R.L.: Combining probability distributions from experts in risk analysis. Risk Anal. 19(2), 187–203 (1999)

    Google Scholar 

  24. Crescenzi, V., Mecca, G., Merialdo, P.: roadRunner: towards automatic data extraction from large Web sites. International conference on very large data bases (VLDB 2001), Roma, Italy, 11–14 September 2001, pp. 109–118

    Google Scholar 

  25. Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. PODS, pp. 1–12 (2007)

    Google Scholar 

  26. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. WWW ’03: proceedings of the 12th International Conference on World Wide Web, pp. 178–186. ACM, New York, NY, USA (2003). http://doi.acm.org/10.1145/775152.775178

  27. Do, H.H., Rahm, E.: Matching large schemas: approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)

    Article  Google Scholar 

  28. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map between ontologies on the semantic web. WWW ’02, pp. 662–673 (2002)

    Google Scholar 

  29. Doan, A., Ramakrishnan, R., Chen, F., DeRose, P., Lee, Y., McCann, R., Sayyadian, M., Shen, W.: Community information management. IEEE Data Eng. Bull. 29(1), 64–72 (2006)

    Google Scholar 

  30. Dong, X., Berti-Equille, L., Hu, Y., Srivastava, D.: Global detection of complex copying relationships between sources. PVLDB 3(1), 1358–1369 (2010)

    Google Scholar 

  31. Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. PVLDB 2(1), 550–561 (2009)

    Google Scholar 

  32. Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. PVLDB 2(1), 562–573 (2009)

    Google Scholar 

  33. Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. IJCAI, pp. 1034–1041 (2005)

    Google Scholar 

  34. Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. VLDB, pp. 216–225 (1997)

    Google Scholar 

  35. Galland, A., Abiteboul, S., Marian, A., Senellart, P.: Corroborating information from disagreeing views. Proceedings of WSDM, New York, USA (2010)

    Google Scholar 

  36. Guha, R., McCool, R.: Tap: a semantic web platform. Comput. Networks 42(5), 557–577 (2003)

    Article  Google Scholar 

  37. Madhavan, J., Bernstein, P.A., Doan, A., Halevy, A.Y.: Corpus-based schema matching. ICDE, pp. 57–68 (2005)

    Google Scholar 

  38. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008). http://www.informationretrieval.org

  39. Pant, G., Srinivasan, P.: Learning to crawl: comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)

    Article  Google Scholar 

  40. Sarma, A.D., Dong, X., Halevy, A.Y.: Bootstrapping pay-as-you-go data integration systems. SIGMOD conference, pp. 861–874 (2008)

    Google Scholar 

  41. Shen, W., DeRose, P., McCann, R., Doan, A., Ramakrishnan, R.: Toward best-effort information extraction. SIGMOD conference, pp. 1031–1042 (2008)

    Google Scholar 

  42. Shen, W., DeRose, P., Vu, L., Doan, A., Ramakrishnan, R.: Source-aware entity matching: a compositional approach. ICDE, pp. 196–205. IEEE Computer Society, Silver Spring, MD (2007)

    Google Scholar 

  43. Sizov, S., Biwer, M., Graupmann, J., Siersdorfer, S., Theobald, M., Weikum, G., Zimmer, P.: The bingo! system for information portal generation and expert web search. CIDR 2003, First Biennial conference on innovative data systems research, Asilomar, CA, USA, 2003

    Google Scholar 

  44. Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-driven crawler generation by example. In: Efthimiadis, E.N., Dumais, S.T.,  Hawking, D.,  Järvelin, K. (eds.) SIGIR, pp. 292–299. ACM, New York (2006)

    Google Scholar 

  45. Wu, M., Marian, A.: Corroborating answers from multiple web sources. WebDB (2007)

    Google Scholar 

  46. Yin, X., Han, J., Yu, P.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paolo Merialdo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P. (2012). Flint: From Web Pages to Probabilistic Semantic Data. In: De Virgilio, R., Guerra, F., Velegrakis, Y. (eds) Semantic Search over the Web. Data-Centric Systems and Applications. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25008-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25008-8_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25007-1

  • Online ISBN: 978-3-642-25008-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics