Skip to main content

Pragmatic Quality Assessment for Automatically Extracted Data

  • Conference paper
  • First Online:
Conceptual Modeling (ER 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9974))

Included in the following conference series:

Abstract

Automatically extracted data is rarely “clean” with respect to pragmatic (real-world) constraints—which thus hinders applications that depend on quality data. We proffer a solution to detecting pragmatic constraint violations that works via a declarative and semantically enabled constraint-violation checker. In conjunction with an ensemble of automated information extractors, the implemented prototype checks both hard and soft constraints—respectively those that are satisfied or not and those that are satisfied probabilistically with respect to a threshold. An experimental evaluation shows that the constraint checker identifies semantic errors with high precision and recall and that pragmatic error identification can improve results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Fe6: Form-based ensemble with 6 pipeline phases that accepts an OCRed document as input and generates a conceptualization of document-asserted facts as output.

References

  1. Embley, D.W., Liddle, S.W., Woodfield, S.N.: A superstructure for models of quality. In: Indulska, M., Purao, S. (eds.) ER 2014. LNCS, vol. 8823, pp. 147–156. Springer, Heidelberg (2014). doi:10.1007/978-3-319-12256-4_16

    Google Scholar 

  2. Akoka, J., Berti-Equille, L., Boucelma, O., Bouzeghoub, M., Comyn-Wattiau, I., Cosquer, M., Goasdoué-Thion, V., Kedad, Z., Nugier, S., Peralta, V., Cherfi, S.S.: A framework for quality evaluation in data integration systems. In: ICEIS 2007 - Proceedings of the Ninth International Conference on Enterprise Information Systems, pp. 170–175, Funchal, Madeira, Portugal, June 2007

    Google Scholar 

  3. Gutierrez, F., Dou, D., Fickas, S., Wimalasuriya, D., Zong, H.: A hybrid ontology-based information extraction system. J. Inf. Sci. (2015). On-line publication number 0165551515610989

    Google Scholar 

  4. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  5. Vanderpoel, G.B. (ed.): The Ely Ancestry: Lineage of RICHARD ELY of Plymouth, England, Who Came to Boston, Mass., about 1655 & settled at Lyme, Conn., in 1660. The Calumet Press, New York (1902)

    Google Scholar 

  6. Liddle, S.W., Embley, D.W., Woodfield, S.N.: Cardinality constraints in semantic data models. Data & Knowl. Eng. 11(3), 235–270 (1993)

    Article  MATH  Google Scholar 

  7. Grant, F.J. (ed.): Index to The Register of Marriages and Baptisms in the PARISH OF KILBARCHAN, pp. 1649–1772. J. Skinner & Company, LTD, Edinburgh, Scotland (1912)

    Google Scholar 

  8. Harwood, W.H.: A Genealogical History of the Harwood Families, Descended from Andrew Harwood, Whose English Home Was in Dartmouth, Devonshire, England, and Who Emigrated to America, and Was Living in Boston, Mass., in 1643. Watson H. Harwood, M.D., Chasm Falls, New York, 3rd edn. (1911)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David W. Embley .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Woodfield, S.N., Lonsdale, D.W., Liddle, S.W., Kim, T.W., Embley, D.W., Almquist, C. (2016). Pragmatic Quality Assessment for Automatically Extracted Data. In: Comyn-Wattiau, I., Tanaka, K., Song, IY., Yamamoto, S., Saeki, M. (eds) Conceptual Modeling. ER 2016. Lecture Notes in Computer Science(), vol 9974. Springer, Cham. https://doi.org/10.1007/978-3-319-46397-1_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46397-1_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46396-4

  • Online ISBN: 978-3-319-46397-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics