Skip to main content

Large-Scale Information Extraction from Emails with Data Constraints

  • Conference paper
  • First Online:
Big Data Analytics (BDA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11932))

Included in the following conference series:

Abstract

Email is the most frequently used web application for communication and collaboration due to its easy access, fast interactions, and convenient management. More than 60% of the email traffic constitutes business to consumer (B2C) emails (e.g., flight reservations, payment reminder, order confirmations, etc.). Most of these emails are generated by filling a template with user or transaction specific values from databases. In this paper we describe various algorithms related to extracting important information from these emails.

Unlike web pages, emails are personal and due to privacy and legal considerations, no other human except the receiver can view them. Thus, adapting extraction techniques used for web pages, such as HTML wrapper-based techniques, have privacy and scalability challenges. We describe end-to-end information extraction system for emails—data collection, anonymization, classification, building the information extraction models, deployment, and monitoring. To handle the privacy and scalability issues, we focus on algorithms which can work with minimum human annotated samples for building classifier and extraction techniques. Similarly, we present algorithms to minimize samples for human inspection to detect precision and recall gaps in the extraction pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ailon, N., Karnin, Z.S., Liberty, E., Maarek, Y.: Threading machine generated email. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013 (2013)

    Google Scholar 

  2. Sheng, Y., Tata, S., Wendt, J.B., Xie, J., Zhao, Q., Najork, M.: Anatomy of a privacy-safe large-scale information extraction system over email. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018 (2018)

    Google Scholar 

  3. Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining, KDD 2015 (2015)

    Google Scholar 

  4. Proskurnia, J., Cartright, M.-A., Garcia-Pueyo, L., Krka, I.: Template induction over unstructured email corpora. In: Proceedings of the International Conference on World Wide Web, WWW 2017, Perth, Australia (2017)

    Google Scholar 

  5. Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: International Conference on Data Engineering, ICDE 2015 (2015)

    Google Scholar 

  6. Grbovic, M., Halawi, G., Karnin, Z., Maarek, Y.: How many folders do you really need? classifying email into a handful of categories. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014 (2014)

    Google Scholar 

  7. Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)

    Google Scholar 

  8. Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). TKDE 2006

    Article  Google Scholar 

  9. Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009 (2009)

    Google Scholar 

  10. Polozov, O., Gulawani, S.: LaSEWeb: automating search strategies over semi-structured web data. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining. KDD 2014 (2014)

    Google Scholar 

  11. Gulwani, S., Jain, P.: Programming by examples: PL meets ML. In: Asian Symposium on Programming Languages and Systems, November 2017

    Chapter  Google Scholar 

  12. Microsoft PROSE SDK Tutorial. https://microsoft.github.io/prose/documentation/prose/tutorial/

  13. Penna, G.D., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)

    Google Scholar 

  14. Penna, G.D., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inform. Syst. 30(3), 667–692 (2012)

    Google Scholar 

  15. Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 (2013)

    Google Scholar 

  16. Wendt, J.B., Bendersky, M., Garcia-Pueyo, L., et al.: Hierarchical label propagation and discovery for machine generated email. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining, WSDM 2016 (2016)

    Google Scholar 

  17. Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Re, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endowment 11(3). VLDB 2017

    Article  Google Scholar 

  18. Bengio, Y., Delalleau, O., Le RouxForman, N.: Label propagation and quadratic criterion, pp. 193–216. MIT Press (2006)

    Google Scholar 

  19. Wang, F., Tan, C., König, A.C., Li, P.: Efficient document clustering via online nonnegative matrix factorizations. In: 11th SIAM International Conference on Data Mining Society for Industrial and Applied Mathematics, 28 April 2011

    Google Scholar 

  20. Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM International Conference on Knowledge Discovery and Data Mining, KDD 2004 (2004)

    Google Scholar 

  21. Prabhu, Y., Verma, M.: FastXML: a fast accurate and stable tree-classifier for eXtreme multi-label learning. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2014 (2014)

    Google Scholar 

  22. Wang, F., Li, P., König, A.C., Wan, M.: Improving clustering by learning a bi-stochastic data similarity matrix. Knowledge and Information Systems (KAIS), August 2011

    Google Scholar 

  23. Safari, B.A.: Intangible privacy rights: how europe’s GDPR will set a new global standard for personal data protection. 47 Seton hall l. Rev. 809, 820–822 (2017)

    Google Scholar 

  24. Graepel, T., Lauter, K., Naehrig, M.: ML confidential: machine learning on encrypted data. In: International Conference on Information Security and Cryptology, ICISC 2012 (2012)

    Google Scholar 

  25. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-Anonymization. In: Proceedings of the International Conference on Data Engineering, ICDE 2005 (2005)

    Google Scholar 

  26. Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: Proceedings of the International Conference on Data Engineering, ICDE 2009 (2009)

    Google Scholar 

  27. Benjamin, C.M., Fung, K.W., Yu, P.S.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007). TKDE 2007

    Article  Google Scholar 

  28. Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2008 (2008)

    Google Scholar 

  29. Dwork, C.: Differential privacy: a survey of results. In: Theory and Applications of Models of Computation—TAMC, April 2008

    Google Scholar 

  30. Gkountouna, O., Terrovitis, M.: Anonymizing collections of tree-structured data. IEEE Trans. Knowl. Data Eng. TKDE 27(8), 2034–2048 (2015)

    Article  Google Scholar 

  31. Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. Artificial Intelligence Applications and Innovations. September 2016

    Google Scholar 

  32. Wojna, Z., et al.: Attention based extraction of structured information from street view imagery (2017). http://arxiv.org/abs/1704.03549

  33. Raffel, C., Ellis, D.P.W.: Feed forward networks with attention can solve some long-term memory problems, June 2017. http://arxiv.org/abs/1512.08756

  34. Zhu, J., Nie, Z., Zhang, B., Wen, J., et al.: 2D conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005 (2005)

    Google Scholar 

  35. Zhu, J., Nie, Z., Wen, J., Zhang, B., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006 (2006)

    Google Scholar 

  36. Huang, Z., Xu, W., Kai, Yu.: Bidirectional LSTM-CRF Models for Sequence Tagging. http://arxiv.org/abs/1508.01991

  37. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011 (2011)

    Google Scholar 

  38. Joty, S., Nakov, P., MĂ rquez, L., Jaradat, I.: Cross-language learning with adversarial neural networks: application to community question answering. In: The SIGNLL Conference on Computational Natural Language Learning; Cross-Language Adversarial Neural Network (CLANN) Model, CoNLL 2017 (2017)

    Google Scholar 

  39. Kuo, B.-C., Ho, H.-H., Li, C.-H., Hung, C.-C., Taur, J.-S.: A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE 2013, pp. 317–326

    Google Scholar 

  40. De Bie, T., Maia, T.T., Braga, A.P.: Machine learning with labeled and unlabeled data. In: European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning. Bruges (Belgium), 22–24 April 2009

    Google Scholar 

  41. Chapelle, O., Sholkopf, B., Zien, A. (eds.): Semi-Supervised Learning, MIT Press, London (2006)

    Google Scholar 

  42. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)

    Google Scholar 

  43. Microsoft Translator Text API. https://www.microsoft.com/en-us/translator/business/translator-api/

Download references

Acknowledgements

We would like to acknowledge the efforts of various members of our team for this paper. Specifically, we would like to thank Richa Bhagat for her work on reducing human involvement in the extraction pipeline. We would also like to thank Pankaj Khanzode and Chakrapani Ravi for their helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajeev Gupta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gupta, R., Kondapally, R., Guha, S. (2019). Large-Scale Information Extraction from Emails with Data Constraints. In: Madria, S., Fournier-Viger, P., Chaudhary, S., Reddy, P. (eds) Big Data Analytics. BDA 2019. Lecture Notes in Computer Science(), vol 11932. Springer, Cham. https://doi.org/10.1007/978-3-030-37188-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37188-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37187-6

  • Online ISBN: 978-3-030-37188-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics