Large-Scale Information Extraction from Emails with Data Constraints

Gupta, Rajeev; Kondapally, Ranganath; Guha, Siddharth

doi:10.1007/978-3-030-37188-3_8

Rajeev Gupta¹²,
Ranganath Kondapally¹² &
Siddharth Guha¹²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11932))

Included in the following conference series:

International Conference on Big Data Analytics

1063 Accesses
3 Citations

Abstract

Email is the most frequently used web application for communication and collaboration due to its easy access, fast interactions, and convenient management. More than 60% of the email traffic constitutes business to consumer (B2C) emails (e.g., flight reservations, payment reminder, order confirmations, etc.). Most of these emails are generated by filling a template with user or transaction specific values from databases. In this paper we describe various algorithms related to extracting important information from these emails.

Unlike web pages, emails are personal and due to privacy and legal considerations, no other human except the receiver can view them. Thus, adapting extraction techniques used for web pages, such as HTML wrapper-based techniques, have privacy and scalability challenges. We describe end-to-end information extraction system for emails—data collection, anonymization, classification, building the information extraction models, deployment, and monitoring. To handle the privacy and scalability issues, we focus on algorithms which can work with minimum human annotated samples for building classifier and extraction techniques. Similarly, we present algorithms to minimize samples for human inspection to detect precision and recall gaps in the extraction pipeline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ailon, N., Karnin, Z.S., Liberty, E., Maarek, Y.: Threading machine generated email. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013 (2013)
Google Scholar
Sheng, Y., Tata, S., Wendt, J.B., Xie, J., Zhao, Q., Najork, M.: Anatomy of a privacy-safe large-scale information extraction system over email. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018 (2018)
Google Scholar
Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining, KDD 2015 (2015)
Google Scholar
Proskurnia, J., Cartright, M.-A., Garcia-Pueyo, L., Krka, I.: Template induction over unstructured email corpora. In: Proceedings of the International Conference on World Wide Web, WWW 2017, Perth, Australia (2017)
Google Scholar
Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: International Conference on Data Engineering, ICDE 2015 (2015)
Google Scholar
Grbovic, M., Halawi, G., Karnin, Z., Maarek, Y.: How many folders do you really need? classifying email into a handful of categories. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014 (2014)
Google Scholar
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). TKDE 2006
Article Google Scholar
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009 (2009)
Google Scholar
Polozov, O., Gulawani, S.: LaSEWeb: automating search strategies over semi-structured web data. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining. KDD 2014 (2014)
Google Scholar
Gulwani, S., Jain, P.: Programming by examples: PL meets ML. In: Asian Symposium on Programming Languages and Systems, November 2017
Chapter Google Scholar
Microsoft PROSE SDK Tutorial. https://microsoft.github.io/prose/documentation/prose/tutorial/
Penna, G.D., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)
Google Scholar
Penna, G.D., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inform. Syst. 30(3), 667–692 (2012)
Google Scholar
Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 (2013)
Google Scholar
Wendt, J.B., Bendersky, M., Garcia-Pueyo, L., et al.: Hierarchical label propagation and discovery for machine generated email. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining, WSDM 2016 (2016)
Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Re, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endowment 11(3). VLDB 2017
Article Google Scholar
Bengio, Y., Delalleau, O., Le RouxForman, N.: Label propagation and quadratic criterion, pp. 193–216. MIT Press (2006)
Google Scholar
Wang, F., Tan, C., König, A.C., Li, P.: Efficient document clustering via online nonnegative matrix factorizations. In: 11th SIAM International Conference on Data Mining Society for Industrial and Applied Mathematics, 28 April 2011
Google Scholar
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM International Conference on Knowledge Discovery and Data Mining, KDD 2004 (2004)
Google Scholar
Prabhu, Y., Verma, M.: FastXML: a fast accurate and stable tree-classifier for eXtreme multi-label learning. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2014 (2014)
Google Scholar
Wang, F., Li, P., König, A.C., Wan, M.: Improving clustering by learning a bi-stochastic data similarity matrix. Knowledge and Information Systems (KAIS), August 2011
Google Scholar
Safari, B.A.: Intangible privacy rights: how europe’s GDPR will set a new global standard for personal data protection. 47 Seton hall l. Rev. 809, 820–822 (2017)
Google Scholar
Graepel, T., Lauter, K., Naehrig, M.: ML confidential: machine learning on encrypted data. In: International Conference on Information Security and Cryptology, ICISC 2012 (2012)
Google Scholar
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-Anonymization. In: Proceedings of the International Conference on Data Engineering, ICDE 2005 (2005)
Google Scholar
Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: Proceedings of the International Conference on Data Engineering, ICDE 2009 (2009)
Google Scholar
Benjamin, C.M., Fung, K.W., Yu, P.S.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007). TKDE 2007
Article Google Scholar
Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2008 (2008)
Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Theory and Applications of Models of Computation—TAMC, April 2008
Google Scholar
Gkountouna, O., Terrovitis, M.: Anonymizing collections of tree-structured data. IEEE Trans. Knowl. Data Eng. TKDE 27(8), 2034–2048 (2015)
Article Google Scholar
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. Artificial Intelligence Applications and Innovations. September 2016
Google Scholar
Wojna, Z., et al.: Attention based extraction of structured information from street view imagery (2017). http://arxiv.org/abs/1704.03549
Raffel, C., Ellis, D.P.W.: Feed forward networks with attention can solve some long-term memory problems, June 2017. http://arxiv.org/abs/1512.08756
Zhu, J., Nie, Z., Zhang, B., Wen, J., et al.: 2D conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005 (2005)
Google Scholar
Zhu, J., Nie, Z., Wen, J., Zhang, B., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006 (2006)
Google Scholar
Huang, Z., Xu, W., Kai, Yu.: Bidirectional LSTM-CRF Models for Sequence Tagging. http://arxiv.org/abs/1508.01991
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011 (2011)
Google Scholar
Joty, S., Nakov, P., Màrquez, L., Jaradat, I.: Cross-language learning with adversarial neural networks: application to community question answering. In: The SIGNLL Conference on Computational Natural Language Learning; Cross-Language Adversarial Neural Network (CLANN) Model, CoNLL 2017 (2017)
Google Scholar
Kuo, B.-C., Ho, H.-H., Li, C.-H., Hung, C.-C., Taur, J.-S.: A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE 2013, pp. 317–326
Google Scholar
De Bie, T., Maia, T.T., Braga, A.P.: Machine learning with labeled and unlabeled data. In: European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning. Bruges (Belgium), 22–24 April 2009
Google Scholar
Chapelle, O., Sholkopf, B., Zien, A. (eds.): Semi-Supervised Learning, MIT Press, London (2006)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)
Google Scholar
Microsoft Translator Text API. https://www.microsoft.com/en-us/translator/business/translator-api/

Download references

Acknowledgements

We would like to acknowledge the efforts of various members of our team for this paper. Specifically, we would like to thank Richa Bhagat for her work on reducing human involvement in the extraction pipeline. We would also like to thank Pankaj Khanzode and Chakrapani Ravi for their helpful comments.

Author information

Authors and Affiliations

Microsoft R&D, Hyderabad, India
Rajeev Gupta, Ranganath Kondapally & Siddharth Guha

Authors

Rajeev Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Ranganath Kondapally
View author publications
You can also search for this author in PubMed Google Scholar
Siddharth Guha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajeev Gupta .

Editor information

Editors and Affiliations

Missouri University of Science and Technology, Rolla, MO, USA
Sanjay Madria
Harbin Institute of Technology, Shenzhen, China
Philippe Fournier-Viger
Ahmedabad University, Ahmedabad, India
Sanjay Chaudhary
International Institute of Information Technology, Hyderabad, India
P. Krishna Reddy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, R., Kondapally, R., Guha, S. (2019). Large-Scale Information Extraction from Emails with Data Constraints. In: Madria, S., Fournier-Viger, P., Chaudhary, S., Reddy, P. (eds) Big Data Analytics. BDA 2019. Lecture Notes in Computer Science(), vol 11932. Springer, Cham. https://doi.org/10.1007/978-3-030-37188-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-37188-3_8
Published: 12 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37187-6
Online ISBN: 978-3-030-37188-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics