Abstract
Email is the most frequently used web application for communication and collaboration due to its easy access, fast interactions, and convenient management. More than 60% of the email traffic constitutes business to consumer (B2C) emails (e.g., flight reservations, payment reminder, order confirmations, etc.). Most of these emails are generated by filling a template with user or transaction specific values from databases. In this paper we describe various algorithms related to extracting important information from these emails.
Unlike web pages, emails are personal and due to privacy and legal considerations, no other human except the receiver can view them. Thus, adapting extraction techniques used for web pages, such as HTML wrapper-based techniques, have privacy and scalability challenges. We describe end-to-end information extraction system for emails—data collection, anonymization, classification, building the information extraction models, deployment, and monitoring. To handle the privacy and scalability issues, we focus on algorithms which can work with minimum human annotated samples for building classifier and extraction techniques. Similarly, we present algorithms to minimize samples for human inspection to detect precision and recall gaps in the extraction pipeline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ailon, N., Karnin, Z.S., Liberty, E., Maarek, Y.: Threading machine generated email. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013 (2013)
Sheng, Y., Tata, S., Wendt, J.B., Xie, J., Zhao, Q., Najork, M.: Anatomy of a privacy-safe large-scale information extraction system over email. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018 (2018)
Zhang, W., Ahmed, A., Yang, J., Josifovski, V., Smola, A.J.: Annotating needles in the haystack without looking: product information extraction from emails. In: Proceedings of the 21st ACM International Conference on Knowledge Discovery and Data Mining, KDD 2015 (2015)
Proskurnia, J., Cartright, M.-A., Garcia-Pueyo, L., Krka, I.: Template induction over unstructured email corpora. In: Proceedings of the International Conference on World Wide Web, WWW 2017, Perth, Australia (2017)
Hua, W., Wang, Z., Wang, H., Zheng, K., Zhou, X.: Short text understanding through lexical-semantic analysis. In: International Conference on Data Engineering, ICDE 2015 (2015)
Grbovic, M., Halawi, G., Karnin, Z., Maarek, Y.: How many folders do you really need? classifying email into a handful of categories. In:Â Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2014 (2014)
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). TKDE 2006
Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009 (2009)
Polozov, O., Gulawani, S.: LaSEWeb: automating search strategies over semi-structured web data. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining. KDD 2014 (2014)
Gulwani, S., Jain, P.: Programming by examples: PL meets ML. In: Asian Symposium on Programming Languages and Systems, November 2017
Microsoft PROSE SDK Tutorial. https://microsoft.github.io/prose/documentation/prose/tutorial/
Penna, G.D., Magazzeni, D., Orefice, S.: Visual extraction of information from web pages. J. Vis. Lang. Comput. 21(1), 23–32 (2010)
Penna, G.D., Magazzeni, D., Orefice, S.: A spatial relation-based framework to perform visual information extraction. Knowl. Inform. Syst. 30(3), 667–692 (2012)
Chiticariu, L., Li, Y., Reiss, F.R.: Rule-based information extraction is dead! long live rule-based information extraction systems. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 (2013)
Wendt, J.B., Bendersky, M., Garcia-Pueyo, L., et al.: Hierarchical label propagation and discovery for machine generated email. In: Proceedings of the 9th ACM International Conference on Web Search and Data Mining, WSDM 2016 (2016)
Ratner, A., Bach, S.H., Ehrenberg, H., Fries, J., Wu, S., Re, C.: Snorkel: rapid training data creation with weak supervision. Proc. VLDB Endowment 11(3). VLDB 2017
Bengio, Y., Delalleau, O., Le RouxForman, N.: Label propagation and quadratic criterion, pp. 193–216. MIT Press (2006)
Wang, F., Tan, C., König, A.C., Li, P.: Efficient document clustering via online nonnegative matrix factorizations. In: 11th SIAM International Conference on Data Mining Society for Industrial and Applied Mathematics, 28 April 2011
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM International Conference on Knowledge Discovery and Data Mining, KDD 2004 (2004)
Prabhu, Y., Verma, M.: FastXML: a fast accurate and stable tree-classifier for eXtreme multi-label learning. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2014 (2014)
Wang, F., Li, P., König, A.C., Wan, M.: Improving clustering by learning a bi-stochastic data similarity matrix. Knowledge and Information Systems (KAIS), August 2011
Safari, B.A.: Intangible privacy rights: how europe’s GDPR will set a new global standard for personal data protection. 47 Seton hall l. Rev. 809, 820–822 (2017)
Graepel, T., Lauter, K., Naehrig, M.: ML confidential: machine learning on encrypted data. In: International Conference on Information Security and Cryptology, ICISC 2012 (2012)
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-Anonymization. In: Proceedings of the International Conference on Data Engineering, ICDE 2005 (2005)
Inan, A., Kantarcioglu, M., Bertino, E.: Using anonymized data for classification. In: Proceedings of the International Conference on Data Engineering, ICDE 2009 (2009)
Benjamin, C.M., Fung, K.W., Yu, P.S.: Anonymizing classification data for privacy preservation. IEEE Trans. Knowl. Data Eng. 19(5), 711–725 (2007). TKDE 2007
Brickell, J., Shmatikov, V.: The cost of privacy: destruction of data-mining utility in anonymized data publishing. In: Proceedings of the 14th ACM International Conference on Knowledge Discovery and Data Mining, KDD 2008 (2008)
Dwork, C.: Differential privacy: a survey of results. In: Theory and Applications of Models of Computation—TAMC, April 2008
Gkountouna, O., Terrovitis, M.: Anonymizing collections of tree-structured data. IEEE Trans. Knowl. Data Eng. TKDE 27(8), 2034–2048 (2015)
Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for web page information extraction. Artificial Intelligence Applications and Innovations. September 2016
Wojna, Z., et al.: Attention based extraction of structured information from street view imagery (2017). http://arxiv.org/abs/1704.03549
Raffel, C., Ellis, D.P.W.: Feed forward networks with attention can solve some long-term memory problems, June 2017. http://arxiv.org/abs/1512.08756
Zhu, J., Nie, Z., Zhang, B., Wen, J., et al.: 2D conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005 (2005)
Zhu, J., Nie, Z., Wen, J., Zhang, B., et al.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006 (2006)
Huang, Z., Xu, W., Kai, Yu.: Bidirectional LSTM-CRF Models for Sequence Tagging. http://arxiv.org/abs/1508.01991
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011 (2011)
Joty, S., Nakov, P., MĂ rquez, L., Jaradat, I.: Cross-language learning with adversarial neural networks: application to community question answering. In: The SIGNLL Conference on Computational Natural Language Learning; Cross-Language Adversarial Neural Network (CLANN) Model, CoNLL 2017 (2017)
Kuo, B.-C., Ho, H.-H., Li, C.-H., Hung, C.-C., Taur, J.-S.: A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE 2013, pp. 317–326
De Bie, T., Maia, T.T., Braga, A.P.: Machine learning with labeled and unlabeled data. In: European Symposium on Artificial Neural Networks - Advances in Computational Intelligence and Learning. Bruges (Belgium), 22–24 April 2009
Chapelle, O., Sholkopf, B., Zien, A. (eds.): Semi-Supervised Learning, MIT Press, London (2006)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)
Microsoft Translator Text API. https://www.microsoft.com/en-us/translator/business/translator-api/
Acknowledgements
We would like to acknowledge the efforts of various members of our team for this paper. Specifically, we would like to thank Richa Bhagat for her work on reducing human involvement in the extraction pipeline. We would also like to thank Pankaj Khanzode and Chakrapani Ravi for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Gupta, R., Kondapally, R., Guha, S. (2019). Large-Scale Information Extraction from Emails with Data Constraints. In: Madria, S., Fournier-Viger, P., Chaudhary, S., Reddy, P. (eds) Big Data Analytics. BDA 2019. Lecture Notes in Computer Science(), vol 11932. Springer, Cham. https://doi.org/10.1007/978-3-030-37188-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-37188-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37187-6
Online ISBN: 978-3-030-37188-3
eBook Packages: Computer ScienceComputer Science (R0)