Outlier Detection on Mixed-Type Data: An Energy-Based Approach

Do, Kien; Tran, Truyen; Phung, Dinh; Venkatesh, Svetha

doi:10.1007/978-3-319-49586-6_8

Kien Do¹⁸,
Truyen Tran¹⁸,
Dinh Phung¹⁸ &
…
Svetha Venkatesh¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10086))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2508 Accesses
8 Citations

Abstract

Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In this paper, we propose a new unsupervised outlier detection method for mixed-type data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that models data density. We propose to use free-energy derived from Mv.RBM as outlier score to detect outliers as those data points lying in low density regions. The method is fast to learn and compute, is scalable to massive datasets. At the same time, the outlier score is identical to data negative log-density up-to an additive constant. We evaluate the proposed method on synthetic and real-world datasets and demonstrate that (a) a proper handling mixed-types is necessary in outlier detection, and (b) free-energy of Mv.RBM is a powerful and efficient outlier scoring method, which is highly competitive against state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The original Mv.RBM also covers rank, but we do not consider in this paper.
2.
https://archive.ics.uci.edu/ml/datasets.html.

References

Aggarwal, C.C.: Outlier Analysis. Data Mining. Springer, Heidelberg (2015)
Book MATH Google Scholar
Angiulli, F., Pizzuti, C.: Fast outlier detection in high dimensional spaces. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS, vol. 2431, pp. 15–27. Springer, Heidelberg (2002). doi:10.1007/3-540-45681-3_2
Chapter Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
Bouguessa, M.: A practical outlier detection approach for mixed-attribute data. Expert Syst. Appl. 42(22), 8637–8649 (2015)
Article Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM (2000)
Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: A survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009)
Article Google Scholar
De Leon, A.R., Chough, K.C.: Analysis of Mixed Data: Methods & Applications. CRC Press (2013)
Google Scholar
Diehl, C.P., Hampshire, J.B.: Real-time object classification and novelty detection for collaborative video surveillance. In: Proceedings of the 2002 International Joint Conference on Neural Networks, 2002. IJCNN 2002, vol. 3, pp. 2620–2625. IEEE (2002)
Google Scholar
Fiore, U., Palmieri, F., Castiglione, A., De Santis, A.: Network anomaly detection with the restricted Boltzmann machine. Neurocomputing 122, 13–23 (2013)
Article Google Scholar
Ghoting, A., Otey, M.E., Parthasarathy, S.: Loaded: Link-based outlier and anomaly detection in evolving data sets. In: ICDM, pp. 387–390 (2004)
Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14, 1771–1800 (2002)
Article MATH Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Kingma, D., Ba, J., Adam: A method for stochastic optimization. arXiv preprint (2014). arXiv:1412.6980
Konijn, R.M., Kowalczyk, W.: Finding fraud in health insurance data with two-layer outlier detection approach. In: Cuzzocrea, A., Dayal, U. (eds.) DaWaK 2011. LNCS, vol. 6862, pp. 394–405. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23544-3_30
Chapter Google Scholar
Koufakou, A., Georgiopoulos, M., Anagnostopoulos, G.C.: Detecting outliers in high-dimensional datasets with mixed attributes. In: DMIN, pp. 427–433. Citeseer (2008)
Google Scholar
Kruegel, C., Vigna, G.: Anomaly detection of web-based attacks. In: Proceedings of the 10th ACM Conference on Computer and Communications Security, pp. 251–261. ACM (2003)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Yen-Cheng, L., Chen, F., Wang, Y., Chang-Tien, L.: Discovering anomalies on mixed-type data using a generalized student-t based approach. IEEE Trans. Knowl. Data Eng. 28, 858–872 (2016). doi:10.1109/TKDE.2016.2583429
Article Google Scholar
Lu, Y.-C., Chen, F., Wang, Y., Lu, C.-T.: Discovering anomalies on mixed-type data using a generalized student-t based approach (2016)
Google Scholar
Manevitz, L.M., Yousef, M.: One-class SVMs for document classification. J. Mach. Learn. Res. 2, 139–154 (2001)
MATH Google Scholar
McLachlan, G.J., Basford, K.E.: Mixture models. inference, applications to clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988, 1 (1988)
Google Scholar
Nguyen, T.D., Tran, T., Phung, D., Venkatesh, S.: Latent patient profile modelling and applications with mixed-variate restricted boltzmann machine. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7818, pp. 123–135. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37453-1_11
Chapter Google Scholar
Nguyen, T.D., Tran, T., Phung, D., Venkatesh, S.: Learning sparse latent representation and distance metric for image retrieval. In: Proceedings of IEEE International Conference on Multimedia & Expo, California, USA, July 15–19 2013
Google Scholar
Otey, M.E.: Srinivasan Parthasarathy, and Amol Ghoting. Fast lightweight outlier detection in mixed-attribute data. Techincal Report, OSU-CISRC-6/05-TR43 (2005)
Google Scholar
Papadimitriou, S., Kitagawa, H., Gibbons, P.B., Faloutsos, C.: Loci: Fast outlier detection using the local correlation integral. In: 19th International Conference on Data Engineering, Proceedings, pp. 315–326. IEEE (2003)
Google Scholar
Portnoy, L., Eskin, E., Stolfo, S.: Intrusion detection with unlabeled data using clustering. In: Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001. Citeseer (2001)
Google Scholar
Salakhutdinov, R., Hinton, G.: Semantic hashing. Int. J. Approximate Reasoning 50(7), 969–978 (2009)
Article Google Scholar
Serfling, R., Wang, S.: General foundations for studying masking and swamping robustness of outlier identifiers. Stat. Methodol. 20, 79–90 (2014)
Article MathSciNet Google Scholar
Tipping, M.E., Bishop, C.M.: Probabilistic principal component analysis. J. Royal Stat. Soc. Ser. B 61(3), 611–622 (1999)
Article MathSciNet MATH Google Scholar
Tran, T., Phung, D., Venkatesh, S., Machines, T.B.: Learning from Multiple Inequalities. In: International Conference on Machine Learning (ICML), Atlanta, USA, June 16–21 2013
Google Scholar
Tran, T., Phung, D.Q., Venkatesh, S.: Mixed-variate restricted Boltzmann machines. In: Proceedings of 3rd Asian Conference on Machine Learning (ACML), Taoyuan, Taiwan (2011)
Google Scholar
Tran, T., Phung, D., Luo, W., Harvey, R., Berk, M., Venkatesh, S.: An integrated framework for suicide risk prediction. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1410–1418. ACM (2013)
Google Scholar
Zhang, K., Jin, H.: An effective pattern based outlier detection approach for mixed attribute data. In: Li, J. (ed.) AI 2010. LNCS (LNAI), vol. 6464, pp. 122–131. Springer, Heidelberg (2010). doi:10.1007/978-3-642-17432-2_13
Chapter Google Scholar

Download references

Acknowledgments

This work is partially supported by the Telstra-Deakin Centre of Excellence in Big Data and Machine Learning.

Author information

Authors and Affiliations

Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia
Kien Do, Truyen Tran, Dinh Phung & Svetha Venkatesh

Authors

Kien Do
View author publications
You can also search for this author in PubMed Google Scholar
Truyen Tran
View author publications
You can also search for this author in PubMed Google Scholar
Dinh Phung
View author publications
You can also search for this author in PubMed Google Scholar
Svetha Venkatesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kien Do .

Editor information

Editors and Affiliations

University of Technology , Sydney, New South Wales, Australia
Jinyan Li
University of Queensland , Brisbane, Australia
Xue Li
Beijing Institute of Technology , Beijing, China
Shuliang Wang
University of Western Australia , Crawley, West Australia, Australia
Jianxin Li
University of Adelaide , Adelaide, South Australia, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, K., Tran, T., Phung, D., Venkatesh, S. (2016). Outlier Detection on Mixed-Type Data: An Energy-Based Approach. In: Li, J., Li, X., Wang, S., Li, J., Sheng, Q. (eds) Advanced Data Mining and Applications. ADMA 2016. Lecture Notes in Computer Science(), vol 10086. Springer, Cham. https://doi.org/10.1007/978-3-319-49586-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-49586-6_8
Published: 13 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49585-9
Online ISBN: 978-3-319-49586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Outlier Detection on Mixed-Type Data: An Energy-Based Approach