Approximate Document Outlier Detection Using Random Spectral Projection

Aouf, Mazin; Park, Laurence A. F.

doi:10.1007/978-3-642-35101-3_49

Approximate Document Outlier Detection Using Random Spectral Projection

Mazin Aouf²¹ &
Laurence A. F. Park²¹

Conference paper

3445 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7691))

Abstract

Outlier detection is an important process for text document collections, but as the collection grows, the detection process becomes a computationally expensive task. Random projection has shown to provide a good fast approximation of sparse data, such as document vectors, for outlier detection. The random samples of Fourier and cosine spectrum have shown to provide good approximations of sparse data when performing document clustering. In this article, we investigate the utility of using these random Fourier and cosine spectral projections for document outlier detection. We show that random samples of the Fourier spectrum for outlier detection provides better accuracy and requires less storage when compared with random projection. We also show that random samples of the cosine spectrum for outlier detection provides similar accuracy and computational time when compared with random projection, but requires much less storage.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Deegalla, S., Bostrom, H.: Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings of the 5th International Conference on Machine Learning and Applications, ICMLA 2006, pp. 245–250. IEEE Computer Society, Washington, DC (2006)
Chapter Google Scholar
Park, L.A.F.: Fast Approximate Text Document Clustering Using Compressive Sampling. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 565–580. Springer, Heidelberg (2011)
Chapter Google Scholar
Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. The VLDB Journal 8(3-4), 237–253 (2000)
Article Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 93–104. ACM, New York (2000)
Google Scholar
Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)
Article Google Scholar
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Information Processing and Management 36(6), 809–840 (2000)
Article Google Scholar
Song, F., Croft, W.B.: A general language model for information retrieval. In: CIKM 1999: Proceedings of the Eighth International Conference on Information and knowledge Management, pp. 316–321. ACM Press (1999)
Google Scholar
Park, L.A.F., Leckie, C.A., Ramamohanarao, K., Bezdek, J.C.: Adapting Spectral Co-clustering to Documents and Terms Using Latent Semantic Analysis. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS, vol. 5866, pp. 301–311. Springer, Heidelberg (2009)
Chapter Google Scholar
Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proc. SIAM Int’l Conf. Data Mining (SDM 2005), pp. 606–610 (April 2005)
Google Scholar
Park, L.A.F., Ramamohanarao, K.: An analysis of latent semantic term self-correlation. ACM Transactions on Information Systems 27(2), 1–35 (2009)
Article Google Scholar
Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The International Journal on Very Large Data Bases 18(1), 141–156 (2009)
Article Google Scholar
Candes, E., Wakin, M.: An introduction to compressive sampling. IEEE Signal Processing Magazine 25(2), 21–30 (2008)
Article Google Scholar
Candes, E.J.: Compressive sampling. In: Proceedings of the International Congress of Mathematicians. European Mathematical Society, Madrid (2006)
Google Scholar
Goyal, V., Fletcher, A., Rangan, S.: Compressive sampling and lossy compression. IEEE Signal Processing Magazine 25(2), 48–56 (2008)
Article Google Scholar
Wu, M., Jermaine, C.: Outlier detection by sampling with accuracy guarantees. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 767–772. ACM, New York (2006)
Google Scholar
de Vries, T., Chawla, S., Houle, M.: Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems 32, 25–52 (2012), doi:10.1007/s10115-011-0430-4
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Engineering and Mathematics, University of Western Sydney, Australia
Mazin Aouf & Laurence A. F. Park

Authors

Mazin Aouf
View author publications
You can also search for this author in PubMed Google Scholar
Laurence A. F. Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, 2052, Sydney, NSW, Australia
Michael Thielscher
School of Computing and Mathematics, University of Western Sydney, 1797, Penrith South DC, NSW, Australia
Dongmo Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Aouf, M., Park, L.A.F. (2012). Approximate Document Outlier Detection Using Random Spectral Projection. In: Thielscher, M., Zhang, D. (eds) AI 2012: Advances in Artificial Intelligence. AI 2012. Lecture Notes in Computer Science(), vol 7691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35101-3_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-35101-3_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35100-6
Online ISBN: 978-3-642-35101-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics