Skip to main content

Approximate Document Outlier Detection Using Random Spectral Projection

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7691))

Abstract

Outlier detection is an important process for text document collections, but as the collection grows, the detection process becomes a computationally expensive task. Random projection has shown to provide a good fast approximation of sparse data, such as document vectors, for outlier detection. The random samples of Fourier and cosine spectrum have shown to provide good approximations of sparse data when performing document clustering. In this article, we investigate the utility of using these random Fourier and cosine spectral projections for document outlier detection. We show that random samples of the Fourier spectrum for outlier detection provides better accuracy and requires less storage when compared with random projection. We also show that random samples of the cosine spectrum for outlier detection provides similar accuracy and computational time when compared with random projection, but requires much less storage.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Deegalla, S., Bostrom, H.: Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: Proceedings of the 5th International Conference on Machine Learning and Applications, ICMLA 2006, pp. 245–250. IEEE Computer Society, Washington, DC (2006)

    Chapter  Google Scholar 

  2. Park, L.A.F.: Fast Approximate Text Document Clustering Using Compressive Sampling. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 565–580. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  3. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications. The VLDB Journal 8(3-4), 237–253 (2000)

    Article  Google Scholar 

  4. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 93–104. ACM, New York (2000)

    Google Scholar 

  5. Zobel, J., Moffat, A.: Exploring the similarity space. ACM SIGIR Forum 32(1), 18–34 (1998)

    Article  Google Scholar 

  6. Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: development and comparative experiments, part 2. Information Processing and Management 36(6), 809–840 (2000)

    Article  Google Scholar 

  7. Song, F., Croft, W.B.: A general language model for information retrieval. In: CIKM 1999: Proceedings of the Eighth International Conference on Information and knowledge Management, pp. 316–321. ACM Press (1999)

    Google Scholar 

  8. Park, L.A.F., Leckie, C.A., Ramamohanarao, K., Bezdek, J.C.: Adapting Spectral Co-clustering to Documents and Terms Using Latent Semantic Analysis. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS, vol. 5866, pp. 301–311. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  9. Ding, C., He, X., Simon, H.D.: On the equivalence of nonnegative matrix factorization and spectral clustering. In: Proc. SIAM Int’l Conf. Data Mining (SDM 2005), pp. 606–610 (April 2005)

    Google Scholar 

  10. Park, L.A.F., Ramamohanarao, K.: An analysis of latent semantic term self-correlation. ACM Transactions on Information Systems 27(2), 1–35 (2009)

    Article  Google Scholar 

  11. Park, L.A.F., Ramamohanarao, K.: Efficient storage and retrieval of probabilistic latent semantic information for information retrieval. The International Journal on Very Large Data Bases 18(1), 141–156 (2009)

    Article  Google Scholar 

  12. Candes, E., Wakin, M.: An introduction to compressive sampling. IEEE Signal Processing Magazine 25(2), 21–30 (2008)

    Article  Google Scholar 

  13. Candes, E.J.: Compressive sampling. In: Proceedings of the International Congress of Mathematicians. European Mathematical Society, Madrid (2006)

    Google Scholar 

  14. Goyal, V., Fletcher, A., Rangan, S.: Compressive sampling and lossy compression. IEEE Signal Processing Magazine 25(2), 48–56 (2008)

    Article  Google Scholar 

  15. Wu, M., Jermaine, C.: Outlier detection by sampling with accuracy guarantees. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 767–772. ACM, New York (2006)

    Google Scholar 

  16. de Vries, T., Chawla, S., Houle, M.: Density-preserving projections for large-scale local anomaly detection. Knowledge and Information Systems 32, 25–52 (2012), doi:10.1007/s10115-011-0430-4

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Aouf, M., Park, L.A.F. (2012). Approximate Document Outlier Detection Using Random Spectral Projection. In: Thielscher, M., Zhang, D. (eds) AI 2012: Advances in Artificial Intelligence. AI 2012. Lecture Notes in Computer Science(), vol 7691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35101-3_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35101-3_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35100-6

  • Online ISBN: 978-3-642-35101-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics