Abstract
Novelty detection in text streams is a challenging task that emerges in quite a few different scenarii, ranging from email threads to RSS news feeds on a cell phone. An efficient novelty detection algorithm can save the user a great deal of time when accessing interesting information. Most of the recent research for the detection of novel documents in text streams uses either geometric distances or distributional similarities with the former typically performing better but being slower as we need to compare an incoming document with all the previously seen ones. In this paper, we propose a new novelty detection algorithm based on the Inverse Document Frequency (IDF) scoring function. Computing novelty based on IDF enables us to avoid similarity comparisons with previous documents in the text stream, thus leading to faster execution times. At the same time, our proposed approach outperforms several commonly used baselines when applied on a real-world news articles dataset.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Allan, J.: Introduction to topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking. The Information Retrieval Series, vol. 12, pp. 1–16. Springer, US (2002)
Allan, J., Lavrenko, V., Jin, H.: First story detection in tdt is hard. In: CIKM 2000, pp. 374–381. ACM (2000)
Allan, J., Lavrenko, V., Malin, D., Swan, R.: Detections, bounds, and timelines: Umass and tdt-3. In: Topic Detection and Tracking Workshop, TDT-3 (2000)
Allan, J., Wade, C., Bolivar, A.: Retrieval and novelty detection at the sentence level. In: SIGIR 2003, pp. 314–321. ACM (2003)
Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR 2004, pp. 49–56. ACM (2004)
Fiscus, J.G., Doddington, G.R.: Topic detection and tracking. In: Allan, J. (ed.) Topic Detection and Tracking, ch. 1, pp. 17–31. Kluwer Academic Publishers (2002)
Harman, D.: Overview of the trec 2002 novelty track. In: TREC 2002, pp. 46–55. NIST Special Publication 500-251 (2002)
Kwee, A.T., Tsai, F.S., Tang, W.: Sentence-level novelty detection in English and Malay. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 40–51. Springer, Heidelberg (2009)
Li, X., Croft, W.B.: Novelty detection based on sentence level patterns. In: CIKM 2005, pp. 744–751. ACM (2005)
Luo, G., Tang, C., Yu, P.S.: Resource-adaptive real-time new event detection. In: SIGMOD 2007, pp. 497–508. ACM (2007)
Manmatha, R., Feng, A., Allan, J.: A critical examination of tdt’s cost function. In: SIGIR 2002, pp. 403–404. ACM (2002)
Markou, M., Singh, S.: Novelty detection a review–part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)
Markou, M., Singh, S.: Novelty detection a review-part 2: neural network based approaches. Signal Process. 83(12), 2499–2521 (2003)
Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The det curve in assessment of detection task performance. In: 5th European Conference on Speech Communication and Technology, pp. 1895–1898 (1997)
Ohgaya, R., Shimmura, A., Takagi, T., Aizawa, A.N.: Meiji university web and novelty track experiments at trec 2003. In: TREC 2003, pp. 399–407 (2003)
Petrović, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to twitter. In: HLT 2010, pp. 181–189. ACL (2010)
Robertson, S.E., Walker, S.: On relevance weights with little relevance information. SIGIR Forum 31(SI), 16–24 (1997)
Robertson, S.E., Walker, S., Sparck Jones, K., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: TREC-3, pp. 109–126 (1994)
Singhal, A., Salton, G., Buckley, C.: Length normalization in degraded text collections. Technical report, Cornell University, Ithaca, NY, USA (1995)
Soboroff, I.: Overview of the trec 2004 novelty track. In: TREC 2004. NIST Special Publication, pp. 500–251 (2004)
Soboroff, I., Harman, D.: Overview of the trec 2003 novelty track. In: TREC 2003. NIST Special Publication, pp. 500–251 (2003)
Soboroff, I., Harman, D.: Novelty detection: the trec experience. In: HLT 2005, pp. 105–112. ACL (2005)
Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–20 (1972)
Tsai, F.S.: Review of techniques for intelligent novelty mining. Information Technology Journal 9, 1255–1261 (2010)
Tsai, F.S., Kwee, A.T.: Experiments in term weighting for novelty mining. Expert Systems with Applications 38(11), 14094–14101 (2011)
Tsai, F.S., Tang, W., Chan, K.L.: Evaluation of novelty metrics for sentence-level novelty mining. Inf. Sci. 180(12), 2359–2374 (2010)
Verheij, A., Kleijn, A., Frasincar, F., Hogenboom, F.: A comparison study for novelty control mechanisms applied to web news stories. In: WI 2012, pp. 431–436. IEEE Computer Society (2012)
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: KDD 2002, pp. 688–693. ACM (2002)
Zhang, K., Zi, J., Wu, L.G.: New event detection based on indexing-tree and named entity. In: SIGIR 2007, pp. 215–222. ACM, New York (2007)
Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002, pp. 81–88. ACM (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Karkali, M., Rousseau, F., Ntoulas, A., Vazirgiannis, M. (2013). Efficient Online Novelty Detection in News Streams. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-41230-1_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)