Skip to main content

Part of the book series: Advances in Pattern Recognition ((ACVPR))

  • 1253 Accesses

Abstract

The initial motivation and, at the same time, the final objective for creating documents is information preservation and transmission. In this perspective, the availability of huge quantities of documents, far from being beneficial, carries the risk of scattering and hiding information in loads of noise. As a consequence, the development of effective and efficient automatic techniques for the identification of interesting documents in a collection, and of relevant information items in a document, becomes a fundamental factor for the practical exploitation of digital document repositories. This chapter is concerned with the management of document contents. Several Information Retrieval approaches are presented first, ranging from term-based indexing to concept-based organization strategies, for supporting document search. Then, tasks more related to the information conveyed by documents are presented: Text Categorization (aimed at identifying the subject of interest of a document), Keyword Extraction (to single out prominent terms in the document text) and Information Extraction (that produces a structured representation of the features of noteworthy events described in a text).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A possible specification/refinement/evolution of these objectives involves the possibility of querying a document base by issuing the request as a question in natural language (a field known as Question Answering).

  2. 2.

    Another measure typically used to evaluate performance of Machine Learning systems is Accuracy (Acc), representing the overall ratio of correct choices made by the system:

    $$\mathit{Acc} = \frac{\mathit{TP}+\mathit{TN}}{\mathit{TP} + \mathit{FN} + \mathit{FP}+ \mathit{FN}} \in[0,1]$$

    or its complement, Error Rate (ER):

    $$\mathit{ER} = 1 - \mathit{Acc} \in[0,1].$$

    This measure, that is intuitive and sufficiently significant when the number of positive and negative instances is balanced, becomes tricky when these quantities are very different from each other. Indeed, when negative instances are an overwhelming majority (say 99%), as is the case for IR (where the number of documents in the collection that are not significant to the query are almost the totality), a trivial system saying that everything is irrelevant would reach 99% accuracy, although it is clear that its performance is absolutely insignificant.

  3. 3.

    Because of synonymy, a person issuing a query might use different words than those that appear in an interesting document to denote the same concepts, so the document would not be retrieved; because of polysemy, uninteresting documents concerning the alternate meanings might be retrieved.

  4. 4.

    In the rest of this section, not to distract the reader from the main matter, most subsidiary information, such as recalls of mathematical concepts, comments and implementation hints, are provided as footnotes. First of all, it may be useful to recall some notations that will be used in the following:

    • A T, transpose of matrix A;

    • diag(a 1,…,a n ), diagonal matrix n×n;

    • I n , identity matrix of order n;

    • rank(A), rank of matrix A;

    • A −1, inverse matrix of A.

    Moreover, given a square matrix M n×n (having the same number of rows and columns):

    • A number λ is an eigenvalue of M if there exists a non-null vector v such that Mv=λv;

    • Such a vector v is an eigenvector of M with respect to λ;

    • σ(M) denotes the set of eigenvalues of M.

    Singular values are a generalization of the concept of eigenvalues to rectangular matrices.

  5. 5.

    U describes the entities associated to the rows as vectors of values of derived orthogonal coefficients; V describes the entities associated to the columns as vectors of values of derived orthogonal coefficients, and W contains scale values that allow going back to the original matrix.

  6. 6.

    Some methods to carry out the SVD do not return the diagonal of W ordered decreasingly. In such a case it must be sorted, and the columns of U and the rows of V T must be permuted consequently.

  7. 7.

    This allows the conversion of m-dimensional vectors into k-dimensional ones, that is needed during the document retrieval phase (and by some index updating techniques for the LSI).

  8. 8.

    W k being a diagonal matrix, its inverse consists just of the inverse of each element thereof. That is, for \(i = 1,\dots,k : W_{k}^{-1}(i, i) = \frac{1}{W_{k}(i, i)}\).

  9. 9.

    Concept Indexing is not to be mistaken for Conceptual Indexing [28], proposed in the same period, and aimed at tackling, in addition to the well-known problem of classical indexing techniques due to synonymy and polysemy spoiling the search results, the shortcoming due to the typical attitude to define hierarchical topic/concept categories, so that a document placed across two topics must be cast under either of the two, and is not found when searching under the other. Based on a generality relation that allows assessing whether a concept is subsumed by another, expressed by a set of basic facts (called subsumption axioms) and reported in a—possibly hand-made—dictionary or thesaurus (such as WordNet), it can determine whether a text is more general than another according to the generality relationships among their constituents. Thus, new texts can be mapped onto such a taxonomy, and the taxonomy itself can be progressively built and extended. Conceptual Indexing recognizes the conceptual structure of each passage in the corpus (i.e., the way in which its constituent elements are interlinked to produce its meaning). Four components are needed for this technique: A concept extractor that identifies terms and sentences to be indexed (and documents where they appear); a concept assimilator that analyzes the structure and meaning of a passage to determine where it should be placed in the conceptual taxonomy (and to which other concepts it should be linked); a conceptual retrieval system that exploits the conceptual taxonomy to associate the query to the indexed information; a conceptual navigator that allows the user to explore the conceptual taxonomy and to browse the concepts and their occurrences in the indexed material.

  10. 10.

    A quick reference to Machine Learning techniques, including k-NN, clustering and k-means, is provided in Appendix B; a more extensive treatise can be found in [20].

  11. 11.

    Attention must be paid if p(D|key), p(PT|key) or p(PS|key) is zero, in which case the probability would be 0 as well, or if p(T,D,PT,PS)=0, in which case a division by zero would occur.

  12. 12.

    A matrix M n×k in which m tg reports in how many sentences term t in the text co-occurs with term gG is useful to represent such information. Having k<n, M is not square, and no diagonal can be considered.

  13. 13.

    Some studies revealed that the overlapping ratio between classifications provided by different experts with the same skills and experience is just around 30%.

  14. 14.

    In the case of regular grammars, the corresponding Finite State Automata are learned by starting from the maximally specific case (called the canonical acceptor) and progressively merging states according to considerations that depend on the specific algorithm.

References

  1. Addis, A., Angioni, M., Armano, G., Demontis, R., Tuveri, F., Vargiu, E.: A novel semantic approach to create document collections. In: Proceedings of the IADIS International Conference—Intelligent Systems and Agents (2008)

    Google Scholar 

  2. Begeja, L., Renger, B., Saraclar, M.: A system for searching and browsing spoken communications. In: HLT/NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 1–8 (2004)

    Google Scholar 

  3. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001)

    Google Scholar 

  4. Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  5. Boulgouris, N.V., Kompatsiaris, I., Mezaris, V., Simitopoulos, D., Strintzis, M.G.: Segmentation and content-based watermarking for color image and image region indexing and retrieval. EURASIP Journal on Applied Signal Processing 1, 418–431 (2002)

    Google Scholar 

  6. Cestnik, B.: Estimating probabilities: A crucial task in machine learning. In: Proceedings of the 9th European Conference on Machine Learning (ECAI), pp. 147–149 (1990)

    Google Scholar 

  7. Chen, Y., Li, J., Wang, J.Z.: Machine Learning and Statistical Modeling Approaches to Image Retrieval. Kluwer, Amsterdam (2004)

    MATH  Google Scholar 

  8. Cowie, J., Wilks, Y.: Information Extraction. In: Dale, R., Moisl, H., Somers, H. (eds.) Handbook of Natural Language Processing, pp. 241–260. Marcel Dekker, New York (2000)

    Google Scholar 

  9. Deb, S.: Multimedia Systems and Content-Based Image Retrieval. IGI Publishing (2003)

    Book  Google Scholar 

  10. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  11. Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39, 169–202 (2000)

    Article  MATH  Google Scholar 

  12. Grishman, R.: Information extraction: Techniques and challenges. In: International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Berlin (1997)

    Chapter  Google Scholar 

  13. Hunyadi, L.: Keyword extraction: Aims and ways today and tomorrow. In: Proceedings of the Keyword Project: Unlocking Content through Computational Linguistics (2001)

    Google Scholar 

  14. Karypis, G., Han, E.H.S.: Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Tech. rep. TR 00-016, University of Minnesota—Department of Computer Science and Engineering (2000)

    Google Scholar 

  15. Karypis, G., Han, E.H.S.: Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), pp. 12–19 (2000)

    Google Scholar 

  16. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 111–140 (1997)

    Article  Google Scholar 

  17. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)

    Article  Google Scholar 

  18. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  19. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(1), 157–169 (2004)

    Article  Google Scholar 

  20. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  21. O Kit Hong, F., Bink-Aghai, R.P.: A Web prefetching model based on content analysis (2000)

    Google Scholar 

  22. O’Brien, G.W.: Information management tools for updating an SVD-encoded indexing scheme. Tech. rep. CS-94-258, University of Tennessee, Knoxville (1994)

    Google Scholar 

  23. Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System, pp. 313–323. Prentice-Hall, New York (1971)

    Google Scholar 

  24. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  25. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  26. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233–272 (1999)

    Article  MATH  Google Scholar 

  27. Uzun, Y.: Keyword extraction using Naive Bayes (2005)

    Google Scholar 

  28. Woods, W.A.: Conceptual indexing: A better way to organize knowledge. SML Technical Report Series. Sun Microsystems (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag London Limited

About this chapter

Cite this chapter

Ferilli, S. (2011). Information Management. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-0-85729-198-1_7

  • Publisher Name: Springer, London

  • Print ISBN: 978-0-85729-197-4

  • Online ISBN: 978-0-85729-198-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics