Abstract
The initial motivation and, at the same time, the final objective for creating documents is information preservation and transmission. In this perspective, the availability of huge quantities of documents, far from being beneficial, carries the risk of scattering and hiding information in loads of noise. As a consequence, the development of effective and efficient automatic techniques for the identification of interesting documents in a collection, and of relevant information items in a document, becomes a fundamental factor for the practical exploitation of digital document repositories. This chapter is concerned with the management of document contents. Several Information Retrieval approaches are presented first, ranging from term-based indexing to concept-based organization strategies, for supporting document search. Then, tasks more related to the information conveyed by documents are presented: Text Categorization (aimed at identifying the subject of interest of a document), Keyword Extraction (to single out prominent terms in the document text) and Information Extraction (that produces a structured representation of the features of noteworthy events described in a text).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
A possible specification/refinement/evolution of these objectives involves the possibility of querying a document base by issuing the request as a question in natural language (a field known as Question Answering).
- 2.
Another measure typically used to evaluate performance of Machine Learning systems is Accuracy (Acc), representing the overall ratio of correct choices made by the system:
$$\mathit{Acc} = \frac{\mathit{TP}+\mathit{TN}}{\mathit{TP} + \mathit{FN} + \mathit{FP}+ \mathit{FN}} \in[0,1]$$or its complement, Error Rate (ER):
$$\mathit{ER} = 1 - \mathit{Acc} \in[0,1].$$This measure, that is intuitive and sufficiently significant when the number of positive and negative instances is balanced, becomes tricky when these quantities are very different from each other. Indeed, when negative instances are an overwhelming majority (say 99%), as is the case for IR (where the number of documents in the collection that are not significant to the query are almost the totality), a trivial system saying that everything is irrelevant would reach 99% accuracy, although it is clear that its performance is absolutely insignificant.
- 3.
Because of synonymy, a person issuing a query might use different words than those that appear in an interesting document to denote the same concepts, so the document would not be retrieved; because of polysemy, uninteresting documents concerning the alternate meanings might be retrieved.
- 4.
In the rest of this section, not to distract the reader from the main matter, most subsidiary information, such as recalls of mathematical concepts, comments and implementation hints, are provided as footnotes. First of all, it may be useful to recall some notations that will be used in the following:
-
A T, transpose of matrix A;
-
diag(a 1,…,a n ), diagonal matrix n×n;
-
I n , identity matrix of order n;
-
rank(A), rank of matrix A;
-
A −1, inverse matrix of A.
Moreover, given a square matrix M n×n (having the same number of rows and columns):
-
A number λ is an eigenvalue of M if there exists a non-null vector v such that Mv=λv;
-
Such a vector v is an eigenvector of M with respect to λ;
-
σ(M) denotes the set of eigenvalues of M.
Singular values are a generalization of the concept of eigenvalues to rectangular matrices.
-
- 5.
U describes the entities associated to the rows as vectors of values of derived orthogonal coefficients; V describes the entities associated to the columns as vectors of values of derived orthogonal coefficients, and W contains scale values that allow going back to the original matrix.
- 6.
Some methods to carry out the SVD do not return the diagonal of W ordered decreasingly. In such a case it must be sorted, and the columns of U and the rows of V T must be permuted consequently.
- 7.
This allows the conversion of m-dimensional vectors into k-dimensional ones, that is needed during the document retrieval phase (and by some index updating techniques for the LSI).
- 8.
W k being a diagonal matrix, its inverse consists just of the inverse of each element thereof. That is, for \(i = 1,\dots,k : W_{k}^{-1}(i, i) = \frac{1}{W_{k}(i, i)}\).
- 9.
Concept Indexing is not to be mistaken for Conceptual Indexing [28], proposed in the same period, and aimed at tackling, in addition to the well-known problem of classical indexing techniques due to synonymy and polysemy spoiling the search results, the shortcoming due to the typical attitude to define hierarchical topic/concept categories, so that a document placed across two topics must be cast under either of the two, and is not found when searching under the other. Based on a generality relation that allows assessing whether a concept is subsumed by another, expressed by a set of basic facts (called subsumption axioms) and reported in a—possibly hand-made—dictionary or thesaurus (such as WordNet), it can determine whether a text is more general than another according to the generality relationships among their constituents. Thus, new texts can be mapped onto such a taxonomy, and the taxonomy itself can be progressively built and extended. Conceptual Indexing recognizes the conceptual structure of each passage in the corpus (i.e., the way in which its constituent elements are interlinked to produce its meaning). Four components are needed for this technique: A concept extractor that identifies terms and sentences to be indexed (and documents where they appear); a concept assimilator that analyzes the structure and meaning of a passage to determine where it should be placed in the conceptual taxonomy (and to which other concepts it should be linked); a conceptual retrieval system that exploits the conceptual taxonomy to associate the query to the indexed information; a conceptual navigator that allows the user to explore the conceptual taxonomy and to browse the concepts and their occurrences in the indexed material.
- 10.
A quick reference to Machine Learning techniques, including k-NN, clustering and k-means, is provided in Appendix B; a more extensive treatise can be found in [20].
- 11.
Attention must be paid if p(D|key), p(PT|key) or p(PS|key) is zero, in which case the probability would be 0 as well, or if p(T,D,PT,PS)=0, in which case a division by zero would occur.
- 12.
A matrix M n×k in which m tg reports in how many sentences term t in the text co-occurs with term g∈G is useful to represent such information. Having k<n, M is not square, and no diagonal can be considered.
- 13.
Some studies revealed that the overlapping ratio between classifications provided by different experts with the same skills and experience is just around 30%.
- 14.
In the case of regular grammars, the corresponding Finite State Automata are learned by starting from the maximally specific case (called the canonical acceptor) and progressively merging states according to considerations that depend on the specific algorithm.
References
Addis, A., Angioni, M., Armano, G., Demontis, R., Tuveri, F., Vargiu, E.: A novel semantic approach to create document collections. In: Proceedings of the IADIS International Conference—Intelligent Systems and Agents (2008)
Begeja, L., Renger, B., Saraclar, M.: A system for searching and browsing spoken communications. In: HLT/NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 1–8 (2004)
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001)
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)
Boulgouris, N.V., Kompatsiaris, I., Mezaris, V., Simitopoulos, D., Strintzis, M.G.: Segmentation and content-based watermarking for color image and image region indexing and retrieval. EURASIP Journal on Applied Signal Processing 1, 418–431 (2002)
Cestnik, B.: Estimating probabilities: A crucial task in machine learning. In: Proceedings of the 9th European Conference on Machine Learning (ECAI), pp. 147–149 (1990)
Chen, Y., Li, J., Wang, J.Z.: Machine Learning and Statistical Modeling Approaches to Image Retrieval. Kluwer, Amsterdam (2004)
Cowie, J., Wilks, Y.: Information Extraction. In: Dale, R., Moisl, H., Somers, H. (eds.) Handbook of Natural Language Processing, pp. 241–260. Marcel Dekker, New York (2000)
Deb, S.: Multimedia Systems and Content-Based Image Retrieval. IGI Publishing (2003)
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39, 169–202 (2000)
Grishman, R.: Information extraction: Techniques and challenges. In: International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Berlin (1997)
Hunyadi, L.: Keyword extraction: Aims and ways today and tomorrow. In: Proceedings of the Keyword Project: Unlocking Content through Computational Linguistics (2001)
Karypis, G., Han, E.H.S.: Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Tech. rep. TR 00-016, University of Minnesota—Department of Computer Science and Engineering (2000)
Karypis, G., Han, E.H.S.: Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), pp. 12–19 (2000)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 111–140 (1997)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(1), 157–169 (2004)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
O Kit Hong, F., Bink-Aghai, R.P.: A Web prefetching model based on content analysis (2000)
O’Brien, G.W.: Information management tools for updating an SVD-encoded indexing scheme. Tech. rep. CS-94-258, University of Tennessee, Knoxville (1994)
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System, pp. 313–323. Prentice-Hall, New York (1971)
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233–272 (1999)
Uzun, Y.: Keyword extraction using Naive Bayes (2005)
Woods, W.A.: Conceptual indexing: A better way to organize knowledge. SML Technical Report Series. Sun Microsystems (1997)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2011 Springer-Verlag London Limited
About this chapter
Cite this chapter
Ferilli, S. (2011). Information Management. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_7
Download citation
DOI: https://doi.org/10.1007/978-0-85729-198-1_7
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)