Information Management

Ferilli, Stefano

doi:10.1007/978-0-85729-198-1_7

Stefano Ferilli²

Part of the book series: Advances in Pattern Recognition ((ACVPR))

1253 Accesses

Abstract

The initial motivation and, at the same time, the final objective for creating documents is information preservation and transmission. In this perspective, the availability of huge quantities of documents, far from being beneficial, carries the risk of scattering and hiding information in loads of noise. As a consequence, the development of effective and efficient automatic techniques for the identification of interesting documents in a collection, and of relevant information items in a document, becomes a fundamental factor for the practical exploitation of digital document repositories. This chapter is concerned with the management of document contents. Several Information Retrieval approaches are presented first, ranging from term-based indexing to concept-based organization strategies, for supporting document search. Then, tasks more related to the information conveyed by documents are presented: Text Categorization (aimed at identifying the subject of interest of a document), Keyword Extraction (to single out prominent terms in the document text) and Information Extraction (that produces a structured representation of the features of noteworthy events described in a text).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A possible specification/refinement/evolution of these objectives involves the possibility of querying a document base by issuing the request as a question in natural language (a field known as Question Answering).
2.
Another measure typically used to evaluate performance of Machine Learning systems is Accuracy (Acc), representing the overall ratio of correct choices made by the system:
$$\mathit{Acc} = \frac{\mathit{TP}+\mathit{TN}}{\mathit{TP} + \mathit{FN} + \mathit{FP}+ \mathit{FN}} \in[0,1]$$
or its complement, Error Rate (ER):
$$\mathit{ER} = 1 - \mathit{Acc} \in[0,1].$$
This measure, that is intuitive and sufficiently significant when the number of positive and negative instances is balanced, becomes tricky when these quantities are very different from each other. Indeed, when negative instances are an overwhelming majority (say 99%), as is the case for IR (where the number of documents in the collection that are not significant to the query are almost the totality), a trivial system saying that everything is irrelevant would reach 99% accuracy, although it is clear that its performance is absolutely insignificant.
3.
Because of synonymy, a person issuing a query might use different words than those that appear in an interesting document to denote the same concepts, so the document would not be retrieved; because of polysemy, uninteresting documents concerning the alternate meanings might be retrieved.
4.
In the rest of this section, not to distract the reader from the main matter, most subsidiary information, such as recalls of mathematical concepts, comments and implementation hints, are provided as footnotes. First of all, it may be useful to recall some notations that will be used in the following:
- A ^T, transpose of matrix A;
- diag(a ₁,…,a _n), diagonal matrix n×n;
- I _n, identity matrix of order n;
- rank(A), rank of matrix A;
- A ⁻¹, inverse matrix of A.
Moreover, given a square matrix M _n×n (having the same number of rows and columns):
- A number λ is an eigenvalue of M if there exists a non-null vector v such that Mv=λv;
- Such a vector v is an eigenvector of M with respect to λ;
- σ(M) denotes the set of eigenvalues of M.
Singular values are a generalization of the concept of eigenvalues to rectangular matrices.
5.
U describes the entities associated to the rows as vectors of values of derived orthogonal coefficients; V describes the entities associated to the columns as vectors of values of derived orthogonal coefficients, and W contains scale values that allow going back to the original matrix.
6.
Some methods to carry out the SVD do not return the diagonal of W ordered decreasingly. In such a case it must be sorted, and the columns of U and the rows of V ^T must be permuted consequently.
7.
This allows the conversion of m-dimensional vectors into k-dimensional ones, that is needed during the document retrieval phase (and by some index updating techniques for the LSI).
8.
W _k being a diagonal matrix, its inverse consists just of the inverse of each element thereof. That is, for $i = 1,\dots,k : W_{k}^{-1}(i, i) = \frac{1}{W_{k}(i, i)}$.
9.
Concept Indexing is not to be mistaken for Conceptual Indexing [28], proposed in the same period, and aimed at tackling, in addition to the well-known problem of classical indexing techniques due to synonymy and polysemy spoiling the search results, the shortcoming due to the typical attitude to define hierarchical topic/concept categories, so that a document placed across two topics must be cast under either of the two, and is not found when searching under the other. Based on a generality relation that allows assessing whether a concept is subsumed by another, expressed by a set of basic facts (called subsumption axioms) and reported in a—possibly hand-made—dictionary or thesaurus (such as WordNet), it can determine whether a text is more general than another according to the generality relationships among their constituents. Thus, new texts can be mapped onto such a taxonomy, and the taxonomy itself can be progressively built and extended. Conceptual Indexing recognizes the conceptual structure of each passage in the corpus (i.e., the way in which its constituent elements are interlinked to produce its meaning). Four components are needed for this technique: A concept extractor that identifies terms and sentences to be indexed (and documents where they appear); a concept assimilator that analyzes the structure and meaning of a passage to determine where it should be placed in the conceptual taxonomy (and to which other concepts it should be linked); a conceptual retrieval system that exploits the conceptual taxonomy to associate the query to the indexed information; a conceptual navigator that allows the user to explore the conceptual taxonomy and to browse the concepts and their occurrences in the indexed material.
10.
A quick reference to Machine Learning techniques, including k-NN, clustering and k-means, is provided in Appendix B; a more extensive treatise can be found in [20].
11.
Attention must be paid if p(D|key), p(PT|key) or p(PS|key) is zero, in which case the probability would be 0 as well, or if p(T,D,PT,PS)=0, in which case a division by zero would occur.
12.
A matrix M _n×k in which m _tg reports in how many sentences term t in the text co-occurs with term g∈G is useful to represent such information. Having k<n, M is not square, and no diagonal can be considered.
13.
Some studies revealed that the overlapping ratio between classifications provided by different experts with the same skills and experience is just around 30%.
14.
In the case of regular grammars, the corresponding Finite State Automata are learned by starting from the maximally specific case (called the canonical acceptor) and progressively merging states according to considerations that depend on the specific algorithm.

References

Addis, A., Angioni, M., Armano, G., Demontis, R., Tuveri, F., Vargiu, E.: A novel semantic approach to create document collections. In: Proceedings of the IADIS International Conference—Intelligent Systems and Agents (2008)
Google Scholar
Begeja, L., Renger, B., Saraclar, M.: A system for searching and browsing spoken communications. In: HLT/NAACL Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval, pp. 1–8 (2004)
Google Scholar
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001)
Google Scholar
Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)
Article MathSciNet MATH Google Scholar
Boulgouris, N.V., Kompatsiaris, I., Mezaris, V., Simitopoulos, D., Strintzis, M.G.: Segmentation and content-based watermarking for color image and image region indexing and retrieval. EURASIP Journal on Applied Signal Processing 1, 418–431 (2002)
Google Scholar
Cestnik, B.: Estimating probabilities: A crucial task in machine learning. In: Proceedings of the 9th European Conference on Machine Learning (ECAI), pp. 147–149 (1990)
Google Scholar
Chen, Y., Li, J., Wang, J.Z.: Machine Learning and Statistical Modeling Approaches to Image Retrieval. Kluwer, Amsterdam (2004)
MATH Google Scholar
Cowie, J., Wilks, Y.: Information Extraction. In: Dale, R., Moisl, H., Somers, H. (eds.) Handbook of Natural Language Processing, pp. 241–260. Marcel Dekker, New York (2000)
Google Scholar
Deb, S.: Multimedia Systems and Content-Based Image Retrieval. IGI Publishing (2003)
Book Google Scholar
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39, 169–202 (2000)
Article MATH Google Scholar
Grishman, R.: Information extraction: Techniques and challenges. In: International Summer School on Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology. Lecture Notes in Computer Science, vol. 1299, pp. 10–27. Springer, Berlin (1997)
Chapter Google Scholar
Hunyadi, L.: Keyword extraction: Aims and ways today and tomorrow. In: Proceedings of the Keyword Project: Unlocking Content through Computational Linguistics (2001)
Google Scholar
Karypis, G., Han, E.H.S.: Concept indexing: A fast dimensionality reduction algorithm with applications to document retrieval & categorization. Tech. rep. TR 00-016, University of Minnesota—Department of Computer Science and Engineering (2000)
Google Scholar
Karypis, G., Han, E.H.S.: Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In: Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM), pp. 12–19 (2000)
Google Scholar
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 111–140 (1997)
Article Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Article Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(1), 157–169 (2004)
Article Google Scholar
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
MATH Google Scholar
O Kit Hong, F., Bink-Aghai, R.P.: A Web prefetching model based on content analysis (2000)
Google Scholar
O’Brien, G.W.: Information management tools for updating an SVD-encoded indexing scheme. Tech. rep. CS-94-258, University of Tennessee, Knoxville (1994)
Google Scholar
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System, pp. 313–323. Prentice-Hall, New York (1971)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233–272 (1999)
Article MATH Google Scholar
Uzun, Y.: Keyword extraction using Naive Bayes (2005)
Google Scholar
Woods, W.A.: Conceptual indexing: A better way to organize knowledge. SML Technical Report Series. Sun Microsystems (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Bari, Via E. Orabona 4, 70126, Bari, Italy
Stefano Ferilli

Authors

Stefano Ferilli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ferilli .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ferilli, S. (2011). Information Management. In: Automatic Digital Document Processing and Management. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-0-85729-198-1_7

Download citation

DOI: https://doi.org/10.1007/978-0-85729-198-1_7
Publisher Name: Springer, London
Print ISBN: 978-0-85729-197-4
Online ISBN: 978-0-85729-198-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics