A Document Model Based on Relevance Modeling Techniques for Semi-structured Information Warehouses
During the last decade, data warehouse and OLAP techniques have helped companies to gather, organize and analyze the structured data they produce. Simultaneously, digital libraries have applied Information Retrieval mechanisms to query their repositories of unstructured text-rich documents. In this paper we explain how XML allows for the convergence of these two approaches, making possible the development of warehouses for semi-structured information. So far, the proposals of extending data warehouse technology to manage semi-structured information have not been able to exploit the textual contents, mainly because they are not based on a proper document model. In our opinion, such a model must integrate IR and OLAP techniques. In this paper we present a set of requirements for semi-structured information warehouses, as well as a document model to support their construction. In this model, new Relevance Modeling mechanisms are used for ranking the facts described in the text of the documents according to their relevance to an IR – OLAP query. Preliminary evaluations show the usefulness of the document model.
KeywordsDigital Library Document Model News Item Path Expression Relevance Ranking
Unable to display preview. Download preview PDF.
- 1.Kimball, R.: The Data Warehouse toolkit. John Wiley & Sons, Chichester (2002)Google Scholar
- 2.Codd, E.F., Codd, S.B., Salley, C.T.: Providing OLAP to user-analysts: An IT mandate. Technical Report, E.F. Codd & Associates (1993)Google Scholar
- 3.Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
- 4.World Wide Web Consortium, http://www.w3.org
- 5.Xyleme, L.: A dynamic warehouse for XML data of the Web. IEEE Data Engineering Bulletin 24(2), 40–47 (2001)Google Scholar
- 6.Pedersen, D., Riis, K., Pedersen, T.B.: XML-Extended OLAP Querying. In: Proc of the 14th International Conference on Scientific and Statistical Database Management, July 24-26, pp. 195–206 (2002)Google Scholar
- 7.Navarro, G., Baeza-Yates, R.: Proximal Nodes: A Model to Query Document Databases by Contents and Structure. ACM Trans. on Information Systems (1997)Google Scholar
- 8.Aramburu, M.J., Berlanga, R.: A Temporal Object-Oriented Model for Digital Librares of Documents. Concurrency: Practice and Experience 13(11), John Wiley (2001)Google Scholar
- 9.Ponte, J.M., Croft, W.B.: A Language Modeling Approach to Information Retrieval. In: Proc. of ACM SIGIR 1998 conference, pp. 275–281 (1998)Google Scholar
- 10.Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proc. of ACM SIGIR 1998 conference, pp. 267–275 (2001)Google Scholar
- 13.Pedersen, T.B., Jensen, C.S., Dyreson, C.E.: Supporting Imprecision in Multidimensional Databases Using Granularities. In: Proc. of the Eleventh International Conference on Scientific and Statistical Database Management, pp. 90–101 (1999)Google Scholar