Abstract
The motivation for the reference string indexing method may be derived from the intention to retrieve any piece of information by specifying arbitrary parts of it. Common restrictions such as the usage only of a certain set of descriptors or (complete) keywords in document retrieval systems or the specification of only certain (inverted) attributed values for queries in formatted files should be removed without loosing performance necessary for interactive usage.
The solution to be described is essentially based on the realistic assumption that the frequency distribution for the occurrence of character strings with a certain length, or words, or word sequences in textual files, and also for the occurrence of attribute values or value combinations in formatted files is not uniform but rather highly hyperbolic or "Zipfian". The same is valid also for the usage of data, expressed as the "80-20"-law. Exploiting this assumption, a (small) set of "reference strings" is generated by a statistical analysis of collected queries or — if not available — by usage estimation with the original data. The inversion to these reference strings with respect to records or record clusters gives the reference string index.
Corresponding to the estimated usage frequency, a search argument may have been made available completely as a reference string or has to be decomposed into shorter reference strings. Therefore, the reference string access is adaptive with the consequence that a routine query may be answered faster than a non-routine one.
The reference string index may be applied as a new adapted index in information retrieval systems as well as in formatted files as single or multi-attribute index. In addition it can be applied for phonetic and general record similarity search.
Chapter PDF
References
A. V. Aho, Margret J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM (1875), Vol. 18, No. 6, pp. 333–340
A. V. Aho, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, Reading, (Mass.) 1974
J. J. Barton, S. E. Creasy, M. R. Lynch, M. J. Snell, An Information-Theoretic Approach to Text Searching in Direct Access Systems, Comm. ACM (1974), Vol. 17, No. 6, pp. 345–350
R. Bayer, E. McCreight, Organization and Maintenance of Large Ordered Indexes, Acta Informatica 1 (1972), pp. 173–189
J. L. Bentley, Multidimensional Binary Search Trees Used for Associative Searching, Comm. ACM (1975), Vol. 18, No. 9, pp. 509–517
W. A. Burkhard, Hasing and Trie Algorithms for Partial Match Retrieval, ACM Transactions on Data Base Systems, (1976), Vol. 1, No. 2, pp. 175–187
A. C. Clare, E. M. Cook, M. F. Lynch, The Identification of Variable-Length, Equifrequent Character Strings in a Natural Language Data Base, Computer Journal Vol. 15, No. 3, pp. 259–262
F. Gebhardt, Wortstatistiken an groesseren Textsammlungen, Nachrichten f. Dokumentation, 2–1977, Hrsg. von der Deutschen Gesellschaft f. Dokumentation e.V., Seite 53–58
M. C. Harrison, Implementation of the Substring Test by Hashing, Comm. ACM (1971), Vol. 14, No. 12, pp. 777–779
R. Henzler, Quantitative Beziehungen zwischen Textlaengen und Wortschatz, Hrg. Zentralstelle fuer maschinelle Dokumentation, Frankfurt, Nr. ZMD-A-23, Beuth-Verlag, Frankfurt, 1974
H. Izbicki, Composita Program, Documentation Draft, IBM Laboratory Vienna, March 1977
D. E. Knuth, The Art of Computer Programming, Sorting and Searching, Addison-Wesley Publishing Company, Reading, (Mass.) 1973
D. E. Knuth et al, Fast Pattern Matching in Strings, Technical Report No. STAN-CA-74-440, 1974
G. Lustig, A New Class of Association factors, in Mechanized Information Storage, retrieval and Dissemination, (ed. K. Samuelson), Proceedings of the FID-IFIP Conf., Rome, 1967, North-Holland Publ. Comp. Amsterdam 1968.
V. Y. Lum, Multi-attribute Retrieval with Combined Indexes, Comm. ACM, (1970), Vol. 13, No. 11, pp. 66–665
W. D. Maurer, T.G. Lewis, Hash Table Methods, Computing Surveys, (1975), Vol. 7, No. 1, pp. 6–19
E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, JACM (1976), Vol. 23, No. 2, pp. 262–272
R. Nussbaum, Diskussion verschiedener Aehnlichkeitsanordnungen in grossen Wortlisten, Diplomarbeit Universitaet Mannheim, Institut f. Wirtschaftsinformatik, 1977.
G. Salton, Automatic Information Organization and Retrieval, Mc Graw-Hill, New York, 1968
H.-J. Schek, Tolerating Fuzziness in Keywords by Similarity Searches, IBM Scientific Center, Heidelberg (1975), Technical Report TR 75.11.010 contained in Kybernetes 6 (1977) Special Issue on Fuzzy Systems
H.-J. Schek, The Reference String Access Method and Partial Match Retrieval, IBM Scientific Center Technical Report TR77.12.009.
G. Schott, Automatische Kompositazerlegung mit einem Minimalwoerterbuch, Vortrag bei der Fruehjahrstagung GMDS-GI, Giessen, April 1977
I. Steinacker, Indexing and Automatic Significance Analysis, Journal of the American Society for Information Science, (1974), Vol. 25, No. 4, pp. 237–241
R. E. Wagner, Indexing Design Considerations, IBM Systems Journal, (1973), No. 4, pp. 351–367
H. Wedekind, T. Haerder, Datenbanksysteme II, Reihe Informatik) 18, Bibliographisches Institut Mannheim/Wien/Zürich, B.I.-Wissenschaftsverlag 1976
E. Wong, T. C. Chiang, Canonical Structure in Attribute Based File Organization, Comm. ACM, (1971), Vol. 14, No. 9, pp. 593–597
S. Yamamoto, S. Tazawa, K. Ushio, H. Ikeda: Design of a Balanced Multiple-Valued File Organization Scheme with the Least Redundancy, Proc. of the Very Large Data Base Conf., Tokio, Oct. 1977, p.230.
G. Zipf, Human Behaviour and the Principle of Least Effort, Addison-Wesley, Cambridge, Mass. 1949.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1978 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schek, H.J. (1978). The reference string indexing method. In: Bracchi, G., Lockemann, P.C. (eds) Information Systems Methodology. ECI 1978. Lecture Notes in Computer Science, vol 65. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-08934-9_92
Download citation
DOI: https://doi.org/10.1007/3-540-08934-9_92
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-08934-6
Online ISBN: 978-3-540-35731-5
eBook Packages: Springer Book Archive