Abstract
The theory of indexing texts is well-researched, which does not hold for indexing other data structures, such as trees for example. In this paper a simple method of indexing a tree for subsequences of string paths in the tree by finite automaton is presented. The use of the index is shown on indexing XML documents for XPath descendant-or-self axis inspired queries. Given a subject tree \(\mathcal{T}\) with n nodes, the tree is preprocessed and an index, which is a directed acyclic subsequence graph for a set of strings, is constructed. The searching phase uses the index, reads an input string path subsequence \(\mathcal{Q}\) inspired by the specific XPath query of size m and computes the list of positions of all occurrences of \(\mathcal{Q}\) in the tree \(\mathcal{T}\). The searching is performed in time \(\mathcal {O}(m)\) and does not depend on n. Although the number of distinct valid queries is \(\mathcal {O}(2^n)\), the size of the index is \(\mathcal {O}(h^k)\), where h is the height of the tree \(\mathcal{T}\) and k is the number of its leaves. Moreover, we discuss that in the case of indexing a common XML document the size of the index is even smaller \(\mathcal {O}(h \cdot 2^k)\).
J. Janoušek—This research has been partially supported by the Czech Science Foundation (GAČR) as project No. GA-13-03253S and by Technology Agency of the Czech Republic (TAČR) as project No. TA03010964 in \(\alpha \) programme.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baeza-Yates, R.A.: Searching subsequences. Theoret. Comput. Sci. 78(2), 363–376 (1991)
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.I.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)
Buneman, P., Davidson, S.B., Fan, W., Hara, C., Tan, W.-C.: Reasoning about Keys for XML. In: Ghelli, G., Grahne, G. (eds.) DBPL 2001. LNCS, vol. 2397, pp. 133–148. Springer, Heidelberg (2002)
Chung, C.-W., Min, J.-K., Shim, K.: Apex: an adaptive path index for xml data. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002, pp. 121–132. ACM, New York (2002)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Crochemore, M., Melichar, B., Tronicek, Z.: Directed acyclic subsequence graph–Overview. J. Discrete Algorithms 1(3–4), 255–280 (2003)
Crochemore, M., Rytter, W.: Text Algorithms. Oxford University Press, Oxford (1994)
Crochemore, M., Troníček, Z.: On the size of DASG for multiple texts. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 58–64. Springer, Heidelberg (2002)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases (1997)
Hoshino, H., Shinohara, A., Takeda, M., Arikawa, S.: Online construction of subsequence automata for multiple texts. In: Seventh International Symposium on String Processing and Information Retrieval, SPIRE 2000. Proceedings, pp. 146–152 (2000)
Janoušek, J., Melichar, B., Polách, R., Poliak, M., Trávníček, J.: A full and linear index of a tree for tree patterns. In: Jürgensen, H., Karhumäki, J., Okhotin, A. (eds.) DCFS 2014. LNCS, vol. 8614, pp. 198–209. Springer, Heidelberg (2014)
Kaushik, R., Bohannon, P., Naughton, J.F., Korth, H.F.: Covering indexes for branching path queries. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002, pp. 133–144. ACM, New York (2002)
Li, Q., Moon, B.: Indexing and querying xml data for regular path expressions. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 361–370. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Melichar, B., Janoušek, J., Flouri, T.: Arbology: trees and pushdown automata. Kybernetika 48(3), 402–428 (2012)
Miklau, G., Suciu, D.: Containment and equivalence for an xpath fragment. In: Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2002, pp. 65–76. ACM, New York (2002)
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of xpath. J. ACM 51(1), 2–45 (2004)
Milo, T.: Index structures for path expressions. In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 277–295. Springer, Heidelberg (1998)
Mark Pettovello, P., Fotouhi, F.: Mtree: an xml xpath graph index. In: Proceedings of the 2006 ACM Symposium on Applied Computing, SAC 2006, pp. 474–481. ACM, New York (2006)
Rao, P., Moon, B.: Prix: indexing and querying xml using prufer sequences. In: 20th International Conference on Data Engineering, 2004. Proceedings, pp. 288–299, March 2004
Tang, N., Yu, J.X., Ozsu, M.T., Wong, K.-F.: Hierarchical indexing approach to support xpath queries. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 1510–1512, April 2008
Šestáková, E.: Indexing XML documents. Master’s thesis, Czech Technical University in Prague, Faculty of Information Technology, Prague (2015)
Wang, H., Park, S., Fan, W., Yu, P.S.: Vist: a dynamic index method for querying xml data by tree structures. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 110–121. ACM, New York (2003)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Šestáková, E., Janoušek, J. (2015). Tree String Path Subsequences Automaton and Its Use for Indexing XML Documents. In: Sierra-Rodríguez, JL., Leal, JP., Simões, A. (eds) Languages, Applications and Technologies. SLATE 2015. Communications in Computer and Information Science, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-319-27653-3_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-27653-3_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27652-6
Online ISBN: 978-3-319-27653-3
eBook Packages: Computer ScienceComputer Science (R0)