Abstract
Word segmentation ambiguity in Thai language affects data indexing process by creating the inverted index relatively to the segmentation results. This phenomenon leads to unreasonable search result. This article proposes Thai word Safe segmentation algorithm using dictionary to solve this problem so that all different terms in an ambiguous part of the sentence are queryable. Next, it shows the bounding extension to improve Safe segmentation performance. It also compares several off-the-shelf implementations of the trie data structure which -we believe- is the best data structure for dictionary-based Thai word segmentation and compares the efficiency of serializable libraries for de-serializing trie in the analyzer’s initial state. Finally, it evaluates the Safe segmentation with several implementations called Safe Analyzer. The experimental results also show that the linked-list Trie and Protostuff library give the outstanding results. The Safe segmentation can definitely solve the ambiguity problem but still it could not solve the misspell within text accurately.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Surapant, M., Paisarn, C., Boonserm, K.: Feature-based Thai word segmentation. In: Proceedings of the Natural Language Processing Pacific Rim Symposium, pp. 41–46 (1997)
Tanapong, P., Virach, S., Thatsanee, C.: Towards building a corpus-based dictionary for nonword-boundary languages. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC2000), pp. 82–86 (2000)
Virach, S., et al.: Dictionary-less search engine for the collaborative database. In: Proceedings of the Third International Symposium on Communications and Information Technologies, pp. 177–182 (2003)
Thanaruk, T., Sasiporn, U.: Non-dictionary based word segmentation using decision tree. In: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–5. ACM, Stroudsburg (2001)
LEXiTRON Dictionary. http://lexitron.nectec.or.th/2009_1
Software: SWATH – Thai word segmentation. http://www.cs.cmu.edu/~paisarn/software.html
Java Program to implement trie. http://www.sanfoundry.com/java-program-implement-trie/
Trie(treeform) in Java. http://codereview.stackexchange.com/questions/41630/trie-tree-form-in-java
Longest prefix matching – A trie based solution in Java, http://www.geeksforgeeks.org/longest-prefix-matching-a-trie-based-solution-in-java/
Robert, S., Kevin, W.: Algorithms, 4th edn., Massachusetts (2011)
JVM-serializers. https://goo.gl/47dcag
A branch and bound algorithm for the Knapsack problem. https://www0.gsb.columbia.edu/mygsb/faculty/research/pubfiles/4407/kolesar_branch_bound.pdf
Thai word safe segmentation for data indexing in dictionary-based search engine. http://sub.aucc2017.nu.ac.th/proceeding/pdfFile/DSA/207-615-camera-ready.pdf
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Klahan, A., Pannoi, S., Uewichitrapochana, P., Wiangsripanawan, R. (2019). Thai Word Safe Segmentation with Bounding Extension for Data Indexing in Search Engine. In: Unger, H., Sodsee, S., Meesad, P. (eds) Recent Advances in Information and Communication Technology 2018. IC2IT 2018. Advances in Intelligent Systems and Computing, vol 769. Springer, Cham. https://doi.org/10.1007/978-3-319-93692-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-93692-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93691-8
Online ISBN: 978-3-319-93692-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)