Weighting of Noun Phrases Based on Local Frequency of Nouns

Yamada, Yasuhiro; Himeno, Yuusuke; Nakatoh, Tetsuya

doi:10.1007/978-3-319-72550-5_42

Yasuhiro Yamada¹⁸,
Yuusuke Himeno¹⁹ &
Tetsuya Nakatoh²⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 700))

Included in the following conference series:

International Conference on Soft Computing and Data Mining

1337 Accesses
3 Citations

Abstract

The tf-idf is a well-known weighting measure for words in texts. It measures both the frequency and the locality of words. It is often used for information retrieval and text mining. However, a lot of infrequent words have the same tf-idf value. In this study, the words are noun phrases. This paper proposes a novel weighting measure for noun phrases in texts by using the local frequency of nouns that construct a noun phrase. The proposed measure is calculated by combining the tf-idf of a noun phrase and the average of the difference between its frequency and the frequency of nouns within the phrase. The proposed measure was evaluated in experiments on the datasets of 19,997 newsgroup texts written in English and 206 Wikipedia pages written in Japanese. The experiments showed that the number of noun phrases with the same proposed measure is less than the number of noun phrases with the same tf-idf.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The original definition of noun phrases is complex compared to the definition used in this paper.
2.
The 20-newsgroups dataset is a set of 19,997 newsgroups texts written in English [3]. The dataset has 20 different groups. We concatenated texts in the same group into a text. Therefore, the number of texts is 20 in the experiments of this paper.
3.
In Japanese, a noun phrase is expressed by \(p=n_1,n_2,\ldots {},n_m\) without spaces.
4.
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
5.
The texts were collected from a Wikipedia page written about a list of countries on June 22, 2017. The URL of the page is https://ja.wikipedia.org/wiki/%e5%9b%bd%e3%81%ae%e4%b8%80%e8%a6%a7.
6.
http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.

References

Salton, G., McGill, J.M.: Introduction to Modern Information Retrieval. McGraw-Hill Inc, New York (1983)
MATH Google Scholar
Zipf, G.K.: The Psychobiology of Language. Routledge, London (1936)
Google Scholar
Home Page for 20 Newsgroups Data Set. http://qwone.com/~jason/20Newsgroups/. Accessed 28 June 2017
Manning, D., Raghavan, P., Shūtza, H.: An Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Rousseau, F., Vazirgiannis, M.: Composition of TF normalizations: new insights on scoring functions for ad hoc IR. In: 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 917–920. ACM, New York (2013)
Google Scholar
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: 3rd Text REtrieval Conference, pp. 109–126 (1994)
Google Scholar
Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: 2014 Australasian Document Computing Symposium, pp. 58–65. ACM, New York (2014)
Google Scholar
Lipani, A., Lupu, M., Hanbury, A., Aizawa, A.: Verboseness fission for BM25 document length normalization. In: 2015 International Conference on The Theory of Information Retrieval, pp. 385–388. ACM, New York (2015)
Google Scholar
Kita, K., Kato, Y., Omoto, T., Yano, Y.: A comparative study of automatic extraction of collocations from corpora: mutual information vs. cost criteria. J. Nat. Lang. Process. 1(1), 21–33 (1994)
Article Google Scholar
Frantzi, K.T., Ananiadou, S.: Extracting nested collocations. In: 16th Conference on Computational Linguistics, vol. 1, pp. 41–46. Association for Computational Linguistics, Stroudsburg (1996)
Google Scholar
Li, S., Li, J., Song, T., Li, W., Chang, B.: A novel topic model for automatic term extraction. In: 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 885–888. ACM, New York (2013)
Google Scholar
Astrakhantsev, N.A., Fedorenko, D.G., Turdakov, DYu.: Methods for automatic term recognition in domain-specific text collections: a survey. J. Program. Comput. Softw. 41(6), 336–349 (2015)
Article MathSciNet Google Scholar
Kathait, S.S., Tiwari, S., Varshney, A., Sharma, A.: Unsupervised key-phrase extraction using noun phrases. Int. J. Comput. Appl. 162(1), 1–5 (2017)
Google Scholar
Yamada, Y., Nakatoh, T., Baba, K., Ikeda, D.: Mining pure patterns in texts. In: 2012 IIAI International Conference on Advanced Applied Informatics, pp. 285–290 (2012)
Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers 15K00426.

Author information

Authors and Affiliations

Interdisciplinary Graduate School of Science and Engineering, Shimane University, 1060 Nishikawatsu-cho, Matsue-shi, Shimane, 690-8504, Japan
Yasuhiro Yamada
Interdisciplinary Faculty of Science and Engineering, Shimane University, 1060 Nishikawatsu-cho, Matsue-shi, Shimane, 690-8504, Japan
Yuusuke Himeno
Research Institute for Information Technology, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka, 819-0395, Japan
Tetsuya Nakatoh

Authors

Yasuhiro Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Yuusuke Himeno
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuya Nakatoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yasuhiro Yamada .

Editor information

Editors and Affiliations

Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Rozaida Ghazali
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Mustafa Mat Deris
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Johor, Malaysia
Nazri Mohd Nawi
School of Information Technology, Deakin University, Geelong, Victoria, Australia
Jemal H. Abawajy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamada, Y., Himeno, Y., Nakatoh, T. (2018). Weighting of Noun Phrases Based on Local Frequency of Nouns. In: Ghazali, R., Deris, M., Nawi, N., Abawajy, J. (eds) Recent Advances on Soft Computing and Data Mining. SCDM 2018. Advances in Intelligent Systems and Computing, vol 700. Springer, Cham. https://doi.org/10.1007/978-3-319-72550-5_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-72550-5_42
Published: 12 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72549-9
Online ISBN: 978-3-319-72550-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics