Constructing Language Models from Online Forms to Aid Better Document Representation for More Effective Clustering

Bradshaw, Stephen; O’Riordan, Colm; Bradshaw, Daragh

doi:10.1007/978-3-030-15640-4_4

Stephen Bradshaw¹⁵,
Colm O’Riordan¹⁵ &
Daragh Bradshaw¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 976))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

334 Accesses

Abstract

Clustering is the practice of finding tacit patterns in datasets by grouping the corpus by similarity. When clustering documents this is achieved by converting the corpus into a numeric format and applying clustering techniques to this new format. Values are assigned to terms based on their frequency within a particular document, against their general occurrence in the corpus. One obstacle in achieving this aim is as a result of the polysemic nature of terms. That is words having multiple meanings; each intended meaning only being discernible when examining the context in which they are used. Thus, disambiguating the intended meaning of a term can greatly improve the efficacy of a clustering algorithm. One approach to achieve this end has been done through the creation of an ontology - Wordnet, which can act as a look-up as to the intended meaning of a term. Wordnet however, is a static source and does not keep pace with the changing nature of language. The aim of this paper is to show that while Wordnet can be affective, however it is static in nature and thus does not capture some contemporary usage of terms. Particularly when the dataset is taken from online conversation forums, who would not be structured in a standard document format. Our proposed solution involves using Reddit as a contemporary source which moves with new trends in word usage. To better illustrate this point we cluster comments found in online threads such as Reddit and compare the efficacy of different representations of these document sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Author, A.-B.: Contribution title. In: 9th International Proceedings on Proceedings, pp. 1–2. Publisher, Location (2010)
Google Scholar
Shirky, C.: The political power of social media: technology, the public sphere, and political change. Foreign Affairs 90, 28–41 (2011). JSTOR
Google Scholar
Lotan, G., Graeff, E., Ananny, M., Gaffney, D., Pearce, I.: The Arab Spring|the revolutions were tweeted: information flows during the 2011 Tunisian and Egyptian revolutions. Int. J. Commun. 5, 31 (2011)
Google Scholar
Golkar, S.: Liberation or suppression technologies? The Internet, the green movement and the regime in Iran. Int. J. Emerg. Technol. Soc. 9(1), 50 (2011). https://doi.org/10.10007/1234567890
Article Google Scholar
Bennett, W.L.: The personalization of politics political identity, social media, and changing patterns of participation. ANNALS Am. Acad. Polit. Soc. Sci. 644, 20–39 (2012)
Article Google Scholar
Singer, P., Flöck, F., Meinhart, C., Zeitfogel, E., Strohmaier, M.: Evolution of reddit: from the front page of the internet to a self-referential community? In: Proceedings of the 23rd International Conference on World Wide Web, pp. 1–13. ACM (2014)
Google Scholar
Shah, N., Mahajan, S.: Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 4, 30–38 (2016)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391 (1990)
Article Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003, pp. 541–544 (2003)
Google Scholar
Baghel, R., Dhir, R.: A frequent concepts based document clustering algorithm. Int. J. Comput. Appl. 4, 6–12 (2010)
Google Scholar
Wang, Y., Hodges, J.: Document clustering with semantic analysis. In: Proceedings of the 39th Annual Hawaii International Conference on System Sciences, vol. 3, pp. 54. IEEE (2006)
Google Scholar
Mahajan, S., Shah, N.: Efficient pre-processing for enhanced semantics based distributed document clustering. In: 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 338–343. IEEE (2016)
Google Scholar
Hung, C., Wermter, S., Smith, P.: Hybrid neural document clustering using guided self-organization and wordnet. IEEE Intell. Syst. 19, 68–77 (2004)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. In: Advances in Neural Information Processing Systems, vol. 1, pp. 601–608. MIT (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Zheng, H.-T., Kang, B.-Y., Kim, H.-G.: Exploiting noun phrases and semantic relationships for text document clustering. Inf. Sci. 179, 2249–2262 (2009)
Article Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behav. Res. Methods Instrum. Comput. 28, 203–208 (1996)
Article Google Scholar
Bruza, P., Song, D.: Discovering information flow using a high dimensional conceptual space. In: Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval (2001)
Google Scholar
Weninger, T., Zhu, X.A., Han, J.: An exploration of discussion threads in social news sites: a case study of the reddit community. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 579–583. ACM (2013)
Google Scholar
Stephen, B., Colm O., Daragh B.: Improving document clustering performance: the use of an automatically generated ontology to augment document representations. In: Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, pp. 215–223. SciTePress, INSTICC (2017). https://doi.org/10.10007/1234567890

Download references

Author information

Authors and Affiliations

National University of Galway, College Road, Galway, Ireland
Stephen Bradshaw & Colm O’Riordan
National University of Limerick, Castletroy, Limerick, Ireland
Daragh Bradshaw

Authors

Stephen Bradshaw
View author publications
You can also search for this author in PubMed Google Scholar
Colm O’Riordan
View author publications
You can also search for this author in PubMed Google Scholar
Daragh Bradshaw
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen Bradshaw .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisbon, Portugal
Ana Fred
University of Madeira, Funchal, Portugal
David Aveiro
Delft University of Technology, Delft, The Netherlands
Jan L. G. Dietz
Henley Business School, University of Reading, Reading, UK
Kecheng Liu
University of Coimbra, Coimbra, Portugal
Jorge Bernardino
Federal University of Pernambuco, Recife, Brazil
Ana Salgado
INSTICC and Instituto Politecnico de Setúbal, Setúbal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bradshaw, S., O’Riordan, C., Bradshaw, D. (2019). Constructing Language Models from Online Forms to Aid Better Document Representation for More Effective Clustering. In: Fred, A., et al. Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2017. Communications in Computer and Information Science, vol 976. Springer, Cham. https://doi.org/10.1007/978-3-030-15640-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-15640-4_4
Published: 15 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15639-8
Online ISBN: 978-3-030-15640-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics