Abstract
This contribution investigates whether companies cluster together according to their field of industry using word embeddings and in particular word2vec models on general news text. We explore to what extent this can be utilised for identifying company-industry affiliations automatically. We present an experiment in which we test seven different classification methods on four different word2vec models trained on a 600-million-word corpus from the Guardian newspaper. For training and testing our classifiers we obtained company-industry assignments from the Dbpedia knowledge base for those companies occurring in both the news corpus and Dbpedia. The majority of the 28 scrutinized classification paradigms displays F1 scores near 80%, with some exceeding this threshold. We found differences across industries, with some industries appearing to be more distinctly defined, while others are less clearly delineated from neighbouring fields. To test the robustness of our approach we conducted a field test, identifying candidate companies absent from Dbpedia with a named-entity recognizer, establishing ground truth on company and industry status manually through web search. We found classifier performance to be less reliable in the field test and of varying quality across industries. with precision at 25 values ranging from 16% to 88%, depending on industry. In summary, the presented approach showed some promise, but also some limitations and may in its current form be only robust enough for semi-automated classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015)
Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1, 60–76 (2009)
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques (2017). arXiv Preprint: arXiv:1707.02919
Kahle, K.M., Walkling, R.A.: The impact of industry classifications on financial research. J. Financ. Quant. Anal. 31, 309 (1996)
Gopikrishnan, P., Rosenow, B., Plerou, V., Stanley, H.E.: Identifying business sectors from stock price fluctuations (2000)
Bernstein, A., Clearwater, S., Provost, F.: The relational vector-space model and industry classification. In: Proceedings of IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, pp. 8–18 (2003)
Drury, B., Almeida, J.J.: Identification, extraction and population of collective named entities from business news. In: Entity 2010 – Workshop on Resources and Evaluation for Entity Resolution and Entity Management, LREC 2010, pp. 19–22 (2010)
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, vol. 2, pp. 539–545 (1992)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011 (2009)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv Preprint: arXiv:1301.3781
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, pp. 746–751 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: ACL, vol. 1, pp. 1199–1209 (2014)
Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A.S., Jayawardana, V., Lakmal, D., Perera, M.: Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity (2017). arXiv Preprint: arXiv:1706.01967
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Van Der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Lamby, M., Isemann, D. (2018). Classifying Companies by Industry Using Word Embeddings. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science(), vol 10859. Springer, Cham. https://doi.org/10.1007/978-3-319-91947-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-91947-8_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91946-1
Online ISBN: 978-3-319-91947-8
eBook Packages: Computer ScienceComputer Science (R0)