Classifying Companies by Industry Using Word Embeddings

Lamby, Martin; Isemann, Daniel

doi:10.1007/978-3-319-91947-8_39

Martin Lamby¹⁸ &
Daniel Isemann¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10859))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

2479 Accesses
5 Citations

Abstract

This contribution investigates whether companies cluster together according to their field of industry using word embeddings and in particular word2vec models on general news text. We explore to what extent this can be utilised for identifying company-industry affiliations automatically. We present an experiment in which we test seven different classification methods on four different word2vec models trained on a 600-million-word corpus from the Guardian newspaper. For training and testing our classifiers we obtained company-industry assignments from the Dbpedia knowledge base for those companies occurring in both the news corpus and Dbpedia. The majority of the 28 scrutinized classification paradigms displays F1 scores near 80%, with some exceeding this threshold. We found differences across industries, with some industries appearing to be more distinctly defined, while others are less clearly delineated from neighbouring fields. To test the robustness of our approach we conducted a field test, identifying candidate companies absent from Dbpedia with a named-entity recognizer, establishing ground truth on company and industry status manually through web search. We found classifier performance to be less reliable in the field test and of varying quality across industries. with precision at 25 values ranging from 16% to 88%, depending on industry. In summary, the presented approach showed some promise, but also some limitations and may in its current form be only robust enough for semi-automated classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015)
Article MathSciNet Google Scholar
Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1, 60–76 (2009)
Google Scholar
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques (2017). arXiv Preprint: arXiv:1707.02919
Kahle, K.M., Walkling, R.A.: The impact of industry classifications on financial research. J. Financ. Quant. Anal. 31, 309 (1996)
Article Google Scholar
Gopikrishnan, P., Rosenow, B., Plerou, V., Stanley, H.E.: Identifying business sectors from stock price fluctuations (2000)
Google Scholar
Bernstein, A., Clearwater, S., Provost, F.: The relational vector-space model and industry classification. In: Proceedings of IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, pp. 8–18 (2003)
Google Scholar
Drury, B., Almeida, J.J.: Identification, extraction and population of collective named entities from business news. In: Entity 2010 – Workshop on Resources and Evaluation for Entity Resolution and Entity Management, LREC 2010, pp. 19–22 (2010)
Google Scholar
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, vol. 2, pp. 539–545 (1992)
Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Article Google Scholar
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011 (2009)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv Preprint: arXiv:1301.3781
Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, pp. 746–751 (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: ACL, vol. 1, pp. 1199–1209 (2014)
Google Scholar
Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A.S., Jayawardana, V., Lakmal, D., Perera, M.: Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity (2017). arXiv Preprint: arXiv:1706.01967
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)
MATH Google Scholar
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Van Der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Chair of Media Informatics, University of Regensburg, Regensburg, Germany
Martin Lamby & Daniel Isemann

Authors

Martin Lamby
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Isemann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Isemann .

Editor information

Editors and Affiliations

Université de Franche-Comté, Besançon, France
Max Silberztein
Conservatoire National des Arts et Métiers, Paris, France
Faten Atigui
Conservatoire National des Arts et Métiers, Paris, France
Elena Kornyshova
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Salford, Manchester, United Kingdom
Farid Meziane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lamby, M., Isemann, D. (2018). Classifying Companies by Industry Using Word Embeddings. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science(), vol 10859. Springer, Cham. https://doi.org/10.1007/978-3-319-91947-8_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-91947-8_39
Published: 22 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91946-1
Online ISBN: 978-3-319-91947-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics