Skip to main content

Classifying Companies by Industry Using Word Embeddings

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10859))

Abstract

This contribution investigates whether companies cluster together according to their field of industry using word embeddings and in particular word2vec models on general news text. We explore to what extent this can be utilised for identifying company-industry affiliations automatically. We present an experiment in which we test seven different classification methods on four different word2vec models trained on a 600-million-word corpus from the Guardian newspaper. For training and testing our classifiers we obtained company-industry assignments from the Dbpedia knowledge base for those companies occurring in both the news corpus and Dbpedia. The majority of the 28 scrutinized classification paradigms displays F1 scores near 80%, with some exceeding this threshold. We found differences across industries, with some industries appearing to be more distinctly defined, while others are less clearly delineated from neighbouring fields. To test the robustness of our approach we conducted a field test, identifying candidate companies absent from Dbpedia with a named-entity recognizer, establishing ground truth on company and industry status manually through web search. We found classifier performance to be less reliable in the field test and of varying quality across industries. with precision at 25 values ranging from 16% to 88%, depending on industry. In summary, the presented approach showed some promise, but also some limitations and may in its current form be only robust enough for semi-automated classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349, 255–260 (2015)

    Article  MathSciNet  Google Scholar 

  2. Gupta, V., Lehal, G.S.: A survey of text mining techniques and applications. J. Emerg. Technol. Web Intell. 1, 60–76 (2009)

    Google Scholar 

  3. Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques (2017). arXiv Preprint: arXiv:1707.02919

  4. Kahle, K.M., Walkling, R.A.: The impact of industry classifications on financial research. J. Financ. Quant. Anal. 31, 309 (1996)

    Article  Google Scholar 

  5. Gopikrishnan, P., Rosenow, B., Plerou, V., Stanley, H.E.: Identifying business sectors from stock price fluctuations (2000)

    Google Scholar 

  6. Bernstein, A., Clearwater, S., Provost, F.: The relational vector-space model and industry classification. In: Proceedings of IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, pp. 8–18 (2003)

    Google Scholar 

  7. Drury, B., Almeida, J.J.: Identification, extraction and population of collective named entities from business news. In: Entity 2010 – Workshop on Resources and Evaluation for Entity Resolution and Entity Management, LREC 2010, pp. 19–22 (2010)

    Google Scholar 

  8. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, vol. 2, pp. 539–545 (1992)

    Google Scholar 

  9. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)

    Article  Google Scholar 

  10. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 2, pp. 1003–1011 (2009)

    Google Scholar 

  11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv Preprint: arXiv:1301.3781

  12. Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of NAACL-HLT, pp. 746–751 (2013)

    Google Scholar 

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  14. Fu, R., Guo, J., Qin, B., Che, W., Wang, H., Liu, T.: Learning semantic hierarchies via word embeddings. In: ACL, vol. 1, pp. 1199–1209 (2014)

    Google Scholar 

  15. Sugathadasa, K., Ayesha, B., de Silva, N., Perera, A.S., Jayawardana, V., Lakmal, D., Perera, M.: Synergistic Union of Word2Vec and Lexicon for Domain Specific Semantic Similarity (2017). arXiv Preprint: arXiv:1706.01967

  16. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)

    MATH  Google Scholar 

  17. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer (2010)

    Google Scholar 

  18. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  19. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  20. Van Der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  21. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370. Association for Computational Linguistics (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Isemann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lamby, M., Isemann, D. (2018). Classifying Companies by Industry Using Word Embeddings. In: Silberztein, M., Atigui, F., Kornyshova, E., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2018. Lecture Notes in Computer Science(), vol 10859. Springer, Cham. https://doi.org/10.1007/978-3-319-91947-8_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91947-8_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91946-1

  • Online ISBN: 978-3-319-91947-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics