Extract Knowledge from Web Pages in a Specific Domain

  • Yihong LuEmail author
  • Shuiyuan Yu
  • Minyong Shi
  • Chunfang Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11061)


Most NLP tasks are based on large, well-organized corpus in general domain, while limited work has been done in specific domain due to the lack of qualified corpus and evaluation dataset. However domain-specific applications are widely needed nowadays. In this paper, we propose a fast and inexpensive, model-assisted method to train a high-quality distributional model from scattered, unconstructed web pages, which can capture knowledge from a specific domain. This approach does not require pre-organized corpus and much human help, and hence works on the specific domain which can’t afford the cost of artificially constructed corpus and complex training. We use Word2vec to assist in creating term set and evaluation dataset of embroidery domain. Next, we train a distributional model on filtered search results of term set, and conduct a task-specific tuning via two simple but practical evaluation metrics, word pairs similarity and in-domain terms’ coverage. Furthermore, our much-smaller models outperform the word embedding model trained on a large, general corpus in our task. In this work, we demonstrate the effectiveness of our method and hope it can serve as a reference for researchers who extract high-quality knowledge in specific domains.


Knowledge extraction Specific domain Web corpus Word2vec 


  1. 1.
    Altszyler, E., Ribeiro, S., Sigman, M., Slezak, D.F.: The interpretation of dream meaning: resolving ambiguity using latent semantic analysis in a small corpus of text. Conscious. Cogn. 56, 178–187 (2017). Scholar
  2. 2.
    Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2Vec embeddings in small corpora: a case study in dreams database. Science 8, 9Google Scholar
  3. 3.
    Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and Word2Vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)
  4. 4.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers, pp. 238–247 (2014)Google Scholar
  5. 5.
    Cardellino, C., Alonso i Alemany, L.: Disjoint semi-supervised Spanish verb sense disambiguation using word embeddings. In: XVIII Simposio Argentino de Inteligencia Artificial (ASAI)-JAIIO 46 (Córdoba, 2017) (2017)Google Scholar
  6. 6.
    Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on BioNLP. ACL (2016)Google Scholar
  7. 7.
    Diaz, F., Mitra, B., Craswell, N.: Query expansion with locally-trained word embeddings. In: Proceedings of the 54th Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2016)Google Scholar
  8. 8.
    Dusserre, E., Padró, M.: Bigger does not mean better! we prefer specificity. In: IWCS 2017–12th International Conference on Computational Semantics–Short Papers (2017)Google Scholar
  9. 9.
    Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002). Scholar
  10. 10.
    Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015). Scholar
  11. 11.
    Jin, P., Wu, Y.: SemEval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 374–377. ACL (2012)Google Scholar
  12. 12.
    Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)Google Scholar
  13. 13.
    Kutuzov, A., Kunilovskaya, M.: Size vs. structure in training corpora for word embedding models: araneum russicum maximum and russian national corpus. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 47–58. Springer, Cham (2018). Scholar
  14. 14.
    Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding? IEEE Intell. Syst. 1 (2017)Google Scholar
  15. 15.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)Google Scholar
  16. 16.
    Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. ACL (2011)Google Scholar
  17. 17.
    Major, V., Surkis, A., Aphinyanaphongs, Y.: Utility of general and specific word embeddings for classifying translational stages of research. arXiv preprint arXiv:1705.06262 (2017)
  18. 18.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  19. 19.
    Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)
  20. 20.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  21. 21.
    Pakhomov, S.V., Finley, G., McEwan, R., Wang, Y., Melton, G.B.: Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 32, 3635–3644 (2016). Scholar
  22. 22.
    Qu, L., Ferraro, G., Zhou, L., Hou, W., Schneider, N., Baldwin, T.: Big data small data, in domain out-of domain, known word unknown word: the impact of word representations on sequence labelling tasks. In: Proceedings of the Nineteenth Conference on CoNLL. ACL (2015).
  23. 23.
    Rekabsaz, N., Mitra, B., Lupu, M., Hanbury, A.: Toward incorporation of relevant documents in Word2Vec. arXiv preprint arXiv:1707.06598 (2017)
  24. 24.
    Spousta, M.: Web as a corpus. In: Zbornik konference WDS, vol. 6, pp. 179–184 (2006)Google Scholar
  25. 25.
    Sugathadasa, K., et al.: Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE ICIIS. IEEE, December 2017.
  26. 26.
    Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2014).
  27. 27.
    Muneeb, T.H., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015. ACL (2015)Google Scholar
  28. 28.
    Tixier, A.J.P., Vazirgiannis, M., Hallowell, M.R.: Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333 (2016)
  29. 29.
    Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. arXiv preprint arXiv:1802.00400 (2018)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of ComputerCommunication University of ChinaBeijingChina

Personalised recommendations