Skip to main content

Impact of Context on Keyword Identification and Use in Biomedical Literature Mining

  • Conference paper
  • First Online:
Proceedings of the Future Technologies Conference (FTC) 2018 (FTC 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 880))

Included in the following conference series:

  • 1628 Accesses

Abstract

The use of two statistical metrics in automatically identifying important keywords associated with a concept such as a gene by mining scientific literature is reviewed. Starting with a subset of MEDLINE® abstracts that contain the name or synonyms of a gene in their titles, the aforementioned metrics contrast the prevalence of specific words in these documents against a broader “background set” of abstracts. If a word occurs substantially more often in the document subset associated with a gene than in the background set that acts as a reference, then the word is viewed as capturing some specific attribute of the gene.

The keywords thus automatically identified may be used as gene features in clustering algorithms. Since the background set is the reference against which keyword prevalence is contrasted, the authors hypothesize that different background document sets can lead to somewhat different sets of keywords to be identified as specific to a gene. Two different background sets are discussed that are useful for two somewhat different purposes, namely, characterizing the function of a gene, and clustering a set of genes based on their shared functional similarities. Experimental results that reveal the significance of the choice of background set are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is also sometimes called the collection frequency of the term in the set of documents, and counts the total number of occurrences of the term in all the documents of the collection. It differs from the document frequency of a term in a collection of documents in that the document frequency just counts how many documents contain the term (with no distinction on the number of occurrences).

References

  1. Andrade, M., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998). https://doi.org/10.1093/bioinformatics/14.7.600

    Article  Google Scholar 

  2. Cherepinsky, V., Feng, J., Rejali, M., Mishra, B.: Shrinkage based similarity metric for cluster analysis of microarray data. Proc. Natl. Acad. Sci. USA 100(17), 418–427 (2003). https://doi.org/10.1073/pnas.1633770100

    Article  MathSciNet  MATH  Google Scholar 

  3. Dasigi, V., Karam, O., Pydimarri, S.: An evaluation of keyword selection on gene clustering in biomedical literature mining. In: Proceedings of Fifth IASTED International Conference on Computational Intelligence, pp. 119–124 (2010). URL: http://www.actapress.com/Abstract.aspx?paperId=43008

  4. Hamdan, H., Bellot, P., Béchet, F.: The impact of Z-score on Twitter sentiment analysis. In: Proceedings of 8th International Workshop on Semantic Evaluation, pp. 596–600 (2014). https://doi.org/10.3115/v1/s14-2113

  5. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). https://doi.org/10.2307/2346830

    Article  MATH  Google Scholar 

  6. Ikeda, D., Suzuki, E.: Mining peculiar compositions of frequent substrings from sparse text data using background texts. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Lecture Notes in Artificial Intelligence, vol. 5781, pp. 596–611 (2009). https://doi.org/10.1007/978-3-642-04180-8_56

    Chapter  Google Scholar 

  7. Liu, Y., Navathe, S., Pivoshenko, A., Dasigi, V., Dingledine, R., Ciliax, B.: Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes. Int. J. Data Min. Bioinform. 1(1), 88–110 (2006). https://doi.org/10.1504/ijdmb.2006.009923

    Article  Google Scholar 

  8. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988). https://doi.org/10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

Download references

Acknowledgments

The authors acknowledge that the MEDLINE® data used in this research are covered by a license agreement supported by the U.S. National Library of Medicine. Thanks are also due to Professor Rajnish Singh (Kennesaw State University) for her assistance in relation to evaluating the keywords for the various genes, and for her help in other ways related to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Venu G. Dasigi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dasigi, V.G., Karam, O., Pydimarri, S. (2019). Impact of Context on Keyword Identification and Use in Biomedical Literature Mining. In: Arai, K., Bhatia, R., Kapoor, S. (eds) Proceedings of the Future Technologies Conference (FTC) 2018. FTC 2018. Advances in Intelligent Systems and Computing, vol 880. Springer, Cham. https://doi.org/10.1007/978-3-030-02686-8_38

Download citation

Publish with us

Policies and ethics