RCorp: a resource for chemical disease semantic extraction in Chinese
- 28 Downloads
To robustly identify synergistic combinations of drugs, high-throughput screenings are desirable. It will be of great help to automatically identify the relations in the published papers with machine learning based tools. To support the chemical disease semantic relation extraction especially for chronic diseases, a chronic disease specific corpus for combination therapy discovery in Chinese (RCorp) is manually annotated.
In this study, we extracted abstracts from a Chinese medical literature server and followed the annotation framework of the BioCreative CDR corpus, with the guidelines modified to make the combination therapy related relations available. An annotation tool was incorporated to the standard annotation process.
The resulting RCorp consists of 339 Chinese biomedical articles with 2367 annotated chemicals, 2113 diseases, 237 symptoms, 164 chemical-induce-disease relations, 163 chemical-induce-symptom relations, and 805 chemical-treat-disease relations. Each annotation includes both the mention text spans and normalized concept identifiers. The corpus gets an inter-annotator agreement score of 0.883 for chemical entities, 0.791 for disease entities which are measured by F score. And the F score for chemical-treat-disease relations gets 0.788 after unifying the entity mentions.
We extracted and manually annotated a chronic disease specific corpus for combination therapy discovery in Chinese. The result analysis of the corpus proves its quality for the combination therapy related knowledge discovery task. Our annotated corpus would be a useful resource for the modelling of entity recognition and relation extraction tools. In the future, an evaluation based on the corpus will be held.
KeywordsCorpus annotation Chemical-disease relations Chronic diseases Combination therapy Relation extraction
Chinese Biomedical Semantic Annotation System
China Conference on Knowledge graph and Semantic computing
Consumer health vocabularies
Chinese Medical Subject Headings
a group to exploit different European electronic healthcare records (EHR) databases for drug safety signal detection
The Informatics for Integrating Biology and the Bedside
Named entity recognition
Natural language processing
Systematized Nomenclature of Medicine-Clinical Terms
Unified medical language system
Relations between chemicals (drugs) and diseases (Chemical-Disease Relations or CDRs) play critical roles in drug discovery, biocuration, pharmacovigilance, etc. . Combination therapies of drugs , disease treatments with two or more drugs, have the potential to improve efficacy while limiting toxicity. Various studies have demonstrated that a drug combination therapy may be beneficial in the treatment and management of chronic medical conditions, such as diabetes mellitus, Alzheimer’s disease, rheumatoid arthritis and pulmonary disorders [3, 4, 5], which are now among the most common and costly health problems worldwide [6, 7]. To robustly identify synergistic combinations, high-throughput screenings are desirable . And mounting clinical evidence in biomedical text can help knowledge discovery of combination therapy for chronic diseases. It will be of great help to identify the relations in the published papers. However, the relation discovery/extraction process should be assisted by text mining tools due to the significant increases of the amount of biomedical text. Although some Information Extraction (IE) research has focused on unsupervised methods of developing systems [9, 10], most practical modern IE work requires data that have been manually annotated with the events, entities and relationships that are considered to express key content for the given domain .
Much effort have been done to manually curate entities and their relations. Roberts et al.  constructed a semantically annotated corpus of 150 clinical texts from the textual component of patient records which includes condition, intervention, drug, locus and their interaction relations. I2B2  organized a challenge on concepts, assertions, and relations in clinical text and released a corpus with 871 annotated clinical records. The annotation framework of I2B2 is similar to the work of Roberts but with more becoming designs. And the I2B2 corpus focused on medical problem concepts and a relation classification task focused on assigning relation types that hold between medical problems, tests, and treatments. The medical problem in the I2B2 corpus includes diseases and symptoms, which are separately treated in some other researches. The above mentioned work are mainly based on clinical text, and some other efforts have been made based on scientific literatures. EU-ADR  constructed a corpus annotated for drugs, disorders, genes and their inter-relationships. For each of the drug–disorder, drug–target, and target–disorder relations three experts have annotated a set of 100 abstracts. To investigate the semantic relationships in biomedical texts, Rosario et al.  extracted sentences from titles and abstracts of Medline 2001 articles, and distinguished seven relation types that can occur between the entities “treatment” and “disease” in bioscience texts. The limitation of the corpus is that only relations within a sentence provided while biosciences texts are more likely to be composed of a number of sentences or a paragraph. Comparative Toxicogenomics Database provides manually curated 254,173 toxicogenomic interactions (152,173 chemical-disease, 58,572 chemical-gene, 5345 gene-disease and 38,083 phenotype interactions . But the entity annotations, which are key features for machine learning tasks, are lacked. BioCreative V developed a corpus for both named entity recognition and chemical-disease relations in the literature. A total of 1500 articles have been annotated with automated assistance from PubTator [16, 17]. However, the combination therapies are recorded as several separate chemicals. To promote the performance of clinical named entity recognition on the Chinese clinical text, the 2017 and 2018 China conference on knowledge graph and semantic computing (CCKS) organized a named entity recognition (NER) evaluation task to identify and extract the anatomy, symptom, independent symptom, drug and operation from Chinese clinical text . But no semantic relations among the entities released in the corpus.
In a word, existing corpora with CDR cannot support the chemical disease semantic relation extraction especially for chronic diseases in Chinese, while the annotation frameworks are useful for reference, especially the one of BioCreative V CDR. Therefore, to support the chemical disease semantic relation extraction especially for chronic diseases, a Chinese biomedical semantic relation corpus (RCorp) is manually annotated with a guideline clarifying combination therapies. The corpus aims to provide a standard dataset for the modelling of natural language processing tools, which mine knowledge about combination therapy of chronic diseases from biomedical text. In future, the mined CDR relations could be further visualized to enhance reading efficiency of researchers.
To construct a corpus for chemical disease semantic extraction in Chinese, we followed the annotation framework of the BioCreative CDR corpus , with the guidelines modified to make the combination therapy related relations available.
In our work, we selected a famous Chinese Medical Server (WANFANG MED ONLINE) as the source of biomedical abstracts. The topics of RCorp articles were predefined to be limited to a number of typical chronic diseases including asthma, chronic obstructive pulmonary disease, tuberculosis of intestines, hypertension, diabetes mellitus, thyroadenitis, hepatitis, Sjogren’s syndrome, cerebral stroke, systemic sclerosis, chronic kidney disease, indolent lymphoma and leucocythemia.
The topic distributions of RCorp
chronic cardiopulmonary disease
The knowledge about combination therapies can be indicated by the chemical, disease and symptom entities and their relations. In the selected articles, few symptoms are mentioned and almost all of the chemicals are used to treat diseases rather than symptoms. Therefore, the relation annotation are defined as chemical-induce-disease, chemical-induce-symptom and chemical-treat-disease. And the chemical-treat-disease is the key relation in our work. We performed manual annotation of all chemical, diseases, symptoms and their interactions mentioned in the articles. For each entity occurrence, we not only annotated its text span but also assigned a relevant concept identifier from the Chinese Medical Subject Headings (CMeSH) , a controlled vocabulary of biomedical concepts provided by the Institute of Medical Information, Chinese Academy of Medical Sciences.
We recruited three CMeSH indexers, all of whom had a medical training background and curation experience. Each article was annotated independently by two annotators (i.e., double annotation). Differences were resolved by a third and senior annotator.
The task organizers followed the usual practice of biomedical corpus annotation for entity annotation and entity relation annotation. An important difference in the entity and relation annotation guideline is that the combination therapy should be annotated as a single mention to provide more hints to the relation recognition. In BioCreative CDR, a combination of chemicals should be annotated as two separate mentions of chemicals, and thus two separate relation mentions are annotated. That is, the combination therapy information is missed in the final annotated results, which is important for combination therapy related knowledge discovery. Therefore, to make the combination therapy information explicit, RCorp provides an alternative expression of chemical combinations by annotating the “AND” relations of chemicals in a combination therapy both on the entity level and on the relation level. For example, “培美曲塞联合顺铂” (Pemetrexed combined with Cisplatin) should be annotated as an entry “C0210657 C1859690” in the sentence “培美曲塞联合顺铂治疗非小细胞肺癌29例临床评价” (Clinical Evaluation on Pemetrexed Combined with Cisplatin in Treating 29 Patients with Non-Small Cell Lung Cancer), and a relation of chemicals combination “Pemetrexed combined with Cisplatin” treats disease “Non-Small Cell Lung Cancer” should be accordingly annotated as a single relation mention “C0210657 C1859690 CTD C0007131” rather than a combination of “C0210657 CTD C0007131” and “C1859690 CTD C0007131”, where “CTD” is the abbreviation of “chemical-treat-disease”.
If there are two individual mentions of chemicals for the same disease in an article, they will not be treated as a combination therapy unless the relation of the two chemicals is “AND”. For example, in comparison study among different medications, the relation of different medications is “OR” rather than “AND”, and thus will not be annotated as a combination therapy.
Different with the BioCreative CDR annotation task, the annotators were asked to annotate the relations based on their own entity annotations rather than the gold-standard entity annotations, which was designed to improve the annotation efficiency.
Annotation data formats
All annotation data is available in the PubTator format which consists of a straightforward tab-delimited text file. And two versions of annotations are provided: a version with relations with separate chemicals and a version with combination therapies. The annotators are required to provide only the version with combination therapies, and the annotation tool will automatically transfer it to the one with separate chemicals.
Inter-annotator agreement (IAA) analysis
To assess the consistency of the entity and entity relationship annotation, the metrics used are equivalent to others more commonly used in IE evaluations. We measured pairwise agreement of duplicate annotations using the F score where the independent annotations served as the benchmark set of the other one.
For the CTD task, if a CTD relation mention has exactly the same article ID, chemical ID, disease ID and relation type with the benchmark set, then it will be counted as TP. For a combination therapy, the system will omit the order of the chemical IDs. For example, “C0210657 C1859690” is as the same as “C1859690 C0210657”. Following the work of Roberts , a relaxed IAA will be evaluated in the future study. Also a relaxed matching F score is also used, in which the cases with same start/end point but different concept identifier are counted as TP. For relation mentions, a relaxed F score is computed based on a unified entity mention set.
The resulting corpus consists of 339 Chinese biomedical articles with 2367 annotated chemicals, 2113 diseases, 237 symptoms, 164 chemical-induce-disease relations (CID), 163 chemical-induce-symptom relations (CIS), and 805 chemical-treat-disease relations (CTD). For entity mentions, chemical mentions and disease mentions are much more than symptoms. For relation mentions, there are more CTD mentions than CID and CIS ones. It seems that the corpus is more available for chemical, disease and CTD recognition.
The overall corpus statistics
Inter-annotator agreement for mention annotation
Existing corpora often focus on the annotation of single entities and do not provide inter-annotator agreement scores. In our work, both of the entities and relation inter-annotator agreement scores are presented.
Inter-annotator agreement F scores of the corpus
The relaxed F scores which omit the differences of concept identifiers are higher than the F scores, which indicate that a large proportion of disagreements are for the identifier discrepancies.
The relation IAAs indicate that the relation annotation work is more subjective than the entity annotation work. And that the pipeline workflow which annotates entities and relationships at the same time can easily cause cascading error especially for the relation annotation which may enlarge the disagreements produced during the period of entity annotation.
Discrepancies between the two independent annotators have been checked. For entity annotations, disagreements are concluded to two types: 1) Inconsistent boundaries including omitted mentions, wrong mentions and different boundaries. 2) Inconsistent concept IDs including error ID or different choices of IDs. For chemical mentions, 47.45% disagreements are boundary ones and 52.55% ID ones. Of the ID disagreements, the concept ID “-1”takes a proportion of 16.83% which indicate that the unknown chemical entries in the CMesh dictionary will influence the annotation quality. And it is noticed that the combination annotations which have two or more chemical IDs in an entry is more likely to get the boundary disagreements than the single ones (e.g. “硫酸沙丁胺醇溶液+布地奈德混悬液” (Salbutamol Sulfate Solution Combined with Budesonide Inhalation Solution), annotators have different ideas with whether “混悬液” (Inhalation Solution) should be included in the mention). For disease mentions, 57.41% disagreements are boundary ones and 42.59% ID ones. The concept ID “-1” takes only 5.17% in the disease concept ID disagreements which indicate that the CMesh dictionary coverage of disease is much better that that of chemicals. For relation annotations, a large proportion of discrepancies are for the inconsistent entity annotations, in other words, the efficiency of the pipeline workflow is at the expense of accuracy.
A comparison with other related works
A comparison of works on the corpus building of CDRs
Corpus or author name
Condition, intervention, drug, locus and their interaction relations
Relation types that hold between medical problems, tests, and treatments
Drugs, disorders, targets and their inter-relationships
Relationships between treatment and disease
75 docs/5410 sentences
Relationships between entities indicating adverse drug reaction events
BioCreative CDR 
Relationships between chemicals and diseases (CID)
Relationships between chemicals and diseases (CTD)
Limitations and future studies
In this study, the topics of the articles were limited and thus limited the applications of the corpus. And the combination therapy related relations, especially for CID and CIS relations, are not sufficient enough more training models. Our next step is to enlarge the annotation scope and size. To improve the agreement rates, we will change to two-phase approach for the entity annotation and relation annotation as work in [16, 24]. And an evaluation of relation extraction will be held in the future. We hope that the corpus serve as an important resource for developing relation extraction tools which automatically mine relations from biomedical abstracts in Chinese.
In this study, we demonstrated a new annotation work for chemical disease semantic extraction in Chinese. The corpus is chronic disease specific and targeted at combination therapy related mining from biomedical abstracts in Chinese. The result analysis of the corpus proves its quality for the chemical-treat-disease relation identification task. Our annotated corpus would be a useful resource for the modelling of relation extraction tools. In the future, we will further enlarge the size of the corpus, and use it to evaluate related semantic relation tools which will be applied in information providing platforms to enhance the visualization of biomedical texts and help knowledge graph construction.
The authors would like to thank the BioCreative V organizers for providing a practical example of how to organize a standard biomedical corpus. The authors also would like to thank WANFANG for providing the chronic disease related abstracts as the original dataset and thank all the CMeSH annotators from IMICAMS for manually annotating the entities and entity relations.
About this supplement
This article has been published as part of BMC Medical Informatics and Decision Making Volume 19 Supplement 5, 2019: Selected articles from the second International Workshop on Health Natural Language Processing (HealthNLP 2019). The full contents of the supplement are available online at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-19-supplement-5.
Li Hou conducted the abstract extraction and corpus annotation study. Yueping Sun and Jiao Li designed the annotation framework, made the rules of annotation and analyzed the results. Yueping Sun, Lu Qin and Yan Liu verified the annotated data by different annotators and revised the annotation guidelines. Li Hou and Qing Qian revised the result analysis part. All the authors wrote and revised the manuscript, all the authors have read and approved the final manuscript.
The publication cost of this article was funded by the National Social Science Foundation for Young Scientists of China (Grant No. 18CTQ024). The research was also funded by the National Social Science Foundation of China (Gant No.14BTQ032), the National Key Technology Research and Development Program of China (Grant No. 2016YFC0901901), the program of China Knowledge Center for Engineering Sciences and Technology (Medical Knowledge Service System) (Grant No. CKCEST-2019-1-10), the program of National Engineering Laboratory for Internet Medical Systems and Applications under Award Number NELIMSA2018P02, the medical knowledge service program of the Key Laboratory of Knowledge Technology for Medical Integrative Publishing.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 4.Orloff D G: Fixed combination drugs for cardiovascular disease risk reduction: regulatory approach. Am J Cardiol. 2005; 96(9), Sup. 1: 28–33.Google Scholar
- 6.World Health Organization. Global status report on noncommunicable diseases. 2014. https://www.who.int/nmh/publications/ncd-status-report-2014/en/. Accessed 21 Dec 2018.
- 7.Wikipedia. Chronic disease in China. https://en.wikipedia.org/wiki/Chronic_disease_in_China. Accessed 21 Dec 2018.
- 9.Taewijit S, Theeramunkong T, Ikeda M. Distant supervision with Transductive learning for adverse drug reaction identification from electronic medical records. J Healthcare Eng. 2017; https://doi.org/10.1155/2017/7575280
- 10.Kim Y, Riloff E, Meystre SM. Exploiting unlabeled texts with clustering-based instance selection for medical relation classification. In: AMIA Annu Symp Proc; 2017. p. 1060–9.Google Scholar
- 15.Davis A P, Wiegers T C, Roberts P M, King B L, Lay J M, Lennon-Hopkins K et al. A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions. Database(Oxford). 2013; https://doi.org/10.1093/database/bat080.
- 16.Li J, Sun Y, Johnson R J, Sciaky D, Wei C H, Leaman R et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database(Oxford). 2016; https://doi.org/10.1093/database/baw068.
- 17.Wei C H, Peng Y, Robert L, Davis A P, Mattingly C J, Li J, et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database(Oxford). 2016; https://doi.org/10.1093/database/baw032.
- 18.Xia Y, Wang Q. Clinical named entity recognition: ECUST in the CCKS-2017 shared task 2. In: China Conference on Knowledge Graph and Semantic Computing; 2017. p. 43–8.Google Scholar
- 19.Li D, Hu T, Zhu W, Qian Q, Ren H, Li J, et al. Retrieval system for the Chinese medical subject headings. Chin J Med Library. 2004;4:1–2,9.Google Scholar
- 21.Wei C H, Harris B R, Li D, Berardini T Z, Huala E, Kao H Y et al. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts. Database(Oxford). 2012; https://doi.org/10.1093/database/bas041.
- 22.Roberts A, Gaizauskas R, Hepple M, Demetriou G, Guo Y, Setzer A. Semantic annotation of clinical text: the CLEF corpus. In: Proceedings of the LREC 2008 workshop on building and evaluating resources for biomedical text mining; 2008. p. 19–26.Google Scholar
- 23.Schuemie M, Jelier R, Kors J. Peregrine: lightweight gene name normalization by dictionary lookup. In: Second BioCreative Workshop; 2007. p. 131–3.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.