Applicability of Machine Learning Methods to Multi-label Medical Text Classification

Lenivtceva, Iuliia; Slasten, Evgenia; Kashina, Mariya; Kopanitsa, Georgy

doi:10.1007/978-3-030-50423-6_38

Iuliia Lenivtceva¹⁵,
Evgenia Slasten,
Mariya Kashina &
…
Georgy Kopanitsa

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12140))

Included in the following conference series:

International Conference on Computational Science

2646 Accesses
6 Citations
2 Altmetric

Abstract

Structuring medical text using international standards allows to improve interoperability and quality of predictive modelling. Medical text classification task facilitates information extraction. In this work we investigate the applicability of several machine learning models and classifier chains (CC) to medical unstructured text classification. The experimental study was performed on a corpus of 11671 manually labeled Russian medical notes. The results showed that using CC strategy allows to improve classification performance. Ensemble of classifier chains based on linear SVC showed the best result: 0.924 micro F-measure, 0.872 micro precision and 0.927 micro recall.

You have full access to this open access chapter, Download conference paper PDF

Divide to Better Classify

Ensemble of Classifiers for Multilabel Clinical Text Categorization in Portuguese

Improved Multi-label Medical Text Classification Using Features Cooperation

Keywords

1 Introduction

Medical data standardization is crucial in terms of data exchange and integration as data formats vary greatly from one healthcare provider to another. Many international standards for terminologies (SNOMED CT [1], LOINC [2]) and data exchange (openEHR [3], ISO13606 [4], HL7 standards [5]) are successfully implemented and perform well in practice. The most developing and perspective standard for medical information today is FHIR-HL7 [6].

The data are usually stored in structured, semi-structured or unstructured form in medical databases. Structured and semi-structured data can be mapped to standards with minimum losses of information [7]. However, a big part of Electronic Health Record (EHR) is in free text [8]. Unstructured medical records are more complicated to process, however, they usually contain detailed information on patients which is valuable in modeling and research [9].

The extraction of useful knowledge becomes more challenging as medical databases become more available and contain a wide range of texts [10]. Sorting documents and searching concepts and entities in texts manually is time-consuming. Text classification is an important task which aims to sort documents or notes according to the predefined classes [11] which facilitates entities extraction such as symptoms [12], drug names [13], dosage [14], drug reactions [15], etc. The task of information extraction (IE) is domain specific and requires considering its specificity in practice. Thus, high performance in IE can be achieved through free text classification to a particular domain [16].

The developed applications and methods for processing free texts are language specific [17]. Russian medical free text processing is challenging mostly because there is no open source medical corpora [18]. Moreover, each medical team develops their own storage format, which makes it difficult to standardize, exchange and integrate Russian medical data.

Our long-term goal is to develop methods for data extraction from Russian unstructured clinical notes and mapping these data on FHIR for better interoperability and personalized medicine. The purpose of the article is to investigate the applicability of machine learning algorithms to classify Russian unstructured and semi-structured allergy anamnesis to facilitate entities extraction.

2 Related Work

Studies on text classification using machine learning methods are widely represented in literature.

Jain et al. [16] describes classifiers based on Multinomial Naïve Bayes (MNB), k- Nearest Neighbors (k-NN) and Support Vector Machine (SVM) as the most popular models for multi-label classification. Logistic regression (LR) is also a widespread model for the task [19].

Binary relevance (BR) approach suggests to train N independent binary classifiers for multi-label classification with N labels. This approach has a linear complexity; however, it does not consider interdependences between labels [19]. Classifier Chains (CC) is a popular and representative algorithm for multi-label classification. CC suggests to link N binary classifiers in a chain with random ordering as it shows better predictive performance of the classification. The set of predicted labels is treated as extra features for the next classifiers in a chain. CC and ensembles [20] are known to solve over-fitting problem. CC are more computationally demanding than simple binary classifiers [21].

The performance metrics of multi-label classifiers applied to medical text are represented in Table 1. The literature review showed that there is no a single concept on which metrics to use when evaluating multi-label classifiers.

Table 1. Performance of medical multi-label classifiers

Full size table

3 Methods

3.1 Data Description

Clinical documents (written in Russian) of more than 250 thousand patients were provided by Almazov National Medical Research Centre (St. Petersburg, Russia) for the research. The patients’ personal information was discarded. We searched for different forms of the words «allergy» and «(in)tolerance» (Russian equivalents «aллepгия», «(нe)пepeнocимocть») using regular expressions to find all the notes containing any information on allergy and intolerances. The corpus of 269 thousand notes was created after the search and duplicates removal. We classified allergy notes according to four labels which are described in Table 2.

Table 2. Classes description

Full size table

Two experts assigned an appropriate label to each note. In case of disagreement the decision was made by consensus.

The final corpus contains 11671 labeled notes.

3.2 Task Description

AllergyIntorence is one of the FHIR resources, it contains structured information on patient’s allergies, intolerances and symptoms. The task of mapping this data to FHIR involves machine learning methods as it is stored in unstructured form. Figure 1 represents the main blocks of information that can be mapped to FHIR. Bold blocks denote information that is mentioned in the processed corpus.

Underlying mechanism can be extracted by searching keywords «allergy» and «intolerance» in the corpus. Category refers to an exact substance type. The most sophisticated task is to extract exact substances and clinical symptoms written in Russian and to bind corresponding codes from international terminological systems to ensure interoperability. To facilitate this task classification of multi-topic clinical notes is required.

3.3 Preprocessing

The steps of preprocessing are:

1.
Clean medical notes from symbols and extra spaces. Full stops are left as they play an important role in sentence tokenization.
2.
Reduce notes to minimize noise during classification as the original note might contain up to 9239 words. Only 2 meaningful sentences before and after regular expression («aллepгия», «(нe)пepeнocимocть») are left.
3.
Correct syntactic, case and spaces errors using regular expressions.
4.
Dictionary-based spelling correction with Levenshtein distance calculation.
5.
Tokenize and normalize words.
6.
Train-test split, training set contains 7819 notes and test set – 3852.
7.
Vectorize both train and test sets using Bag of Words (BOW) representation. The dictionary size for BOW is 8000 words.

3.4 Classification

We applied four shallow machine learning models: MNB, LR, SVM, k-NN and two ensembles of classifier chains: ECCLR, ECCSVM. The optimal parameters of the shallow models were adjusted by grid search. Optimal parameters of the models are introduced in Table 3.

Table 3. Parameters of classifiers

Full size table

The pipeline was built using python version 3.7.1. For lexical normalization «pymorphy2» was used. All the preprocessing steps were realized with custom skripts. «scikit-learn» package was used to implement supervised learning algorithms, evaluate models and to perform t-SNE. «Bokeh», «matplotlib» and «plotly» were used for visualization.

3.5 Evaluation Metrics

According to [21] macro and micro averaging precision, recall and F-measure are often used to evaluate multi-label classification performance. So, we used these metrics to evaluate the performance of the classification.

Micro-averaging:

$$ B_{micro} \left( h \right) = B\left( {\sum\nolimits_{j = 1}^{q} {TP_{j} } ,\sum\nolimits_{j = 1}^{q} {FP_{j} } ,\sum\nolimits_{j = 1}^{q} {TN_{j} } ,\sum\nolimits_{j = 1}^{q} {FN_{j} } } \right) $$

(1)

Macro-averaging:

$$ B_{macro} \left( h \right) = \frac{1}{q}\sum\nolimits_{j = 1}^{q} {B(TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} )} $$

(2)

B ∈ {Precision, Recall, F^β}, q – number of class labels.

Precision (positive predictive value) is the fraction of correctly identified examples of the class among all the examples identified as this class.

$$ Precision\left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{TP_{j} }}{{TP_{j} + FP_{j} }} $$

(3)

Recall evaluates the fraction of identified examples from the class among all the examples of this class.

$$ Recall\left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{TP_{j} }}{{TP_{j} + FN_{j} }} $$

(4)

F-measure is harmonic mean (β = 1) of precision and recall.

$$ F^{\beta } \left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{(1 + \beta^{2} )TP_{j} }}{{(1 + \beta^{2} )TP_{j} + FP_{j} + \beta^{2} FN_{j} }} $$

(5)

TP – true positive examples, TN – true negative examples, FP – false positive examples, FN – false negative examples, β = 1.

t-SNE was performed using predicted probabilities for each label. The perplexity equals 30 according to recommendations of G.E. van der Maaten et al. [29].

4 Results

After text cleaning still there were notes which contained neither allergies nor intolerances.

Figure 2 illustrates the distribution of classes in the corpus. The classes are imbalanced.

Performances of different classifiers are represented in Table 4. LR and linear SVM showed the best results among shallow classifiers. However, the use of CC with LR and linear SVM as base classifiers improved performance metrics and showed best results.

Table 4. Performance of the applied classifiers

Full size table

Classification report for the best classifier is represented in Table 5.

Table 5. Classification report for ECCSVM

Full size table

Figure 3 illustrates t-SNE representation classes.

Figure 4, Fig. 5, Fig. 6, Fig. 7 represent 10 most important keywords in the corpus which indicate that the note belongs to the corresponding class. The diagrams show how often each word can be met in the corpus (word counts) and how important this word is for classification (weights of classifier). The diagram is plotted using LR weights.

5 Discussion

Regarding previous studies on multi-label medical text classification many authors use applications for entities extraction and algorithms implementation (Table 1). However, there is no open source applications for medical purposes developed for the Russian case such as MetaMap [30], for instance. Thus, all the steps were realized manually and with custom scripts.

In the medical text multi-label classification task with limited labeled data we concentrated on improving F-measure as it enforces a better balance between performing on relevant and irrelevant labels and, thus, suitable for multi-label task evaluation [31]. Also, precision, recall and F-measure are not sensitive to classes imbalance.

Two of the proposed shallow classifiers LR and linear SVM performed well on real unstructured labeled data. Using CC strategy allowed to improve the results of basic classifiers and the best performance was shown by ensemble of classifier chains based on linear SVC. Classification report for this classifier (Table 5) has shown that three most important labels for mapping AL, R and NN are well separated from each other and from the fourth class N. The fourth class showed lower performance which can be caused by the least number of labeled data in the corpus and the variety of topics covered in it.

Recall is higher than precision for all classifiers and for both averaging strategies. It means that classifiers are good at identifying classes and differentiating them from each other. The number of false negatives is low, which means that classifiers do not intend to lose important notes. This result is satisfying from the point of mapping task as it is important to find as many class representatives as possible.

The obtained result of 0.924 μ F-measure, 0.872 μ Precision and 0.927 μ Recall by ECCSVC outperformed almost all the represented in Table 1 results. Baghdadi et al. [24] reported high overall performance of implemented classifiers and the data were previously standardized. W.-H. Weng et al. [25] used additional tools for clinical text processing and information extraction. The closest task was solved by Argaw et al. [10] in terms of real data manual labeling. All the obtained metrics of our ECCSVC are higher, however, the number of labels in the classification task is lower.

t-SNE representation shows that classes are well separated.

Figure 4 shows 10 most important words associated with allergens and substances. The list of keywords for this task contain such entities as «intolerance» which indicates the presence of patient’s intolerance in the text of anamnesis; «food» which is associated with the category of allergy in the FHIR resource; medications such as «concor» which might be associated with a substance in the FHIR resource; number of verbs indicating the presence of allergy such as «follow», «have». The words «intolerance» and «food» are also most frequent words of this class in a corpus.

Figure 5 shows 10 most important words associated with clinical symptoms in FHIR resources and reactions. All the most frequent keywords of this class are symptoms.

Figure 6 shows 10 most important words associated with the situation when no allergy was detected. This class keywords contain many negative words such as «no», «deny», «not complicated» and general purpose normalized words, which are usually met in calm allergy anamnesis: «calm», «be», «notice». The keywords of this group are not frequent in a corpus because of low number of labeled notes for this class. The NN notes would be marked as «no allergy» and would not be considered during information extraction and mappings.

Figure 7 shows 10 most important words associated with class N, which indicates that the exact note is not connected with allergy or intolerance. The most important and frequently met keyword in this class is «tolerate (пepeнocить)». This word has one root with the word «intolerance (нeпepeнocимocть)». Thus, this word frequent due to the initial mechanism of search. Other keywords represent different topics not connected with allergy and intolerance. Thus, the notes from this class would not be considered during information extraction and mappings.

6 Conclusion

In this study we investigated the applicability of several classifiers to the task of clinical free-text allergy anamnesis classification for filtering multi-topic data.

The research showed that LR, linear SVC, ECCLR and ECCSVC performed well and can be applied to the task of clinical free-text allergy anamnesis classification. The use of chaining strategy improved the performance of shallow classifiers.

In the future we plan to apply a model for Named Entity Recognition (NER) to extract named entities such as allergies and symptoms from medical free text and map them to FHIR. Also, we plan to develop a model to ICD-10 Russian codes and terms identification in medical free-text allergy anamnesis.

References

Fung, K.W., Xu, J., Rosenbloom, S.T., Campbell, J.R.: Using SNOMED CT-encoded problems to improve ICD-10-CM coding—a randomized controlled experiment. Int J Med Inform 126, 19–25 (2019). https://doi.org/10.1016/j.ijmedinf.2019.03.002
Article Google Scholar
Fiebeck, J., Gietzelt, M., Ballout, S., et al.: Implementing LOINC: current status and ongoing work at the Hannover Medical School. In: Studies in Health Technology and Informatics, pp. 247–248. IOS Press (2019)
Google Scholar
Mascia, C., Uva, P., Leo, S., Zanetti, G.: OpenEHR modeling for genomics in clinical practice. Int. J. Med. Inform. 120, 147–156 (2018). https://doi.org/10.1016/j.ijmedinf.2018.10.007
Article Google Scholar
Santos, M.R., Bax, M.P., Kalra, D.: Building a logical EHR architecture based on ISO 13606 standard and semantic web technologies. In: Studies in Health Technology and Informatics (2010)
Google Scholar
Ulrich, H., Kock, A.K., Duhm-Harbeck, P., et al.: Metadata repository for improved data sharing and reuse based on HL7 FHIR. In: Studies in Health Technology and Informatics (2017)
Google Scholar
Hong, N., Wen, A., Mojarad, M.R., et al.: Standardizing heterogeneous annotation corpora using HL7 FHIR for facilitating their reuse and integration in clinical NLP. In: AMIA Annual Symposium Proceedings AMIA Symposium, pp. 574–583 (2018)
Google Scholar
Lenivtseva, Y., Kopanitsa, G.: Investigation of content overlap in proprietary medical mappings. Stud. Health Technol. Inform. 258, 41–45 (2019). https://doi.org/10.3233/978-1-61499-959-1-41
Article Google Scholar
Kaur, R., Ginige, J.A.: Analysing effectiveness of multi-label classification in clinical coding. In: ACM International Conference Proceeding Series. Association for Computing Machinery (2019)
Google Scholar
Wang, Y., Wang, L., Rastegar-Mojarad, M., et al.: Clinical information extraction applications: a literature review. J. Biomed. Inform. 77, 34–49 (2018)
Article Google Scholar
Alemu, A., Hulth, A., Megyesi, B.: General-purpose text categorization applied to the medical domain. Comput. Sci. 16 (2007)
Google Scholar
Onan, A., Korukoǧlu, S., Bulut, H.: Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 57, 232–247 (2016). https://doi.org/10.1016/j.eswa.2016.03.045
Article Google Scholar
Métivier, J.-P., Serrano, L., Charnois, T., Cuissart, B., Widlöcher, A.: Automatic symptom extraction from texts to enhance knowledge discovery on rare diseases. In: Holmes, J.H., Bellazzi, R., Sacchi, L., Peek, N. (eds.) AIME 2015. LNCS (LNAI), vol. 9105, pp. 249–254. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19551-3_33
Chapter Google Scholar
Levin, M.A., Krol, M., Doshi, A.M., Reich, D.L.: Extraction and mapping of drug names from free text to a standardized nomenclature. In: AMIA Annual Symposium Proceedings, pp. 438–442 (2007)
Google Scholar
Xu, H., Jiang, M., Oetjens, M., et al.: Facilitating pharmacogenetic studies using electronic health records and natural-language processing: a case study of warfarin. J. Am. Med. Inform. Assoc. 18, 387–391 (2011). https://doi.org/10.1136/amiajnl-2011-000208
Article Google Scholar
Wang, X., Hripcsak, G., Markatou, M., Friedman, C.: Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. J. Am. Med. Inform. Assoc. 16, 328–337 (2009). https://doi.org/10.1197/jamia.M3028
Article Google Scholar
Jain, A., Mandowara, J.: Text classification by combining text classifiers to improve the efficiency of classification. Int. J. Comput. Appl. 6, 1797–2250 (2016)
Google Scholar
Ali, A.R., Ijaz, M.: Urdu text classification. In: Proceedings of the 6th International Conference on Frontiers of Information Technology, FIT 2009 (2009)
Google Scholar
Toldova, S., Lyashevskaya, O., Bonch-Osmolovskaya, A., Ionov, M.: Evaluation for morphologically rich language: Russian NLP. In: Proceedings on the International Conference on Artificial Intelligence (ICAI), pp. 300–306. CSREA Press, Las Vegas (2015)
Google Scholar
Cheng, W., Hüllermeier, E.: Combining instance-based learning and logistic regression for multilabel classification. Mach. Learn. 76, 211–225 (2009). https://doi.org/10.1007/s10994-009-5127-5
Article Google Scholar
Tahir, M.A., Kittler, J., Bouridane, A.: Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recogn. Lett. 33, 513–523 (2012). https://doi.org/10.1016/j.patrec.2011.10.019
Article Google Scholar
Zhang, M.L., Zhou, Z.H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26, 1819–1837 (2014)
Article Google Scholar
Zhao, R.W., Li, G.Z., Liu, J.M., Wang, X.: Clinical multi-label free text classification by exploiting disease label relation. In: Proceedings - 2013 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2013, pp 311–315 (2013)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5782, pp. 254–269. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04174-7_17
Chapter Google Scholar
Baghdadi, Y., Bourrée, A., Robert, A., et al.: Automatic classification of free-text medical causes from death certificates for reactive mortality surveillance in France. Int. J. Med. Inform. 131. https://doi.org/10.1016/j.ijmedinf.2019.06.022
Weng, W.-H., Wagholikar, K.B., McCray, A.T., et al.: Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC Med. Inform. Decis. Mak. 17, 155 (2017). https://doi.org/10.1186/s12911-017-0556-8
Article Google Scholar
Spat, S., et al.: Multi-label classification of clinical text documents considering the impact of text pre-processing and training size. In: 23rd International Conference of the European Federation for Medical Informatics (2011)
Google Scholar
Lita, L.V., Yu, S., Niculescu, S., Bi, J.: Large scale diagnostic code classification for medical patient records. In: IJCNLP, pp. 877–882 (2008)
Google Scholar
Baumel, T., Nassour-Kassis, J., Cohen, R., et al.: Multi-label classification of patient notes a case study on ICD code assignment. In: AAAI Conference on Artificial Intelligence. pp. 409–416 (2017)
Google Scholar
van der Maaten, L.J.P., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Aronson, A.R., Lang, F.M.: An overview of MetaMap: historical perspective and recent advances. J. Am. Med. Inform. Assoc. 17, 229–236 (2010). https://doi.org/10.1136/jamia.2009.002733
Article Google Scholar
Dembczynski, K., Jachnik, A., Kotłowski, W., et al. Optimizing the F-measure in multi-label classification: plug-in rule approach versus structured loss minimization. In: ICML 2013: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. 1130–1138 (2013)
Google Scholar

Download references

Acknowledgements

This work financially supported by the government of the Russian Federation through the ITMO fellowship and professorship program. This work was supported by a Russian Fund for Basic research 18-37-20002. This work is financially supported by National Center for Cognitive Research of ITMO University.

Author information

Authors and Affiliations

ITMO University, 49 Kronverkskiy Prospect, 197101, Saint Petersburg, Russian Federation
Iuliia Lenivtceva

Authors

Iuliia Lenivtceva
View author publications
You can also search for this author in PubMed Google Scholar
Evgenia Slasten
View author publications
You can also search for this author in PubMed Google Scholar
Mariya Kashina
View author publications
You can also search for this author in PubMed Google Scholar
Georgy Kopanitsa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iuliia Lenivtceva .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Gábor Závodszky
University of Amsterdam, Amsterdam, The Netherlands
Michael H. Lees
University of Tennessee, Knoxville, TN, USA
Jack J. Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot
Intellegibilis, Setúbal, Portugal
Sérgio Brissos
Intellegibilis, Setúbal, Portugal
João Teixeira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lenivtceva, I., Slasten, E., Kashina, M., Kopanitsa, G. (2020). Applicability of Machine Learning Methods to Multi-label Medical Text Classification. In: Krzhizhanovskaya, V.V., et al. Computational Science – ICCS 2020. ICCS 2020. Lecture Notes in Computer Science(), vol 12140. Springer, Cham. https://doi.org/10.1007/978-3-030-50423-6_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-50423-6_38
Published: 15 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-50422-9
Online ISBN: 978-3-030-50423-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Applicability of Machine Learning Methods to Multi-label Medical Text Classification

Abstract