Keywords

1 Introduction

Medical data standardization is crucial in terms of data exchange and integration as data formats vary greatly from one healthcare provider to another. Many international standards for terminologies (SNOMED CT [1], LOINC [2]) and data exchange (openEHR [3], ISO13606 [4], HL7 standards [5]) are successfully implemented and perform well in practice. The most developing and perspective standard for medical information today is FHIR-HL7 [6].

The data are usually stored in structured, semi-structured or unstructured form in medical databases. Structured and semi-structured data can be mapped to standards with minimum losses of information [7]. However, a big part of Electronic Health Record (EHR) is in free text [8]. Unstructured medical records are more complicated to process, however, they usually contain detailed information on patients which is valuable in modeling and research [9].

The extraction of useful knowledge becomes more challenging as medical databases become more available and contain a wide range of texts [10]. Sorting documents and searching concepts and entities in texts manually is time-consuming. Text classification is an important task which aims to sort documents or notes according to the predefined classes [11] which facilitates entities extraction such as symptoms [12], drug names [13], dosage [14], drug reactions [15], etc. The task of information extraction (IE) is domain specific and requires considering its specificity in practice. Thus, high performance in IE can be achieved through free text classification to a particular domain [16].

The developed applications and methods for processing free texts are language specific [17]. Russian medical free text processing is challenging mostly because there is no open source medical corpora [18]. Moreover, each medical team develops their own storage format, which makes it difficult to standardize, exchange and integrate Russian medical data.

Our long-term goal is to develop methods for data extraction from Russian unstructured clinical notes and mapping these data on FHIR for better interoperability and personalized medicine. The purpose of the article is to investigate the applicability of machine learning algorithms to classify Russian unstructured and semi-structured allergy anamnesis to facilitate entities extraction.

2 Related Work

Studies on text classification using machine learning methods are widely represented in literature.

Jain et al. [16] describes classifiers based on Multinomial Naïve Bayes (MNB), k- Nearest Neighbors (k-NN) and Support Vector Machine (SVM) as the most popular models for multi-label classification. Logistic regression (LR) is also a widespread model for the task [19].

Binary relevance (BR) approach suggests to train N independent binary classifiers for multi-label classification with N labels. This approach has a linear complexity; however, it does not consider interdependences between labels [19]. Classifier Chains (CC) is a popular and representative algorithm for multi-label classification. CC suggests to link N binary classifiers in a chain with random ordering as it shows better predictive performance of the classification. The set of predicted labels is treated as extra features for the next classifiers in a chain. CC and ensembles [20] are known to solve over-fitting problem. CC are more computationally demanding than simple binary classifiers [21].

The performance metrics of multi-label classifiers applied to medical text are represented in Table 1. The literature review showed that there is no a single concept on which metrics to use when evaluating multi-label classifiers.

Table 1. Performance of medical multi-label classifiers

3 Methods

3.1 Data Description

Clinical documents (written in Russian) of more than 250 thousand patients were provided by Almazov National Medical Research Centre (St. Petersburg, Russia) for the research. The patients’ personal information was discarded. We searched for different forms of the words «allergy» and «(in)tolerance» (Russian equivalents «aллepгия», «(нe)пepeнocимocть») using regular expressions to find all the notes containing any information on allergy and intolerances. The corpus of 269 thousand notes was created after the search and duplicates removal. We classified allergy notes according to four labels which are described in Table 2.

Table 2. Classes description

Two experts assigned an appropriate label to each note. In case of disagreement the decision was made by consensus.

The final corpus contains 11671 labeled notes.

3.2 Task Description

AllergyIntorence is one of the FHIR resources, it contains structured information on patient’s allergies, intolerances and symptoms. The task of mapping this data to FHIR involves machine learning methods as it is stored in unstructured form. Figure 1 represents the main blocks of information that can be mapped to FHIR. Bold blocks denote information that is mentioned in the processed corpus.

Fig. 1.
figure 1

Blocks of information to be mapped to FHIR

Underlying mechanism can be extracted by searching keywords «allergy» and «intolerance» in the corpus. Category refers to an exact substance type. The most sophisticated task is to extract exact substances and clinical symptoms written in Russian and to bind corresponding codes from international terminological systems to ensure interoperability. To facilitate this task classification of multi-topic clinical notes is required.

3.3 Preprocessing

The steps of preprocessing are:

  1. 1.

    Clean medical notes from symbols and extra spaces. Full stops are left as they play an important role in sentence tokenization.

  2. 2.

    Reduce notes to minimize noise during classification as the original note might contain up to 9239 words. Only 2 meaningful sentences before and after regular expression («aллepгия», «(нe)пepeнocимocть») are left.

  3. 3.

    Correct syntactic, case and spaces errors using regular expressions.

  4. 4.

    Dictionary-based spelling correction with Levenshtein distance calculation.

  5. 5.

    Tokenize and normalize words.

  6. 6.

    Train-test split, training set contains 7819 notes and test set – 3852.

  7. 7.

    Vectorize both train and test sets using Bag of Words (BOW) representation. The dictionary size for BOW is 8000 words.

3.4 Classification

We applied four shallow machine learning models: MNB, LR, SVM, k-NN and two ensembles of classifier chains: ECCLR, ECCSVM. The optimal parameters of the shallow models were adjusted by grid search. Optimal parameters of the models are introduced in Table 3.

Table 3. Parameters of classifiers

The pipeline was built using python version 3.7.1. For lexical normalization «pymorphy2» was used. All the preprocessing steps were realized with custom skripts. «scikit-learn» package was used to implement supervised learning algorithms, evaluate models and to perform t-SNE. «Bokeh», «matplotlib» and «plotly» were used for visualization.

3.5 Evaluation Metrics

According to [21] macro and micro averaging precision, recall and F-measure are often used to evaluate multi-label classification performance. So, we used these metrics to evaluate the performance of the classification.

Micro-averaging:

$$ B_{micro} \left( h \right) = B\left( {\sum\nolimits_{j = 1}^{q} {TP_{j} } ,\sum\nolimits_{j = 1}^{q} {FP_{j} } ,\sum\nolimits_{j = 1}^{q} {TN_{j} } ,\sum\nolimits_{j = 1}^{q} {FN_{j} } } \right) $$
(1)

Macro-averaging:

$$ B_{macro} \left( h \right) = \frac{1}{q}\sum\nolimits_{j = 1}^{q} {B(TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} )} $$
(2)

B ∈ {Precision, Recall, Fβ}, q – number of class labels.

Precision (positive predictive value) is the fraction of correctly identified examples of the class among all the examples identified as this class.

$$ Precision\left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{TP_{j} }}{{TP_{j} + FP_{j} }} $$
(3)

Recall evaluates the fraction of identified examples from the class among all the examples of this class.

$$ Recall\left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{TP_{j} }}{{TP_{j} + FN_{j} }} $$
(4)

F-measure is harmonic mean (β = 1) of precision and recall.

$$ F^{\beta } \left( {TP_{j} ,FP_{j} ,TN_{j} ,FN_{j} } \right) = \frac{{(1 + \beta^{2} )TP_{j} }}{{(1 + \beta^{2} )TP_{j} + FP_{j} + \beta^{2} FN_{j} }} $$
(5)

TP – true positive examples, TN – true negative examples, FP – false positive examples, FN – false negative examples, β = 1.

t-SNE was performed using predicted probabilities for each label. The perplexity equals 30 according to recommendations of G.E. van der Maaten et al. [29].

4 Results

After text cleaning still there were notes which contained neither allergies nor intolerances.

Figure 2 illustrates the distribution of classes in the corpus. The classes are imbalanced.

Fig. 2.
figure 2

Classes distribution in the corpus

Performances of different classifiers are represented in Table 4. LR and linear SVM showed the best results among shallow classifiers. However, the use of CC with LR and linear SVM as base classifiers improved performance metrics and showed best results.

Table 4. Performance of the applied classifiers

Classification report for the best classifier is represented in Table 5.

Table 5. Classification report for ECCSVM

Figure 3 illustrates t-SNE representation classes.

Fig. 3.
figure 3

t-SNE representation of classes

Figure 4, Fig. 5, Fig. 6, Fig. 7 represent 10 most important keywords in the corpus which indicate that the note belongs to the corresponding class. The diagrams show how often each word can be met in the corpus (word counts) and how important this word is for classification (weights of classifier). The diagram is plotted using LR weights.

Fig. 4.
figure 4

Top 10 positive keywords for label AL

Fig. 5.
figure 5

Top 10 positive keywords for label R

Fig. 6.
figure 6

Top 10 positive keywords for label NN

Fig. 7.
figure 7

Top 10 positive keywords for label N

5 Discussion

Regarding previous studies on multi-label medical text classification many authors use applications for entities extraction and algorithms implementation (Table 1). However, there is no open source applications for medical purposes developed for the Russian case such as MetaMap [30], for instance. Thus, all the steps were realized manually and with custom scripts.

In the medical text multi-label classification task with limited labeled data we concentrated on improving F-measure as it enforces a better balance between performing on relevant and irrelevant labels and, thus, suitable for multi-label task evaluation [31]. Also, precision, recall and F-measure are not sensitive to classes imbalance.

Two of the proposed shallow classifiers LR and linear SVM performed well on real unstructured labeled data. Using CC strategy allowed to improve the results of basic classifiers and the best performance was shown by ensemble of classifier chains based on linear SVC. Classification report for this classifier (Table 5) has shown that three most important labels for mapping AL, R and NN are well separated from each other and from the fourth class N. The fourth class showed lower performance which can be caused by the least number of labeled data in the corpus and the variety of topics covered in it.

Recall is higher than precision for all classifiers and for both averaging strategies. It means that classifiers are good at identifying classes and differentiating them from each other. The number of false negatives is low, which means that classifiers do not intend to lose important notes. This result is satisfying from the point of mapping task as it is important to find as many class representatives as possible.

The obtained result of 0.924 μ F-measure, 0.872 μ Precision and 0.927 μ Recall by ECCSVC outperformed almost all the represented in Table 1 results. Baghdadi et al. [24] reported high overall performance of implemented classifiers and the data were previously standardized. W.-H. Weng et al. [25] used additional tools for clinical text processing and information extraction. The closest task was solved by Argaw et al. [10] in terms of real data manual labeling. All the obtained metrics of our ECCSVC are higher, however, the number of labels in the classification task is lower.

t-SNE representation shows that classes are well separated.

Figure 4 shows 10 most important words associated with allergens and substances. The list of keywords for this task contain such entities as «intolerance» which indicates the presence of patient’s intolerance in the text of anamnesis; «food» which is associated with the category of allergy in the FHIR resource; medications such as «concor» which might be associated with a substance in the FHIR resource; number of verbs indicating the presence of allergy such as «follow», «have». The words «intolerance» and «food» are also most frequent words of this class in a corpus.

Figure 5 shows 10 most important words associated with clinical symptoms in FHIR resources and reactions. All the most frequent keywords of this class are symptoms.

Figure 6 shows 10 most important words associated with the situation when no allergy was detected. This class keywords contain many negative words such as «no», «deny», «not complicated» and general purpose normalized words, which are usually met in calm allergy anamnesis: «calm», «be», «notice». The keywords of this group are not frequent in a corpus because of low number of labeled notes for this class. The NN notes would be marked as «no allergy» and would not be considered during information extraction and mappings.

Figure 7 shows 10 most important words associated with class N, which indicates that the exact note is not connected with allergy or intolerance. The most important and frequently met keyword in this class is «tolerate (пepeнocить)». This word has one root with the word «intolerance (нeпepeнocимocть)». Thus, this word frequent due to the initial mechanism of search. Other keywords represent different topics not connected with allergy and intolerance. Thus, the notes from this class would not be considered during information extraction and mappings.

6 Conclusion

In this study we investigated the applicability of several classifiers to the task of clinical free-text allergy anamnesis classification for filtering multi-topic data.

The research showed that LR, linear SVC, ECCLR and ECCSVC performed well and can be applied to the task of clinical free-text allergy anamnesis classification. The use of chaining strategy improved the performance of shallow classifiers.

In the future we plan to apply a model for Named Entity Recognition (NER) to extract named entities such as allergies and symptoms from medical free text and map them to FHIR. Also, we plan to develop a model to ICD-10 Russian codes and terms identification in medical free-text allergy anamnesis.