End-to-End Joint Entity Extraction and Negation Detection for Clinical Text

Bhatia, Parminder; Busra Celikkaya, E.; Khalilia, Mohammed

doi:10.1007/978-3-030-24409-5_13

Parminder Bhatia⁴,
E. Busra Celikkaya⁴ &
Mohammed Khalilia⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 843))

Included in the following conference series:

International Workshop on Health Intelligence

888 Accesses
6 Citations

Abstract

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the Conditional Softmax Shared Decoder architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.

Download chapter PDF

Biomedical Named Entity Recognition at Scale

Medical Entity Recognition and Negation Extraction: Assessment of NegEx on Health Records in Spanish

MRC-Based Medical NER with Multi-task Learning and Multi-strategies

1 Introduction

In recent years, natural language processing (NLP) techniques have demonstrated increasing effectiveness in clinical text mining. Electronic health record (EHR) narratives, e.g., discharge summaries and progress notes contain a wealth of medically relevant information such as diagnosis information and adverse drug events. Automatic extraction of such information and representation of clinical knowledge in standardized formats could be employed for a variety of purposes such as clinical event surveillance, decision support, pharmacovigilance, and drug efficacy studies.

Although many NLP applications that successfully extract findings from medical reports have been developed in recent years, identifying assertions such as positive (present), negative (absent), and hypothetical remains a challenging task, especially to generalize [15]. However, identifying assertions is critical since negative and uncertain findings are frequent in clinical notes, and information extraction algorithms that do not distinguish between them will not paint a clear picture of the patient.

In this paper, we focus on identifying the negated findings. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and negation detection. Previous efforts in this area include both rule-based and machine-learning approaches.

Rule-based systems rely on negation keywords and rules to determine the cue of negation. NegEx [2] is a widely used algorithm that consists of ontology lookup to index findings, and negation regular expression search in a fixed scope. ConText [7] extends NegEx to other attributes like hypothetical and make scope variable by searching for a termination term. NegBio [10] uses a universal dependency graph for scope detection. Another similar work is Gkotsis et al. [6], where they utilize a constituency-based parse tree to prune out the parts outside the scope. However, these approaches use rules and regular expressions for cue detection which rely solely on surface text and thus are limited when attempting to capture complex syntactic constructions such as long noun phrases.

Kernel-based approaches are also very common, especially in the 2010 i2b2/VA task of predicting assertions. The state-of-the-art in that challenge applies support vector machines (SVM) to assertion prediction as a separate step after concept extraction [4]. They train classifiers to predict assertions of each concept word, and a separate classifier to predict the assertion of the whole concept. Shivade et al. [12] proposed Augmented Bag of Words Kernel (ABoW), which generates features based on NegEx rules along with bag-of-words features. Cheng et al. [3] uses CRF for classification of cues and scope detection. These machine learning based approaches often suffer in generalizability, the ability to perform well on unseen text.

Recently, neural network models such as Fancellu et al. [5] and Rumeng et al. [11] have been proposed. Fancellu et al. [5] exploits feedforward and bidirectional Long Short Term Memory (BiLSTM) networks for generic negation scope detection. This is a slightly different task since the negation cue is assumed to be given as input. Most relevant to our work is Rumeng et al. [11] where gated recurrent units (GRUs) are used to represent the clinical events and their context, along with an attention mechanism. Given a text annotated with events, it classifies the presence and period of the events. However, this approach is not end-to-end as it does not predict the events. Additionally, these models generally require large annotated corpus, which is necessary for good performance. Unfortunately, such clinical text data is not easily available.

In this paper, we propose a multi-task learning (MTL) approach to negation detection that overcomes some of the limitations in the existing models such as data accessibility. MTL leverages overlapping representation across sub-tasks and it is one of the most effective solutions for knowledge transfer across tasks. In the context of neural network architectures, we perform MTL by sharing parameters across tasks. We look towards parameter sharing methods [9] to transfer overlapped representation from two the tasks.

To the best of our knowledge, this is the first work to jointly model named entity and negation in an end-to-end system. Our main contributions are summarized below:

An end-to-end hierarchical neural model consisting of shared encoder and different decoding schemes to jointly extract entities and negations. Using our proposed model, we obtain substantial improvement over prior models for both entities and negations on the 2010 i2b2/VA challenge task as well as a proprietary de-identified clinical note dataset for medical conditions.
Conditional softmax shared decoder model to overcome the problem for low resource settings (datasets that have limited amounts of training data), which achieves state of art results across different datasets.
A thorough empirical analysis of parameter sharing for low resource setting highlighting the significance of the shared decoder.

2 Methodology

We first present a standard neural framework for named entity recognition. To facilitate multi-task learning, we expand on that architecture by building the two decoder model. Finally, we introduce the single decoder conditional softmax architecture.

2.1 Named Entity Recognition Architecture

A sequence tagging problem such as NER can be formulated as maximizing the conditional probability distribution over tags $\mathbf {y}$ given an input sequence $\mathbf {x}$, and model parameters $\theta $.

$$\begin{aligned} P(\mathbf {y} | \mathbf {x}, \theta ) = {\displaystyle \prod _{t=1}^{T} P(y_t | x_t, y_{1:t-1}, \theta )} \end{aligned}$$

(1)

T is the length of the sequence, and $y_{1:t-1}$ are tags for the previous words. The architecture we use as a foundation is that of [8, 16]. The model consists of three main components: the (i) character and (ii) word encoders, and the (iii) decoder/tagger.

Encoders Given an input sequence $\mathbf {x} \in \mathbb {N}^T$ whose coordinates indicate the words in the input vocabulary, we first encode the character level representation for each word. For each $x_t$ the corresponding sequence $\mathbf {c}^{(t)} \in \mathbb {R}^{L \times e_c}$ of character embeddings is fed into an encoder, where L is the length of a given word and $e_c$ is the size of the character embedding. The character encoder employs two LSTM units which produce $\overrightarrow{h^{(t)}_{1:l}}$, and $\overleftarrow{h^{(t)}_{1:l}}$, the forward and backward hidden representations, respectively, where l is the last timestep in both sequences. We concatenate the last timestep of each of these as the final encoded representation, $h_c^{(t)} = [\overrightarrow{h^{(t)}_l} || \overleftarrow{h^{(t)}_l}]$, of $x_t$ at the character level.

The output of the character encoder is concatenated with a pre-trained word embedding, $m_t = [h_c^{(t)} || \text {emb}_{word}(x_t)]$, which is used as the input to the word level encoder. Using learned character embeddings alongside word embeddings has shown to be useful for learning word level morphology, as well as mitigating loss of representation for out-of-vocabulary words. Similar to the character encoder we use a BiLSTM to encode the sequence at the word level. The word encoder does not lose resolution, meaning the output at each timestep is the concatenated output of both word LSTMs, $h_t = [\overrightarrow{h_t} || \overleftarrow{h_t}]$.

Decoder and Tagger Finally, the concatenated output of the word encoder is used as input to the decoder, along with the label embedding of the previous timestep. During training we use teacher forcing [14] to provide the gold standard label as part of the input.

$$\begin{aligned} o_t = \text {LSTM}(o_{t-1}, [h_t || \hat{y}_{t-1}]) \end{aligned}$$

(2)

$$\begin{aligned} \hat{y}_t = \text {Softmax}(\mathbf {W}o_t + b^s) \end{aligned}$$

(3)

where $\mathbf {W} \in \mathbb {R}^{d \times n}$, d is the number of hidden units in the decoder LSTM, and n is the number of tags. The model is trained in an end-to-end fashion using a standard cross-entropy objective.

2.2 Two Decoder Model

To facilitate the multi-task learning setting, we started with a two decoder model consisting of two decoders which use the shared encoder representation to jointly predict entities and negation attribute (Fig. 1). This is a standard architecture used in multi-task learning setting which consists of different LSTM’s for equation 2 followed by different softmax. This model mitigates the issues associated with rule-based models that rely solely on surface text, and thus are limited when attempting to capture complex syntactic constructions. With shared contextual encoder representation consisting of character and word embedding based models, the proposed architecture provides an effective solution for knowledge transfer across tasks, thus consolidating the ability to perform well on unseen text. However, this proposed architecture is not scalable, the number of decoders scales linearly with the number of attributes. Another problem we realized with this architecture is the performance degradation when working in an extremely low resource setting, where more parameters prevents the model to generalize well.

2.3 Shared Decoder Model

To overcome the issues with two decoder model we propose a shared decoder model. We share the encoder and decoder for the two tasks and the common output from the decoder is fed into two different softmax for entity and negations.

$$\begin{aligned} \hat{y}^{Entity}_t = \text {Softmax}^{Ent}(\mathbf {W^{Ent}}o_t + b^s) \end{aligned}$$

(4)

$$\begin{aligned} \hat{y}^{Neg}_t = \text {Softmax}^{Neg}(\mathbf {W^{Neg}}o_t + b^s) \end{aligned}$$

(5)

Conditional Softmax Decoder Model While the single decoder model is more scalable, we found that this model did not perform as well for negation as the two decoder model. It can be attributed to the fact that negation occurs less frequently than the entities, thus the decoder primarily focuses on making entity extraction predictions. To mitigate this issue and provide more context to negation attributes, we add additional input, which is the softmax output from entity extraction (Fig. 2). Thus, the model learns more about the input as well as the label distribution from entity extraction prediction. As an example, we use negation only for problem entity in the i2b2 dataset. Providing the entity prediction distribution helps the negation model to make better prediction. The negation model learns that if predict probability is not inclined towards the problem entity, then it should not predict negation irrespective of the word representation.

$$\begin{aligned} \hat{y}^{Entity}_t, \text {SoftOut}^{Entity}_t = \text {Softmax}^{Ent}(\mathbf {W^{Ent}}o_t + b^s) \end{aligned}$$

(6)

$$\begin{aligned} \hat{y}^{Neg}_t = \text {Softmax}^{Neg} (\mathbf {W^{Neg}}[o_t,\text {SoftOut}^{Entity}_t] + b^s) \end{aligned}$$

(7)

where, $SoftOut^{Entity}_t$ is the softmax output of the entity at time step t.

2.4 Results

Since there has been no prior work which has solved the two tasks as a joint model, we report the best results for both the individual tasks (Table 1). We observe that our baseline model for NER presented in the methodology section outperforms the best model [1] on the i2b2 challenge. Two decoder and conditional decoder model achieve even better results for NER than our baseline model, where conditional decoder model achieved new state-of-art for 2010 i2b2/VA challenge task. Single decoder underperformed the other two models. That can be attributed to a single decoder which primarily focuses on making entity extraction predictions which are more frequent than negations. The conditional decoder outperformed the baseline model on the negation prediction task and achieved an improvement of about 8% in F1 score compared to the baseline model, which suggests that modeling named entity and negation task together helps in achieving better results than each of the tasks done independently.

Table 1 Test set performance during multi-task training. (A) displays results from i2b2. (B) uses our medical condition data. The baseline is the current state-of-the art optimized architecture

Full size table

Table 2 Conditional softmax decoder is more robust in extreme low resource setting than its two decoder counterpart

Full size table

We compare our models for negation detection against NegEx [2] and ABoW [12], which has the best results for the negation detection task on i2b2 dataset. Conditional softmax decoder model outperforms both NegEx and ABoW (Table 1). NegEx and ABoW low performance is mainly attributed to the fact that NegEx and ABoW uses ontology lookup to index findings and negation regular expression search within a fixed scope.

A similar trend was observed in the medication condition dataset. The important thing to note is the low F1 score for NegEx. This can primarily be attributed to abbreviations and misspellings in clinical notes which can not be handled well by rule-based systems.

To understand the advantage of conditional softmax decoder, we evaluated our model in extreme low data settings, where we used a sample of our training data. We observed that conditional softmax decoder outperforms the two decoder model and achieved an improvement of 6% in F1 score in those settings (Table 2). As we increase the data size, their performance gap reduces which clearly demonstrates that conditional softmax decoder is robust in low resource settings.

2.5 Conclusion

In this paper, we have shown that named entity and negation assertion can be modeled in a multi-task setting. Joint learning with sharing of parameters provides better contextual representation and helps in alleviating problems associated with using neural networks for negation detection thereby achieving better results than the rule-based system. Our proposed conditional softmax decoder achieves best results across both tasks and is robust to work well in extreme low data settings. For future work, we plan to investigate the model on other related tasks such as relation extraction, normalization as well as the use of advanced conditional models.

References

Chalapathy, R., Borzeshi, E.Z., Piccardi, M.: Bidirectional LSTM-CRF for clinical concept extraction. arXiv:1611.08373 (2016)
Chapman, W.W., Bridewell, W., Hanbury, P., Cooper, G.F., Buchanan, B.G.: A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inf. 34(5), 301–310 (2001)
Article Google Scholar
Cheng, K., Baldwin, T., Verspoor, K.: Automatic negation and speculation detection in veterinary clinical text. In: Proceedings of the Australasian Language Technology Association Workshop 2017, pp. 70–78 (2017)
Google Scholar
de Bruijn, B., Cherry, C., Kiritchenko, S., Martin, J., Zhu, X.: Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. J. Am. Med. Inf. Assoc. 18(5), 557–562 (2011)
Article Google Scholar
Fancellu, F., Lopez, A., Webber, B.: Neural networks for negation scope detection. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 495–504 (2016)
Google Scholar
Gkotsis, G., Velupillai, S., Oellrich, A., Dean, H., Liakata, M., Dutta, R.: Don’t let notes be misunderstood: a negation detection method for assessing risk of suicide in mental health records. In: Proceedings of the Third Workshop on Computational Lingusitics and Clinical Psychology, pp. 95–105 (2016)
Google Scholar
Harkema, H., Dowling, J.N., Thornblade, T., Chapman, W.W.: Context: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J. Biomed. Inf. 42(5), 839–851 (2009)
Article Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of NAACL-HLT, pp. 260–270 (2016)
Google Scholar
Peng, N., Dredze, M.: Multi-task domain adaptation for sequence tagging. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 91–100 (2017)
Google Scholar
Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA Jt. Summits Transl. Sci. Proc. 2017, 188 (2018)
Google Scholar
Rumeng, L., Jagannatha Abhyuday, N., Hong, Y.: A hybrid neural network model for joint prediction of presence and period assertions of medical events in clinical notes. In: AMIA Annual Symposium Proceedings, vol. 2017, p. 1149. American Medical Informatics Association (2017)
Google Scholar
Shivade, C., de Marneffe, M.-C., Fosler-Lussier, E., Lai, A.M.: Extending NegEx with kernel methods for negation detection in clinical text. In: Proceedings of the Second Workshop on Extra-Propositional Aspects of Meaning in Computational Semantics (ExProM 2015), pp. 41–46 (2015)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems, pp. 2951–2959 (2012)
Google Scholar
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)
Article Google Scholar
Wu, S., Miller, T., Masanz, J., Coarr, M., Halgrim, S., Carrell, D., Clark, C.: Negation’s not solved: generalizability versus optimizability in clinical natural language processing. PloS One 9(11), e112774 (2014)
Article Google Scholar
Yang, Z., Salakhutdinov, R., Cohen, W.: Multi-task cross-lingual sequence tagging from scratch. arXiv:1603.06270 (2016)

Download references

Author information

Authors and Affiliations

Amazon, Seattle, WA, USA
Parminder Bhatia, E. Busra Celikkaya & Mohammed Khalilia

Authors

Parminder Bhatia
View author publications
You can also search for this author in PubMed Google Scholar
E. Busra Celikkaya
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed Khalilia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Parminder Bhatia .

Editor information

Editors and Affiliations

Department of Pediatrics, The University of Tennessee Health Science Center – Oak-Ridge National Lab (UTHSC-ORNL) Center for Biomedical Informatics, Memphis, TN, USA
Arash Shaban-Nejad
School of Nursing, University of Minnesota, Minneapolis, MN, USA
Martin Michalowski

Appendix

1.1 Experiments

Dataset We evaluated our model on two datasets. First is the 2010 i2b2/VA challenge dataset for “test, treatment, problem” (TTP) entity extraction and assertion detection (i2b2 dataset). Unfortunately, only part of this dataset was made public after the challenge, therefore we cannot directly compare with NegEx and ABoW results. We followed the original data split from [1] of 170 notes for training and 256 for testing. The second dataset is proprietary and consists of 4,200 de-identified annotated clinical notes with medical conditions (proprietary dataset). Below is a summary of the datasets (Table 3).

Table 3 Overview of the i2b2 and the proprietary medical condition datasets

Full size table

Model settings Word, character and tag embeddings are 100, 25, and 50 dimensions, respectively. Word embeddings are initialized using GloVe, while character and tag embeddings are learned. Character and word encoders have 50, and 100 hidden units, respectively, while the decoder LSTM has a hidden size of 50. Dropout is used after every LSTM, as well as for word embedding input. We use Adam as an optimizer. Our model is built using MXNet. Hyperparameters are tuned using Bayesian Optimization [13].

Training details Our models are trained until convergence, and we use the development set for both tasks to evaluate performance for early stopping. We performed two sets of experiments. The first set evaluates the performance of NER and negation assertion of the baseline, two decoder, shared decoder and conditional softmax decoder models on i2b2 and the medical condition datasets. The second set uses low resource settings, where we evaluate the performance of negation assertion of the conditional softmax decoder model on 5, 10 and 20% of the proprietary medical condition training data. Development and test sets are kept at the original size.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bhatia, P., Busra Celikkaya, E., Khalilia, M. (2020). End-to-End Joint Entity Extraction and Negation Detection for Clinical Text. In: Shaban-Nejad, A., Michalowski, M. (eds) Precision Health and Medicine. W3PHAI 2019. Studies in Computational Intelligence, vol 843. Springer, Cham. https://doi.org/10.1007/978-3-030-24409-5_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-24409-5_13
Published: 02 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-24408-8
Online ISBN: 978-3-030-24409-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

End-to-End Joint Entity Extraction and Negation Detection for Clinical Text

Abstract

Similar content being viewed by others

Biomedical Named Entity Recognition at Scale

Medical Entity Recognition and Negation Extraction: Assessment of NegEx on Health Records in Spanish

MRC-Based Medical NER with Multi-task Learning and Multi-strategies

1 Introduction

2 Methodology