1 Introduction

In recent years, natural language processing (NLP) techniques have demonstrated increasing effectiveness in clinical text mining. Electronic health record (EHR) narratives, e.g., discharge summaries and progress notes contain a wealth of medically relevant information such as diagnosis information and adverse drug events. Automatic extraction of such information and representation of clinical knowledge in standardized formats could be employed for a variety of purposes such as clinical event surveillance, decision support, pharmacovigilance, and drug efficacy studies.

Although many NLP applications that successfully extract findings from medical reports have been developed in recent years, identifying assertions such as positive (present), negative (absent), and hypothetical remains a challenging task, especially to generalize [15]. However, identifying assertions is critical since negative and uncertain findings are frequent in clinical notes, and information extraction algorithms that do not distinguish between them will not paint a clear picture of the patient.

In this paper, we focus on identifying the negated findings. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and negation detection. Previous efforts in this area include both rule-based and machine-learning approaches.

Rule-based systems rely on negation keywords and rules to determine the cue of negation. NegEx [2] is a widely used algorithm that consists of ontology lookup to index findings, and negation regular expression search in a fixed scope. ConText [7] extends NegEx to other attributes like hypothetical and make scope variable by searching for a termination term. NegBio [10] uses a universal dependency graph for scope detection. Another similar work is Gkotsis et al. [6], where they utilize a constituency-based parse tree to prune out the parts outside the scope. However, these approaches use rules and regular expressions for cue detection which rely solely on surface text and thus are limited when attempting to capture complex syntactic constructions such as long noun phrases.

Kernel-based approaches are also very common, especially in the 2010 i2b2/VA task of predicting assertions. The state-of-the-art in that challenge applies support vector machines (SVM) to assertion prediction as a separate step after concept extraction [4]. They train classifiers to predict assertions of each concept word, and a separate classifier to predict the assertion of the whole concept. Shivade et al. [12] proposed Augmented Bag of Words Kernel (ABoW), which generates features based on NegEx rules along with bag-of-words features. Cheng et al. [3] uses CRF for classification of cues and scope detection. These machine learning based approaches often suffer in generalizability, the ability to perform well on unseen text.

Recently, neural network models such as Fancellu et al. [5] and Rumeng et al. [11] have been proposed. Fancellu et al. [5] exploits feedforward and bidirectional Long Short Term Memory (BiLSTM) networks for generic negation scope detection. This is a slightly different task since the negation cue is assumed to be given as input. Most relevant to our work is Rumeng et al. [11] where gated recurrent units (GRUs) are used to represent the clinical events and their context, along with an attention mechanism. Given a text annotated with events, it classifies the presence and period of the events. However, this approach is not end-to-end as it does not predict the events. Additionally, these models generally require large annotated corpus, which is necessary for good performance. Unfortunately, such clinical text data is not easily available.

In this paper, we propose a multi-task learning (MTL) approach to negation detection that overcomes some of the limitations in the existing models such as data accessibility. MTL leverages overlapping representation across sub-tasks and it is one of the most effective solutions for knowledge transfer across tasks. In the context of neural network architectures, we perform MTL by sharing parameters across tasks. We look towards parameter sharing methods [9] to transfer overlapped representation from two the tasks.

To the best of our knowledge, this is the first work to jointly model named entity and negation in an end-to-end system. Our main contributions are summarized below:

  • An end-to-end hierarchical neural model consisting of shared encoder and different decoding schemes to jointly extract entities and negations. Using our proposed model, we obtain substantial improvement over prior models for both entities and negations on the 2010 i2b2/VA challenge task as well as a proprietary de-identified clinical note dataset for medical conditions.

  • Conditional softmax shared decoder model to overcome the problem for low resource settings (datasets that have limited amounts of training data), which achieves state of art results across different datasets.

  • A thorough empirical analysis of parameter sharing for low resource setting highlighting the significance of the shared decoder.

2 Methodology

We first present a standard neural framework for named entity recognition. To facilitate multi-task learning, we expand on that architecture by building the two decoder model. Finally, we introduce the single decoder conditional softmax architecture.

2.1 Named Entity Recognition Architecture

A sequence tagging problem such as NER can be formulated as maximizing the conditional probability distribution over tags \(\mathbf {y}\) given an input sequence \(\mathbf {x}\), and model parameters \(\theta \).

$$\begin{aligned} P(\mathbf {y} | \mathbf {x}, \theta ) = {\displaystyle \prod _{t=1}^{T} P(y_t | x_t, y_{1:t-1}, \theta )} \end{aligned}$$
(1)

T is the length of the sequence, and \(y_{1:t-1}\) are tags for the previous words. The architecture we use as a foundation is that of [8, 16]. The model consists of three main components: the (i) character and (ii) word encoders, and the (iii) decoder/tagger.

Encoders Given an input sequence \(\mathbf {x} \in \mathbb {N}^T\) whose coordinates indicate the words in the input vocabulary, we first encode the character level representation for each word. For each \(x_t\) the corresponding sequence \(\mathbf {c}^{(t)} \in \mathbb {R}^{L \times e_c}\) of character embeddings is fed into an encoder, where L is the length of a given word and \(e_c\) is the size of the character embedding. The character encoder employs two LSTM units which produce \(\overrightarrow{h^{(t)}_{1:l}}\), and \(\overleftarrow{h^{(t)}_{1:l}}\), the forward and backward hidden representations, respectively, where l is the last timestep in both sequences. We concatenate the last timestep of each of these as the final encoded representation, \(h_c^{(t)} = [\overrightarrow{h^{(t)}_l} || \overleftarrow{h^{(t)}_l}]\), of \(x_t\) at the character level.

The output of the character encoder is concatenated with a pre-trained word embedding, \(m_t = [h_c^{(t)} || \text {emb}_{word}(x_t)]\), which is used as the input to the word level encoder. Using learned character embeddings alongside word embeddings has shown to be useful for learning word level morphology, as well as mitigating loss of representation for out-of-vocabulary words. Similar to the character encoder we use a BiLSTM to encode the sequence at the word level. The word encoder does not lose resolution, meaning the output at each timestep is the concatenated output of both word LSTMs, \(h_t = [\overrightarrow{h_t} || \overleftarrow{h_t}]\).

Decoder and Tagger Finally, the concatenated output of the word encoder is used as input to the decoder, along with the label embedding of the previous timestep. During training we use teacher forcing [14] to provide the gold standard label as part of the input.

$$\begin{aligned} o_t = \text {LSTM}(o_{t-1}, [h_t || \hat{y}_{t-1}]) \end{aligned}$$
(2)
$$\begin{aligned} \hat{y}_t = \text {Softmax}(\mathbf {W}o_t + b^s) \end{aligned}$$
(3)

where \(\mathbf {W} \in \mathbb {R}^{d \times n}\), d is the number of hidden units in the decoder LSTM, and n is the number of tags. The model is trained in an end-to-end fashion using a standard cross-entropy objective.

2.2 Two Decoder Model

To facilitate the multi-task learning setting, we started with a two decoder model consisting of two decoders which use the shared encoder representation to jointly predict entities and negation attribute (Fig. 1). This is a standard architecture used in multi-task learning setting which consists of different LSTM’s for equation 2 followed by different softmax. This model mitigates the issues associated with rule-based models that rely solely on surface text, and thus are limited when attempting to capture complex syntactic constructions. With shared contextual encoder representation consisting of character and word embedding based models, the proposed architecture provides an effective solution for knowledge transfer across tasks, thus consolidating the ability to perform well on unseen text. However, this proposed architecture is not scalable, the number of decoders scales linearly with the number of attributes. Another problem we realized with this architecture is the performance degradation when working in an extremely low resource setting, where more parameters prevents the model to generalize well.

2.3 Shared Decoder Model

To overcome the issues with two decoder model we propose a shared decoder model. We share the encoder and decoder for the two tasks and the common output from the decoder is fed into two different softmax for entity and negations.

$$\begin{aligned} \hat{y}^{Entity}_t = \text {Softmax}^{Ent}(\mathbf {W^{Ent}}o_t + b^s) \end{aligned}$$
(4)
$$\begin{aligned} \hat{y}^{Neg}_t = \text {Softmax}^{Neg}(\mathbf {W^{Neg}}o_t + b^s) \end{aligned}$$
(5)
Fig. 1
figure 1

Two decoder model, upper decoder for NER and the lower decoder for negation, where encoder provides same input to both the decoders

Conditional Softmax Decoder Model While the single decoder model is more scalable, we found that this model did not perform as well for negation as the two decoder model. It can be attributed to the fact that negation occurs less frequently than the entities, thus the decoder primarily focuses on making entity extraction predictions. To mitigate this issue and provide more context to negation attributes, we add additional input, which is the softmax output from entity extraction (Fig. 2). Thus, the model learns more about the input as well as the label distribution from entity extraction prediction. As an example, we use negation only for problem entity in the i2b2 dataset. Providing the entity prediction distribution helps the negation model to make better prediction. The negation model learns that if predict probability is not inclined towards the problem entity, then it should not predict negation irrespective of the word representation.

$$\begin{aligned} \hat{y}^{Entity}_t, \text {SoftOut}^{Entity}_t = \text {Softmax}^{Ent}(\mathbf {W^{Ent}}o_t + b^s) \end{aligned}$$
(6)
$$\begin{aligned} \hat{y}^{Neg}_t = \text {Softmax}^{Neg} (\mathbf {W^{Neg}}[o_t,\text {SoftOut}^{Entity}_t] + b^s) \end{aligned}$$
(7)

where, \(SoftOut^{Entity}_t\) is the softmax output of the entity at time step t.

Fig. 2
figure 2

Conditional softmax decoder model

2.4 Results

Since there has been no prior work which has solved the two tasks as a joint model, we report the best results for both the individual tasks (Table 1). We observe that our baseline model for NER presented in the methodology section outperforms the best model [1] on the i2b2 challenge. Two decoder and conditional decoder model achieve even better results for NER than our baseline model, where conditional decoder model achieved new state-of-art for 2010 i2b2/VA challenge task. Single decoder underperformed the other two models. That can be attributed to a single decoder which primarily focuses on making entity extraction predictions which are more frequent than negations. The conditional decoder outperformed the baseline model on the negation prediction task and achieved an improvement of about 8% in F1 score compared to the baseline model, which suggests that modeling named entity and negation task together helps in achieving better results than each of the tasks done independently.

Table 1 Test set performance during multi-task training. (A) displays results from i2b2. (B) uses our medical condition data. The baseline is the current state-of-the art optimized architecture
Table 2 Conditional softmax decoder is more robust in extreme low resource setting than its two decoder counterpart

We compare our models for negation detection against NegEx [2] and ABoW [12], which has the best results for the negation detection task on i2b2 dataset. Conditional softmax decoder model outperforms both NegEx and ABoW (Table 1). NegEx and ABoW low performance is mainly attributed to the fact that NegEx and ABoW uses ontology lookup to index findings and negation regular expression search within a fixed scope.

A similar trend was observed in the medication condition dataset. The important thing to note is the low F1 score for NegEx. This can primarily be attributed to abbreviations and misspellings in clinical notes which can not be handled well by rule-based systems.

To understand the advantage of conditional softmax decoder, we evaluated our model in extreme low data settings, where we used a sample of our training data. We observed that conditional softmax decoder outperforms the two decoder model and achieved an improvement of 6% in F1 score in those settings (Table 2). As we increase the data size, their performance gap reduces which clearly demonstrates that conditional softmax decoder is robust in low resource settings.

2.5 Conclusion

In this paper, we have shown that named entity and negation assertion can be modeled in a multi-task setting. Joint learning with sharing of parameters provides better contextual representation and helps in alleviating problems associated with using neural networks for negation detection thereby achieving better results than the rule-based system. Our proposed conditional softmax decoder achieves best results across both tasks and is robust to work well in extreme low data settings. For future work, we plan to investigate the model on other related tasks such as relation extraction, normalization as well as the use of advanced conditional models.