3.1 Introduction

Following the distributional hypothesis, one could project the semantic meaning of a word into a low-dimensional real-valued vector according to its context information. Here comes a further problem: how to compress a higher semantic unit, such as a phrase, into a vector or other kinds of mathematical representations like a matrix or a tensor? In this chapter, we introduce the representation learning approach to model semantic composition functions from the linguistic perspective.

Compositionality enables natural languages to construct complicated semantic meanings from the combinations of basic semantic elements with particular rules. The basic principle is the semantic meaning of a whole is a function of the semantic meanings of its several parts. Therefore, the semantic meanings of complex structures will depend on how their semantic elements combine. There are many previous studies dedicated to the representation learning of compositional semantics. Among them, the article composition in distributional models of semantics [7] proposes a comprehensive framework of compositional semantics and becomes a rather representative summarization of this line of work. In this chapter, we adopt the framework to introduce compositional semantics along with our understanding and discussion (Fig. 3.1).

Fig. 3.1
A diagram illustrates the classification of higher-level units. The document, which is coherent, is divided into two sentences. Sentences are an example of syntax. It then divides into phrases, which is a combination, and concludes with word. Words are an example of morphology.

Higher-level linguistic units are composed of basic linguistic units guided by certain rules

Here, we consider two basic semantic units and use u and v to denote them, respectively. And the most intuitive way to define the joint representation could be formulated by directly building a mapping function:

$$\displaystyle \begin{aligned} \mathbf{p} = f(\mathbf{u}, \mathbf{v}), \end{aligned} $$
(3.1)

where p corresponds to the representation of the joint semantic unit (u, v). Generally, u and v could denote words, phrases, sentences, paragraphs, or even higher-level semantic units.

However, given the representations of two semantic constituents, it is not enough to derive their joint embedding without syntactic information. For instance, although the phrase machine learning and learning machine have the same vocabulary, they contain different meanings: machine learning refers to a research field in artificial intelligence, while learning machine means some specific learning algorithms. That is to say, the way how the units are mixed is also an essential part of the semantic composition. This phenomenon stresses the importance of syntactic and order information in a compositional unit. Hence, an improved version of the framework is to take the role of syntactic and order information into consideration [9]. Specifically, in terms of formulation, we can introduce an \(\mathcal {R}\) to represent the relationship between units.

The complex semantics of a combined unit in the real world can also be influenced by human background knowledge. In other words, some sentences are difficult to understand by merely paying attention to the constituent units and the syntax. For example, the sentence Tom and Jerry is one of the most popular comedies in that style. needs two main backgrounds: firstly, Tom and Jerry is a special noun phrase or knowledge entity that indicates a cartoon comedy, rather than two ordinary people. The other prior knowledge should be that style, which needs further explanation in the previous sentences. Hence, a full understanding of compositional semantics needs to take external knowledge into account. We consummate the formulation by adding another item \(\mathcal {K}\) to denote such background knowledge. The complete composition function in Eq. (3.1) is redefined to combine the syntactic relationship rule \(\mathcal {R}\) between the semantic units u and v and human background knowledge \(\mathcal {K}\):

$$\displaystyle \begin{aligned} \mathbf{p} = f(\mathbf{u}, \mathbf{v}, \mathcal{R},\mathcal{K}). \end{aligned} $$
(3.2)

From the perspective of computational linguistics, the meaning of a word is not isolated from the context. That is, the semantics of the combined unit are constituted from its components, and the meanings of the components are meanwhile derived from the combined unit [3]. This also echoes the concept of structuralism stated in Chap. 1. Compositionality is also a matter of degree instead of an either-or issue [8]. We could divide linguistic structures into several groups according to the degree of compositionality. For example, the fully compositional structure means that the combined high-level semantics is completely composed of the independent semantics of basic units (e.g., white fur). In partly compositional expressions, basic units still have separate meanings, but when combined together, they derive additional semantics (e.g., take care). In non-compositional idioms or multi-word expressions, the combined semantics have little relationship with the semantics of basic units (e.g., at large).

From the above equations formulating composition function, it could be concluded that composition could be viewed as more than a specific binary operation. The syntactic information could help to indicate a particular approach, while background knowledge helps to explain some obscure words or specific context-dependent entities such as pronouns. To this end, we could realize that the complexity of language comes from the nearly infinite combination of finite elements. This chapter can be seen as the transition from word representation to sentence and document representation, aiming to introduce the basic concepts and methods for dealing with compositional semantics from a linguistic point of view. We will first explain basic binary composition functions in Sect. 3.2, including the additive model and multiplicative model and then give a brief introduction to modeling methods for more complex N-ary composition in Sect. 3.3, including sequential order, recursive order, and convolutional order modeling. We will introduce the specific methods of learning sentence and document representation in modern NLP in detail in the next chapter.

3.2 Binary Composition

The goal of compositional semantics is to construct vector representations for higher-level linguistic units with basic units via binary composition. Without loss of generality, we assume that each constituent of a phrase (or even higher-level linguistic units) is embedded into a computable vector which will be further used to generate a representation vector for the phrase.

In this section, we focus on binary composition, where two objects will be involved in each operation. We now consider phrases consisting of two components: a head and a modifier or complement. If we cannot model the binary composition (or phrase representation), it is almost impossible to build more complex compositional representations for higher-level linguistic units. Even in today’s age of neural networks, the concept of binary composition is still important. For example, in the transformer architecture, the calculation of the attention of two units could be regarded as a type of binary operation.

Given a phrase with two constituent words machine learning, as well as the representations u and v representing the words machine and learning, respectively, our primary goal is to construct a representation vector p of the phrase according to the representations of the words. With a simple semantic space where each vector is represented by five integers, we let the hypothetical vectors for machine and learning be [0, 3, 1, 5, 2] and [1, 4, 2, 2, 0], respectively. And if we simply use the add operator to represent the phrase machine learning, it becomes [0, 3, 1, 5, 2] + [1, 4, 2, 2, 0] = [1, 7, 3, 7, 2]. The key to this problem is designing a primitive composition function as a binary operator. Based on this function, one could apply it to a word sequence recursively to derive composition for longer text.

Modeling the binary composition function is a well-studied but still challenging problem. There are mainly two perspectives on this question, including the additive model and the multiplicative model according to the basic operators. We will introduce the basic concepts and computation principles of these two approaches in this section.

3.2.1 Additive Model

The additive model, as the name implies, is a modeling method with addition as the basic operation. Recall that in the introductory part, we have derived a formulation that may contain complex relationships between units and external background knowledge, which would make our discussion exceedingly broad. In this section, to narrow the space of our considered function and establish fundamental understandings of compositional semantics, we start by simplifying the formula to p = f(u, v) and omitting the relationship and background items. Naturally, if we aim to perform addition correctly, p, u, and v should lie in the same semantic space.

One of the simplest ways is to directly use the sum to represent the joint representation:

$$\displaystyle \begin{aligned} \mathbf{p} = \mathbf{u} + \mathbf{v}. \end{aligned} $$
(3.3)

As computed in the foregoing part, the sum of the two vectors representing machine and learning would be w(machine) + w(learning) = [1, 7, 3, 7, 2]. It assumes that the composition of different constituents is a symmetric function where p = u + v = v + u. That is, it does not consider the order of constituents. Although having lots of drawbacks such as a lack of the ability to model word orders and the absence of background syntactic or knowledge information, this approach still provides a relatively strong baseline [6].

To overcome the word order issue, one easy variant is applying a weighted sum instead of uniform weights. This is, the composition has the following form:

$$\displaystyle \begin{aligned} \mathbf{p} = \alpha \mathbf{u} +\beta \mathbf{v}, \end{aligned} $$
(3.4)

where α and β correspond to different weights for two vectors. Under this setting, two sequences (u, v) and (v, u) have different representations when α ≠ β, which is consistent with real language phenomena. For example, machine learning and learning machine have different meanings and require different representations. To this end, we assign different importance scores to different components. For instance, we set α to 0.3 and β to 0.7, the 0.3 ×w(machine) = [0, 0.9, 0.3, 1.5, 0.6] and 0.7 ×w(learning) = [0.7, 2.8, 1.4, 1.4, 0], and machine learning is represented by their addition 0.3 ×w(machine) + 0.7 ×w(learning) = [0.7, 3.7, 1.7, 2.9, 0.6].

Now, we attempt to incorporate prior knowledge and syntax information into the additive model in a straightforward way. To achieve that, one could combine K nearest neighborhood semantics into composition, deriving:

$$\displaystyle \begin{aligned} \mathbf{p} = \mathbf{u} + \sum_{i=1}^{L} {\mathbf{m}}_i + \mathbf{v} +\sum_{i=1}^{K} {\mathbf{n}}_i, \end{aligned} $$
(3.5)

where m1, m2, …, mL denote semantic neighbors (i.e., synonyms) of u, and n1, n2, …, nK denote semantic neighbors of v. To this end, this method could ensemble such synonyms of the component as a smoothing factor into the composition function, which reduces the variance of language. For example, if in the composition of machine and learning, the chosen neighbors are computer and optimizing with w(computer) = [1, 0, 0, 0, 1] and w(optimizing) = [1, 5, 3, 2, 1], respectively. This leads to the situation that the representation of machine learning becomes w(machine) + w(computer) + w(learning) + w(optimizing) = [3, 12, 6, 9, 4]. Although it is a simple strategy, the use of synonyms to improve the robustness of language models is still a very effective practice in modern NLP.

When it comes to the measurement of similarity between representations, the cosine function is a natural approach in the semantic space. We will take a closer look to understand the additive model by computing the cosine similarity between p = u + v (we go back to the naive additive model for computation simplicity) and an arbitrary word w. The cosine similarity, denoted as s(⋅) could be derived as:

$$\displaystyle \begin{aligned} s(\mathbf{p}, \mathbf{w}) &= \frac{\mathbf{p} \cdot \mathbf{w}}{\Vert\mathbf{p} \Vert \cdot \Vert\mathbf{w} \Vert } =\frac{(\mathbf{u} + \mathbf{v}) \mathbf{w}}{\Vert\mathbf{u} + \mathbf{v}\Vert \Vert\mathbf{w} \Vert } \end{aligned} $$
(3.6)
$$\displaystyle \begin{aligned} &=\frac{\Vert\mathbf{u}\Vert}{\Vert\mathbf{u} + \mathbf{v}\Vert} s(\mathbf{u}, \mathbf{w}) + \frac{\Vert\mathbf{v}\Vert}{\Vert\mathbf{u} + \mathbf{v}\Vert} s(\mathbf{v}, \mathbf{w}). \end{aligned} $$
(3.7)

From the derivation ahead, it could be concluded that this composition function composes of both magnitude and directions of two component vectors. And the composition similarity of two linguistic units could be viewed as a linear combination of the similarity of two components. In other words, if one vector dominates the magnitude, it will also dominate the similarity. For example, if ∥u∥ = 103 and ∥v∥ = 10−3, the similarity between p and w will be mostly determined by the semantics of u. This could happen if u is an entity with a strong specific meaning like Europe while v is an empty word like there. Further disassembly, in terms of the norm of compositional semantics, we have:

$$\displaystyle \begin{aligned} \Vert\mathbf{p}\Vert = \Vert\mathbf{u}+\mathbf{v}\Vert \leq \Vert\mathbf{u}\Vert + \Vert\mathbf{v}\Vert. \end{aligned} $$
(3.8)

This lemma suggests that the semantic unit with a deeper-rooted parsing tree could determine the joint representation when combined with a shallow unit. That is, the closer the unit to the final semantic combined unit, the more likely it is to exert a greater influence on the overall semantics.

3.2.2 Multiplicative Model

Though the additive model achieves considerable success in semantic composition, the simplification may also restrict it from performing more complex interactions. Different from the additive model that regards composition as a simple linear transformation, the three-order multiplicative model aims to make higher-order interactions by using multiplication as the basic operator. Among all models from this perspective, the most intuitive approach tried to apply the pairwise product as a composition function approximation. In this method, the composition function is shown as the following:

$$\displaystyle \begin{aligned} \mathbf{p} = \mathbf{u} \odot \mathbf{v}, \end{aligned} $$
(3.9)

where, pi = ui ⋅vi, which implies each dimension of the output only depends on the corresponding dimension of two input vectors. However, similar to the simplest additive model, this model is also suffering from the lack of the ability to model word order and the absence of background syntactic or knowledge information.

In the additive model, we have p = αu + βv to alleviate the word order issue by assigning different weights to different items. Here, α and β are two scalars, which can also be naturally changed to two matrices. The composition function could be represented as:

$$\displaystyle \begin{aligned} \mathbf{p} = {\mathbf{W}}_{\alpha} \cdot \mathbf{u} + {\mathbf{W}}_{\beta} \cdot \mathbf{v}, \end{aligned} $$
(3.10)

where Wα and Wβ are weight matrices that indicate the importance of components u and v to the combined unit p. With this expression, the composition could be more expressive and flexible, although much harder to train.

By generalizing the multiplicative model ahead, another approach is to utilize tensors as a multiplicative descriptor, and the composition function could be viewed as:

$$\displaystyle \begin{aligned} \mathbf{p} = \overrightarrow{\mathbf{W}} \cdot \mathbf{u} \mathbf{v}, \end{aligned} $$
(3.11)

where \(\overrightarrow {\mathbf {W}}\) denotes a three-order tensor, i.e., the formula above could be written as pk =∑i,jWijk ⋅ui ⋅vj. Hence, this model makes sure that each element of p could be influenced by all elements of both u and v, with a relationship of linear combination by assigning each (i, j) a unique weight.

Starting from this simple but general baseline, some researchers proposed to make the function not symmetric to consider word order in the sequence, paying more attention to the first element. The composition function could be:

$$\displaystyle \begin{aligned} \mathbf{p} = \overrightarrow{\mathbf{W}} \cdot \mathbf{uuv}, \end{aligned} $$
(3.12)

where \(\overrightarrow {\mathbf {W}}\) denotes a four-order tensor. This method could be understood as replacing the linear transformation of u and v to a quadratic in u asymmetrically. So this is a variant of the tensor multiplicative compositional model.

Different from expanding a simple multiplicative model to complex ones, other kinds of approaches are proposed to reduce the parameter space. With the reduction of parameter size, people could make compositions much more efficient rather than having an O(n3) time complexity in the tensor-based model. Thus, some compression techniques could be applied to the original tensor model. One representative instance is the circular convolution model, which could be shown as:

(3.13)

where ) represents the circular convolution operation with the following definition:

$$\displaystyle \begin{aligned} {\mathbf{p}}_i = \sum_{j}{{\mathbf{u}}_j \cdot {\mathbf{v}}_{i - j}}. \end{aligned} $$
(3.14)

If we assign each pair with unique weights, the composition function will be:

$$\displaystyle \begin{aligned} {\mathbf{p}}_i = \sum_{j}{{\mathbf{W}}_{ij} \cdot {\mathbf{u}}_j \cdot {\mathbf{v}}_{i - j}}. \end{aligned} $$
(3.15)

The circular convolution model could be viewed as a special instance of a tensor-based composition model. If we write the circular convolution in the tensor form, we have Wijk = 0, where k ≠ i + j. Thus, the parameter number could be reduced from n3 to n2, while maintaining the interactions between each pair of dimensions in the input vectors.

Both in the additive and multiplicative models, the basic condition is all components lie in the same semantic space as the output. Nevertheless, different modeling types of words in different semantic spaces could bring us different perspectives. For instance, given (u, v), the multiplicative model could be reformulated as:

$$\displaystyle \begin{aligned} \mathbf{p} = \mathbf{W} \cdot (\mathbf{u} \cdot \mathbf{v}) = \mathbf{U} \cdot \mathbf{v}. \end{aligned} $$
(3.16)

This implies that each left unit could be treated as an operation on the representation of the right one. In other words, each remaining unit could be formulated as a transformation matrix, while the right one should be represented as a semantic vector. This argument could be meaningful, especially for some kinds of phrase compositions. Baroni et al. [2] argue that for adj-noun phrases, the joint semantic information could be viewed as the conjunction of the semantic meanings of two components. Given a phrase red car, its semantic meaning is the conjunction of all red things and all different kinds of cars. Thus, red could be formulated as an operator on the vector of car, deriving the new semantic vector, which expressed the meaning of red car. These observations lead to another genre of semantic compositional modeling: semantic matrix-composition space.

3.3 N-ary Composition

In real-world NLP tasks, the input is typically a sequence of multiple words or tokens rather than just a pair of words. Therefore, besides designing a suitable binary compositional operator, the order to apply binary operations is also important. In this section, we will introduce mainstream strategies in N-ary composition by taking language modeling as an example. To illustrate the language modeling task more clearly, the composition problem to model a sentence or even a document could be formulated as follows. Given a sentence/document consisting of a word sequence {w1, w2, ..., wN}, we aim to design the following functions to obtain the joint semantic representation of the whole sentence/document:

  1. 1.

    A semantic representation method like semantic vector space or compositional matrix space.

  2. 2.

    A binary compositional operation function f(u, v) like we introduced in the previous sections. Here the input u and v denote the representations of two constitute semantic units, while the output is also the representation in the same space.

  3. 3.

    An order to apply the binary function in step 2. To describe in detail, we could use a bracket to identify the order to apply the composition function. For instance, we could use ((w1, w2), w3) to represent the sequential order from beginning to end.

Methods to model sentence semantics and tackle the above problems could be classified by word-level order: sequential order and convolution order. These composition methods can be particularly implemented by neural networks with corresponding structures. We will introduce the fundamental concepts of the modeling and leave the specific neural network methods to the next chapter.

Sequential Order

To design orders to apply binary compositional functions, the most intuitive method is utilizing sequentiality. Namely, the sequence order should be sn = (sn−1, wn), where sn−1 is the order of the first n − 1 words. In this case, the most suitable neural network is the recurrent neural network (RNN). An RNN applies the composition function sequentially and derives the representations of hidden semantic units. Based on these hidden semantic units, we could use them on some specific NLP tasks like sentiment analysis or text classification. Also, note that basic RNNs only utilize the sequential information from head to tail of a sentence/document. To improve the representation ability, RNNs could be enhanced by considering sequential and reverse-sequential information. In RNNs, each hidden state is controlled by the previous hidden state and the input embeddings at the current timestep, thereby forming the composition function of the sequential order.

Convolutional Order

In addition to the sequential and recursive order from linguistic intuition, we can also model high-level semantics from the convolutional order. Naturally, this is implemented by a convolutional neural network (CNN), which extracts local features by a convolution layer and then integrates local features via pooling operations to produce sentence-level representations. The starting point of such methods is also to model local features for basic units and then synthesize the universal representation of the entire input. The difference from the previous approaches is that it does not follow the sequence order or syntactic structure but lets the convolutional layer complete this combination automatically.

For the sake of simplicity, this chapter ignores the relationship \(\mathcal {R}\) and external knowledge \(\mathcal {K}\) of compositional semantics when introducing them. And these two items are challenging to be heuristically defined and applied in traditional computational linguistics. However, the modern NLP, typically based on deep neural networks, brings a twist to the situation. The tremendous capacity enables neural networks to model almost arbitrarily complex semantic structures in an implicit way, which could be regarded as modeling the \(\mathcal {R}\) item (will be introduced in Chap. 4). And advances in knowledge representation learning and knowledge-guided NLP could be naturally seen as a process to model the \(\mathcal {K}\) item (will be introduced in Chap. 9).

3.4 Summary and Further Readings

In this chapter, we first introduce the semantic space for compositional semantics. Afterward, we take phrase representation as an example to introduce representative models for binary semantic composition, including additive models and multiplicative models. Finally, we introduce typical methods for N-ary semantic composition. We use fundamental principles and concepts to illustrate the core idea of compositional semantics: to build complex semantics with the combinations of basic components. For further understanding of compositional semantics, readers can refer to some recommended surveys and books that comprehensively introduce the area. For example, the framework applied in this chapter is from the inspiring article of Pelletier et al. [10].

For better modeling compositional semantics, some directions require further efforts in the future. For example, neurobiology-inspired compositional semantics is a promising research topic that explores the neurobiological insights of compositional semantics [11]. The analysis of how language builds meaning and lays out directions in neurobiological research may bring some instructive reference for modeling compositional semantics in representation learning. It is valuable to design novel compositional forms inspired by recent neurobiological advances. There are also studies that attempt to consider discrete symbols in deep neural networks [1, 4], triggering new research issues on the combination of neural models and symbolic models.

Generally speaking, modeling complex semantics distributed in sentences and even documents could be extremely difficult. It may be difficult for us to complete the modeling through heuristic methods. At this point, the powerful fitting and generalization capability of the neural networks are needed to play a role. In the next chapter, we will introduce concepts, methodologies, and applications of sentence and document representation and particularly put focus on the neural network approaches.