Keywords

1 Introduction

Question Answering (QA) is a computer science discipline born from the Information Retrieval (IR) field and aimed at providing exact answers to natural language questions. Even if QA is a long-standing research problem, recently, it has gained increasing attention since it has been advocated as the basic function in digital assistants like Siri, Google Assistant, Alexa, and Cortana. Furthermore, it has been identified as fundamental to build the next generation of Cognitive Systems [10] able to offer complex analytic algorithms as composite services in the cloud [1,2,3].

Generally speaking, a QA system processes a question and extracts as many clues as possible to understand what it is being asked for (Question Processing). Then, a knowledge base/corpus is queried to retrieve relevant information about the question (Information Retrieval). Finally, all the possible answers are evaluated in order to select the most confident one (Answer Processing).

Driven by the observation that knowledge bases are far from complete and, thus, they may not always contain the information required to answer questions, this work studies corpus-based QA systems, with the focus on directly mining answers from a collection of documents pertaining a closed domain. Building such systems, however, has difficulties mostly due to some challenging issues.

First, in the Question Processing phase, all the clues collected so far are used to generate a query aimed at retrieving relevant documents. Therefore, the policy employed to extract relevant terms from the question has to be carefully defined, according to the type of information the question is asking for. Second, in the Answer Processing phase, determining the likelihood that a candidate answer is correct is a very thorny task. Most of the corpus-based QA systems use filtering algorithms based on patterns to score the set of candidates and determine the most pertinent one. However, filtering needs more complex heuristics to be applied to a corpus made of unstructured documents and closed domain knowledge. Finally, while a huge literature exists for English QA systems, few studies are available for corpus-based QA on closed-domain for non-English languages.

For these reasons, this paper proposes a novel pipeline for corpus-based QA in Italian, developed within C3, a Cognitive Computing Center in Naples and involving the National Research Council of Italy and the Italian Labs of the IBM Research and Development Division.

The first distinctive element of this pipeline relies on its capacity of analyzing a question expressed in Italian and determining, by exploiting only lexical features, the best useful terms to be used for IR into a closed domain. Secondly, no restriction is made on the expected answer type that can be handled. In particular, two main types are considered, namely Factoid for precise words, and Description for short paragraphs talking about concepts or named entities. Factoid type is further organized into a two-level taxonomy in order to finely classify the expected answers. Finally, the third novel functionality is represented by the manner in which candidate answers are evaluated and selected. In particular, a set of filtering procedures is proposed to score evidence for each candidate with respect to its type, i.e. factoid or description.

The rest of the paper is organized as follows. Section 2 describes the common approaches in literature, highlighting differences with the proposed one. Section 3 describes the pipeline, the choices made and its implementation. Section 4 describes the experimental evaluation and presents the results achieved. Section 5 concludes the work with several considerations and future activities.

2 Related Work

In the last decade, literature about QA has been very flourishing. Plenty of works are available spanning across quite different approaches.

Most studies are limited to the English language only. The most relevant solution is represented by Watson [12], realized in the IBM facilities. Among the other proposals available in this field and working with the English language, three notable studies are: YodaQA [4],Watsonsim [8] and OpenEphyra [18].

The former is an end-to-end QA pipeline designed to answer factoid questions. It has been built by exploiting UIMA (Unstructured Information Management Architecture)Footnote 1 and it relies on different NLP tools, like Stanford CoreNLP, OpenNLP, and LanguageTool. Watsonsim is a class project realized at the University of North Carolina. It is inspired by IBM Watson and based on Lucene and OpenNLP. It is meant to answer questions from Jeopardy!. The latter is a QA pipeline that relies on Patterns Learning techniques to interpret questions and extract answers from structured and unstructured knowledge bases.

With reference to other languages, few examples exist, which address multilingual QA [19] or focus on a specific language [5, 13, 14]. For what concerns Italian, two studies stand out [14, 16]. The former proposed a QA system named Question Cube, relying on NLP algorithms for both Question Processing and Answer Processing, machine learning for question classification and probabilistic IR models for Information Retrieval. The latter, named Quasit, is a QA system that leverages the power of a previously known linguistic knowledge, and it bases its reasoning process on both rules and ontologies. It uses the structured knowledge known as DBpediaFootnote 2.

More recently, novel end-to-end QA approaches have been proposed, essentially based on deep learning algorithms [7, 23]. In particular, Yu et al. [25] carry a matching between the semantic representations of questions and candidate answers, by means of Convolutional Neural Networks. In another case, the feedback of the users is taken into account [22] to improve the quality of the answers.

Furthermore, most of the existing QA systems are meant to be open-domain [4, 8, 9, 12, 21] while very few are meant to be closed-domain [20]. Of the studies on closed domains, only a fraction uses unstructured knowledge base. The majority of works are on structured knowledge bases [7, 16, 23], whereas few others exploit unstructured knowledge bases [11, 24] or use hybrid knowledge bases [15].

Summarizing, most of the studies are meant for the English language, and regard open-domain, while few systems work on non-English languages and closed-domain. Moreover, to the best of our knowledge, no system exists that answers factoid and description questions expressed in Italian by operating with unstructured documents in a closed domain. Comparing to these previous works, our contribution consists in: (i) supporting both Factoid and Description questions in Italian; (ii) modeling the expected answer type by means of a two-level taxonomy; (iii) employing contextual stop words to generate the query from the question; (iv) filtering candidate answers by means of novel scoring procedures.

3 The Proposed Question Answering Pipeline

The proposed QA pipeline, depicted in Fig. 1 extends the one defined in [6] and operates on a corpus of documents in Italian related to a closed domain. It consists of four main phases: (i) Indexing; (ii) Question Processing; (iii) Information Retrieval; (iv) Answer Processing.

Fig. 1.
figure 1

The proposed QA pipeline.

3.1 Indexing

This phase consists into an off-line procedure that elaborates the documents of the corpus and produces annotations representing the semantic types of pieces of text. In particular, the technique of Predictive Annotations [17] has been here employed, which essentially indexes not only the textual contents of the documents for the IR phase, but also the semantic classes of entities or pieces of text in order to focus the search on the candidate answers of the right type.

According to this idea, annotations are designed to represent classes of information that people ask questions about, overlapping with the expected answer types. Two different approaches have been here used for generating annotations with reference to factoid and description answer types. In the first case, the annotations are anchored to entities and are used to univocally refer to them, whereas in the second one, they are anchored to entities belonging to sentences or short paragraphs and are used to univocally refer to these pieces of text.

A two-level hierarchy of semantic classes has been proposed for annotating precise entities in the text (i.e. factoid answer types), where Person, Entity, Location, Date are examples of top level categories, and Address, City, Region, State, Artifact are examples of bottom level categories. On the other hand, a particular semantic class, named Description, has been defined to annotate sentences or short paragraphs in the text (i.e. description answer types).

3.2 Question Processing

This phase employs the following set of procedures to elaborate the question text and extract a set of features to ultimately generate a query for the IR phase.

The Answer Type Detection identifies the expected answer type of a question, generally referred as Lexical Answer Type (LAT), in accordance with the hierarchy of semantic classes defined for annotating the documents. This task, also known as Question Classification, has been performed by adopting the deep learning approach mentioned in [6], opportunely tuned with an extended training set in order to face novel LATs. For instance, the question “Where is the Sistine Chapel?”, is classified with Location as LAT.

The Named Entity Extraction identifies all the Named Entity (NE) occurring in the question text, intended as entities, such as persons, locations, organizations, artifacts, and so on, that can be denoted with a proper name. For instance, in the question “Where is the Sistine Chapel?”, Sistine Chapel is a NE. Such NE is successively used to generate the most appropriate query for the IR phase. For more details about this procedure, please refer to [6].

The Keyword Extraction extracts significant terms from the question and, in detail, nouns and high-content verbs. Further details are available in [6].

The NLP Metadata Extraction is in charge of applying the standard NLP techniques like POS Tagging, Lemmatisation and Stemming to the question terms in order to determine their role. More details are reported in [6].

The Stop Word Removal - SWR determines and removes stop words, intended as terms that add no or little information to the text and, therefore, are useless in the query for the IR phase. Two types of stop words have been here considered. The first one is represented by the most common words in the Italian language, such as prepositions, determiners, conjunctions, question words and so on, in accordance with a classical perspective. The second one includes words that are correlated to the LAT of the questions, and, for this reason, are superfluous. For instance, in the question “In what year did Michelangelo Buonarroti paint the Sistine Chapel?”, the word year is considered as a stop word and, thus, removed, since tied to its LAT, i.e. Date. It is worth noting that, for this second type, the list of stop words changes depending on the LAT considered.

The Query Formulation generates the query to pass to the IR phase, by using a subset of the features collected by the aforementioned procedures, depending on the expected answer type. In detail, for factoid type, all the features are used, whereas, for description type, only two features are employed: LAT and NE.

3.3 Information Retrieval

This phase retrieves a set of documents (default to 5) satisfying the query. Each returned document has a score and a list of annotations added in the Indexing phase. Each document is then split into sentences for further processing.

3.4 Answer Processing

This phase first filters and scores sentences containing candidate answers, intended as pieces of text annotated with the expected LAT and, next, selects the best candidate answer. Each sentence can have zero or more candidate answers.

Two typologies of Answer filtering procedures have been developed, namely Cutting filtering and Scoring filtering.

The Cutting filtering removes sentences without eligible candidate answers. Two procedures of this typology have been implemented and are described below.

The Answer Type Filtering removes all the sentences that do not contain the expected LAT. For instance, given the question “When was the Sistine Chapel built?”, with Date as LAT, the sentence “The Sistine Chapel is a chapel in the Apostolic Palace, the official residence of the Pope, in Vatican City” is discarded, since it does not include a piece of text annotated with the expected LAT.

The Duplicated Named Entity Filtering - DNEF removes sentences containing a candidate answer annotated with the expected LAT but represented by a named entity also present in the question. For instance, given the question “Who was the father of Leonardo da Vinci?”, with Person as LAT and the sentence “Leonardo was the firstborn son of the notary Piero from Vinci”, it ensures that Leonardo is discarded as a possible answer, even if its LAT is equal to Person.

With reference to the second typology of Answer filtering procedures, the Scoring filtering gives a score to each candidate answer. A collection of filtering procedures of this typology has been proposed, where each of them is independent from all the others and may be applied depending on the type of answer. All the different scores calculated by them are normalized and exhibit a monotonic behavior, i.e. they are increased or decreased with the likelihood of the answer being correct. This second type of procedures is described as follows.

The Keywords Overlapping scores each candidate answer in relation to the number of keywords (not including stop words and NEs) in both the sentence containing it and the question. For instance, given the question “Which famous artwork did Michelangelo realize based on a commission from the Pope?” and the sentences “The Sistine Chapel is the most famous artwork realized by Michelangelo, commissioned by the Pope Julius II...” and “Pope Julius II requested Michelangelo to paint the ceiling of the Sistine Chapel, that now is considered one of the most notorious artworks of Michelangelo”, the procedure assigns a higher score to the candidate answer “Pope Julius II” in the first sentence since it has a greater percentage of keywords matching with the question.

The NE Matching scores each candidate answer depending on the number of NEs occurring in both the sentence containing it and the question. For instance, given the question “When did Pope Julius II hire Michelangelo to paint the Sistine Chapel?”, with “Pope Julius II”, “Michelangelo” and “Sistine Chapel” as NEs, the procedure assigns the greatest score to the candidate answer “between 1536 and 1541”, since contained into a sentence like “between 1536 and 1541, Pope Julius II commissioned Michelangelo to paint the giant fresco The Last Judgment behind the Sistine Chapel altar.” with all those NEs.

The Proximity with Gap - PwG scores each candidate answer in relation to the distribution of question terms that are present in the sentence containing it. This procedure is inspired by dynamic programming algorithms typically used to align sequences. In particular, given the text containing a candidate answer to a question, the procedure inspects all the neighbor terms of the candidate answer and, in case of a match with a question term, it increments the partial score with a default value equal to 1, whereas, in case of a gap, it decrements that score with a value calculated by using the series described in Eq. 1, where X is the set of terms of the sentence placed on the left/right of the candidate answer. The index i is set back to 2 every time a match is encountered.

$$\begin{aligned} \sum _{i = 2}^{|X|}{\frac{1}{2^{i}}} \end{aligned}$$
(1)

The procedure looks at both sides of the candidate answer, keeping the calculation of the partial scores distinct. At the end, the partial scores calculated moving on both sides are summed and normalized to form the final score. For instance, given the question “Who hired Michelangelo to paint the Sistine Chapel’s ceiling?”, the procedure calculates a greater score for the candidate answer “Pope Julius II” in the sentence “In 1508, Pope Julius II hired Michelangelo to paint the ceiling of the Sistine Chapel” with respect to the same answer in the sentence “Pope Julius II, after 5 years from the beginning of his reign, hired Michelangelo to paint the ceiling of the Sistine Chapel, that later become the most famous piece of artwork of the artist.”. Indeed, the matching terms around the candidate answer are less sparse in the first sentence than in the second one.

The Proximity of Highlighted Text to LAT - PHTL scores each candidate answer depending on its average distance, in terms of characters, with all the terms in the sentence containing it that match the query terms, named highlighted terms. For instance, given the question “Who painted the Sistine Chapel?”, this procedure assigns a greater score to the candidate answer “Michelangelo” in the sentence “In 1508 Michelangelo painted the Sistine Chapel, at the request of Pope Julius II” with respect to the same answer in the sentence “In 1508 Michelangelo went back to Rome and received the request from Pope Julius II to paint the Sistine Chapel”. Indeed, the first sentence shows closer matching terms to the candidate answer than the second one.

Finally, the Answer Selection procedure identifies as right answer the one having the highest score. The final score is calculated as the normalized algebraic sum of the scores attributed by each and every of the aforementioned Scoring filters. It’s worth noting that, in this sum, all the scores produced by the Scoring filtering procedures have the same weight equal to 1.

4 Experimental Evaluation

The QA pipeline is implemented as a RESTful API, as described in [6] and an ad-hoc tool named Carya has been used to test it. In more detail, the corpus used for testing consists of 16 documents in Italian crawled from the Internet and specific to a cultural heritage theme of the city of Naples. The test suite consists of approximately 920 questions, split according to their LAT and stored in csv files. The list encompasses Artifact, Date, Description, Location, and Person.

Many different configurations of the QA pipeline have been used for the test. The first one is named Old Pipeline and represents the first version of the pipeline described in [6]. It includes Answer Type Filtering as Cutting Filtering and NE Matching and Keywords Overlapping as Scoring Filtering. Next, this configuration is enriched with the remaining filtering procedures here defined, by adding first a procedure at a time, then in pairs of two, and finally all together.

Measurements for both performance and accuracy are carried out. The response time is still under 5 s, and on average under 3. Table 1 shows the Accuracy@1 for each new filtering procedure added, and their combinations.

Table 1. Accuracy@1 for each filtering procedure and their combinations

PwG contributes the most with a boost on average of 5.42%, with the greatest improvement when the question type is Date (18.08%). PHTL instead generates an average improvement of 4.74% with a peak of 14.75% when the question type is Artifact. DNEF improves the number of correct answers in the first position of 1.74% (Person), and also the number of correct answers in second and third positions, with an increment of 0.87% always when the question type is Person.

Based on these results, a new configuration, named New Pipeline, has been manually defined that applies filtering procedures depending on the question type. So, for example, it does not apply PHTL when the question type is Person.

On top of this configuration, another one, named New Pipeline + SWR, has been set, which adds the Stop Word Removal procedure to the pipeline. The three configurations Old Pipeline, New Pipeline, and New Pipeline + SWR are compared and the results are listed in Table 2.

Table 2. Accuracy@1 for different configurations of the QA pipeline.

It is worth noting that the Old Pipeline configuration cannot be applied to questions whose type is Description, since the previous version of the pipeline did not address the question type Description. The New Pipeline configuration has a positive impact on three question types, showing an average improvement of 10.06% with a peak of 17.97% (Artifact). For Location, the Accuracy@1 drops of 0.72%. The New Pipeline + SWR configuration shows an average improvement of 2.61% with a minimum of 1.54% (Date) till an 8.9% (Location). This configuration boosts the system performance, achieving an average improvement, in comparison to the Old Pipeline, of 12.67% with a minimum of 6.08% (Artifact) and a maximum of 18.47% (Date). It is worth highlighting that the question type Description is not taken into account in computing the average improvements since the filtering procedures are not used for it.

5 Conclusion

The paper presented a Question Answering pipeline for Italian and based on a corpus of documents pertaining a closed domain. The novelty of this pipeline essentially relies on its capacity of: (i) analyzing natural language questions in Italian by using lexical features; (ii) handling both factoid and description answer types and, depending on them, filtering contextual stop words from the questions; (iii) scoring and selecting candidate answers with respect to their type in order to determine the best one.

The proposed pipeline has been subject to an evaluation of its performance by using an Italian corpus of 16 documents crawled from the Internet and specific to a cultural heritage theme of the city of Naples. The response time is under 3 s on average, whereas the percentage of correct answers as first response is largely over 70%. These results showed the positive impact of the new procedures for Question Processing and Answer Processing on the whole pipeline.

Although the results are encouraging, there is still space for improvements. In particular, further developments will regard issues about the usage of: (i) syntactic features, by exploiting the structure of a sentence, to improve the relevance of the terms to be included in the query and the selection of answer candidates; (ii) semantic features, by exploiting Wordnet, to automatically expand the query, improve the quality of answer filtering and automatically build a set of stop words per LAT; (iii) a machine learning approach to adaptively combine the scores produced by the scoring filtering procedures and determine the best answer.