Keywords

1 Introduction

The World Intellectual Property Organization claims that more than 3 million patents were published worldwide in 2016, that is 8.3% more than in 2015 [1]. Next to this massive source of technical knowledge, scientific articles also carry weight in expert scientific content with at least 2.5 million new scientific papers published each year since 2013 [2]. While patents are legal sources which provide technical information about the innovations they aim to protect, scientific articles provide theoretical as well as technical answers which help the scientific community understanding and gaining control of their environment through diverse applications. With such a huge number of publications and given that they are not always easily accessible, science news websites have developed on the Internet and share some important researches and innovative breakthroughs in a condensed and accessible form, making it possible for the scientific community to stay update and learn about advances in other science-related fields.

With innovation going very fast, engineers are facing a challenge to find creative ideas that may lead to innovation. To help them in their innovation process, researches have been made since the 1990s, developing and adapting Altshuller’s theory of inventive problem solving (TRIZ) [3]. Yet, though patents seem the best way for engineers to find solutions [4], it doesn’t mean that other science-related documents cannot also lead to innovation. In order to determine the significance of this kind of documents on innovation, we need to find out whether these documents include enough content useful for TRIZ theory application. During the last few years, our team has been developing a tool for automated extraction of knowledge related to the Inventive Design Method (IDM) from English-language patents [5].

Following these researches, we want to adapt the information extraction from patents to the case of scientific papers and science news articles, i.e. automating the extraction of topics, problems, partial solutions, evaluation and action parameters, as well as values, inside these two alternative types of science-related documents.

This is a challenging project in that both scientific research articles and science news articles are unstructured data, with much more liberty regarding their structure and writing style. Therefore, we need to evaluate the structural, syntactic as well as semantic features implemented in the patent-knowledge extraction tool [5], and then to find new features that are specific to the new document types. This implies the use of machine learning and other methods for Natural Language Processing.

In this article, we will give you an overview of the literature on the Inventive Design Method (IDM) and its tool for patent-knowledge extraction, followed by the literature on scientific research papers and science news articles (Sect. 2), and then we will present our methodology along with the current advances on the project and an emphasis on our evaluation method (Sect. 3).

2 State of the Art

2.1 The Inventive Design Method

The Inventive Design Method (IDM) is an application method derived from TRIZ, the theory of inventive problem solving. It describes four steps for innovation [6]: during the first phase, the users must extract knowledge and organize it into a graph of problems and partial solutions. With this graph, they must then formulate a set of contradictions, which will be solved individually in phase 3, and finally, they must choose the most innovative Solution Concept before they can invest in it and set it up.

For the contradiction formulation, the Inventive Design Method offers a formal and practical definition of the broad TRIZ contradiction notion, which is very useful for industrial innovation and introduces other notions linked to IDM-ontology [7]. This contradiction “is characterized by a set of three parameters […] where one of the parameters can take two possible opposite values Va and \( \overline{Va} \)” (Fig. 1). The first parameter is called action parameter (AP) and is defined with its characteristic to be able to “tend towards two opposite values” and to “have an impact on one or more other parameters”. Moreover, “the designers have the possibility of modifying them”. The two other parameters in a contradiction are called evaluation parameters (EP) and are defined with their capacity to “evolve under the influence of one or more action parameters”, thus making it possible to “evaluate the positive aspect of a choice made by the designer”.

Fig. 1.
figure 1

Possible representation model of contradictions [7]

Knowledge of how the contradiction must be formulated helps for the knowledge extraction part as well as the graph creation, because the elements that must be extracted from the documentation are in priority: problems, partial solutions, evaluation and action parameters as well as their possible values Va and \( \overline{Va} \). As of problems and partial solutions, research on IDM contains clear definitions, both in their form (syntax and graphical representation) and in their content [8]. A problem (Fig. 2) “describes a situation where an obstacle prevents a progress, an advance or the achievement of what has to be done”. A partial solution (Fig. 3) “expresses a result that is known in the domain and verified by experience.” Cavallucci et al. give more precision about the concept of partial solution [8]:

Fig. 2.
figure 2

Graphical representation of a problem

Fig. 3.
figure 3

Graphical representation of a partial solution

“It may materialize a tacit or explicit knowledge of one or more members of the design team upon their past experience, a patent filled by the company or a competitor or any partial solution known in the field of competence of the members of the design team. We wish also to remind that a partial solution is supposed to bring the least possible uncertainty about the assertions of its effects on the associated problem. Confusion can appear between a “solution concept” (which is the result of an assumption made by a member) and a “partial solution”, which has been validated by experience, tests, calculations or results known and verified.”

2.2 Patent-Knowledge Extraction

An important principle both for TRIZ and for IDM is that inventive solutions are generally found with analogy, and therefore thanks to solutions or tools which belong to another domain. Patents are documents with very rich technical content, which is generally not to be found elsewhere, like for instance in scientific papers [4]. In order to save engineers a lot of time and to help them getting to these analogies, a tool has been developed by Souili et al., which automatically creates a graph of problems, partial solutions and parameters out of English-language patents selected by a user in a large database [5] (Fig. 4). This corresponds to the first phase of the Inventive Design Method, which has been automated thanks to Natural Language Processing techniques.

Fig. 4.
figure 4

An example of a problem-solution graph

2.3 Scientific Research Papers and Science News Articles

Scientific Research Papers.

Many studies have been conducted on the automated information extraction from scientific research papers. While some of them are strongly related to their domain or a specific application (like biology) [9,10,11,12,13], other present, for general purposes, the extraction of general features of articles such as keywords or key phrases [13,14,15,16]. However, scientific research papers have not yet been the topic of research for TRIZ or IDM applications.

One of the main challenges for the adaptation of the patent-knowledge extraction tool is to understand and characterize the differences between the different types of documents: patents, scientific research papers and science news articles.

While all patents have a very similar structure with four sections and despite the higher degree of liberty in the description section, research papers have much more differences amongst themselves. For instance, title names, even if they often indicate the same concepts, are not formulated the same way. Moreover, the number and topics of the sections vary a lot, rendering text-mining techniques more difficult to apply.

However, many researches have been conducted on the structure and content of research papers. Pontille, based on a previous research by Bazerman [17], describes a generic structure format named IMRAD (Introduction, Materials and methods, Results And Discussion) [18] as a stabile tool for professional practice harmonization; “a particular form of proof expression”Footnote 1 and “a standardized argumentative matrix”Footnote 2 which has become an official norm with the Standard for the preparation of scientific papers for written or oral presentation in 1979 in the USA.

According to Pontille, authors generally define a problem and propose hypotheses in the introduction. Therefore, it will be interesting for our research to study how the main problem is introduced in this section, in order ameliorate its extraction.

From the section about “Materials and methods”, which explains “the way the study has been conducted”Footnote 3 and has become central for the argumentation in the last decades [18], we won’t focus on intermediate problems and solutions for our research, since they are often criticized during the article. However, Pontille mentions the occasional presence of a subsection for the description of “variables”, which we can link to the IDM-ontology concept of parameter. Therefore, it could be interesting to extract the main parameters and to find a way to know, with the help of their limited context, whether they are evaluation or action parameters, i.e. whether they have an impact on the final output and can be modified, or whether they are central to the evaluation of the output.

The discussion section, as described by Pontille, is also interesting for our extraction tool. By “identifying the practical and theoretical implications of their analysis”Footnote 4 and “opening perspectives for futures researches”Footnote 5 [18], we could find a repeated version of the main problem, hopefully the main partial solution of the article, and potential unresolved new problems. Moreover, since this section is the interpretation of the results evaluation, it will be interesting to look particularly for evaluation parameters here.

Considering the content, Gosden study on rhetorical themes and more specifically on contextualizing frames throughout research articles [19], with an analysis of their functions and even some links to problems and solutions, is very interesting for the extraction of features which will help us in our information extraction task.

The abstract is also a key part of a research article. According to a study conducted on articles in the field of linguistics and educational technology, three moves seem to be unavoidable in research articles: “Presenting the research”, “Describing the methodology” and “Summarizing the findings” [20]. These moves are very close to actual sections of the IMRAD structure, and since it is a summary of the research, the presence of the major problem, partial solution and parameters seems very probable.

Science News Articles.

Science news articles are not a famous research topic, but media coverage of science is [21,22,23,24,25,26,27]. An increasing amount of science-related content is published regularly on the internet, in very diverse forms: online versions of traditional print newspapers and magazines, science section of online newspapers, science blog articles from professional researchers, videos, specialized science news websites, etc. While traditional online sources are declining in terms of audience rates, nontraditional online sources such as blogs and “other online-only media sources for information about specific scientific issues” now get much more attraction for the public [21]. Science news websites therefore seem to be interesting sources for condensed information, with regularly new publications in multiple research fields.

This kind of source is nevertheless contested in the literature because of the trustworthiness of its contents [22]. Since the targeted audience is interested in scientific innovation and breakthrough, but does not have a large scientific background, information is often summarized and the risk is that is diverts from the intention of the original author. In order to prevent engineers from using erroneous extracted data during the IDM process, we will have to compare the results of the extraction from the science news article on the one hand, and from the full original document on the other hand.

3 Methodology

The information extraction tool we aim to create is mainly an adaptation of the existing tool used with patents. This adaptation requires a certain number of steps, from building a corpus to the evaluation of the final program. As mentioned above, the major challenge for this project is to adapt the features needed to select the candidate sentences for problem or partial solution classification.

3.1 The Corpus Creation

Annotated Corpus.

The first step for adapting the extraction tool is to build an annotated corpus that will be used for classification, in order to extract the main linguistic features (words or phrases) for each category – problems, partial solutions, evaluation parameters, action parameters, values.

Since research articles and science news articles have a very different structure, they should not be treated the same way, but preferably with two separated corpora. Moreover, there are multiple sources publishing articles in diverse fields, with differences in the global structure as well as the length and style of the articles. Therefore, each corpus must contain articles form different sources.

We chose to give the annotation task to 44 students enrolled in a course on Inventive Design. With this condition, we decided to build both corpora so that they correspond to the level and number of students who will annotate them.

For the annotation task, all articles have been cleaned and transformed into PDF when necessary for reading comfort and homogeneity of the annotation procedure, at least for the science news articles that were in HTML format in the beginning

The corpus of science news articles contains 44 articles from the 7 following sources:

  • Machine Design [28] – 6 articles, about 1–2 pages long

  • New Atlas [29] – 7 articles, about 1 full page long

  • Phys.org [30] – 7 articles, about 1–2 pages long

  • Research & Development [31] – 6 articles, about 1 page long

  • Science Daily [32] – 6 articles, about 1 full page long

  • Science News [33] – 6 articles, about 1–2 pages long

  • Science News for Students [34] – 6 articles, about 2 pages long

The corpus of scientific research articles contains 44 articles from the 4 following sources:

  • Accounts of chemical research [35] – 15 articles, about 8–10 pages long

  • Annual Review of Condensed Matter Physics [36] – 8 articles, between 21–28 pages long

  • Chemistry of Materials [37] – 11 articles, between 6–12 pages long

  • Proceedings of the National Academy of Sciences of the USA (PNAS) [38] – 10 articles, about 6 pages long

The annotation procedure is done on the PDF-formatted files with the highlighting tool of Adobe Acrobat, following a given color legend for each element: problem, solution, evaluation parameter, action parameter, value. After collecting all annotations, evaluating and modifying them when necessary, we use Sumnotes, a web application [39], to extract a list for all sequences corresponding to each element.

Clean Corpus for the Extraction.

For the extraction of the different elements related to IDM, we cleaned and transformed the articles into JSON format. Since most of the articles were available in HTML format, both for science news articles and scientific research papers, we used the BeautifulSoup library for Python 3 [40], which is a very efficient library for cleaning data from HTML and XML documents.

However, we are still looking for a way to efficiently clean PDF articles from PNAS, because the GROBID service [41] we used for transforming them into TEI-XML documents mixed up all the sections, therefore making our information extraction task impossible. In the meantime, we had to drop this source for the features extraction, implementation and evaluation steps.

3.2 The Extraction and Selection of Features

The adaptation of the linguistic features needed for the sentences, keywords or phrases selection and classification tasks involves three steps, and its goal is to create lists and dictionaries of words, phrases or n-grams that make the extraction efficient and specific to both scientific research articles and science news articles.

The three steps are:

  • Patent features assessment

  • Specific features extraction

  • Features selection

Evaluation Method.

The whole features adaptation process involves multiple evaluations which have to be undertaken with the same parameters. The main purpose for the extraction program is to get sufficient results for a further use in the inventive design process following IDM-TRIZ. All the evaluations have to be undertaken with the basis of a single Gold Standard for each element category (e.g. parameters or problems), built from the annotated corpora. The program has a two-step classification process for problems and partial solutions. The first step extracts candidate sentences and the second step confirms or rejects the first choice. The rejected sentences are then classified as “neutral”. For a complete evaluation, we will also consider those sentences and add them to the “concepts” category (thus containing all extracted problems, solutions and neutrals). The evaluation criteria are:

  • The number of concepts, parameters, partial solutions, neutrals and problems extracted in average per article

  • The precision for each category, which is the ratio between the number of relevant elements found automatically in the category and the total number of elements automatically retrieved in the category.

  • The recall for each category, which is the ratio between the number of relevant elements retrieved automatically in the category and those found manually in the Gold Standard for this category.

  • The rate of misclassification, i.e. the proportion of sentences that should have been classified in another category but is considered useful. We consider as “useful” neutrals, sentences that are really interesting for the research or its future developments, but does not belong to the category of problems or partial solutions.

Patent Features Assessment.

The first step is the assessment of the linguistic features from the original program that is made for patents. With a first minimal adaptation of the original program, keeping all the features intact, we assess the features globally, and then individually. The global assessment will be made by evaluating the whole program following the procedure described above. This assessment will serve as a comparison point for the individual assessments that follow, for which the program will run on a loop and evaluate its results, each time ignoring a different feature. This makes it possible to remove the features which don’t have any positive impact on the final result. The removing task, however, considering the relatively small number of articles in our corpus, should be done manually in order to maintain the possibility of keeping some features for which we estimate a potential impact with a bigger corpus.

Specific Features Extraction.

The second step is the extraction of new features. This task will involve several tools for Natural Language Processing (NLP):

  • The feature selection tool included in the Weka software [42]

  • Wordclouds created with the programming language Python

  • Graphs of ranked token frequencies for a specific concept

  • A graph created with the Bokeh library in Python which show the relative polarity of the most frequent tokens for partial solutions and problems

  • A manual selection of feature candidates observed during the creation of the Gold Standard.

Features Selection.

third task consists in selecting the final linguistic features from the remaining patent features and the previously extracted candidate features, then in implementing them into the main classification program, and again in making an assessment of these features with the same methodology as for the patent features assessment. Depending on the results, it could be necessary to make slight changes, and run a new evaluation again, in order to get the best performances as possible.

4 Results and Outlook

The annotation task taking more time than expected, we chose to focus on a corpus of 23 science news articles, from which we built a Gold Standard for each kind of elements related to IDM (Table 1).

Table 1. Gold Standard characteristics

This Gold Standard was the basis of our evaluation during the project. The reference performances of the program (Table 2), established with the original features used for patents, show a precision of about 40% for topics and parameters. These results are not very satisfying, but they are far less concerning compared to the results for problems and partial solutions, where only 30 elements have been extracted, from which only one wrongly classified as a partial solution.

Table 2. Reference performances

With such initial results, we concentrated our efforts on improving the performances for the extraction of problems and partial solutions, especially the linguistic features assessment, extraction and selection tasks as presented in the methodology section.

The assessment task allowed us to remove the following linguistic features: “common”, “high”, “manufacturing”, “complex”, “espial”, “field”, “step”, “might”, “claim 1”, “claim 2”, “claim 3”, “^-”, “c higher than those for e-glass”. Removing those linguistic features, combined to a few minor changes in the program, already improved the performances for problems, partial solutions and parameters.

The extraction and selection task allowed us to add 10 new linguistic features to the program. The following new linguistic features are useful for the extraction of partial solutions: “by”, “discovered”, “modeling”, “resulting”, “using”, “through”. The following linguistic features are useful for the extraction of problems: “because”, “expensive”, “how”, “often”.

The changes made in the program and in the linguistic features lists during this project have led to a significant increase of the performance indicators (Table 3).

Table 3. Performances after the changes in the program and features lists

5 Outlook and Summary

As a conclusion, this article relates about the project at CSIP to extend its automatic extraction techniques to research and science news articles in the field of the Inventive Design Method. Due to unexpected obstacles during the annotation phase of the project, we focused mainly on science news articles, for which we could evaluate and adjust the linguistic features, starting from another extraction program developed by our team and based on patents.

In spite of the increase of the program performance indicators, the results still need to be improved in order for engineers to be able to use the tool for their projects. A major point for its improvement is to create bigger corpora of annotated articles, which ally both quality and quantity.

Moreover, our program is able to extract information from research articles, but the number of extractions is too high (often nearly 70 extracted elements for one article). Therefore it seems even more important to separate completely the extraction processes for science news articles from research articles, and to find new sets of linguistic features as well as a system to interact differently depending on the article section for research papers.

A last angle to work on in order to improve the precision and recall is to cut the article contents more finely than only between sentences, making sure that a problem and a partial solution cannot be in the same meaning unit.