Keywords

1 Introduction

Searching prior art to determine whether or not a patent can be granted for a claimed invention is one of the most crucial and time intensive aspect of the patent examination. This is owing to the fact that a comprehensive prior art search is involved in deciding whether any previous publication discloses the claimed invention or not. Currently, patent examiners perform extensive keyword based searches in existing patents and other published documents (non-patent literature) using a variety of sources such as technical documents, computer databases etc. to identify terms that are related to the filed patent application. This process of manually searching and evaluating results is very laborious and time consuming and the length of search time depends on the complexity of the invention. Furthermore, the overall efficiency in identifying related patent terms in existing patents and publications heavily relies on the expertise of the patent examiner.

In recent years, word embeddings have been shown to provide representation of words in a meaningful way and thus have become popular in Natural Language Processing applications [4,5,6]. In word embedding techniques, each word is represented as a vector in an n-dimension space, learned from very large dataset, in a way such that the semantic relationship between the words is captured. Mostly, word embedding techniques are based on Harris distributional hypothesis [7], which states that words that occur in the same contexts tend to have similar meanings. Individual words are represented as vectors of real numbers using their context such that similar words tend to have similar vector representations. Furthermore, over the years, numerous algorithms have been developed to generate word embedding models [8,9,10,11,12]. More recently, shallow neural networks based methods that can learn phrases vector representations have been introduced and have found to be very effective on word similarity and analogy tasks [2, 3].

The purpose of this study is to develop a centralized user interface that can be help patent examiners identify related patent terms and thus aid in making decisions regarding prior art.

2 Related Terms Search

We have built a search system that will allow patent examiners and other domain experts to evaluate the quality of related patent terms returned by the different models. In our study, we initially created clusters of patent data depending on the areas of interests such as Computer Architecture, Biology, Surface Transportation, Plants, Optics etc. to name a few. These clustered groups of patent data were then trained to generate word embedding models. The word embedding models were then utilized to query and identify related patent terms in the patent corpus. For example, when “apple” is entered as an input key term, all the terms related to apple are displayed in the Computer Architecture group and the oil wells group (see Fig. 1). Our application thereby seeks to provide a centralized interface that will act as a knowledge management and curation tool to patent examiners in identifying related patent terms.

The “Technical Fields” menu allows users to choose the different models to compare. For each chosen model, the corresponding tile with top 10 closest words appears in the results. Clicking on the ellipsis expands the result set can and shows more terms. User is able to choose, by clicking on the star icons, multiple terms across several models. As shown in “Saved Terms” menu on the right side, these selected terms can then be exported. The following sections describe the data set variations and alternate word embedding technologies used to create these models.

3 Dataset

The patent text for the granted United States patents and pre grant published patent applications, available publically on the (http://patents.reedtech.com/), serves as the primary data source. All of the patent text including the full text describing the invention and tables, mathematical expressions etc. are used to train the word embedding models.

The patent dataset is first clustered in to different areas of interests such as Computer Architecture, Biology, Surface Transportation, Plants, Optics etc. This clustered raw data is pre-processed, for example, stop words and non-alphanumeric characters are removed. The processed text acts as input for generating word embedding models (see Fig. 2).

Fig. 2.
figure 2

System architecture.

4 Methods

Two state-of-the-art methods are used for generating word embeddings: Word2Vec [2] and fastText [3].

4.1 Word2Vec

The processed clustered data is trained using the Word2Vec neural network model. Word2vec is a group of related models that learn word embeddings in an unsupervised manner. These models are shallow neural networks that are trained to capture the semantic relationship between words. Word2vec models take large corpus of text as input and encode each unique tokens as N dimensional vectors. The mapping of words to vectors allows words with similar context to be placed proximally closer to one another in the vector space. The following hyperparameters were set for training the skip-gram model: minimum count of words, size of negative samples, embedding size and epochs to train.

4.2 fastText

We also trained the processed clustered data using fastText word embedding method. In the process of computing embedding, fastText enhances word vectors with subword information using additional character N-gram of variable length. This allows the algorithm to identify prefixes, suffixes, morphological nuances and syntactic structure and makes them useful for morphologically rich languages. The following hyperparameters were set for training the fastText word vector model: window size, learning rate, wordNgrams, minimum count of words, dimensions and epochs to train.

5 Infrastructure

Amazon Web Services (AWS) cloud infrastructure and Docker [13] were utilized for identifying related patent terms. AWS m4.16xlarge (64 Core Intel Xeon E5-2676 v3 Haswell processor and 256 GB DDR3 RAM) spot instance was utilized for generating word embeddings. AWS EC2 spot instances were chosen since they have an advantage of providing surplus of computing resource at a lower price compared to the on-demand instance price. In addition, Docker light-weight containers were configured to ease the configuration and setup of the TensorFlow framework [14].

6 Explanation – Search Logs

Word embedding models are flexible and a general way to capture co-occurrence relations between words. However, these models do not offer explanations or reasons for the answers. Often when users spot an unexpected word in the related terms, the users often want to know why. Our use case, patent search, lends itself to explaining the occurrence. By mining patent search logs shared by examiners, we are able to show instances of patent examiners having used the two terms together in a search query. These examples offer insights into why and how the related term is used (Fig. 3).

Fig. 3.
figure 3

Explanation from search logs.

7 Conclusion

Using recent advances in word embedding [1] techniques we have been able to produce several models for identifying patent terms related to a keyword. However, none of the models by itself was sophisticated enough to match an experts choice of keyword expansion. In this effort, we have designed and implemented a centralized user interface that allows users to consolidate and curate related terms across different models. In addition, the interface offers explanation and insight into why the related term was suggested. In future work, we would like to explore crowd sourcing of the curated patent terms.