1 Introduction

Several contexts require finding similarity between a pair of documents. The problem of finding similarity between a pair of documents also lays groundwork for the problem of clustering similar documents together. Most of the initial research in this domain was based on cosine distance with tf-idf term vectors. Topic modeling based techniques such as LSA and LDA learn an intuitive set of topics from a given corpus of documents, and the topic distribution vectors of documents can be used to find document similarities or cluster documents together. More recently, word2vec and doc2vec based document similarity methods have been gaining popularity.

All the document clustering techniques group similar documents together, while keeping dissimilar documents in different groups. Various document similarity measures such as cosine similarity, Dice’s coefficient and Jaccard’s coefficient [13] have been used in literature to evaluate document similarity. However, the definition of a pair of documents being similar depends on the problem context. For example, finding the same joke [10] told differently is a vastly different problem from finding whether term paper submissions from two students are the same. Irrespective of the clustering technique used, most document similarity learning methods e.g. LDA [17], Doc2Vec [15] etc. require a large corpus of data to learn document features well. No document similarity computation methods work well if the corpus is small, or if only two documents are to be compared.

In our work of enabling hiring solutions via cognitive collaboration, wherein several agents/players such as job match, diversity champion, cultural assessment agent come together to make holistic hiring decision, one problem that we have faced many times is that of identifying which job requisitions are similar. This problem arises in two contexts:

  1. 1.

    Grouping jobs together: A typical application of machine learning in hiring is to learn success models for various jobs. To be meaningful, the models need to be learned at a sufficient level of granularity. Thus, arises the need to cluster jobs together. Grouping jobs together also arises necessity for a cultural assessment agent as similar assessments can be used for alike jobs.

  2. 2.

    Candidates’ previous jobs need to be matched with the opening they apply to (or to the openings that will be recommended to them). This requires comparing job description from their previous jobs to the job openings available in the ATS (applicant tracking system).

Job requisitions typically consist of several well defined components: skill and years of experience requirement, job location and a job description. With the rest being structured fields, job title (covered in [8]) and job description, which typically consists of roles and responsibilities the job entails, becomes the primary component that needs to be matched across jobs. RISE [19] proposes a method of job classification followed by similarity establishment processes that leverages both structured and unstructured components of a job.

In this paper, we compare novel part-of-speech tagging based document similarity (POSDC) calculation method with doc2vec and LDA based topic modeling method. POSDC analyzes the actions (verbs), objects of each action (nouns) and attributes of the objects (adjectives) that appear in the two job description to be compared.

This paper is organized as follows. In the next section, we describe the literature on document similarity/clustering. Section 3 describes our job description similarity computation methodology and experimental setup. Section 4 explains the evaluation criteria. Section 5 concludes and discusses some future work.

2 Literature Survey

In typical text document classification and clustering tasks, the definition of a distance or similarity measure is essential. The most common methods employ keyword matching techniques. Methods such TFIDF [9] leverage the frequency of words occurring in a document to infer on similarity. The assumption is that if two documents have a similar distribution of words or have common keywords, then they are similar. Researches have also extended this to N-gram based models [16], where group of consecutive words are taken together to capture the context. With large N gram models, typically large corpus of documents are required to obtain sufficient statistical information.

[16, 17] extended these approaches to include a probabilistic generative model that would explain the frequency of occurrence of words. These methods include PLSI (probabilistic latent semantic indexing) and LSA (latent semantic analysis). The assumption is that there are an underlying latent set of topics (with their individual distribution of words describing the topic) and each document is generated from a mixture of these topics. The above described methods fall in the category of bag-of words models. The major limitation of bag-of-words models is that the text is essentially represent as an un-ordered set of words (or n-words, for n gram models). The long-range word relations are not captured leading to loss of information. Another issue with these techniques is that they rely on the surface information of the words and not its semantics. That is, words with two or more meaning (polysemy) are represented in the same way and two or more words with the same meaning (synonymy) are denoted differently.

To alleviate the drawback of bag-of-words model, Le et al. [15] proposed Paragraph to Vector, an unsupervised algorithm that learns feature representations from variable-length pieces of texts, such as paragraphs. The algorithm represents each paragraph by a dense vector which may be used to predict words in the paragraph. Its construction captures semantics and has the potential to overcome the weaknesses of bag-of-words models. Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations and have achieved state-of-the-art results on text classification and sentiment analysis tasks [15].

Another approach to document similarity is via concept modeling (Wikipedia concepts [11], IBM watson natural language understanding service [4]). The main idea is to use many concepts from Wikipedia or any other encyclopedia to construct a reference space, where each document is mapped from a keyword vector to a concept vector. This captures the semantic information contained in the document. [18] have demonstrated the effectiveness of concept matching to overcome the semantic mismatch problem. However, the concepts themselves are not independent. [12] extended Wikipedia matching to document clustering by enriching the feature vector of a text document by using the correlation information between concept articles.

Fig. 1.
figure 1

Main flow describing the four steps for computation of job description similarity

3 Methodology

Our proposed approach to generate a similarity score among two job description D and \(D'\) can be divided into four parts as illustrated in Fig. 1. In this section we explain these steps alongside their system implementation in greater detail.

Fig. 2.
figure 2

Dictionary structure of keywords and their synonyms

Fig. 3.
figure 3

JSON structure of a dictionary entry

3.1 Document Representation

Each document is treated as a collection of sentences \(set_{sent}\), with each sentence sent being further represented as a collection of sets of triplets - action, object and attributes. Consider the following illustrative example.

Job Description Document

Determines operational feasibility by evaluating analysis, problem definition, requirements, solution development, and proposed solutions.

Representation of Job Description Document

  1. 1.

    Action: determines, Object: feasibility, Attributes: [operational]

  2. 2.

    Action: evaluating, Object: problem definition, Attributes: [ ]

  3. 3.

    Action: evaluating, Object: requirements, Attributes: [ ]

  4. 4.

    Action: evaluating, Object: solution development, Attributes: [ ]

  5. 5.

    Action: evaluating, Object: solutions, Attributes: [proposed]

where action symbolizes the main activity described by that particular sent, object represents the entity on which the activity has been acted upon and attributes corresponds to the characteristics of the object. We will refer the triplet as \(t_{POS}\), set of triplets corresponding to a sen as \(sent_{t_{POS}}\) and set of triplets corresponding to a document D as \(D_{t_{POS}}\). Our hypothesis is that these sets of triplets can properly describe a job description document. To verify this statement, we performed a small experiment. We chose five people (experts in Job analytics domain) and gave them the generated set of triplets for 10 job descriptions. Without seeing the original job description documents, they could easily extract the actual essence out of these triplet sets.

As all the job description documents were in English, without loss of generality it can be said that the main activity of a sent i.e action is represented by the non-auxiliary verb v. The entity on which the activity v has been acted upon is generally the object noun corresponding to v in sent. Characteristics of an entity are portrayed by the adjectives in English. So, we depicted attribute of an object as the adjectives corresponding to the object noun present in sent. We assume all the sentences in job description documents were in a particular format from which we could extracted the triplet. In cases where an entity of \(t_{POS}\) contains multiple elements, such as in the case of compound nouns, a list of elements is created instead.

For faster computation we used Apache Spark [1] environment to parallalize the sequential loop in Algorithm 1. Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way [2]. We modified Algorithm 1 to Algorithm 3 to incorporate distributed system capabilities.

3.2 Document Parsing and Dictionary Creation

Given a document D, it’s corresponding representation discussed in Sect. 3.1 is obtained by parsing the output tree \(tree_{dep}\) generated by Stanford Dependency Parser from the NLTK Library [3] for each sentence. Algorithm for creating \(D_{t_{POS}}\) is described by below Algorithm 1.

figure a
figure b
figure c

After obtaining the \(D_{t{POS}}\) we used memoization and precomputation techniques to build a dictionary Dict of words present in the \(D_{t_{POS}}\). The structure of the dictionary is depicted as in Fig. 2. In the dictionary, every word w has been stored with its synonym list \(syn_w\). We used Wordnet dictionary from NLTK [5] to get \(syn_w\) for a given w. We used cloudant Database to store this dictionary as JSONs. The structure of the JSON is in Fig. 3. \(syn_w\) for a w consists of only the words which exist in Dict and cross a threshold of semantic similarity score (\(sim_{sem}\)). Algorithm to update the dictionary is given in Algorithm 5.

WordNet is a large lexical database of English language. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms which is called synsets. Each synsets expresses a distinct concept which interlinked by means of conceptual-semantic and lexical relations. Wordnet provides synsets for a given English word [6]. To calculate \(sim_{sem}\) between \(w_1\) and \(w_2\) we calculate wup similarity score between two synsets corresponding to \(w_1\) and \(w_2\). Wu Palmer Similarity or wup similarity provides a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node) [7]. After getting the scores between each synset we took an average of the scores to get the semantic similarity score between \(w_1\) and \(w_2\) and denoted it as \(sim_{sem_{w_1 , w_2}}\). Algorithm to find \(sim_{sem}\) is described in Algorithm 4.

When Dict is empty and the algorithm encounters a new word it creates Dict and stores an entry corresponding to the word. When Dict exists in the cloudant database and algorithm encounters a w then it first checks whether it is present in Dict or not. If w is not present in Dict then it will create an entry for w and will generate a corresponding \(syn_w\) by calculating \(sim_{sem}\) with every other words in Dict. The \(sim_{sem}\) of every other words of Dict will also be updated accordingly. While processing each entry in \(D_{t{POS}}\), we precompute the semantic similarity scores among the words and store them in a database. Processing of \(D_{t{POS}}\) is described in Algorithm 6.

figure d
figure e

3.3 Assignment Problem Formulation

After document parsing stage, the two documents, D and \(D'\), are represented as two sets of triplets, \(D_{t{POS}}\) and \(D'_{t{POS}}\), each with elements \(t_{POS_1}\), \(t_{POS_2}\), ... \(t_{POS_n}\) and \(t_{POS_1}'\), \(t_{POS_2}'\), ... \(t_{POS_m}'\) respectively. The similarity score between the two job des can be now interpreted as the similarity score between these two sets. The similarity function is explained in detail in the next subsection, and is denoted by \(\mathbf F \) for now. To calculate the similarity score between two sets, a naive approach would be to calculate the similarity score between each pair of elements from two sets (S and \(S'\)), greedily pick the pair with the highest similarity score and repeat the process till either one of the sets has no element left. This greedy approach, although simple, does not provide an optimal match between the sets being compared. We assume that there are no repeating descriptions in the descriptive document, hence, the representative set for a document too will not have synonymous elements i.e. no same action on the same object. This assumption motivates a one-to-one mapping among the two sets being compared for similarity.

figure f

To find an optimum one-to-one mapping among the aforementioned two sets, we formulate the problem as an assignment problem [14]. In a generic assignment problem, given the cost of assignment among each pair of elements in two sets, the task is to find an optimal one-to-one assignment among the elements that maximizes/minimizes the total cost of assignment. Our problem of finding such a one-to-one mapping among the representative sets of the descriptive documents can be formulated in a similar way - given \(\mathbf F \) as the cost of assignment function among each pair of elements in the two representative sets, the task is to find an optimal one-to-one assignment among the elements that maximizes the aggregate similarity score. Since the two sets being compared can have unequal number of elements, this is a case of an imbalanced assignment problem.

After formulating the problem as a similarity score maximization assignment problem, we use the Hungarian Method [14] to extract out the matches. This method takes as input a nxn square cost matrix and post applying a set of matrix operations, outputs an optimal set of n assignments, one per row and column, which offer a maximum cumulative assignment score. Since ours is a case of an imbalanced assignment problem, given 2 sets with m and n triplets each, we start with a mxn cost matrix, where each cell contains the similarity score between the corresponding row and column elements of the matrix. Without loss of generality, we assume \(n>m\), and add zero padding to extend the mxn matrix to a nxn one. Rest of the steps for applying the Hungarian Method remain the same, as for a typical score maximization assignment problem. We will refer Hungarian Method as \(Assign_{Hung.}\) in the rest of the paper.

Post this assignment, the following subsection defines the similarity and aggregation functions.

3.4 Score Calculation

We define four similarity functions for calculating similarity score between two job description documents D and \(D'\). We will refer \(sim_{sem}\) function as calculating semantic similarity between two words described in Algorithm 4. The definition of the functions are following,

Definition 1

If \(v_1\) and \(v_2\) are two single-word verbs and \(V_{sim}\) is the similarity function between two verbs then \(V_{sim}\) is defined as,

$$\begin{aligned} V_{sim} (v_1,v_2) = sim_{sem_{v_1,v_2}} \end{aligned}$$
(1)

Definition 2

If \(N_1\) and \(N_2\) are two sets of nouns (nouns can be a set in \(t_{POS}\) in case of compound noun) and \(N_{sim}\) is the similarity function between two noun sets then \(N_{sim}\) is defined as,

$$\begin{aligned} N_{sim} (N_1,N_2) = \dfrac{( 1*|N_1\cap {N_2}| + \dfrac{\sum _{i=1}^{n} \sum _{j=1}^{m} sim_{sem_{N'_{1_i},N'_{2_j}}} }{|N'_1|*|N'_2|})}{ (1+|N_1\cap {N_2}|)} \end{aligned}$$
(2)

where, \(N'_1 = N_1 - (N_1\cap {N_2})\) and \(N'_2 = N_2 - (N_1\cap {N_2})\).

Definition 3

If \(A_1\) and \(A_2\) are two sets of adjectives (adjectives can be a set in \(t_{POS}\) in case of multiple adjectives corresponding to a noun) and \(A_{sim}\) is the similarity function between two adjective sets then \(A_{sim}\) is defined as,

$$\begin{aligned} A_{sim} (A_1,A_2) = \dfrac{( 1*|A_1\cap {A_2}| + \dfrac{\sum _{i=1}^{n} \sum _{j=1}^{m} sim_{sem_{A'_{1_i},A'_{2_j}}}}{|A'_1|*|A'_2|})}{ (1+|A_1\cap {A_2}|)} \end{aligned}$$
(3)

where, \(A'_1 = A_1 - (A_1\cap {A_2})\) and \(A'_2 = A_2 - (A_1\cap {A_2})\).

Definition 4

If \(t_{POS_1}\) and \(t_{POS_2}\) are two sets of triplets consisting of \((v_1,N_1,A_1)\) and \((v_2,N_2,A_2)\) respectively. Then, \(t_{sim}\) is the similarity function between two triplet sets and \(t_{sim}\) is defined as,

$$\begin{aligned} \begin{aligned} t_{sim}(t_{POS_1},t_{POS_2}) =&\, \dfrac{1}{(2+\mathbf {1}_{A_1\cup {A_2} \ne null})}\\&*(V_{sim} (v_1,v_2) *(1+N_{sim} (N_1,N_2)\\ {}&*(1+A_{sim} (A_1,A_2)))) \end{aligned} \end{aligned}$$
(4)

where \(\mathbf {1}_{A_1\cup {A_2} \ne null} = 0\), if \(A_1\cup {A_2} = null\), 1 otherwise.

Calculating triplet similarity includes finding semantic similarity between action, object and attributes, where attributes set can be null but others can’t be null. We have already discussed in Sect. 3.1 that action are actually nothing but verbs, object are nothing but nouns and attributes are nothing but adjectives. So, calculating similarity between action, object and attributes boils down to finding semantic similarity between verbs, corresponding nouns and corresponding adjectives. A triplet \(t_{POS}\) consists of exactly one action or one verb,one object or a set of nouns (in case of compound noun), and a set of attributes or a set of adjectives (in case of multiple adjectives). Calculating semantic similarity between two verbs is straight forward using \(sim_{sem_{w_1 , w_2}}\) discussed in Sect. 3.2 and as described in Definition 1. On the other hand, calculating semantic similarity between two noun sets or two adjective sets in (Definitions 2 and 3) is a bit tricky. Both follow the same rule. So, we will discuss about the noun similarity calculation here. We compute the intersection \(N_1\cap {N_2}\) between two sets \(N_1\) and \(N_2\). We also compute the set difference between both the sets \(N'_1\), \(N'_2\) as \(N_1-N_1\cap {N_2}\) and \(N_2-N_1\cap {N_2}\) respectively. Then we compute the pairwise semantic similarity among elements of \(N'_1\) and \(N'_2\) using \(sim_{sem_{w_1 , w_2}}\) function. Next, we take the average of these pair wise semantic similarity and treat it as one entity \(sim_{sem_{nonIntersec}}\), where

$$\begin{aligned} sim_{sem_{nonIntersec}} = \dfrac{\sum _{i=1}^{n} \sum _{j=1}^{m} sim_{sem_{N'_{1_i},N'_{2_j}}}}{|N'_1|*|N'_2|} \end{aligned}$$
(5)

The other entity is semantic similarity score for \(N_1\cap {N_2}\) which is 1. Finally, we compute the weighted average of 1 and \(sim_{sem_{nonIntersec}}\) where the weights are \(|N_1\cap {N_2}|\) and 1. After calculating these individual similarity scores we aggregate them to compute the similarity score between two triplets such that action gets the highest importance, followed by object and attributes.

Given two documents D and \(D'\) we first compute their corresponding triplet representation \(D_{t_{POS}}\) and \(D'_{t_{POS}}\). Lets say, \(|D_{t_{POS}}| = n\) and \(|D'_{t_{POS}}| = m\). Without loss of generality, it can also be stated that \(n\ge {m}\). Then, the similarity matrix \(Mat_{sim}\) has been calculated as,

$$\begin{aligned} \begin{aligned}&Mat_{sim^{i,j}} = t_{sim}(D_{t_{POS}^i},D'_{t_{POS}^j})\\&\forall i,j \in n,m \end{aligned} \end{aligned}$$
(6)

After \(Mat_{sim}\) calculation, it is provided as the input matrix to \(Assign_{Hung.}\) algorithm, which returns a unique \(1-1\) mapping \(Map_{D,D'}\). Now the final similarity score between two documents \(sim_{D,D'}\)is calculated using the following Eq. 7

$$\begin{aligned} \begin{aligned}&sim_{D,D'} = \dfrac{\sum _{k=1}^{m} t_{sim}(D_{t_{POS}^i},D'_{t_{POS}^j})}{n}\\&\textit{where } Map_{D,D'}^{k} : D_{t_{POS}^i} -> D'_{t_{POS}^j}\\&\forall i,j \in n,m \end{aligned} \end{aligned}$$
(7)

Following Algorithm 7 actually describes the procedure to calculate the similarity between two job description documents.

figure g

3.5 Experimental Setup and Data-Sets

We used a Spark cluster with 6 executors each having 8 GB of RAM for running our experiments. Apache Spark frame work has been used to incorporate parallelism to carry out the experiments. All the codes have been written in python using pySpark library. Cloudant services have been incorporated as database resource. We also used Stanford Core NLP Parser and Wordnet from NLTK library.

Job description documents from IBM Talent Framework Data have been used to carry out all the experiments. All the sentences in the job description documents are grammatically incomplete in the sense that each of them starts with a verb. Subject noun is missing from each sentence, e.g. “Require analytical skills”. So we add “You” or “You are” at the beginning of the each sentence depending upon the form of the verb. If the verb ends with “ing” we added “You are”, otherwise we added “You” to make the sentences grammatically correct. In the cases where verbs end with “s” (verb meant to be for third person singular number) “You” has been treated as a name (third person singular number) and thus resolves the grammatical issue. Then these grammatically correct sentences are fed to the Stanford Core NLP Parser for generating dependency tree. As an estimate of the computation time in this setup, the action-object-attribute representation and calculation of job description similarity of 500 cross 500 jobs took 3892.33 s.

4 Evaluation

For testing our method we do Job Family based evaluation. Since we are using IBM Kenexa talent frameworks, we can utilize its default clubbing of jobs into job families. The general expectation is that jobs within a family (intra) will have higher job description similarity scores than those outside the job family (inter). Let

  • \(F = \{F_1, F_2, ..., F_n\}\) be the set of all job families in the test set.

  • \(J_i = \{J_{i,1}, J_{i,2}, ..., J_{i,n_i}\}\) be the set of all jobs in family \(F_i\).

  • \(Intra_i\) be the average similarity between all pairs of jobs within \(F_i\).

  • \(Inter_i\) be the average similarity between all pairs (AB) of jobs such that \(A \in F_i\) and \(B \in F_j\) for all \(j \ne i\).

  • \(R_i = \frac{Intra_i}{Inter_i}\).

Then the gross metric of interest to gauge effectiveness of a document similarity computation method is \(S = \frac{\sum _{i}{|F_i| \times R_i}}{\sum _{i}{|F_i|}}\), computed over a common test set. So, higher S value means better performance of the similarity calculation approach.

Since we intend to benchmark our method against existing state of the art methods, we conduct several experiments with varying corpus of training data with \(N_1 = 56\), \(N_2 = 129\) and \(N_3 = 430\) documents used for training. POSDC does not require any training corpus, therefore the corpus varying experiments are valid only for doc2vec and LDA. Note that the test set consisted of 500 randomly chosen jobs out of the 2344 available in IBM Kenexa talent frameworks, so that there is representation from each job family in the selected test set. The test sets selected did not have any of the jobs on which the models were trained, and were selected separately for the 3 experiments.

As is evident by the bar charts and Table 1, when a large enough corpus is chosen, LDA gives the best overall performance. Otherwise POSDC performs better. When we looked at individual job families, neither LDA nor POSDC completely dominates the other. Doc2vec seems to be consistently inferior to both LDA and POSDC irrespective of the corpus size.

Table 1. Comparison of S value across methods
Fig. 4.
figure 4

10 largest job families’ \(R_i\) values for \(N_1\)

Total number of job families in IBM Kenexa Talent Frameworks is more than 100. But for the sake of brevity and clarity, we show the bar charts for ten largest job families (in terms of number of jobs included) for all three training corpus sizes.

The comparison of \(R_i\) values for ten of the biggest job families corresponding to \(N_1\), \(N_2\) and \(N_3\) can be seen in Figs. 4, 5 and 6 respectively.

Fig. 5.
figure 5

10 largest job families’ \(R_i\) values for \(N_2\)

Fig. 6.
figure 6

10 largest job families’ \(R_i\) values for \(N_3\)

5 Conclusion and Future Work

The core novelty of POSDC is that unlike LDA or doc2vec, it doesn’t require any prior training on large corpus. It uses the inherent semantics of job descriptions to find the similarity using available dictionary. As can be seen in our results, it is consistently superior to doc2vec, and even superior to LDA based method when the corpus available to train is smaller. The future work in this direction would be to define similar paradigm(s) for other/generic documents.

In the current approach, we have assumed that there is no duplication or alternate description of the same action-object-attribute triplet within a document. If that is not the case, then effectively the same action-object-attribute triplet in one job may get matched to different ones in another job. This can be overcome by first matching a job description with itself, and removing pairs of action-object-attribute triplets that match with a score above a threshold.

Another possible future direction could be more domain specific rather than being problem specific. Since our motivation to tackle this problem is to find jobs that are similar, we could combine similarity between job title [8] and POSDC to improve upon RISE [19].