1 Introduction

Identifying right job positions is a key to right opportunity. Job descriptions (JDs) expose job positions by providing information about the positions. To-date job descriptions are prescriptive and dependent solely on expression of job details from employers or recruiting systems. As a result job descriptions differ significantly from one employer to another even for the same role or responsibility, making it difficult for job seekers to identify the right set of jobs. Thus, Job Data Normalization would go a long way assisting the community of job seekers to identify the right set of opportunities for their profiles through standardization of job requirements.

Functional commonalities observed in recruiting systems such as hiring, selection etc., among organizations result in data (JDs, candidate CVs, processes for hiring etc.) that exhibit high commonalities. Such data is often non-standard and each organization chooses its own identifier to refer to each of the records. Our own analysis of about 27000 JDsĀ across 28 different organizations has revealed that terms used to refer to roles vary significantly by name. Thus identifying similar JDsĀ by role names becomes difficult if not impossible. Often times, manual inspection of data is employed to assess contents of JDsĀ and establish similarity and thus identities.

In this paper, we address one such critical problem of resolving identities of JDsĀ and normalizing them across organizations. We present RISE, an identity resolution engine that uses underlying similarity in the nature of the data, representing attributes as a fundamental concept to resolve identities. It establishes novel methods to process unstructured JDs, identify attributes and convert unstructured textual descriptions into structured information. Similarity is established against the first class attributes identified to represent the data. Identities (Job Titles/Department) pertaining to JDs, which are established as similar, are used to build rules to construct equivalence. We enumerate each of the steps in the process in detail and show that our approach identifies similarity across JDsĀ with high accuracy.

Our contributions in this paper can be summarized as follows:

  1. 1.

    Identity resolution of JDs

    1. (a)

      Identification of important attributes that are descriptive of the information in JDs.

    2. (b)

      Build highly accurate classifier that labels the unstructured text into one or more of the attributes.

    3. (c)

      Identify and extract keywords for each of the attributes from unstructured text.

    4. (d)

      Establish similarity among JDsĀ based on extracted keywords.

    5. (e)

      Establish equivalence among identity titles/roles of JDs.

  2. 2.

    Experiments on real world data sets

    1. (a)

      We performed each of above 5 steps on real world data sets collected across 28 different organizations.

    2. (b)

      We extensively validated results at each step to ensure that the overall process derives similarity with high accuracy.

Rest of the paper is organized as follows. The system and steps are presented in detail in Sect.Ā 2. SectionĀ 3 is used to present the algorithms that we have used. We review some of the existing literature as applicable for our work in Sect.Ā 5. SectionĀ 4 presents the details of the experiments and the results. We conclude with directions to future work in Sect.Ā 6.

2 System Overview and Approach

Before providing system details, we present a list of terms along with their definitions in TableĀ 1. We use these terms throughout our paper. It will help readers not to get confused with terms having literally similar meaning.

Table 1. Important terms and definitions

Our system RISEĀ comprises six different phases. FigureĀ 1 shows all the phases and respective steps involved in extracting the relevant information from highly unstructured text describing a job requirement. Details of every phase are as follows.

Fig. 1.
figure 1

Phases and steps involved in processing unstructured job descriptions

1. Attribute Identification Phase \(\mathbf{( }{{\varvec{AIP}}}\mathbf{) .}\) It uses Principal Component Analysis (PCA) to identify attributes that are representatives of the information in JDsĀ (Step 1 in Fig.Ā 1). This is done with the help of the domain and subject matter experts. Algorithm for identification of these attributes is presented in detail in Sect.Ā 3.1. The five attributes those were determined through this analysis are \({{\varvec{\{Education, Skills, Experience, Roles, Responsibilities\}}}}\).

2. Classifier Training Phase \(\mathbf{( }{{\varvec{CTP}}}\mathbf{) .}\) It is responsible for training a multi-label ensemble classifier to assign one or more attributes to each line in a JD. The input to this step is training data where every line of historical JDsĀ is already labeled. Output is a classifier model. Ground truth, collected through manual labeling (as described in Sect.Ā 4.1), is used to train the classifier. In step 2 (refer to Fig.Ā 1), we extract unstructured text information from the JDsĀ only for attributes identified in AIP. Step 3 involves unstructured text processing, where text is parsed to be broken into a set of lines. Delimiters that have been used to break the unstructured text into lines are {, . ; newline}. Based on the keywords present in a line, a set of categories are extracted for that line. These categories can be in hierarchical order. Eventually a binary feature set is built for every line where every entry in a feature set indicates whether the corresponding category is present or absent. In step 4, this feature set is used to create training data for a classification algorithm. In step 5, a classifier is trained to output a classifier model as shown in step 6. Details of the algorithms that are involved in category extraction, feature set generation, and classifier training are presented in Sect.Ā 3.2.

3. Attribute Association Phase \(\mathbf{( }{{\varvec{AAP}}}\mathbf{) .}\) It uses the classifier model built in step 5 of CTPĀ for the classification. Every new JDĀ is passed through unstructured text processing (step 7), which extracts the features against each line (step 8). The extracted features are passed through the classifier which associates each line with one or more attributes (step 9). The output of classifier passes through text standardization (step 10). The main functionality of this component is to convert the keywords available in each labeled line into standard recognizable forms. Trivial differences such as multiple spaces between keywords, presence of delimiters are also cleaned by the text standardization process. Algorithmic details of the text standardization process is presented in Sect.Ā 3.3.

4. Extraction Phase \(\mathbf{( }{{\varvec{EP}}}\mathbf{) .}\) The text is still in the form of unstructured lines after AAP. This textual lines now with labels are passed into the Extraction Phase (step 11). In this phase, keywords referring to each of the attributes are identified and extracted from the text. This step converts the unstructured text into a structured JD. The keywords for each of the attributes are stored in the form of comma separated values. SectionĀ 3.4 presents the complete set of algorithms for extracting keywords related to each of the attributes from the lines.

5. Similarity Phase \(\mathbf{( }{{\varvec{SP}}}\mathbf{) .}\) Step 12 of Fig.Ā 1 represents a similarity algorithm to find similarity between any two JDs. We have used JaccardĀ similarity measure to determine the similarity of two JDsĀ based on the attributes education, skills, roles, and responsibilities. For similarity measures on the experience, we have provided our own approach of computing similarity based on the number of years of experience. Algorithmic details are provided in Sect.Ā 3.6.

6. Identity Resolution Phase \(\mathbf{( }{{\varvec{IRP}}}\mathbf{) .}\) In this phase, we establish and build a set of rules that can be used to easily identify two equivalent JDsĀ (Step 13 in Fig.Ā 1). Each job description is identified by its job title (a role oriented descriptor) and department. For every pair of similar JDsĀ identified in previous phase, their job titles and departments are stored as a rule in the rules repository. We donā€™t have to analyze any two JDsĀ for similarity in future, if their jobs titles and departments are already present among rules.

3 Algorithms

In this section, we primarily describe 4 algorithms, (1) for identifying important attributes for JDs, (2) for tagging unstructured lines of JDsĀ as one of 5 attributes {Education, Skills, Experience, Roles, Responsibilities}, (3) for creating structured job descriptions, and (4) for finding similarity between two job descriptions.

3.1 Identifying Important Attributes for Job Descriptions

This algorithm is used in the AIPĀ described in Sect.Ā 2. Our aim is to identify a set of important attributes that are representatives of the information in JDs. Principal Component Analysis (PCA) is used in dimensionality reduction. We use a hybrid feature reduction method MSNRPCA based on the combination of feature ranking with PCA. This method was proposed by Yang et al. [20].

We use labeled data set of JDs, as described in Sect.Ā 4.1 to run MSNRPCA on it. Labeled data has values of every attribute for all JDs. An exhaustive list of different attributes derived by qualitative analysis on these JDsĀ is as shown below.

figure a

MSNRPCA assigns a score to each of these attributes, based on the importance of every attribute. Higher the score, higher is the importance. For robustness, we create 5 sets of labeled JDs, by randomly sampling 80% of total JDsĀ every time. MSNRPCA is used to assign an importance score to every attribute for every sampled instance of labeled JDs. To simplify the analysis, we map all scores on a rating of 10. TableĀ 2 captures all such scores. You can think of these scores as ratings given to every attribute by 5 different domain experts. Co-related variables are removed from this list thus reducing the set of variables to a minimum number of independent attributes. We perform factor analysis on these scores to eventually identify 5 important attributes. Those are {Education, Skills, Experience, Roles, Responsibilities}. Skills attribute can further be classified into ā€œTechnical Skillsā€ and ā€œSoft Skillsā€. All skills that involve a known technology, tools, product or methodology are categorized under technical skills. Soft skills include those which do not involve a known tool, but are gained through experience and personal affiliation. Examples of such skills include management skill, communication skill, etc. In this paper we do not discuss this classification of skills and consider only 5 important attributes. We refer these attributes using the notations \(y_\text {edu}, y_\text {skill}, y_\text {exp}, y_\text {role}\) and \(y_\text {resp}\) respectively.

Table 2. Importance scoring of all attributes

3.2 Unstructured Text Classification

The bunch of algorithms presented here are used in the CTPĀ and AAPĀ phases explained in Sect.Ā 2. A job description generally contains unstructured text describing the requirements of an open job position in terms of important attributes {Education, Skills, Experience, Roles, Responsibilities}. It is observed that every line of such a job description describes one or more attributes. So in order to create a structured description out of an unstructured one, we first identify which line describes what attributes. This leads to a multi-label classification problem where we need to assign one or more labels to every line L of a job description. In our case, a set of possible labels is \(\mathcal {Y}=\) \(\{y_\text {edu},\) \(y_\text {skill},\) \(y_\text {exp},\) \(y_\text {role},\) \(y_\text {resp}\}\). As described in [19], there are two main methods for tackling multi-label classification problem, (1) problem transformation methods that transform the multi-label problem into a set of binary classification problems and (2) algorithm adaptation methods that adapt the algorithms to directly perform multi-label classification. We use the problem transformation method, where we create 5 binary classifiers one for each label in \(\mathcal {Y}\). All the steps involved in multi-label classification are explained in detail below.

Feature Extraction. This section provides an approach to unstructured text processing as mentioned in Sect.Ā 2. A set of features is required for a line L to use any standard classification algorithm. So feature extraction is an important step in our approach. In a way, our labels \(\mathcal {Y}\) are the categories which we have to identify for every line. Such a category identification needs a mapping between categories and keywords as an input. It looks for keywords in text and based on mapping it figures out most appropriate category. We use Naive Bayes classifier to classify text into one of the categories [9]. There are mainly two problems with this approach, (1) one keyword may be mapped to multiple categories, resulting in more than one possible categories for L, and (2) it is very difficult to come up with an exhaustive list of keywords for every category. Thus, using category identification approach for our problem leads to poor results.

Instead we can have a taxonomy for several different categories (including \(\mathcal {Y}\)), such as academics, products, work, business, etc. An example of such a taxonomy is shown in Fig.Ā 2. Similar taxonomy can easily be found in public domain. Every parent node in a taxonomy can be considered as a category and children can be considered as relevant keywords. Using this taxonomy, we can find a set of most suitable categories for a line L. Resultant categories may or may not have categories from \(\mathcal {Y}\), but we can deduce categories from \(\mathcal {Y}\) provided we have some knowledge about which combination of categories result in which categories from \(\mathcal {Y}\).

Fig. 2.
figure 2

Taxonomy of categories

Being a good indicator of information present in L, we can use extracted categories as features \(f^L\) for L. If given taxonomy has m categories, then there will be m binary features for every L. For all \(i \in \{1,...,m\}\), feature \(f_i=1\) if category \(c_i\) is extracted for L, otherwise \(f_i=0\). It creates a feature vector \(f^L = \{f_1, f_2, ..., f_m\}\).

Classifier Training. We provide a multi-label classifier training algorithm in this section, which is used by step 5 of CTPĀ phase as described in Sect.Ā 2. As mentioned in the previous section, if we know the rules that map combinations of categories into one of the categories from \(\mathcal {Y}\), we can easily assign a label from \(\mathcal {Y}\) to L. Decision Tree is a good choice to learn such a set of rules from the given data. It also generates a classification tree, which can be used to classify L into one of the labels from \(\mathcal {Y}\). Decision tree assigns only one label to L, while we need multiple labels. So we create a decision tree for every label in \(\mathcal {Y}\). Though there are several flavors of Decision Tree available in literature, we have used its generic form for simplicity of explanations. However we have presented results for C5.0 [1], CHAID [12] and C&RT [5] in Sect.Ā 4.

Decision Tree requires labeled data for training. So we parse several job descriptions to get a set of lines. During a ground truth collection phase, we receive a multi-label set \(y^L \subseteq \mathcal {Y}\) for every lineĀ L. Thus we get pairs \(\{L,y^L\}\) in training dataset \(\mathcal {D}_\text {train}\). To train a particular classifier for \(y_i \in \mathcal {Y}\), we replace \(y^L\) from every \(\{L,y^L\}\) pair with a binary value \(b_i^L\) where \(b_i^L=1\) if \(y_i \in y^L\), otherwise \(b_i^L=0\). This gives us pairs \(\{L,b_i^L\}\) in training dataset \(\mathcal {D}_\text {train}^i\) for label \(y_i\).

Having all labeled data with us and features vectors for every line as described in Sect.Ā 3.2, we build a decision tree \(T_i\) for every label \(y_i\).

Multi-label Classification. Given a new unseen line L, we classify it using each of the decision trees built in training phase. Decision Tree \(T_i\) classifies a line L and provides label \(b_i^L\). For example, consider a decision tree \(T_\text {edu}\) for \(y_\text {edu}\). For any L, decision tree returns \(b_\text {edu}^L=1\) indicating that L describes education requirements and it returns \(b_\text {edu}^L=0\) when L is not about education. We combine all such labels for L from all decision trees and generate a multi-label set \(y^L\) where \(y^L\) contains a label \(y_i\) if \(b_i^L=1\). This approach can be used in step 9 of AAPĀ phase as mentioned in Sect.Ā 2.

3.3 Text Standardization

To address the problem of standardizing text in step 10 of AAPĀ phase as mentioned in Sect.Ā 2, we use ontologies like WordNetFootnote 1 and YagoFootnote 2. For example, some recruiters may write ā€œMS Officeā€ and others may write ā€œMicrosoft Officeā€. If we donā€™t standardize words like ā€˜MSā€™ into ā€˜Microsoftā€™, it would be difficult to find similarity between two job descriptions, which is our final goal. Ontologies are useful, because they usually contain common entities and their abbreviations. WordNet can be used to find even synonyms which can replace certain keywords in a line. We also use Jaro-Winkler distance [6] on keywords to group similar keywords together. Input to this distance estimator are keywords from lines and dictionary of keywords collected through large databases from organizations. For instance, names of all skills relevant to an organization can be made available in the form of a dictionary. Each of the keywords of the lines are compared with the keywords in the dictionary using the Jaro-Winkler distance estimator to determine the closeness. If two keywords are identified as similar by the algorithm with high confidence level, the keyword in the line is replaced with the keyword from the dictionary. This ensures uniform representation of keywords across text.

3.4 Building Structured JDs Using Keywords Extraction

Once we have every line L of every job description classified as one or more labels from \(\mathcal {Y}\), our next task is to extract certain keywords from L, which precisely tells about the job requirements. This set of algorithms relate to the EPĀ phase in Sect.Ā 2. For example, consider following line of a job description.

figure b

This line should ideally receive labels \(y_\text {edu}\), \(y_\text {skill}\) and \(y_\text {exp}\), as it is talking about education, skills and experience requirements. After labeling, we should extract bold keywords from this line so as to organize it as follows. Observe that though the line is labeled as \(y_\text {exp}\), we do not extract any keywords forĀ experience. It is because we extract only numeric information for experience attribute, for example, number of years of experience. Line given in this example doesnā€™t contain any such information.

figure c

We get a set of keywords \(S_i^L\) for each attribute \(y_i\) from \(\mathcal {Y}\) for every lineĀ L. Eventually we take union of all sets \(S_i^*\) over all lines to get a final set of keywords \(S_i\) for attribute \(y_i\). Such sets for all \(y_i\) together forms a structured job description. Next we describe how we can extract important keywords from a line L after it has been classified into a set of attributes \(y^L\).

Keywords Extraction for Education. It is observed that education is usually specified in following format.

figure d

For example, ā€œBachelor of Engineeringā€. It is easier to get an exhaustive list of possible values of degree, while set of possible values of field/stream/department can be huge and we may not be able to create an exhaustive list. But we can utilize the correlation among keywords of education phrase. It is very clear from the above format that parts of speech of an education phrase are Noun-Preposition-Noun. We use NLP (Natural Language Processing) based part of speech (POS) tagging [3] to tag every phrase of a line, which is labeled as \(y_\text {edu}\). We use OpenNLPFootnote 3 tool, which can tag every keyword from a set of 36 different POS tags. We pick all the Noun-Preposition-Noun phrases and lookup for degree related keywords in Noun phrases. For this purpose, we maintain a dictionary of keywords for degree. This dictionary is used for lookup. Once we find a degree keyword that must have been tagged as Noun in a phrase, we can tag other Noun of the phrase as field of the degree. Finally we extract all such phrases, where we can find degree and field combination.

Another possible format for education phrase can be only . It is applicable for education level lower than graduation where they donā€™t have any specialization. This case is easier to handle by having only dictionary lookup for degree.

Keywords Extraction for Experience. We extract years or months of experience required for a job position if a line is labeled as \(y_\text {exp}\). To find experience phrases in a line, we can use an approach similar to what we do for extracting education phrases. Formats of experience phrases are observed to be as follows.

figure e

For example, ā€œ... 5Ā years of experience in Java...ā€, ā€œ... 2ā€“3 years of experience in Databases...ā€, etc. A number can be written either in digits or in words. So we again use NLP based POS tagging to find phrases those are tagged as numbers. If a number phrase is found along with ā€˜yearā€™ or ā€˜monthsā€™ keywords then we extract such number as experience. This can be ambiguous sometime when a time duration is not associated with experience, for example, ā€œ... candidate should be at least 25Ā years old...ā€. To resolve such ambiguities and boost our confidence, we also look for skills or work related keywords in the vicinity of experience number. Skills and work related keywords can be found using taxonomy of categories mentioned in Sect.Ā 3.2. If we find two numbers separated by ā€˜-ā€™ or keywords like ā€˜toā€™ as shown in possible formats above, we extract the average of both numbers as experience. For example, we extract keywords ā€œ2.5Ā yearsā€ from a line ā€œ... 2ā€“3 years of experience in Databasesā€.

Keywords Extraction for Skills and Roles. Skills required for a job position and roles in an organization can be very specific and recruiters use them again and again while writing job descriptions for several job positions. Hence, it is easier to maintain dictionaries of exhaustive keywords for skills and roles. If a line in a job description is labeled as \(y_\text {skill}\), we lookup into skills dictionary to check if any keywords from dictionary are present in the line. We extract all such matching keywords to tag them as skills. We follow the same procedure for the lines which are labeled as \(y_\text {role}\).

Responsibilities of a job position are well understood from entire line instead of few keywords. Hence we donā€™t extract any specific keywords for responsibilities attribute. We consider entire line among responsibilities if the line is labeled as \(y_\text {resp}\). We use Jaro-Winkler distance [6] based string similarity for all dictionary lookups, because it takes into consideration minor spelling mistakes and white spacing between keywords.

3.5 Enriching Dictionaries

The present dictionaries of exhaustive keywords for skills, roles and education may not be exhaustive tomorrow due to ever evolving needs of new skills, roles and education. We propose a way to keep enriching these dictionaries with new keywords by analyzing the frequent occurrences of nouns in lines labeled as one or more of \(y_\text {skill}\), \(y_\text {role}\) and \(y_\text {edu}\). As described in algorithm 1, if a noun is not in any of the dictionaries, we count its frequency in the context of different labels. For every such noun, we find a label where the noun has maximum frequency and insert it into the dictionary corresponding to that label, if maximum frequency is above certain threshold. Frequency based analysis is important, because every line can be assigned multiple labels and it can be confusing do decide which dictionary a noun should be inserted into. For simplicity and accuracy, we assume that a noun belongs to only one dictionary.

Following the keywords extraction methods for skill, roles and education, as described in Sect.Ā 3.4, we run the process of enriching dictionaries and then again try to extract keywords. It helps in extracting those keywords, which we could not extract in previous iteration due to lack of their presence in relevant dictionaries.

figure f

3.6 Similarity of Job Descriptions

Given two job descriptions \(J_1\) and \(J_2\), our aim is to find how similar they are in terms of attributes \(y_\text {edu}\), \(y_\text {skill}\), \(y_\text {exp}\), \(y_\text {role}\) and \(y_\text {resp}\). We have provided a detailed procedure in Sects.Ā 3.2 and 3.4 about how to arrive at a structured job description which has sets of keywords \(S_\text {edu}\), \(S_\text {skill}\), \(S_\text {exp}\), \(S_\text {role}\) and \(S_\text {resp}\) for respective attributes. Having these keyword sets where text has been standardized using ontologies as mentioned in Sect.Ā 3.3, we just have to find keywords based overlap between respective sets of job descriptions. \(S_i^{J_k}\) represents a set of keywords for job description \(J_k\) and attribute \(y_i\). We compute JaccardĀ similarity score between two respective sets \(S_i^{J_k}\) and \(S_i^{J_l}\) of job descriptions \(J_k\) and \(J_l\) to get a score \(\text {sim}_i^{k,l}\) as follows. CosineĀ similarity [2] can also be used instead of Jaccard. For the ease of explanation we mention only JaccardĀ similarity here.

$$\begin{aligned} \text {sim}_i^{k,l} = \dfrac{|S_i^{J_k} \bigcap S_i^{J_l}|}{|S_i^{J_k} \bigcup S_i^{J_l}|} \end{aligned}$$
(1)

This is repeated for all \(y_i\) except \(y_\text {exp}\), because we extract only numbers for experience attribute and not keywords. So JaccardĀ does not work for \(y_\text {exp}\). Instead we propose a novel similarity measure for finding similarity based on the numeric values.

Similarity for Experience. Given two numeric values \(e_k\) and \(e_l\) of experience attributes for job descriptions \(J_k\) and \(J_l\), dissimilarity of experience is equivalent to normalized gap between two values. As both are non-negative numbers, maximum gap is equal to \(\max \{e_k, e_l\}\), which is used for normalization. Thus similarity of experience values can be formulated as follows.

$$\begin{aligned} \text {sim}_\text {exp}^{k,l} = 1 - \dfrac{|e_k - e_l|}{\max \{e_k, e_l\}} \end{aligned}$$
(2)

We also define a weight vector \(w = \{w_\text {edu}, w_\text {skill}, w_\text {exp}, w_\text {role}, w_\text {resp}\}\) to specify importance of every attribute for all job descriptions. A weight can be any non-negative number. All similarity scores \(\text {sim}_i\) are scaled by weights \(w_i\), added up and then normalized to get the final similarity score between two job descriptions. It can be summarized with following equation.

$$\begin{aligned} \text {JobSim}(J_k, J_l) = \frac{\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} \left( w_i \times \text {sim}_i^{k,l}\right) }{\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} w_i} \end{aligned}$$
(3)

4 Experiments

We evaluated the performance of every phase of our system by running a set of experiments over a data set as described below. We categorize our experiments mainly into three sets. First set of experiments were conducted to assess the accuracy of classification algorithm (CTPĀ and AAPĀ phases). Second set of experiments were conducted to assess the accuracy of keyword extraction from labeled text for creating structured JDsĀ (EPĀ phase), and third set of experiments were conducted to assess the accuracy of similarity algorithm (SPĀ phase). All of these experiments are described in following subsections.

4.1 Data Set

We collected approximately 27000 JDsĀ from 28 organizations including IBM and its clients. Client names are not mentioned in this paper to preserve confidentiality. These JDsĀ were picked from actual jobs posted by organizations for hiring candidates. Distribution of number of JDsĀ picked from 28 organizations is 8000, 4000, 2500 and 500 each from remaining organizations. The organizations are in the area of Information Technology. Thus diverse set of JDsĀ for a single domain were considered. It was observed that these JDsĀ were highly unstructured with free flow text expressing requirements of the job. There was no explicit or consistent expression of lines as skills, experience, roles, etc. These JDsĀ produced about 0.18 million lines which were used as data.

Approximately 5000 lines, chosen randomly, were manually tagged for collecting the ground truth. Human labelers assigned one or more of the labels \(\{y_\text {edu}, y_\text {skill}, y_\text {exp}, y_\text {role}, y_\text {resp}\}\) to every line, depending on what a line was describing. Along with labels, human labelers also annotated the phrases that actually described the labels assigned. Total 6367 phrases were annotated. 10 people contributed in this ground truth collection activity. Every line is labeled and annotated by 2 labelers. Overall agreement on labels and annotations was 86%. We carefully resolved the conflicts while finalizing the ground truth.

4.2 Classifier Evaluation

We compare the performance of our classification algorithm as described in Sect.Ā 3.2 with a baseline approach. Our algorithm uses decision tree classifier, that automatically generates a set of rules for assigning an attribute as a label to every line. On the contrary, baseline approach relies on a set of rules manually provided by domain experts for every attribute. Given a set of categories extracted for a line, baseline approach scans through rules for an attribute and if any rule is satisfied, that particular attribute is assigned to the line. This process is repeated for every attribute.

We conducted experiments of baseline and our algorithm over a ground truth of 5000 lines labeled manually. As rules are readily available, baseline approach doesnā€™t require any training phase. Baseline predicted attributes for every line and later we compared them against the ground truth. Whereas a 5 fold cross validation was used to report precision and recall for our algorithm.

While comparing with baseline, we computed three different set of results for our classification algorithm by selecting a different decision tree algorithm every time. Namely we used C5.0 [1], CHAID [12] and C&RT [5] decision tree algorithms.

As this is multi-label classification problem, we report F1 score for every attribute as shown in Fig.Ā 3. It is clear that F1 score of our algorithm beats baseline F1 score or at least at par with baseline F1 score in all three settings for all attributes except for experience. Improvement ranges from 0 for skills using C&RT to 0.23 for roles using CHAID. For experience, drop in F1 score ranges from 0.04 using CHAID to 0.15 using CR&T. Thus, our algorithm works better than baseline in most of the cases. In remaining cases, our algorithm is not far behind the baseline in terms of F1 score. Additionally, our algorithm can be used with larger data sets. Manual rules in baseline approach may not be exhaustive in case of larger data sets.

Comparing among three different settings for our algorithm, we can infer from above analysis that CHAID is the best suited for our algorithm and C&RT is the worst among three.

Fig. 3.
figure 3

F1 score based comparison

We plot ROC curves for decision tree classifiers for every attribute with CHAID technique. Classification scores obtained for every attribute and for every line are used for this purpose. These ROC curves are shown in Fig.Ā 4. It is observed that area under ROC curve (AU-ROC) is high for all attributes, that establishes the quality of our algorithm for classification.

Fig. 4.
figure 4

ROC curve along with AU-ROC value of a classifier for every attribute

4.3 Keyword Extractor Evaluation

We used 5000 lines from ground truth having attributes assigned to them. For each of these lines, we extracted keywords based on the attributes of lines. We compare the extracted keywords for every line with the annotated keywords for that line in the ground truth. We adopt the standard definition of precision to find the precision of our keyword extraction algorithm in terms of following formula.

$$\begin{aligned} \text {Precision} = \frac{\sum _{L \in \{\text {all lines}\}}\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} N_{\text {anno, ext}}^{i,L}}{\sum _{L \in \{\text {all lines}\}}\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} N_{\text {ext}}^{i,L}} \end{aligned}$$
(4)

where \(N_{\text {anno, ext}}^{i,L}\) is the total number of keywords those were annotated in the ground truth as well as extracted by our algorithm for a line L and for attributeĀ i. \(N_{\text {ext}}^{i,L}\) is the total number of keywords extracted by our algorithm for a line L and for attribute i. We calculated the value of above formula to find what fraction of total extracted keywords were actually describing the attributes of the lines. The value of Eq.Ā 4 was computed to be as high as 0.954.

We also adopt the standard definition of recall to find the recall of our keyword extraction algorithm as follows.

$$\begin{aligned} \text {Recall} = \frac{\sum _{L \in \{\text {all lines}\}}\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} N_{\text {anno, ext}}^{i,L}}{\sum _{L \in \{\text {all lines}\}}\sum _{i \in \{\text {edu},\text {skill},\text {exp},\text {role},\text {resp}\}} N_{\text {anno}}^{i,L}} \end{aligned}$$
(5)

where \(N_{\text {anno}}^{i,L}\) is the total number of keywords annotated in the ground truth for a line L and for attribute i. This gives us what fraction of total annotated keywords were actually recognized by our algorithm. The value of Eq.Ā 5 was computed to be 0.842. This implies that our keyword extraction algorithm is highly effective with high precision and recall. F1 score can be computed to be 0.896.

4.4 Similarity Algorithm Evaluation

Similarity algorithm provides a score between 0 to 1 for a pair of JDs. Similarity is high, if the score high. One way to evaluate similarity algorithm is to find similarity scores of a JDĀ with every other JDĀ from our data set of 27000 JDs. We can set a threshold on similarity score to find all pair of similar JDs. Then manually find out how many of those pair are actually similar. There are two problems in this evaluation approach. First, setting a threshold value is tricky. One value for a pair of JDsĀ may not be valid for other pair of JDs. Second problem is that inspecting all similar pairs manually is not feasible for possible \(27000 \times 27000\) pairs. Collecting ground truth for those many pairs is also time consuming and need a lot many human resources.

We decided to go with ranking approach to address these two problems in evaluation. We randomly selected 50 JDsĀ out of a data set of 27000 JDs. For every JDĀ in this set of 50 JDs, we computed similarity scores with every other JDĀ in 27000 set. For a selected JD, we ranked all JDsĀ in decreasing of their similarity scores. We picked top 10 and manually observed how many of them were actually similar. We repeated this for each of the 50 selected JDs. Thus ground truth collection efforts was brought down to \(50 \times 10\) from previous value \(27000 \times 27000\). Setting a threshold value is also not required for this evaluation. Just that instead of computing precision and recall, we computed area under ROC curve (AU-ROC) in this setting for each of these 50 ranked lists with 10 JDsĀ each.

We observed that minimum AU-ROC was 0.642, maximum AU-ROC was 0.9 and mean AU-ROC was 0.779. This highlights the effectiveness of similarity algorithm in ranking similar JDsĀ at the top. Ranked list certainly does not provide exact list of similar job descriptions, but it provides an ordered list of JDs, which user can follow to find similar job descriptions. It reduces tremendous efforts of user of scanning all JDsĀ in random order. Based on the application, a threshold value for similarity scores can as well be used or top k JDsĀ can be picked. It will further reduces the screening efforts of user.

We also report precision of similarity algorithm for the sake of completeness, by setting up following experiment. Given above mentioned 50 ranked lists, we set a high threshold of 0.7 for two JDsĀ to be similar. It gave us a set of pairs of JDs, that we predicted as similar. It shortlisted on an average top 15 JDsĀ from every ranked list which increases manual labeling effort from \(50 \times 10\) to \(50 \times 15\) pairs. Based on manual labeling, we observed that 88% of predicted similar JDsĀ were truly similar. This sets a high precision value for our similarity algorithm.

5 Related Work

Importance of entity and identity resolution have been established earlier in several research works [4, 13, 17]. Our work is along the lines of recent approaches which are variants of Fellegi-Sunter Model [8]. In [8] identity resolution is solved as a classification problem - given a set of similarity scores for different attributes of two candidates, classify it as a match or a non-match. Several bodies of research work have advised, compared and learned similarity measures for use in entity resolution (example, [7, 18]). Typically in such work, matching is performed individually on each of the attributes and then a transitive closure is used to eliminate inconsistencies. In our work, we establish these attributes through MSNRPCA [20], a hybrid approach for feature reduction based on the combination of feature ranking with PCA, and utilize the similarity to resolve identity. We train classification models for attributes using well established decision tree algorithms such as C5.0 [1], CHAID [12] and C&RT [5].

Entity resolution has been solved in several domains by various research works (example, [15, 16]) and to different types of data, including text (example, [14]) and images (example, [11]). RISEĀ targets resolution of entities in the domain of Job Descriptions in a recruiting system. We have highlighted the importance of the problem earlier and the goals of our work have been motivated by real world requirements of recruiting systems. There has been a pressing demand for identity resolution systems where identifying right candidates through one channel for specific organization can be routed to other job descriptions if not found suitable. Furthermore, there has been demands for creation of context sensitive job descriptions based on the existing job descriptions that had the best convergence. For all these purposes, one requires that similarities are established and identities are resolved. A big distinguishing factor is that our data source have been cross organizational. Thus we expect the identities of these job descriptions to be completely different from one organization to another. The problem is more challenging also due to the nature of the attributes. For instance, the numeric values for experience attribute requires different measures for obtaining similarity.

A group of researchers have focused on large databases and resolving identities in them. Methods were provided to avoid the quadratic number of comparisons between all pairs of entities (example, [10]). Such methods can be leveraged to reduce number of comparisons while finding similarities between every pair of JDs.

6 Conclusion and Future Work

We have built a system called RISE that addresses one of the key issues of identity resolution among job descriptions in recruitment systems. Recruitment systems typically employ technologies that allow centralized storage of data across different organizations. Although, centralized yet underlying unstructured data and lack of resolution techniques have rendered the data less usable for several valuable applications. RISEprovides an end-to-end system for establishing equivalence among identities and resolving them with high precision and recall. Our future work includes enabling several key capabilities on top of this system such as automated creation of job description based on the context, routing of profiles across different jobs etc.