Keywords

1 Introduction

Product ads are a popular form of search advertizingFootnote 1 that are increasingly used as a replacement for text-based search ads, and are currently offered as an option by Bing, Google and Yahoo under different trade names. Unlike traditional search ads that carry only a title, link and a description, product ads are more structured. They often include further details such as the product identifier, brand, model for electronics, or gender for clothing. These details are provided as part of data feeds that merchants transmit to the search engine, and they allow search engine providers to perform better keyword-based ad retrieval, and to offer additional options such as faceted search over a set of product results. The level of completeness of the product specification, however, depends on the completeness of the advertisers’ own data, their level of technical sophistication in creating data feeds and/or willingness to provide additional information to the search engine provider beyond the minimally required set of attributes. As a result of this, product ads are often very incomplete when it comes to the details of the product on offer.

In this paper, we address this problem by enriching product ads with structured data extracted from HTML pages that contain semantic annotations. Structured data in HTML pages is becoming more commonplace and it obviates the need for costly information extraction. In particular, annotations using vocabularies such as schema.org (an initiative sponsored by Bing, Google, Yahoo, and Yandex) and Facebook’s OGP (Open Graph Protocol) are increasingly popular. To our knowledge, ours is the first work to investigate the potential of this data for enriching product ads, and to provide a targeted solution for matching product ads to product descriptions on the Web in order to exploit this data. In this work we focus on data annotated with the Microdata markup format using the schema.org vocabulary. Recent works [10, 11] have shown that the Microdata format is the most commonly used markup format, with highest domain and entity coverage. Also, schema.org is the most frequently used vocabulary to describe products.

Our method relies on a combination of highly accurate Information Extraction from unstructured text (titles and descriptions of products) and efficient and effective instance matching (also called reconciliation or duplicate detection) between product descriptions. More precisely, we use the structured product specifications in Yahoo’s Gemini Product Ads as a supervision to build two feature extraction models, i.e., dictionary-based model and Conditional Random Field tagger, able to extract attribute-value pairs from unstructured text. Later, we use these features to build machine learning models able to identify matching products. An evaluation on three categories related to electronics shows that we are able to identify matching products across thousands of online shops with high precision and extract valuable structured data for enriching product ads.

The rest of this paper is structured as follows. In Sect. 2, we give an overview of related work. In Sect. 3, we formally define the problem of enriching product ads and we introduce our methodology. In Sect. 4, we present the results of matching unstructured product descriptions, followed by the results of the product ads enriching in Sect. 5. In Sect. 6, we adapt the proposed methodology for the task of product categorization. We conclude with a summary and an outlook on future work.

2 Related Work

While the task of enriching product ads with features from HTML annotations hasn’t been studied so far, the problem of products matching and integration on the Web has been extensively studied in the recent years.

The approach presented by Ghani et al. [6] is the first effort for enriching product databases with attribute-value pairs extracted from product descriptions on the Web. The approach uses Naive Bayes in combination with semi-supervised co-EM algorithm to extract attribute-value pairs from text. An evaluation on apparel products shows promising results, however the system is able to extract attribute-value pairs only if both the attribute and the value appear in the text.

One of the closest works is the work by Kannan et al. [8]. The approach uses a database of structured product records to build a dictionary-based feature extraction model. Later, the features of the products are used to train Logistic Regression model for matching product offers. The approach has been used for matching offers received by Bing shopping data to the Bing product catalog.

The XploreProducts.com platform [16] is the first effort to integrate products from different online shops annotated using RDFa annotations. The approach is based on several string similarity functions for product matching. Once the matching products are identified, the system integrates the available ratings, offers and reviews into one system. The system is evaluated on an almost balanced set of 600 electronics product combinations. However, in real applications the problem of products matching is highly imbalanced. The approach is first extended in [1], using a hybrid similarity method. Later, the method is extended in [2], where hierarchical clustering is used for matching products from multiple web shops, using the same hybrid similarity method.

Similar to our CRF feature extraction approach, the authors in [9] propose an approach for annotating products descriptions based on a sequence BIO tagging model, following an NLP text chunking process. Specifically, the authors train a linear-chain conditional random field model on a manually annotated training dataset, to identify only 8 general classes of terms. However, the approach is not able to extract explicit attribute-value pairs.

The first approach to perform products matching on Microdata annotation is presented in [13]. The approach is based on the Silk rule learning framework [7], which is able to identify matching products based on their attributes. To do so, different combination of features from the product descriptions are used, e.g., bag of words, attribute-value pairs extracted using a dictionary, features extracted using manually written regular expressions, and combination of all. The work has been extended in [14], where the authors developed a genetic algorithm for learning regular expressions for extracting attribute-value pairs from products.

While there are several approaches concerned with products data categorization [8, 12, 13, 15, 16], the approach by Meusel et al. [11] is the most recent approach for exploiting Microdata annotations for categorization of products data. In this approach the authors exploit the already assigned s:Category property to develop distantly supervised approaches to map the products to set of target categories from an existing product catalog.

3 Approach

3.1 Problem Statement

We have a database A of structured product ads and a dataset of unstructured product descriptions P extracted from the Web. Every record a \(\in \) A consist of title, description, URL, and a set of attribute-value pairs extracted from the title of the ad, where the attributes are numeric, categorical or free-text attributes. Every record p \(\in \) P consist of title and description as unstructured textual fields. Our objective is to use the structured information from the product ads set A as supervision for identifying duplicate records in P, or matching products from P to one or more structured ads in A. More precisely, we use the structured information as a supervision for building a feature extraction model able to extract attribute-value pairs from the unstructured product descriptions in P. After the feature extraction model is applied, each product p \(\in \) P is represented as a vector of attributes \(F_p = \left\{ f_1, f_2, ..., f_n\right\} \), where the attributes are numerical or categorical. Then we use the attribute vectors to build a machine learning model able to identify matching products. To train the model we manually label a small training set of matching and non-matching unstructured product offers.

3.2 Methodology

The approach we propose in this paper consist of three main steps: (i) feature extraction, (ii) calculating similarity feature vectors and (iii) classification. The overall design of our system is illustrated in Fig. 1. The products integration workflow runs in two phases: training and application. The training phase starts with preprocessing both the structured product ads and the unstructured product descriptions. Then, we use the structured product ads to build a feature extraction model. In this work we build two strategies for feature extraction (see Sect. 3.3): dictionary-based approach and Conditional Random Fields tagger. Next, we manually label a small training set of matching and non-matching unstructured pairs of product descriptions. We use the created feature extraction model to extract attribute-value pairs from the unstructured product descriptions. Then, we calculate the similarity feature vectors for the labeled training product pairs (see Sect. 3.5). In the final step, the similarity feature vectors are used to train a classification model (see Sect. 3.6). After the training phase is over, we have a trained feature extraction model and a classification model.

The application phase starts with preprocessing both of the datasets that are supposed to be matchedFootnote 2. Next, we generate a set M of all possible candidate matching pairs, which leads to a large number of candidates i.e., \(|M|=n*(n-1)/2\), if we try to identify duplicates within a single dataset of n products, or \(|M|=n*m\), if we try to match two datasets of products with size n and m, respectively. To reduce the search space we use the brand value for blocking, i.e., we apply the matcher only for pairs of product descriptions sharing the same brand. Then, we extract the attribute-value pairs using the feature extraction model and calculate the feature similarity vectors. In the final step we apply the previously built classification model to identify the matching pairs of products.

Fig. 1.
figure 1

System architecture overview (Color figure online)

3.3 Feature Extraction

In this section we describe two approaches for extracting attribute-value pairs from unstructured product title and description. In particular, both approaches take as an input unstructured text, and output a set of attribute-value pairs.

Dictionary-Based Approach: To implement the dictionary-based approach we were motivated by the approach described by Kannan et al. [8]. We use the set of product ads in A to generate a dictionary of attributes and values. Let F represent all the attributes present in the product ads A. The dictionary represents an inverted index D from A such that D(v) returns the attribute name f \(\in \) F associated with a string value v. Then, to extract features from a given product description p, we generate all possible n-grams (\(n \le 4\)) from the text, and try to match them against the dictionary values. In case of multiple matches, we choose the longest n-gram.

Conditional Random Fields: As the dictionary-based approach is able to extract only values that were seen, we need to use more advanced approach that is able to extract unseen attribute-value pairs. A commonly used approach for tagging textual descriptions in NLP are conditional random field (CRF) models. A CRF is a conditional sequence model which defines a conditional probability distribution over label sequences given a particular observation sequence. In this work we use the Stanford CRF implementationFootnote 3 in order to train product specific CRF models [5]. To train the CRF model the following features are used: current word, previous word, next word, current word character n-gram (n \(\le \) 6), current POS tag, surrounding POS tag sequence, current word shape, surrounding word shape sequence, presence of word in left window (size = 4) and presence of word in right window (size = 4).

To train the CRF model we use the structured product ads from database A. That means that the model is able to extract only attribute names that appear in the database A, but it can tag values that don’t appear in the database.

Custom Feature Extraction: Beside the supervised extraction approaches, we extract several more features for all unstructured products. We use the product Web domain name, and the product URL (both are considered a long string in the following section). The rationale for using these two fields is that often important keywords can be found in the product URL, and the domain might be a good indicator about the type of the product.

Furthermore, we noticed that in some of the product title and/or description a so called product code is present, which in many cases uniquely identifies the product. For example, UN55ES6500 is a unique product code for a Samsung TV. This attribute has a high relevance for the task of product matching. To extract the product code from the text we use a set of manually written regular expressions across all categories.

3.4 Attribute Value Normalization

Once all attribute-value pairs are extracted from the given dataset of offers, we continue with normalizing the values of the attributes. To do so, we first try to identify the data type of each of the attributes, using several manually defined regular expressions, which are able to detect the following data types: string, long string (string with more than 3 word tokens), number and number with unit of measurement. Additionally, the algorithm uses around 200 manually generated rules for converting units of measurements to the corresponding base unit (metric system), e.g. 5” will be converted to 0.127 m. In the end, the string values are lower cased, stop words and some special characters are removed.

Example Tagging: In Fig. 2 we give an example of feature extraction from a given product title. The extracted attribute-value pairs are shown in Table 1, as well as the normalized values, and the detected attribute data type.

Fig. 2.
figure 2

Example of attribute extraction from a product title

Table 1. Attributes and values normalization

3.5 Calculating Similarity Feature Vectors

After the feature extraction is done, we can define an attribute space \(F = \left\{ f_1, f_2, ..., f_n\right\} \) that contains all of the extracted attributes. To measure the similarity between two products we calculate similarity feature vector \(F(p_i,p_j)\) for each candidate product pair. For two products \(p_1\) and \(p_2\), represented with the attribute vectors \(F_{p1} = \left\{ f_1v, f_2v, ..., f_nv\right\} \) and \(F_{p2} = \left\{ f_1v, f_2v, ..., f_nv\right\} \), respectively, we calculate the similarity feature vector \(F(p_1,p_2)\) by calculating the similarity value for each attribute f in the attribute space F. Let \(p_1\).val(f) and \(p_2\).val(f) represent the value of an attribute f from \(p_1\) and \(p_2\), respectively. The similarity between \(p_1\) and \(p_2\) for the attribute f is calculated based on the attribute data type as shown in Eq. 1.

(1)

The Jaccard similarity is calculated on character n-grams (n \(\le \) 4), and the Cosine similarity is calculated on word tokens using TF-IDF weights.

3.6 Classification Approaches

Once the similarity feature vectors are calculated, we train four different classifiers that are commonly used for the given task: (i) Random Forest, (ii) Naive Bayes, (iii) Support Vector Machines (SVM) and (iv) Logistic Regression.

As the training dataset contains only a few matching pairs, and a lot of non-matching pairs, the dataset is highly imbalanced. To address the problem of classifying imbalanced datasets we use two sampling approaches [4]: (i) Random Under Sampling (RUS): removes samples from the majority class until the number of the samples of the minority class equals the number of samples of the majority class; (ii) Random Over Sampling (ROS): randomly samples instances from the minority class until the number of the samples of the minority class equals the number of samples of the majority class.

4 Evaluation

In this section, we evaluate the extent to which we can use the dataset of structured product ads for the task of matching unstructured product descriptions.

4.1 Datasets

For the evaluation we use Yahoo’s Gemini Product Ads (GPA) for supervision, and we use a subset of the WebDataCommons (WDC) extractionFootnote 4.

Table 2. Datasets used in the evaluation

Prouduct Ads - GPA Dataset: For our experiments, we are using a sample of three product categories from the Yahoo’s Gemini Product Ads database. More precisely, we use a sample of 3,476 TVs, 3,372 mobile phones and 3,330 laptops. There are 35 different attributes in the TVs and mobile phones categories, and 27 attributes in the laptops category. We use this dataset to build the dictionary-based and the CRF feature extraction models.

Unstructured Product Offers - WDC Microdata Dataset: The latest extraction of WebDataCommons includes over 5 billion entities marked up by one of the three main HTML markup languages (i.e., Microdata, Microformats and RDFa) and has been retrieved from the CommonCrawl 2014 corpusFootnote 5. From this dataset we focus on product entities annotated with Microdata using the schema.org vocabulary. To do so, we use a sub-set of entities annotated with http://schema.org/Product. The dataset contains 288,082,823 entities in total, or 2,829,523,589 RDF quads. 89,608 PLDs (10.9 %) annotate at least one entity as s:Product and 62,849 PLDs (7.6 %) annotate at least one entity as s:Offer. In our approach, we make use of the properties s:name and s:description for extracting attribute-value pairs.

To evaluate the approach, we built a gold standard from the WDC dataset on three categories in the Electronics domain, i.e., TVs, mobile phones and laptops. We set some constraints on the entities we select: (i) the products must contain s:name and s:description property in English language, (ii) the s:name must contain between 3 and 50 words, (iii) the s:description must contain between 10 and 200 words, (iv) ignore entities from community advertisement websites (e.g., gumtree.com), (v) the product can be uniquely identified based on the title and description i.e., contains enough information to pinpoint the exact product.

The gold standard is generated by manually identifying matching products in the whole dataset. Two entities are labeled as matching products if both entities contain enough information to be uniquely identified, and both entities point to the same product. It is important to note that the entities do not necessarily contain the same set of product features. The number of entities, the number of matching and non-matching pairs for each of the datasets is shown in Table 2.

4.2 Experiment Setup

To evaluate the effectiveness of the approach we use the standard performance measures, i.e., Precision (P), Recall (R) and F-score (F1). The results are calculated using stratified 10-fold cross validation. For conducting the experiments, we used the RapidMiner machine learning platform and the RapidMiner development library.

We compare our approach with two baseline methods. First, we try to match the products based on TF-IDF cosine similarity. We report the best score on different levels of matching thresholds, i.e., we iterate the matching threshold starting from 0.0 to 1.0 (with step 0.01) and we assume that all pairs with similarity above the threshold are matching pairsFootnote 6.

As a second baseline we use the Silk Link Discovery Framework [7], an open-source tool for discovering links between data items within different data sources. The tool uses genetic algorithm to learn linkage rules based on the extracted attributes. For this experiment, we first extract the features from the product title and description using our CRF model, and then represent the gold standard in RDF format. The evaluation is performed using 10-fold cross validation.

4.3 Results

The results for both baseline approaches are shown in Table 3. We might conclude that both baseline approaches deliver rather poor results.

Table 3. Products matching baseline results using cosine similarity and Silk

Table 4 shows the results of our approach on the TVs dataset, using both CRF and Dictionary feature extraction approach. The best score is achieved using the CRF feature extraction approach, and Random Forest classifier without sampling. We can see that the same classifier performs a little bit worse when using the dictionary-based feature extraction approach.

Table 4. Products matching performance - televisions

Table 5 shows the results on the mobile phones dataset. As before, the best score is achieved using the CRF feature extraction approach, and Random Forest classifier using ROS sampling. We can note that the results using the dictionary-based approach are significantly worse than the CRF approach. The reason is that the GPA dataset contains a lot of trendy phones from 2015, while the WDC dataset contains phones that were popular in 2014, therefore the dictionary-based approach fails to extract many attribute-value pairs.

Table 5. Products matching performance - mobile phones

Table 6 shows the results of our approach on the Laptops dataset. Again, the best score is achieved using the CRF feature extraction approach, and Random Forest classifier without sampling. We can observe that for this dataset the results drop significantly compared to the other datasets. The reason is that the matching task for laptops is more challenging, because it needs more overlapping features to conclude that two products are matchingFootnote 7.

Table 6. Products matching performance - Laptops

The results show that our approach clearly outperforms both baseline approaches on all three categories. The Random Forest classifier delivers the best result for all three categories. We can observe that the other classifiers achieve high recall, i.e., they are able to detect the matching pairs in the dataset, but they also misclassify a lot of non-matching pairs, leading to a low precision. It is also interesting to observe that the RUS sampling performs almost as good as the other sampling techniques, but it has considerably lower runtime.

CRF Evaluation: We also evaluate the Conditional Random Field model on the database of structured product ads. For each of the three product categories we select 70 % of the instances as a training set and the rest as a test set. The results for each category, as well as the number of instances used for training and testing, and the number of attributes are shown in Table 7. The results show that the CRF model is able to identify the attributes in the text descriptions with high precision.

Table 7. CRF evaluation on structured product ads data

5 Data Fusion

As the evaluation of the approach showed that we are able to identify duplicate products with high precision, we apply the approach on the whole WDC and GPA products datasets. First, we try to identify duplicate products within the WDC dataset for top 10 TV brands. Then, we try to identify matching products in the WDC dataset for the product ads in the GPA dataset in the TV category.

Integrating Unstructured Product Descriptions: In the first experiment we apply the previously trained Random Forest model to identify matching products for the top 10 TV brands in the WDC dataset. To do so, we selected a sub-set of products from the WDC dataset that contain one of the TV brands in the s:name or s:description of the products. Furthermore, we apply the same constraints described in Sect. 4, which reduces the number of products. We use the brand name as a blocking approach, i.e., we generate candidate matching pairs only for products that share the same brand. We use the CRF feature extraction approach to extract the features and we tune the Random Forest model in a way that we increase the precision, on the cost of lower recall, i.e., a candidate products pair is considered to be positive matching pair if the classification confidence of the model is above 0.8.

We report the number of discovered matches for each of the TV brands in Table 8. The second row of the table shows the number of candidate product descriptions after we apply the selection constraints on each brand. We manually evaluated the correctness of the matches and report the precision. The results show that we are able to find a large number of matching products with high precision. By relaxing the selection constraints of product candidates the number of discovered matches would increase, but it might also reduce the precision.

Table 8. Discovered matching products in the WDC dataset

Furthermore, we try to identify how many matches and new attributes can be identified for single products. We randomly chose 5 different TVs and counted the number of discovered matches, s:offers, s:reviews and s:aggregatedRating properties from the WDC dataset, and how many new attribute-value pairs we discover from the s:name and s:description using the CRF model.

The results are shown in Table 9. The results show that we are able to identify a number of matches among products, and the aggregated descriptions have at least six new attribute-value pairs in each case.

Table 9. Extracted attributes for TV products

Enriching Product Ads: In this experiment we try to identify matching products in the WDC dataset for the product ads in the GPA dataset. Similarly as before, we select WDC products based on the brand name and we apply the same filtering to reduce the sub-set of products for matching. To extract the features for the WDC products we use the CRF feature extraction model, and for the GPA products we use the already existing features provided by the merchants. To identify the matches we apply the same Random Forest model as before. The results are shown in Table 10. The second row reports the number of products of the given brand in the GPA dataset, and the third row in the WDC dataset.

The results show that we are able to identify small number of matching products with high precision. We have to note again that we are not able to identify any matches for the products in the GPA dataset that are released after 2014, because they do not appear in the WDC dataset.

Furthermore, we analyzed the number of new attributes we can discover for the GPA products from the matching WDC products. The distribution of matches, newly discovered attribute-value pairs, offers, ratings and reviews per GPA instance is shown in Fig. 3. The results show that for each of the product ads that we found a matching product description, at least 1 new attribute-value pair was discovered. And for some product ads even 8 new attribute-value pairs were discovered.

Table 10. Discovered matching products in the WDC dataset for product ads in the GPA dataset
Fig. 3.
figure 3

Distribution of newly discovered matches and attributes per product ad

6 Product Categorization

Categorization of products in a given product catalog is an important task. For example, online shops categorize products for easier navigation, and in the case of product ads, it allows easier ad retrieval and better user targeting. Also, identifying the category of the products before applying the matching approach might be used as a blocking approach. Here we examine to which extent we can use a structured database of product ads to perform categorization of unstructured products description. Again we use the database of structured product ads to extract features from unstructured product descriptions, which are then used to build a classification model.

In this approach we use only the dictionary-based feature extraction approach as described in Sect. 3.3 Footnote 8. To build the dictionary, we use all product ads across all categories in the database of product ads. To generate the feature vectors for each instance, after the features from the text are extracted, the value of each feature is tokenized, lowercased, and removed tokens shorter than 3 characters. The terms of each feature are concatenated with the feature name e.g. for a value blue for the feature color, the final value will be blue-color.

Following Meusel et al. [11], in order to weigh the different features for each of the elements in the two input sets, we apply two different strategies, Binary Term Occurrence (BTO) and TF-IDF. In the end we use the feature vectors to build a classification model. We have experimented with 4 algorithms: Naive Bayes (NB), Support Vector Machines (SVM), Random Forest (RF) and k-Nearest Neighbors (KNN), where k = 1.

Gold Standard: For our experiments we use the GS1 Product Catalogue (GPC)Footnote 9 as a target hierarchy. The hierarchy is structured in six different levels, but in our experiments we try to categorize the products in the first three levels of the taxonomy: Segment, Family and Class. The first level contains 38 different categories, the second level 113 categories and the third level 783 categories.

To evaluate the proposed approach we use the Microdata products gold standard developed in [11]. We removed non-English instances from the dataset, resulting in 8,362 products. In our evaluation we use the s:name and the s:description properties for generating the features.

Table 11. The best categorization results using the dictionary-based approach and baseline approach

Evaluation: The evaluation is performed using 10-fold cross validation. We measure accuracy (Acc), Precision (P), Recall (R) and F-score (F1). We compare our approach to a TF-IDF and BTO baseline, where the text is preprocessed as before, but no dictionary is used.

Due to space constraints we show only the best performing results for each of the three levels. The complete results can be found onlineFootnote 10. The results show that the dictionary-based approach can be used for classification of products on different level of the hierarchy with high performance. Also, the results show that it outperforms the baseline approaches for all three levels of classification for both accuracy and F-score. Again, we have to note that the gold standard contains products that appear in 2014, while the GPA dataset contains relatively up to date products from 2015 (Table 11).

7 Conclusion

In this paper, we proposed an approach for enriching structured product ads with structured data extracted from HTML pages that contain semantic annotations. The approach is able to identify matching products in unstructured product descriptions using the database of structured product ads as supervision.

We identify the Microdata dataset as a valuable source for enriching existing structured product ads with new attributes. We showed that not only we could integrate some of the existing Microdata attributes, like s:offers, s:aggregateRating and s:review, but with our approach we can extract valuable attribute-value pairs from the textual information of the products that do not exist in the structured product ads. In future work, to validate the value of the new attributes we need to evaluate the influence of the new attributes on the ads ranking algorithm. We could also include other schema.org product properties in the approach, like s:mpn, s:model and s:gtin, which might be useful for identity resolution. Additionally, mining search engine query logs we could extract valuable features for identifying matching products.

Besides integrating products over different online shops and product categorization, our approach could be used for search query processing, which would undoubtedly improve the shopping experience for users [3].