Introduction

Information extraction (IE) is a broad area in both the Natural Language Processing (NLP) and the Web communities (Chang et al. 2006a, b). The main goal of IE is to extract useful information from raw documents and webpages. Traditional IE, which is assumed in this article, assumes a particular schema according to which information must be extracted and typed. Domain-specific applications, such as human trafficking, generally require the schema to be specific and fine-grained, supporting attributes of interest to investigators, including phone number, address and also physical features such as hair color and eye color (Fig. 1). As shown in the figure, some attributes may occur as ‘links’ (e.g., phone number) and are not directly visible in the text on the page. There is also considerable heterogeneity, both across webpages in the same Web domain (e.g., two individual webpages from backpage.com), and across Web domains (e.g., backpage.com and liveescortreviews.com). All of these observations make IE a challenging problem in an illicit Web domain such as online sex trafficking.

Fig. 1
figure 1

A truncated example of a webpage (images/key information are obfuscated) from the online sex advertisement domain. White boxes indicate regions of the webpage containing critical attributes (e.g., age) that need to be processed and extracted by an information extraction (IE) system

As with other AI approaches, quality tradeoffs of rival IE systems are determined by applying them (after the appropriate training and validation, if applicableFootnote 1) to a withheld (but still manually labeled) test dataset (Freitag 2000). It is less clear how IE quality can be evaluated without access to such ground-truths. This article probes the issue of whether it is possible to use the relational structure of an IE system’s outputs to characterize its quality (using well defined metrics) without the provision of ground-truths. Rather than focus on a theoretical model, the article specifically considers quality evaluation in the online sex advertisement domain. We hypothesize that, as with other relational systems, network science could be used in support of this goal.

We exploit the following intuition in support of this goal. Since it is generally the case that an attribute (such as city) is not extracted from a single document, but from multiple documents, extractions tend to be ‘shared’ between documents. Furthermore, a single document can yield more than one extraction per attribute, especially if the underlying IE system is recall-favoring, and some extraction combinations are higher-probability than others. For example, ‘Charlotte’ and ‘Raleigh’ have higher probability of being extracted from the same document than ‘Charlotte’ and ‘Los Angeles’. In the same vein, some extractions are more noise-prone than others e.g., Charlotte has higher potential to be mis-extracted as a name (in some documents) than Raleigh. Such relational connections can be used to model the set of documents and IE extractions as an attribute extraction network. The AEN is constructed by modeling extractions as nodes in the network, and by modeling shared extractions (within a single Web document) as network edges. Our overarching hypothesis is that changes in the structure of this simple network can be used to ‘track’ changes in the quality of the underlying IE system that yielded the extractions in the first place. Specifically, we propose to answer the following research questions for a domain-specific corpus of online sex advertisements:

Research Question (RQ) 1 : When extractions from a corpus are represented as an attribute extraction network (AEN), how does the structure of the network change as the quality of the underlying IE system changes, where structure is measured using single-point network-theoretic metrics such as algebraic connectivity and network diameter?

Research Question (RQ) 2: How does the degree distribution of an AEN change as IE quality changes? Are degree distributions normal?

We note that, strictly speaking, RQ2 is a special case of RQ1, since the degree distribution is a function of network structure as well. However, for methodological reasons we choose to separate the two questions with the first question covering single-point metrics (such as connectivity metrics, diameter etc.) and the second, covering the degree distribution itself. Potentially, any distribution (e.g., clustering coefficient distribution) could be selected for investigation in lieu of the degree distribution. However, the motivation in choosing the degree distribution is to specifically investigate if networks such as the AEN obey (approximately) normal distributions. If not, then important concerns arise as to whether simple random sampling and labeling is appropriate when constructing a gold standard (or training dataset) for an IE. In fact, many theoretical treatments on machine learning make critical assumptions about i.i.d (independent and identically distributed) data. Similarly, when evaluating machine learning and NLP systems, it is often the case that (when exhaustive ground-truths are not available) outputs are randomly sampled and annotated, the hope being that the measured performance will generalize statistically with finite, but sufficiently sized, samples.

By plotting the degree distribution of the AEN, we can directly analyze whether noise in the IE system is i.i.d by verifying that the degree distribution is Gaussian. If, instead, the distribution exhibits a power-law trend as with scale-free networks, for example, non-normality would be strongly indicated, implying that we should revise our statistical assumptions when deciding how to sample and annotate extractions for IE quality assessments. Furthermore, by studying how (or whether) the degree distribution changes or undergoes drift as the level of noise in the IE outputs increases, we hope to gain interesting and direct insights into the nature of IE noise.

For conducting these empirical studies, we use sex advertisement data scraped from the Open Web, and attributes extracted by a relatively advanced IE for an in-use investigative search engine developed in our previous work (Kejriwal and Szekely 2017b). This search engine is being used by multiple investigative agencies in the United States, and empirical work conducted in support of this article is being directly applied (Kejriwal et al. 2018). However, while our previous work was focused on describing and evaluating a search engine for HT, as well as the information extraction programs that fed into the search engine, this work centers on evaluating IE on an HT-specific corpus without access to a ground truth.

Contributions

Specific contributions in this article are follows. We propose to empirically study Information Extraction (IE) quality in the human trafficking (HT) domain using a novel network science-based framework, without relying on traditionally required ground truths. This HT-focused empirical study is a central contribution of the article, since an important motivation for our research is to mitigate the expense of acquiring a laboriously annotated ground-truth, without which an IE cannot be evaluated (and consequently, any system that relies on IE cannot be used with confidence by investigators). Specifically, we empirically study two research questions, using a 10,000+ document corpus of sex advertisements crawled from backpage.comFootnote 2, and a variety of IE systems executed over three attributes (name, city and phone). Our results show that there is a definite and consistent correlation across standard quality metrics as defined by the IE community, and structural metrics defined by the network science community. To the best of our knowledge, such a correlation has never been noted or exploited before in prior work. Our results also suggest the possibility of using structural metrics, which can be deduced in an unsupervised manner without access to a ground truth, to study whether a given IE is deviating from the ground truth (compared to another IE) on a quality metric such as precision or F-measure.

Related work

The primary problem that is being studied in this article is the evaluation of an important AI approach (information extraction) in a domain with deep social impact (human trafficking). Because we cannot assume a ground-truth, the evaluation is conducted using network-theoretic techniques. All of these individual fields of study (information extraction, computational investigations of human trafficking, and network science) have individually received much research attention, as we describe in the sub-sections below. However, we are not aware of any network science papers that are related to evaluation of IE quality without potential access to ground truths, either within or without the context of human trafficking. This article attempts to build such a bridge within the context of the special domain of human trafficking.

Information extraction

Information extraction is a core component of any information integration pipeline over Web and natural language corpora, as ‘unstructured’Footnote 3 data must first be rendered into a machine-readable, structured form in order for fine-grained queries to be executed over them. With the initial advent of the Web, wrapper induction systems had proved successful for several IE domains (Kushmerick et al. 1997). State-of-the-art work in the early 2000s (e.g. STALKER (Muslea et al. 1998)) used machine learning methods for the wrapper induction problem (Lerman et al. 2003). Such methods were inherently data-driven, and were less brittle than rule-based wrapper architectures. IE systems have continued to evolve since then; Chang et al. provide a comparative survey of many of the leading IE techniques along three dimensions (task domain, degree of automation and the actual techniques used) (Chang et al. 2006a). A key finding of the survey is the dependence of techniques on the actual input format. For example, while unsupervised and semi-supervised methods are well-suited for template pages, regular expressions and supervised approaches tend to be more robust for non-template pages (Lerman et al. 2003; Muslea et al. 1998). A consequent problem arising from such diverse methodologies is evaluating precision and recall in a consistent way (Chang et al. 2006a).

There has been much research on IE in traditional domains, and on datasets that are ‘well-formed’ (e.g., newswire) with accuracy on attributes such as person names often exceeding 80% (Nadeau and Sekine 2007). In contrast, it is well known that for more complex extractions (including relation and event extractions), accuracy is much lower (Ahn 2006). A similar problem occurs when one moves from newswire to social media and unusual domains that have not been well-studied, either socially or computationally (Ritter et al. 2012).

Evaluation of IE

As with IE systems development, IE evaluation (in the research community) was also predominantly designed for newswire-resembling corpora, with competitions and efforts such as the Message Understanding Conference (MUC) and Text Retrieval Conference (TREC) series involving the annotation of large corpora of data to ensure sufficient resources for training, validation and testing (Chinchor 1998; Voorhees and et al. 1999). For illicit domains such as human trafficking, ground-truths do not exist and are hard to acquire. We note also that one cannot crowdsource the annotations due to the sensitivity of sex advertisement data. The cost of labeling is also an issue, since investigative agencies are typically resource-strapped to begin with (and cannot dedicate additional resources to an annotation service).

Domain-specific applications of IE

IE has many applications, one of which is knowledge graph construction (KGC). KGC draws on advances from a number of different research areas, including information extraction (Chang et al. 2006b), information integration (Doan et al. 2012), and inferential tasks such as entity resolution (Elmagarmid et al. 2007). Good examples of architectures that implement KGC principles are Domain-specific Insight Graphs (DIG) and DeepDive (Niu et al. 2012; Szekely et al. 2015). Both of these architectures have a significant IE component, and also rely (either directly or indirectly) on the quality of the extractions in important sub-components such as search and analytics.

More recently, Open Information Extraction or OpenIE has become a popular topic of research, owing to the need for IE techniques that do not rely on pre-specified vocabularies (Banko et al. 2007; Etzioni et al. 2008). In a preliminary version of the system, we tried state-of-the-art versions of OpenIE, including both old and new versions of the system proposed by (Etzioni et al. 2008). Even when relevant extractions were obtained from the corpus of webpages, the precision and recall were both judged to be too low to be useful. This largely motivated our earlier research on focused knowledge graph construction for illicit domains, albeit only for keyword queries that were easily amenable to GUI integration (Szekely et al. 2015).

The importance of domain-specific IE has also been rising through a series of ambitious projects. For example, the Defense Advanced Research Projects Agency (DARPA) MEMEX programFootnote 4, which funded multiple institutions in the United States to build semi-automatic, democratized domain-specific search systems, led to national efforts in using such technology for combating human traffickingFootnote 5. IE was a crucial step in setting up such search engines.

Beyond human trafficking, investigators in other illicit domains, such as narcotics, securities fraud and illegal weapons sales, also expressed interest in using the technology. However, before one can deploy IE and search technology to such agencies, it is important to get some sense of the quality of multiple IE systems, and to also reason about changes in quality with the tuning of parameters. For example, when extracting attributes from text that has been scraped from webpages, it is intuitively plausible (and empirically the case as well (Kapoor et al. 2017)), that the more text is scraped from the webpage, the higher will be the recall of an extraction system’s output compared to the output if it were run on more conservatively parsed text. Precision tends to suffer, however, since extraneous text gets scraped and causes noise to creep into downstream extractors (such as a phone number extractor executed on the scraped text).

An empirical study on capturing such tradeoffs systematically, particularly without access to ground-truths, has thus far been lacking. This article attempts to address this need. Specifically, in contrast with much of the prior work on IE, this work neither proposes a new system nor algorithm, but instead describes a network science-based framework that allows the evaluation and comparison of IE systems (for the HT domain) without being restricted by the availability of large quantities of labeled data. Furthermore, the empirical data and findings described in this article shed new insights on the nature of IE noise e.g., our evidence suggests that the i.i.d (independent and identically distributed) assumption often used in machine learning may not be applicable to IE in the HT domain.

Human trafficking (HT)

One of the most important aspects that separate this work from prior work is its focus on a non-traditional domain such as human trafficking (HT) that has an outsize presence on the Web. By some estimates, HT is a multi-billion dollar industry; however, due to both technical and social reasons, it has largely been ignored by the computational sciences till quite recently (Alvari et al. 2016; Hultgren et al. 2016). A notable exception in the knowledge graph construction domain is the DIG (Domain-specific Insight Graphs) system (Szekely et al. 2015). Similar to other systems such as DeepDive, DIG implements KGC components, in addition to a GUI, and was evaluated on human trafficking data. Those evaluations largely motivated this article, since extensive effort had to be expended to annotate even small ground truths.

More broadly, semi-supervised and minimally supervised AI has been applied to fight human trafficking in contexts beyond information extraction and search (Alvari et al. 2017; Burbano and Hernandez-Alvarez 2017; Kejriwal et al. 2017; Rabbany et al. 2018). As one example, the FlagIt system, recently developed in our group, attempts to semi-automatically mine indicators of human trafficking (which include movement, advertisement of multiple girls etc.) (Kejriwal et al. 2017). As another example, Rabbany et al. (2018) explore methods for active search of connections in order to build cases and combat human trafficking. Finally, although this work deals primarily with linguistic data (since it is focused on IE, which tends to work on linguistic data), there has also been a steady stream of work on the non-linguistic characteristics of sex ads. For example, recently, Whitney et al. describe how emojis can be used to add a layer of obfuscation to sex ads to avoid getting investigated, caught and prosecuted (Whitney et al. 2018). In part, this work is motivated by such findings: even if investigators invested the effort to painstakingly construct ground-truths, the creative and dynamic ways in which traffickers adapt (e.g., by using obfuscations such as emojis and misspellings) would soon render those ground-truths stale and obsolete. Hence, there is a real need for developing end-to-end unsupervised IE systems, both for acquiring and evaluating extractions.

Furthermore, although the research described herein is specifically designed to investigate and combat human trafficking, we believe that the core elements of the overall problem and solution can be extended to other domains (e.g. from the Dark Web Chen (2011)) that are highly heterogeneous, dynamic and that deliberately obfuscate key information. Because illicit domains are under-studied, and obtaining both raw and ground-truth data are difficult, we use a rich trove of documents available to us from the human trafficking domain to study the research problems in this article. Currently, it is too early to tell if the findings can be empirically extended to other illicit domains. On the other hand, multiple illicit domains share both common challenges (e.g., information obfuscation), and common needs (e.g., prioritization on extracting location-specific and identifier-specific attributes to assist law enforcement). These commonalities suggest that some of our empirical findings may be generalizable to other potential illicit domains such as securities fraud and narcotics.

Finally, in the context of this paper, it is important to distinguish between causation, correlation and prediction. Many of the results we explicitly describe are correlates; our research question, in fact, can be framed in terms of finding metrics that do not require labeled data but that are correlated with actual performance (captured properly by metrics that do require labeled data as a gold standard). However, we invoke a longitudinal argument in claiming that, because the networks are constructed from extractions, they are derivatives of real data and cannot (arguably) have caused the relational dependencies between extractions. We do not claim causation in any form, however, only that the network metrics are predictive of accuracy metrics by virtue of the correlation. More formal models that go into depth into such theoretical issues were presented in (Hultgren et al. 2016, 2018; Whitney et al. 2018).

Network science

Network science is an actively researched, standard framework for studying complex systems that possess structure (Barabási and et al. 2016). Such systems include networks of protein-protein interactions (Gavin et al. 2002), citation networks (Hummon and Dereian 1989) and social networks (Borgatti et al. 2009), to only name a few. Recent research has led to many exciting advances in the construction and study of complex networks, especially from ‘Big Data’. For example, Chen and Redner study the community structure of the physical review citation network from the mid-1890s to 2007 (Chen and Redner 2010). Other domain-specific examples include the study of patent citation networks in nanotechnology (Li et al. 2007) and the creation and influence of citation distortions (Greenberg 2009).

Another highly active sub-area of research in network science, and (arguably) one of the original motivations for employing network science as a scientific methodology for studying structure, is social networks. Work in this area can be traced back to at least the 1940s (and possibly beyond), when Moreno first proposed the ‘sociogram’ as a way of studying such systems at a structural level (Moreno 1946). Since then, there have been tens of thousands of papers and articles on the subject; a standard, highly comprehensive treatment on social network analysis was provided by Wasserman and Faust (Wasserman and Faust 1994), with a more recent book by Knoke and Yang (2008). More recently, pioneering work in this area include a study of networks, crowds and markets by Easley and Kleinberg (Easley et al. 2010), social tie inference in heterogeneous networks (Tang et al. 2012), prediction of positive and negative links in social networks (Leskovec et al. 2010) and even ethics and privacy-related challenges in mining social network data (Kleinberg 2007). Other important applications of network science include bioinformatics, with research ranging from studies in systems pharmacology (Berger and Iyengar 2009) to tools designed for fast network motif detection (Wernicke and Rasche 2006), (Schreiber and Schwöbbermeyer 2005).

This article is differentiated from the papers above by not attempting to use network science to study the properties of a domain by modeling its structure as a network; rather, we hypothesize that network science can be used to deduce (at least as a correlate) the levels of noise and data quality in real-world IE systems applied to consequential domains such as human trafficking. In that sense, the article presents a novel application of network science compared to prior related work.

Technical preliminaries

In this section, we introduce the necessary technical preliminaries to place the (subsequently described) empirical studies in context. Because the formalism is interdisciplinary and relies on both IE and network science, two constructs that do not traditionally intersect in the academic literature, we define concepts from the ground-up.

The core elements in our framework are documents, which in our specific application are blocks of text scraped from sex advertisements. A raw document D may be considered to be a pair (id,text), where id is an identifier for the document, and text is usually just a (potentially long) string. Some IE systems require a list of tokens, rather than a string, in which case a tokenizer has to be applied to text to yield a list of strings. However, the tokenizer is extraneous to the definition of a document itself. A corpus is simply a set of documents.

For the purposes of this article, we consider a very simple definition of a schema, namely as a set S of attributes. For the human trafficking domain, the set contains attributes such as phone number, hair color, eye color etc.; in essence, anything that would allow an investigator using these extractions to locate a potentially trafficked victim. More complex schemas and attributes can also be considered (e.g., cluster classes such as Vendor explored in Kejriwal and Szekely (2017b)), but will not change the formalism presented herein.

Given an attribute a, we define an information extractor IEa for that attribute as a system that takes as input the text field of a document, and outputs a set of tokens, each of which is denoted as an instance of a or equivalently, as having typea. The data types of the tokens may be strings, but could also be numbers or dates. Without loss of generality, we assume strings.

Example: Consider the text ‘Hi, my name is Elsa and I am new in town.’ A machine-learning based extractor for the attribute name would (ideally) yield the instance Elsa when applied to text i.e. the extraction Elsa would have type name.

As the example above indicates, errors in IE can occur for two reasons. First, a correct instance of an attribute may not get extracted by the extractor. Second, an incorrect instance may get extracted. Even here, there are two possibilities. The incorrect instance may be a correct instance of a differently typed extractor e.g., imagine that Charlotte got extracted in some sentence as an instance of the city extractor, when in actuality, Charlotte was the name of a person in that sentence. However, it is also quite possible that the wrongly extracted instance is not a correct instance of any type. Some of our research questions in a subsequent section will return to the issue of distinguishing between these different types of ‘noise’.

Given an IE IEa and document D, we can obtain an ‘enriched’ document (and corpus) by applying IEa on D[text] and obtaining D=(id,text,{a1,…,am}), where the third element is the set containing the m instances of attribute a as occurring in D. Similarly, a sequence of IEs \({IE}_{a_{1}}, {IE}_{a_{2}}, \ldots {IE}_{a_{n}}\) can be independently applied for the n attributes a1,…,an in schema S to obtain a fully enriched document D that records all extracted (from its text) instances of all attributes in its information set.

Given an attribute a, a corpus of enriched documents, and a ground-truth set G of true extractions across all documents in the corpus, we can define accuracy metrics for the extracted instances of a in the corpus. We consider three important metrics in this article, widely used in the IE community, namely precision, recall and F1-measure (the harmonic mean of precision and recall). The precision is the fraction of extractions the IE labeled correctly as positives, while the recall measures how many of the positives in the ground truth the extractor was able to retrieve. Note that each of the metrics can be individually defined for each attribute a, assuming a ground truth Ga of the correct extractions (called the Positives) is available. Anything which is not in Positives is assumed to be in the set Negatives. Clearly, when an IEa is applied to the corpus, any extraction has to be either in Positives or Negatives.

In normal situations, these metrics can only be computed and trusted if a good ground truth is available to begin with. Typically this is done by a human annotator who samples some documents and annotates extractions within those documents. This is a laborious process, and much harder to accomplish in the case of sex advertisements, since techniques such as crowdsourcing cannot be effectively leveraged. One of the critical motivations behind this article is to investigate how we can measure (at least in a relative sense) the quality of systems’ extractions without access to ground truths. In support of this motivation, we now introduce the simple concept of an attribute extraction network (AEN). The AEN will serve as the central data structure on which empirical studies will be conducted for each attribute.

Definition: An attribute extraction network (AEN) Na is an undirected graph (V,E) where the set of vertices V is defined such that there is one vertex for every id in the corpus, and an undirected edge eij={vi,vj} between vertices vi,vj (∈V) exists iff a common instance was extracted (for attribute a) for the two documents with IDs corresponding to vi and vj.

Figure 2 illustrates an example of how an AEN is defined for five documents (D1-D6) and the Name attribute. Each document is a vertex in this representation. The extractions obtained from a given IE system are noted next to the vertex. There is an edge connecting two vertices if their corresponding document representations share an extraction e.g., the extraction ‘Mayank’ is shared between documents D1 and D2. An important point to note is that a vertex can be a singleton for at least two reasons. First, it may be that no values for attribute a were extracted from the corresponding document. In the figure, document D6 is an example of such a vertex (note that D6 may have extractions for other attributes such as phone number; it just doesn’t have an extraction for the attribute Name over which this AEN was constructed). Second, it may be that the set of values that were extracted did not get extracted elsewhere (i.e. another document). Thus, by definition that vertex would not be connected to any other vertex in the network. Furthermore, since there is a bijective (1-1) mapping between vertices and documents, we henceforth refer to vertices (also, nodes) as documents for the sake of maintaining a uniform terminology.

Fig. 2
figure 2

An example of an Attribute Extraction Network (AEN), assuming the attribute Name. Vertices are documents. D6 and D5 are singletons since D5 does not share extractions with any other document, and D6 does not have any extractions

In this article, we refer to a structural network metric as a function that takes an AEN as input and returns either a single point (single-point metric) or a distribution. The only distribution that we will consider in this article is the degree distribution, due to its importance. The structural single-point metrics under consideration are noted in Table 1. Note that the structural network is completely agnostic to what the vertices and edges ‘mean’ (i.e. their underlying semantics) although, of course, the actual values that a structural network metric would return would depend intimately both on how the network is constructed, and its semantics. One of the goals of this article is to assess the empirical nature and extent of this dependence.

Table 1 Single point network-theoretic structural metrics considered in this article for some of the empirical studies

Empirical studies

Earlier in the introduction, we stated two research questions to study the relationship between the network-theoretic metrics presented earlier, and the traditional IE metrics (precision, recall and F-measure). Recall that the first of those questions was based on measurements and comparisons between single-point metrics and the IE metrics, while the second involves similar comparisons but uses an important distributional metric (the degree distribution) rather than the single-point metrics. In this section, we present more details on the data and the empirical methodology for exploring those questions on an online sex trafficking corpus, followed by a report on the results of the analysis.

Data

To validate whether network science can be used to assess changes in IE quality without access to a ground truth, we test our hypotheses on IE extractions for which a reference ground truth is available. Below, we describe these datasets and the ground-truth in more detail. All datasets were constructed over a large corpus of online sex advertisements that were crawled from (the now shut-down) backpage.com portal during the calendar year of March 2016-2017.

We note that the corpus was collected by an independent contractor funded under the DARPA MEMEX program (mentioned earlier in the Related Work), which minimizes chances of dataset bias. The ground-truths were constructed semi-automatically by an academic group of social and political science experts in human trafficking who were not affiliated with the program during ground-truth construction. This ground-truth construction procedure is described in more detail below. The raw HTML pages had to undergo multiple steps of preprocessing and extraction before networks could be constructed. Technical details on webpage preprocessing were provided in our earlier work on information extraction and indicator mining (Kejriwal and Szekely 2017a; Kejriwal et al. 2017). A succinct summary of the datasets is provided in Table 2; further details are provided below.

Table 2 IE datasets (constructed per attribute) used in the empirical studies in this article

Ground-truths

In total, the corpus under consideration consists of 11,530 webpages. Multiple domain-specific attributes were extracted from this corpus, including City, Name, Phone, Address, Service Type, and even physical attributes such as Hair Color and Eye Color. In other illicit domains that we have studied (including securities fraud, narcotics, illegal weapons sales online and counterfeit electronics sales), the first three of these were found to be always present in the domain-specific schema that investigators defined. In contrast, Address and Service Type were more rarely defined, while physical attributes seemed to be exclusive to the online sex trafficking domain. Name and City are also common in non-illicit domains subject to extraction pipelines e.g., both SpaCyFootnote 6 and Stanford NER (two influential open-source IEs tuned for non-illicit domains such as newswire) make available pre-trained modules for Location and Person, which can be re-normalized to City and Name as we have considered them in this article (Finkel and Manning 2009).

In keeping with these observations, and to ensure that our findings are relatively generalizable, we consider Name, City and Phone extractions obtained from the corpus. We were provided a ground-truth set of extractions for each document in this corpus, and for each attribute, by an independent group of domain experts and social scientists who had developed highly tuned rule-based extractors (for all of these three attributes) specifically for sex ads in backpage.com. Typically, such extractors try to encode domain knowledge using unions of regular expressions, followed by post-processing checks. For example, the phone extractor would try to match a sequence of either ten or eleven digits in the ad text, with multiple rules accounting for such obfuscations as word representations of numbers (e.g., one instead of 1), the substitution of o for 0, and so on. Because the focus was on extracting US phones, a post-processing step was to check that the extracted number either had 10 digits, or started with 1 if it had 11 digits. Another step was to remove leading 0s. Checks were also run for numbers that were known to be spammy and occasionally present in ads (e.g., a sequence of nine consecutive 9s or 1s). Similarly, extractors for names and cities also used rules, but additionally, relied on glossaries such as Geonames (Wick and Vatant 2012).

To verify quality, we randomly sampled a set of fifty web documents from the corpus to verify that misclassification rates were low for all attributes. Thus, this dataset can be used in lieu of an exhaustively labeled ground truth, which is not feasible to construct both because of its scale and its real-world qualities. In Table 2, we refer to this dataset as Set 1.

Extraction datasets

To test our hypotheses using measures of IE accuracy that are predominantly used in the community (especially precision and recall), we considered two different extraction systems, one of which is precision-favoring and the other of which is recall-favoring. In Table 2, the outputs from these systems are denoted as Sets 2 and 3 respectively.

In illicit domains, structured attributes such as name and phone number are not present in ‘infobox’ style layouts, but are typically embedded in the text, often in an obfuscated format. This is to avoid direct investigative lookup of an advertiser’s street name and contact details using a search engine such as Google. Therefore, in order to extract attributes such as the ones considered in this work, one must first extract the free text from the webpage, following which NLP-centric extraction techniques can be applied on the extracted (and pre-processed) free text. Text scraping from websites is itself a hard problem (due to presence of inserted ad markup, dynamic changes, link structures and variability). We used the Readability Text Extractor (RTE), currently available as the Mercury APIFootnote 7, to perform the text scraping. We tuned RTE in two different modes. The first mode, which is recall-friendly, is more aggressive and scrapes much of the relevant text, but may also scrape irrelevant text and markup with it. The second mode, which is precision-friendly, tends to be ‘cleaner’ in that almost all content is relevant, but may miss relevant sentences, especially if there are gaps or links between the relevant portions.

Next, we run identical extraction programs for all three attributes on the precision-friendly and recall-friendly RTE outputs. City and name extractions are obtained using a dictionary based extractor, using existing sets of popular cities and names from a manually curated subset of the GeoNames knowledge base (Wick and Vatant 2012). However, for phones, we used different programs for extracting precision-favoring and recall-favoring phones, since our phone extractor (which has to deal with obfuscation) is based on rules. The precision-favoring phone extractor is applied to the precision-favoring RTE output, and similarly for obtaining recall-favoring phone extractions.

Using the ground-truth dataset (Set 1) and the precision-favoring and recall-favoring datasets (Sets 2 and 3), we can construct other IE datasets expressing varying tradeoffs between noise and quality metrics. We create four new datasets by combining these existing datasets in various ways (e.g., by taking their union). Details on this construction (Sets 4-7) are succinctly formulated in Table 2.

Finally, we also created a synthetic dataset (Set 8) by adding random noise to the ground truth (Set 1) such that quality metrics coincided with those of Set 2. Using the precision and recall values from Set 2 and the number of actual extractions from Set 1, the desired true positive, false positive and false negative values were calculated.

Specifically, for creating more false-positives, a two-step procedure was iteratively employed: (1) a document was chosen at random, and (2) a false extraction, randomly chosen from the dictionary of all extractions observed in the corpus, was added to the extraction set for that document. Similarly, for creating more false-negatives, randomly chosen true extractions are removed from randomly chosen documents in an iterative two-step procedure. Iteration continues till the precision and recall of the constructed dataset equal those of Set 2. Since the number of true positives is fixed, and we are able to precisely control the numbers of false-positives and false-negatives, precision and recall can both be decreased in a controlled and unbiased way. The reason for constructing this dataset is that it proves especially important in assessing some of the results against a reference of random noise, since it allows us to consider whether our real-world IE systems exhibit similar characteristics.

Tables 3, 4 and 5 show some key statistics about all the constructed sets. Since Set 1 is considered as our reference set, we consider it to have perfect quality (1.0 on all quality metrics). In keeping with our intuitions, we find that (relative to Set 3) Set 2 tends to have higher precision (+18-39%), while Set 3 has higher recall (+0.45%-17.8%), though the increase in recall of Set 3 is significantly more diminished by the loss in precision, leading to considerably lower F-scores. Sets 4-8 express a range of tradeoffs; for example, Set 4, which considers the intersection of the ground-truth with an already precision-favoring Set 2, yields perfect precision but at the same level of recall as Set 4. These datasets allow us to counterfactually investigate the different effects of precision and recall on network-theoretic metrics, since they control for one metric.

Table 3 Dataset characteristics for city extractions
Table 4 Dataset characteristics for name extractions
Table 5 Dataset characteristics for phone extractions

Finally, as explained earlier, Set 8 was synthetically created by adding random noise to Set 1, such that the quality metrics coincided with those of Set 2; hence, the two sets have expectedly near-identical quality metrics. This also illustrates that, even if a ground-truth were available (such as Set 1) to a practitioner, she would not be able to distinguish between Sets 2 and 8 based only on IE metrics. We show subsequently, however, that the structural properties of the extraction graphs of Sets 2 and 8 markedly differ.

Experiments and methods

To answer the first research question (RQ 1), we devised a set of quantitative experimental methods to record the variance in structural metrics for each of the eight datasets listed in Table 2. Note that structural metrics are unsupervised, requiring mechanical computations that depend only on the structure of the network. For each of the three attributes under consideration, we compute the individual Pearson correlation between the precision, recall and F-score, and several well-known network-theoretic structural metrics such as described in Formalism using eight single-point measurements for the correlations (one data point per metric per dataset in Table 2). We do not consider non-single-point metrics such as the degree distribution, since its investigation falls specifically within the purview of RQ 2. Because the eight datasets in Table 2 have varying qualities on the different IE metrics of interest to us (precision, recall and F-measure), consistent changes in structure across all three attributes enable us to take a principled approach to answering RQ 1.

Furthermore, with a view to assessing if noise in real-world IE systems exhibits significantly non-random tendencies, we report the same structural metrics that we considered in the methodology above for each set and attribute in Table 2, and study the specific differences between Set 8 and Set 2, since Set 8 has the same accuracy as Set 8, but with noise inserted randomly.

The methodology for exploring RQ 2 is fairly straightforward; we compute and plot the degree distribution of the extraction networks underlying the sets in Table 2. We also refer to the power-law coefficient of each network, computed earlier in the data collected for answering RQ 2, to assess to what extent each distribution obeys the power law. We also study how the power law distribution for each network evolves for a given attribute as performance changes gradually (across the spectrum from Set 1 to Set 8).

Results

We report results for both research questions enumerated earlier. Each research question (RQ) is considered individually below.

Research question 1

Recall that the first research question involved detecting patterns in structural metrics’ changes with changes in IE quality. We note the primary observations that emerged from conducting the RQ 1 experiments, using the methodology described earlier, below:

  • First, precision was found to be strongly correlated with several structural metrics, as quantified in Table 6, which records the Pearson correlation coefficient using the 8-point precision vector of the eight datasets in Table 2, and the corresponding values of the single-point structural metrics computed over their respective extraction networks. In some cases, the correlations seem intuitive and even obvious. For example, the relatively strong negative correlation between Order and precision can be explained as follows. Since the order corresponds to the set of all entities (for an attribute) extracted over the entire set of documents, and since every unique entity has, in practice, some non-zero probability of being noise, networks with a higher order tend to have lower precision. This is especially true when an attribute in question contains entities from some pre-specified ‘universal’ set, which is true for namesFootnote 8 and cities. In contrast, phones, which are syntactically constrained, but tend to accommodate many more possible unique values, show a weaker (but still quite strong) negative correlation.

    Table 6 Pearson correlation coefficients between precision and network metrics
  • Second, more interestingly, the ‘erroneous’ edges in less precise extraction networks tend to serve as ‘weak ties’ that end up collapsing two or more connected components into a single connected component, reducing the number of connected components. In other words, less precise edges tend to straddle components (a rough definition of what would constitute a weak tie in network science). This suggests a potential line of attack in cleaning up noisy extractions, by exploiting hierarchical or agglomerative clustering algorithms (Murtagh and Legendre 2014) that may be able to detect such weak ties (e.g., by iteratively breaking up connected components into clusters using mechanisms such as betweenness centrality for assigning weights to edges). The empirical utility of such methods is an important agenda that we will pursue in future work.

  • Third, precision is positively correlated with the Clustering Coefficient of the extraction network, but the correlation is not as strong as between precision and the number of connected components. This implies that cleaner extraction sets yield a smaller number of, but more tightly knit, groups (in the underlying extraction network) as compared with noisier extraction sets. In other words, in aggregate, the incorrect extraction edges tend to contribute to non-transitivity, since clustering coefficient is related to the number of triadic closures (indicating high transitivity) in the network. On average, therefore, given two links n1n2 and n1n3, all else being equal, a third link n2n3 introduced by a real-world extraction system is more likely to be correct than in the absence of either n1n2 or n1n3 (or both). This suggests another potential line of attack in trying to clean up noisy extractions (or selecting a system under the expectation of high precision without access to ground-truth) by systematically making use of global information.

  • Fourth, precision is also positively correlated with the Vertex and Edge Connectivity of the largest connected component. Adding more incorrect extraction edges leads to lower connectivity in the larger connected component compared to the original closely connected component. This offers a finer-grained ‘check’ on systems’ precision, as opposed to a coarse-grained classification (of whether a given extraction set is more precise than another extraction set) compared to the previous observation, which would only check the size of the largest component.

  • Interestingly, in contrast with precision, recall is not heavily correlated with any of the metrics described above, whether positively or negatively. Table 7 shows the correlation between recall and the various metrics discussed above in the context of precision.

    Table 7 Pearson correlation coefficients between recall and network metrics
  • Finally, because F-score is the harmonic mean of both precision and recall, it was unsurprisingly found to be correlated positively with the number of connected components, and also the clustering coefficient, of the network. The correlations were not as strong as those of precision (Table 8).

    Table 8 Pearson correlation coefficients between F-Score and network metrics

Brief Summary. The results show that certain structural metrics are excellent predictors of the overall performance of an extraction system, especially if precision is of interest. In contrast, recall cannot be predicted very accurately. We note that there is also variance between the attributes, though not as strong as one might expect, given that they are very different from one another. In general, we found that, in terms of evaluating RQ1, the Name attribute tended to be more heterogeneous and less predictive compared to City and Phone attributes. Other possible limitations of the study are described in the “Discussion” section.

Random vs. non-random noise

Recall that one of our goals had been to study the properties of real-world extraction noise; namely, is the noise random? We proposed studying these properties by first introducing attribute-specific random noise in the ground-truth network (Set 8 in Table 2) till it had the same precision, recall and F-scores as Set 2 (Tables 3, 4 and 5) for that attribute. Using only the accuracy metrics, there is no difference, in aggregate, between Sets 2 and 8. However, our (subsequently described) observations show that there are considerable structural differences between the two networks, providing evidence that the noise incorporated in real-world extraction settings is indeed significantly non-random. Furthermore, in deviating from randomness, the noise exhibits some clear patterns, lending credence to the observations and summary in the previous section as well.

First, we illustrate (in Fig. 3) the degree distribution of the Set 8 (i.e. random noise) network and compare it to the Set 2 network (obtained from a real-world ‘high precision’ extraction system). The figure reveals that the number of lower-degree nodes tend to be higher in the Set 8 network, as would be expected with random noise, while the number of higher-degree nodes tend to be higher in the Set 2 network. In investigating RQ 1, we saw earlier that noise in real-world networks can have a ‘weak link’ effect in that noisy links end up connecting otherwise disconnected components than might be expected by chance. The figure agrees with this intuition, in that higher-degree nodes continue to increase in degree (thereby seeming to obey scale-free assumptions) with addition in real-world noise, in contrast with random noise that skews the degree distribution in the reverse direction.

Fig. 3
figure 3

(Log-log) Degree distributions of Set 2 (Dotted Red) and Set 8 (Solid Blue) extraction networks for all three attributes. The Y-axis records the empirical probability of each degree value

Second, the clustering coefficient of the Set 8 network is consistently lower than that of the Set 2 network (see Tables 9, 10, and 11), providing more evidence that real-world erroneous extractions are more localized than would be true for random errors. Not only that, but the same errors seem to recur consistently across documents, which leads to their clustering by means of common (error-prone) extractions. We also note that the largest connected component of the Set 2 network has a smaller diameter for two out of the three attributes (Phone and Name) compared to the random network. As is true for other real-world networks exhibiting (approximately) power-law degree distributions (such as social networks), a real-world noise network also tends to exhibit ‘small world’ properties (in comparison with random networks).

Table 9 Single-point structural network metrics for City extraction datasets
Table 10 Single-point structural network metrics for Name extraction datasets
Table 11 Single-point structural network metrics for Phone extraction datasets

Brief Summary. Real-world extraction systems are not noisy in random ways, which (arguably) provides a compelling reason for using network science in the first place for studying their noise. More practically, it explains why active learning approaches lead to super-linear (with respect to labeling effort) gains when properly used, since the same errors seem to be occurring ‘independently’ in multiple documents (Thompson et al. 1999). If the noise had been truly random, active learning would be much less effective, since there would be a higher probability of sampling lower-degree nodes in Fig. 3.

Research question 2

Finally, in investigating the dependency of the degree distribution on the varying levels of noise and precision-recall tradeoff, we present degree distribution (log-log) plots in Figs. 4, 5 and 6. Contrary to social networks and other ‘natural’ networks, noise in the extraction networks does not follow a clear power law. However, with the exception of the phone extraction network’s degree distributions, which are fairly uniform across all sets (an artifact that may be the result of phone extractions being of generally higher quality than other extractionsFootnote 9), we note that the networks for Sets 3 and 7 are most erratic (compared to the Set 1 ground truth network). Considering the data in Tables 3 and 4, we find that these are the two sets with the lowest F-scores. In contrast, Sets 5 and 6 for both attributes have F-scores greater than 80%. Correspondingly, the degree distributions for these sets are much smoother. Although difficult to quantify, the results show that the degree distribution could potentially be used to diagnose whether the F-score of an extraction set is abnormally low.

Fig. 4
figure 4

Degree distributions of the City extraction network for the eight datasets described in Table 3, in support of the findings in Research Question 2. The X-axis and Y-axis respectively plot the degree and its empirical frequency in the network. Scales are log-log

Fig. 5
figure 5

Degree distributions of the Phone extraction network for the eight datasets described in Table 5, in support of the findings in Research Question 2. The X-axis and Y-axis respectively plot the degree and its empirical frequency in the network. Scales are log-log

Fig. 6
figure 6

Degree distributions of the Name extraction network for the eight datasets described in Table 4, in support of the findings in Research Question 2. The X-axis and Y-axis respectively plot the degree and its empirical frequency in the network. Scales are log-log

Discussion

In exploring the research questions, we utilized a select set of network metrics on the AEN as an approach for evaluating the quality of rival IE systems without access to ground-truth, in an unusual domain such as human trafficking where ground-truths are difficult to acquire across the Web. These network metrics were found to be correlated with some IE accuracy metrics, particularly precision. Although it is not currently feasible to provide a mathematical justification for why this turned out to be the case (although in future work, we are looking to develop a theoretical model explaining this finding), we posit two intuitive reasons below:

  1. 1

    Noise in IE is non-random i.e. if a word or phrase got mis-extracted in one document, there is a higher-than-normal probability that it will get mis-extracted in another document. This occurs despite the observations that both documents were generated independently, the contexts surrounding the word are distinct in both documents, and the training data was sufficiently representative. Intuitively, this occurs because there is an extraneous property that leads to the noise. For example ‘Charlotte’ may be getting mis-extracted more often than ‘Los Angeles’ by a Named Entity Recognition system (Nadeau and Sekine 2007), despite representative training data, because Charlotte is also a common name. Charlotte is what a practitioner would define as a ‘difficult’ example, even though it is hard to formalize what makes one example more difficult than another.

  2. 2

    In aggregate, noise seems to have macroscopic structure that can be formally quantified using concepts from network science. One reason for this is that the universe of extractions (i.e. all possible extractions) tends to be bounded in practice. For example, there is a finite number of locations in the world, and although any string could potentially be a name, the number of names in a corpus tends to be bounded. Because extractions (and also mis-extractions) are repeated across documents, certain regular structures and patterns may emerge. Consider again the case of ‘Charlotte’ statistically: assuming it gets incorrectly extracted by a recall-friendly system, along with the true city extraction, it is statistically unlikely that the true extraction will also be a city in North Carolina. In the broader AI community, probabilistic techniques (such as Probabilistic Soft Logic or PSL) have exploited this observation to ingest extractions from multiple independent IEs, and identify true extractions by probabilistically reasoning about such patterns (Kimmig et al. 2012). However, PSL needs clear domain rules (which is beyond reach of non-technical investigative experts), and knowledge graph identification systems that rely on PSL try to combine multiple IE and Entity Resolution systems to leverage such statistical knowledge.

We also note that the study is not without its limitations, which must be borne in mind before applying the findings to other HT datasets, or to datasets from similar illicit (or even non-illicit domains). Importantly, while the systems and datasets considered in this article are real-world, the structural metrics are ‘global’, meaning that, in general, it is considerably more difficult to predict precisely when a given system is wrong (i.e. pinpoint individual wrong extractions or links). In the future, it may be possible to use the concepts developed in this work in a machine learning setting to make such microscopic predictions; however, at present, the network metrics are computed over the full network rather than on a per-node or per-edge basis. However, because the network metrics capture an aggregate property of the performance of the underlying IE system (e.g., whether it is precision-favoring or not), they could be used to configure the IE system through hyperparameter optimization (Bergstra et al. 2011; Eggensperger et al. 2013). Intuitively, each set of hyperparameters yields a ‘different’ IE system, expressing a performance tradeoff typically captured through ROC curves (plotted using a validation set) in the machine learning literature. However, the network metrics are computed in an unsupervised fashion, and do not need labeled data.

Conclusion

In this article, we addressed the problem of assessing and profiling data quality in competing information extraction systems over domains that are unusual, have no ground truth annotations, but are consequential in the real world. We conducted a detailed empirical study using extractions covering three attributes, and different IE precision-recall tradeoffs, over a large corpus of webpages in the sex advertisement domain. The empirical studies illustrate some interesting aspects of noise in IE systems. For example, we found that, in real-world extraction systems, edges introduced in the attribute extraction network due to erroneous extractions tend to be of the ‘weak tie’ variety and lead to larger connected components. Recall was not found to exhibit strong dependencies on any structural metrics. Finally, noise distributions were found to exhibit non-random tendencies, with more predictable patterns emerging for lower levels of noise.

In current ongoing research, we are looking to release a software package that is able to use regression analysis to predict precision, recall and F-Measure scores for different configurations of an IE system, given baseline scores with respect to a default configuration. This package is expected to serve a useful purpose both in active learning and for determining system improvement with small or no ground truths.