Spatiotemporal Information for the Web
- GEO/GEO Ambiguity
It refers that many locations can share a single place name
- GEO/NON-GEO Ambiguity
It refers that a location name can be used as other types of names
Global reference time
A site name or geographic scope mentioned in Web pages
Local reference time
Named entity recognition
- Primary Location
The most appropriate location associated with a Web page
- Primary Time
The most appropriate time associated with a Web page
- Search Engine
Search engine is a popular tool to find information in the Web
One or more units of chronons. It can be a time instant or a time period
This subject is mainly towards the spatiotemporal information involved in the Web, particularly in Web pages. Typical spatiotemporal information in the Web includes the locations and time mentioned in Web pages, the update date of Web pages, and the Web server locations. As we know, location and time are the essential dimensions of information including Web information. However, they are usually ignored in traditional keyword-based Web search engines.
Traditional search engines are basically based on keyword-based approaches or content-based methods. Though many contributions have been presented in both directions, in some cases users are still difficult to express their search needs. For example, more than 70% Web queries are related with time and locations (Setzer and Gaizauskas 2002; Sanderson and Kohler 2004), but spatiotemporal Web queries such as “to get the news about Olympic Beijing in recent three days” or “to get the sales information about Nike in Beijing in this week” are often with bad results in traditional search engines. One reason is that such queries are difficult to express in keyword-based search engines. Moreover, traditional search engines also lack of the ability to process such spatiotemporal queries.
Aiming at improving the effectiveness and efficiency of spatiotemporal queries in search engines, many researchers began to study the spatiotemporal information in the Web. However, most of previous researches focused on time-based Web search (Nunes et al. 2008) and location-based Web search (Wang et al. 2005; Ding et al. 2000; Zhou et al. 2005; Markowetz et al. 2005) separately. And few works considered the temporal information of the content in Web pages. In this entry, we will describe the semantics of spatiotemporal information in the Web and try to present a framework for spa-tiotemporal information extraction under the Web context.
Spatiotemporal information has been deeply studied in spatiotemporal database area, in which moving geographic objects are concentrated. However, they are not popular in Web context. So at present, the main focus on spatiotemporal information in the Web is to integrate location and time information into search process, such as information extraction, indexing, querying, ranking, and visualization.
In this entry, we focus on the spatiotempo-ral semantics of Web information, mainly of Web pages, and present a framework to represent and extract the spatiotemporal information in the Web.
Identifier: The identifier of a Web page is usually the URL.
Locations: The spatial information of a Web page may consist of two types of locations, which are provider location and content locations (Wang et al. 2005). The provider location refers to the physical location of the provider who owns the Web resource. The content locations are the geographic locations that are described in the content of a Web page.
Time: The temporal information of a Web page has two types: update time and content time. The update time is the latest modified time of a Web page. The content time is the time that the content of a Web page indicates. The content time may contain implicit time such as “Today” and “Three Days Ago”.
Non-spatiotemporal attributes: The non-spatiotemporal attributes of a Web page refer to the traditional keywords set of the Web page.
Most Web pages contain location and time information. Previous works regard the locations in Web pages as geographic scope (Ding et al. 2000), which can be determined by analyzing the content and links in the Web page. The locations in a Web page usually have spatial containment relationships. For example, “China” contains “Beijing.” In the literatures (Zhou et al. 2005), a classification framework for Web locations is presented, and an algorithm to extract the locations in Web pages is further proposed. In order to support spatial computation, they use MBRs (Minimal Bounding Rectangle) to represent the geographic scope of Web pages. There are also some other methods proposed to represent geographic scopes, such as raster-based representation (Markowetz et al. 2005). Generally, the MBR-based method is widely used (Lee et al. 2003; Ma and Tanaka 2004). One problem of those previous works is that they treat the geographic scope of a Web page as exact one MBR, which is not very precise for many Web pages.
Temporal information is also very common in Web pages, especially in news pages. Temporal information extraction first appeared in MUC-5 whose task was to extract from business news when a joint venture took place. In MUC-6 some research was done on extracting absolute time information as part of general tasks of named entity recognition (Sundheim and Chinchor 1995). In MUC-7, the notion of temporal information extraction was expanded to include relative time in named entities (Chinchor 1998). MUC is practically the pioneer and prime driver of temporal information extraction research.
The temporal information of a Web page refers to the time related with it, e.g., the created date of the Web page and the date of an event reported in the Web page. There are many representation forms for the temporal information in Web pages, such as yesterday, Christmas, and August 15, 2012. Besides, many Web queries are time sensitive. Fresh Web pages have more important roles when users are searching news or sales information.
Proposed Solution and Methodology
In this section, we introduce the approach in capturing spatiotemporal information in Web search. The proposed approach consists of three main components: (1) Semantic Modeling for Spatiotemporal Information in the Web, (2) Extracting Primary Location from the Web, and (3) Extracting Primary Time for Web Pages.
Semantic Modeling for Spatiotemporal Information in the Web
From an object-oriented perspective, a Web page can be defined as follows:
Definition 1 A Web page is a quintuple O = < OID, LD, TD, AD>, where OID (Object IDentifier) is the identifier of the Web page, LD (Location Descriptor) is the location descriptor describing the location information of the Web page, TD (Time Descriptor) is the time descriptor describing the temporal information of the Web page, and AD is the attribute descriptor which describes the non-spatiotemporal properties of the Web page.
Location descriptor represents the location information of a Web page. A Web page has a unique provider location which is the geographic location of the Web server containing the Web page. The location information that described in the content of a Web page is called content locations. For example, in a company’s homepage, the provider location may be “Beijing,” since the Web server containing the homepage is located in Beijing, while the content locations may include the address of the company and other locations. As many locations may be involved in the content of a Web page, we should define a primary location for the content of a Web page. The primary location is the most appropriate location that describes the location information of a Web page. In the previous example, the primary location of the Web page could be the address of the company. However, how to compute the primary location of a Web page is an unrevealed issue in location-based Web search area.
Update Time. This refers to the update time of the corresponding file of a Web page. For a given Web page, the update time is unique and can be regarded as the timestamp of the Web page. Whenever a Web page is updated, the update time is also renewed.
Content Time. This refers to the involved temporal information in the text content of a Web page. Compared with update time, which is unique and explicit for a specific Web page, the content time is a set of time instant or time period which may be explicit or implicit. For example, a news page may contain the explicit published time “2008-1-24” of the news in the title. Meanwhile, in the news body, there may have some temporal keywords such as “three days ago” and “today.” The implicit content time should be translated into calendar time. Among the many time instants and periods described in the content of a Web page, we also need to define and compute the primary time of the Web page. The primary time of a Web page is the most appropriate time related to the Web page. In time-based Web search engine, primary time and secondary time should be treated and searched in different ways.
The above classification on the Web page time mainly considers the role of time in Web pages. Upon another view on time structure, there are two types of time: instant and period.
Instant. Instant is a specific point in the timeline. An instant may be a second, e.g., “2008-04-01 11:59:59.” It also can be a time point related to current time, e.g., “one hour ago” means the time instant which is 1 h before current time.
Period. Period is time duration. It contains a pair of instants and represents the time duration between the instants. For example, “[2000-09-01 00:00:00, 2003-02-01 00:00:00]” represents the time duration from “2000-09-01 00:00:00” to “2003-02-01 00:00:00,” and “[2002-09-01 00:00:00, NOW]” indicates the time duration since “2002-09-01 00:00:00.”
Another issue when considering the temporal semantics of Web pages is the granularity of the time. Different events in Web pages will have different granularities, e.g., the foundation event of a company may use “day” as the granularity, while a news report about earthquake may use “second.” How to set up a unified referential framework for the temporal granularity is a critical issue in the spatiotemporal information modeling of Web pages.
Attribute descriptor describes the text keywords that mostly depict the content of a Web page. Generally, it consists of a set of keywords which are extracted from the Web page. Many traditional technologies can be used to construct the attribute descriptor of a Web page, such as word segment and keyword extraction in commercial search engines.
Extracting Primary Locations from the Web
Most Web pages are associated with certain locations, e.g., news report and retailer promotion. Therefore, how to extract locations for Web pages and then use them in Web search process has been a hot and critical issue in current Web search.
As a Web page usually contains two or more location words, it is necessary to find the primary locations of the Web page. The primary locations represent the most appropriate locations associated with contents of a Web page. Generally, we assume that each Web page has several primary locations. The most difficult issue in determining primary locations is that there are GEO/GEO and GEO/NON-GEO ambiguities existing in Web pages. The GEO/GEO ambiguity refers that many locations can share a single place name. For example, Washington can be 41 cities and communities in the USA and 11 locations outside. The GEO/NON-GEO ambiguity refers that a location name can be used as other types of names, such as person names. For example, Washington can be regarded as a person name as George Washington and as a location name as Washington, D.C. Mark Sanderson et al.’s work (2000) shows that 20–30% extent of error rate in location name disambiguation was enough to worsen the performance of the information retrieval methods. Due to those ambiguities in Web pages, previous research failed to reach a satisfied performance in primary location extraction.
On the other side, it is hard to resolve the GEO/GEO and GEO/NON-GEO ambiguities as well as to determine the primary locations of Web pages through the widely studied named entity recognition (NER) approaches. Current NER tools in Web area aim at annotating named entities including place names from Web pages. However, although some of the GEO/NON-GEO ambiguities can be removed by NER tools, the GEO/GEO disambiguation is still a problem. Furthermore, NER tools have no consideration on the extraction of the primary locations of Web pages. Basically, the NER tools are able to extract place names from Web pages, which can be further processed to resolve the GEO/GEO ambiguities as well as the GEO/NON-GEO ones. Thus, we will not concentrate on the NER approaches but on the following disambiguation and primary location determination. Those works differ a lot from traditional NER approaches.
The General Framework
As Fig. 3 shows, we get a set of geo-candidates before the disambiguation procedure. We assume that all geo-candidates are associated with the locations in the Web page.
Basically, we assume there are n geo-candidates in a Web page and totally N locations that those geo-candidates may refer to. Then the GEO/GEO disambiguation problem can be formalized as follows:
Given a specific geo-candidate G, determining the most appropriate location among its possible locations.
In detail, as a geo-candidate can give more evidence to the one near to it in a Web page (text contribution) and a location can give more evidence to the one near to it in the geographic context (geographic contribution), we first construct a matrix M involving all locations (with each location occupies one row and one column), whose values are scores of each location of each geo-candidate voted by other ones that belong to different geo-candidates. This procedure is much like the voting process in PageRank, except that the items in M are locations but not Web pages and the scoring policy is based on text contribution and geographic contribution but not based on Web links.
Rule 1: In the matrix M, if a location of a geo-candidate gets score averagely from all locations of other geo-candidates, it is not considered as a location, because none of any possible location of any other geo-candidate can give evidence to locations of this geo-candidate.
Rule 2: After removing the GEO/GEO ambiguity, if a non-country location does not have the same country with any other location, it is considered not a location. Here we get the rule from our observation that a Web page is unlikely to mention a non-country location that does not share a same country with any other locations.
Determining Primary Locations
In this stage, we calculate the scores of all the locations after disambiguation and then return the focused ones for the Web page. We consider three aspects when computing the scores of a location, namely, the term frequency, position, and geographic contributions (the contributions from locations geographically contained by the location). The motivation of the geographic contribution is that if there are many states of USA in a Web page, the location “USA” will receive contributions from those states, as those states are all geographically contained in the USA. As a result, we use an explicit score to represent the term frequency of a location name and an implicit score for the geographic contribution. The score of a location is determined by its explicit score and implicit score.
For a location D i its explicit score, denoted as ES (D i ), is defined as the term frequency of D i in the Web page.
If D i follows on the heels of the other location D j and D i has some relationship with D j , suppose D j is contained in D i , then we think the appearance of D i in the page will emphasize D j , so we take 0.5 away from D i and add it to D j , i.e., ES(D i ) = ES(D i ) − 0:5, ES(D j ) = ES(D j ) + 0.5.
If D i appears in the title of a Web page, then we add half of SUM to D i to emphasize this appearance, where SUM is the sum of all the ES values, as defined in the formula (1):
Here, diff refers to the score difference among S 1, S 2; … S m . The average value of S 1, S 2; … S m must be less than or equal the maximum value of them, so diff ≤ 1. If D i contains no sub-locations, then IS(D i ) = 0.
Based on a Gazetteer, we can build a hierarchy location tree. Then we start from the leaf nodes and compute the scores of all locations. After that, we sort all locations according to their scores and partition locations into three groups based on the scores. The first group with the highest scores is determined as the primary locations.
Extracting Primary Time from the Web
How to determine the right temporal information for implicit expressions contained in Web pages? Differing from the explicit expressions, which can be directly found in a calendar, the implicit expressions need a transformation process and usually a referential time is required.
How to determine the primary time for a Web page? A Web page may contain a lot of temporal information, but which ones are the most appropriate times associated with the Web page? This is very important to temporal-textual Web search engines which support both term-based and time-based queries, as they aim at finding “the Web pages associated with the given terms and under the given temporal predicate.” For instance, to answer the query specifying “finding the information about tourism during the National Day,” the search engines have to first determine which Web pages are mostly related with “the National Day.”
For the first issue, namely, implicit time resolution, the difficult part is to select the referential time which is used to resolve implicit expressions. For example, to determine the exact time of the implicit expression “Yesterday” in a Web page, we must know the date of NOW under the context.
For the second issue, namely, primary time determination, the difficult part is to develop an effective scoring technique to measure the importance and relevance of the extracted temporal information. As there may be some containment relationship among temporal information, the time ranking task has to consider both frequency and the temporal containment. For instance, suppose “April, 2011” and “17 April, 2011” are two extracted time words, and “17 April, 2011” is contained in “April, 2011.” Therefore, even “April, 2011” rarely appears in the Web pages, it will still be the primary time for the page in case that there are a great number of extracted time words contained by “April, 2011.”
We propose a new dynamic approach to resolve the implicit temporal expressions in Web pages. We classify the implicit expressions into global and local temporal expressions and then use different methods to determine the referential time for global expressions and local expressions.
We present a score model to determine the primary time for Web pages. Our score model takes into account both the frequency of temporal information in Web pages and the containment relationship among temporal information.
Temporal Expressions Extraction
Explicit Temporal Expressions. These temporal expressions directly describe entries in some timeline, such as an exact date or year. For example, the token sequences “December 2004” or “September 12, 2005” in a document are explicit temporal expressions and can be mapped directly to chronons in a timeline.
Implicit Temporal Expressions. These temporal expressions represent temporal entities that can only be anchored in a timeline in reference to another explicit or implicit, already anchored temporal expression. For example, the expression “today” alone cannot be anchored in any timeline. However, it can be anchored if the document is known to have a publication date. This date then can be used as a reference for that expression, which then can be mapped to a chronon. There are many instances of implicit temporal expressions, such as the names of weekdays (e.g., “on Thursday”) or months (e.g., “in July”) or references to such points in time like “next week” or “last Friday.”
The explicit temporal expressions can be recognized by many time annotation tools, such as TempEx and GUTime (GUTime 2012). The temporal expressions in the GUTime output are annotated with TIMEX3 tags, which is an extension of the ACE 2004 TIMEX2 annotation scheme (tern.mitre.org).
For the extraction of implicit temporal expressions, the biggest difference of recognition between the explicit and implicit temporal expressions is that the implicit temporal expressions need to determine a reference time, so choosing the right reference is the key to the identification of the implicit temporal expression. The reference time can either be the publication time or another temporal expression in the document. Although the GUTime has a good performance in the extraction of explicit temporal expressions, it does not perform very well in dealing with the implicit temporal expressions, especially in the case of lacking of the document publication time. To improve the GUTime performance, we need to improve the reference-choosing mechanism of GUTime.
In this entry, we suppose that an implicit time expression consists of a modifier and a temporal noun which is modified by the modifier. For instance, a news report is as follows:
“(Beijing, May 6, 2009) B company took over A company totally on March 8, 2000”. After 1 week, B company listed in Hong Kong, and became the first listed company in that industry. However, owing to the decision-making mistakes in the leadership and the company later poor management, B company got into debt for several 100 million dollars, and was forced to announce bankruptcy this Monday.
In this news report, “ten days” is a temporal noun, but “ten days ago” is modified after adding the modifier “ago.” For the two temporal expressions that hold reference relations in this text, “after one week” and “this Monday,” we can achieve the anchor direction easily from the modifiers through some mapping rules. Meanwhile, the offsets are able to understand directly by machine with pattern matching. But for the anchor points (referents), we must build the context-dependent reference reasoning to trace them. The full temporal reference comes from two parts: modifier reference and temporal noun reference. Because the former is inferred from the latter, the temporal noun reference reasoning plays more important roles in normalizations. Actually, we notice that the temporal noun can be classified into two classes according to the reference attributes. One is called Global Time (GT) whose temporal semantics is independent with the current context and takes the report time or publication time as the referent. Another one, Local Time (LT), makes reference to the narrative time in text above on account of depending on the current context.
Global Reference Time: Global Reference Time (GRT) is a type of reference time which is referred to by the Global Time. Specifically, it is the report time or the publication time of the document.
Local Reference Time: Local Reference Time (LRT) is referenced by the Local Time. It will be updated dynamically after each normalizing.
Different classes of time will dynamically and automatically choose references based on their respective classes rather than doing it using the fixed value or the inconsiderate rule under the static mechanism. And the reference time table is updated in real time finishing each normalizing, which makes the temporal situation compliable with dynamically changeable contexts.
Determining the Primary Time
We proposed a score model to calculate the score of each temporal expression. In detail, we consider two aspects when calculating the score of a temporal expression, namely, the term frequency of the temporal expression and the relevance between temporal expressions. It is easy to understand that the term frequency is related to the score of a temporal expression. Here we focus attention on introducing the relevance between temporal expressions. We make an assumption that there is an article which contains some temporal expressions, and most of them refer to a certain day in March. In this case, we tend to choose March as the primary time rather than any one of them. Based on this view, we think that a temporal expression will make a contribution to its parent temporal expression. For example, the expression March 7, 2012 makes a contribution to its parent expression March 2012, and the expression 1983 contribute to its parent expression 1980s.
Here, we define the score of a temporal expression as a combination of an explicit score and an implicit score. The explicit score is related to the term frequency of a temporal expression, and accordingly, the implicit score is related to the contribution made by all its children expressions. The score of T i , denoted as ES(T⇁ i ), is the sum of its explicit score, denoted as ES(T i ), and its implicit score, denoted as IS(T i ).
Here, TF ete (T i ) refers to the term frequency of the explicit temporal expressions which are recognized as T i . TF ite . (T i ) refers to the term frequency of the implicit temporal expressions which are calculated as T i . d is the weighting factor; if d is set to 1, it means that the explicit and implicit temporal expression have the same credible level; if d is set to 0, it means that we take no account of implicit temporal expressions.
Finally, we can compute the scores of each time expression based on its explicit score and implicit one and then choose the Top-K time expressions as the primary time of the Web page.
TASE extracts the temporal expressions for each Web page and calculates the relevant score between the Web page and each temporal expression. Compared with traditional approaches, TASE uses a new reference time dynamic-choosing approach to extract implicit temporal expressions in Web pages. Besides, it distinguishes the temporal expressions with their relevant score and takes the containment relationship among the temporal expressions into consideration.
TASE combines the temporal similarity and the textual similarity to re-rank the search results. Our experiments demonstrate its effectiveness in dealing with temporal-sensitive Web queries.
Extract Candidate Documents. This module extracts the original Top-K documents from the search results which are used as the candidate documents.
Extract Temporal Expressions. This module extracts all the temporal expressions in each candidate document, including the explicit temporal expressions and the implicit temporal expressions.
Calculate Relevant Score. The relevant score between a temporal expression and a Web page will be calculated in this module.
Calculate Temporal Similarity. It calculates the similarity between the temporal expressions in a query and a document.
Calculate Textual Similarity. TASE is built on Lucene, an open-source search engine. Therefore, we use the textual similarity determined by Lucene as the original textual similarity.
Re-ranking. In this module, it used the temporal similarity and the original textual similarity to determine the final relevant score of a document.
Spatiotemporal information presented here may be used in many applications. First of all, spatiotemporal information can be used in search engines to improve the quality of results. By designing spatiotemporal indexes and ranking algorithms, search engines can be enhanced to process time-and-location-related Web queries effectively and efficiently. Secondly, our method is also useful in focused search engines, such as news search, product search, or stock search. In such applications, time and location information play an important role and our approach can be applied to offer better solutions to the search needs. Thirdly, spatiotemporal information can be utilized in question answering or automatic summarization in the Web. Many questions in the Web are related with time and location, which can be answered if we extract facts as well as their associated time and locations. For the automatic summarization, as events are usually described along a timeline, so it can be well done with the help of the extracted time of the specified topic.
This work can be extended to spatiotemporal analysis and mining in Web data, which may bring values for Web knowledge discovery. As Web has been regarded a major source of competitive intelligence, how to acquire competitive intelligence from the Web has been a hot topic. By using spatiotemporal information, we are able to find some historical information about interested competitors and further detect their future strategic planning in the near future. Spatiotemporal information can also be used to measure the credibility of Web information. With the rapid development of Web 2.0 and social network applications, there are many fakes and false information in the Web, which will introduce a lot of risks in decision making and other applications. Though information credibility involves many aspects of factors, spatiotemporal information can be used as one type of measurement to validate the credibility of specific information. For example, when we want to determine the credibility of a piece of news reporting “Apple iPhone 5 has been released,” we can collect the Web pages or microblogs mentioning the news and perform spatiotemporal clustering process to detect its credibility.
We would like to thank the University of Science and Technology of China for providing the environment where the study described in this entry was completed. The work involved in this article is partially supported by the National Science Foundation of Anhui Province (NO. 1208085MG117) and the USTC Youth Innovation Foundation.
- Brin S, Page L (1998) The anatomy of a large-scale hyper textual web search engine. In: Proceedings of WWW, Brisbane, pp 107–117Google Scholar
- Chinchor N (1998) MUC-7 information extraction task definition, version 5.1. In: Proceedings of the 7th message understanding conference (MUC-7), FairfaxGoogle Scholar
- Ding J, Gravano L, Shivakumar N (2000) Computing geographical scopes of web resources. In: Proceedings of VLDB, Cairo, pp 545–556Google Scholar
- GUTime (2012.) http://www.timeml.org/site/tarsqi/modules/gutime/index.html. Accessed Aug 2012
- Lee R et al (2003) Optimization of geographic area to a web page for two-dimensional range query processing. In: Proceedings of fourth international conference on web information systems engineering workshops (WISEW 2003), Roma. IEEE Computer Society, pp 9–17Google Scholar
- Lin S, Jin P, Zhao X, Yue L (2012) TASE: a time-aware search engine. In: Proceedings Of CIKM‘12, Maui. ACMGoogle Scholar
- Ma Q, Tanaka K (2004) Retrieving regional information from web by contents localness and user location. In: Proceedings of AIRS, Beijing, pp 301–312Google Scholar
- Markowetz A, Chen Y, Suel T, Long, X, Seeger B (2005) Design and implementation of a geographic search engine. Technical report TR-CIS-2005-03, Polytechnic University, BrooklynGoogle Scholar
- Nunes S, Ribeiro C, David G (2008) Use of temporal expressions in web search. In: Proceedings of ECIR' 08, Glasgow, pp 580–584Google Scholar
- Sanderson M, Kohler J (2004) Analyzing geographic queries. In: Proceedings of GIR' 04, Sheffield. ACMGoogle Scholar
- Setzer A, Gaizauskas R (2002) On the importance of annotating event-event temporal relations in text. In: Proceedings of LREC' 02, ParisGoogle Scholar
- Sundheim B, Chinchor N (1995) Named entity task definition, version 2.0. In: Proceedings of the 6th message understanding conference (MUC-6), Columbia. Morgan Kaufman, pp 319–332Google Scholar
- Wang C, Xie X et al (2005) Web resource geographic location classification and detection. In: Proceedings of WWW' 05, Chiba. ACMGoogle Scholar
- Zhou Y, Xie X, Wang C et al (2005) Hybrid index structures of location-based web search. In: Proceedings of CIKM' 05, BremenGoogle Scholar