1 Introduction

Data from social media platforms is an attractive real-time resource for data analysts. It can be used for a wide range of use cases, such as monitoring of fire- (Paul et al. 2014) and flue-outbreaks (Power et al. 2013), provide location-based recommendations (Ye et al. 2010), or is utilized in demographic analyses (Sloan et al. 2013). Although some platforms, such as Twitter, allow users to geolocate posts, Jurgens et al. (2015) reported that less than 3% of all Twitter posts are geotagged. This severely impacts the use of social media data for such location-specific applications.

The location prediction task can be either tackled as a classification problem, or alternatively as a multi-target regression problem. In the former case the goal is to predict city labels for a specific tweet, whereas the latter case predicts latitude and longitude coordinates for a given tweet. Previous studies showed that text in combination with metadata can be used to predict user locations (Han et al. 2014). Liu and Inkpen (2015) presented a system based on stacked denoising auto-encoders (Vincent et al. 2008) for location prediction. State-of-the-art approaches, however, often make use of very specific, non-generalizing features based on web site scraping, IP resolutions, or external resources such as GeoNames. In contrast, we present an approach for geographical location prediction that achieves state-of-the-art results using neural networks trained solely on Twitter text and metadata. It does not require external knowledge sources, and hence generalizes more easily to new domains and languages.

The remainder of this paper is organized as follows: First, we provide an overview of related work for Twitter location prediction. In Sect. 3 we describe the details of our neural network architecture. Results on the test set are shown in Sect. 4. Finally, we conclude the paper with some future directions in Sect. 5.

2 Related Work

For a better comparability of our approach, we focus on the shared task presented at the 2nd Workshop on Noisy User-generated Text (WNUT’16) (Han et al. 2016). The organizers introduced a dataset to evaluate individual approaches for tweet- and user-level location prediction. For tweet-level prediction the goal is to predict the location of one specific message, while for user-level prediction the goal is to predict the user location based on a variable number of user messages. The organizers evaluate team submissions based on accuracy and distance in kilometers. The latter metric allows to account for wrong, but geographically close predictions, for example, when the model predicts Vienna instead of Budapest.

We focus on the five teams who participated in the WNUT shared task. Official team results for tweet- and user-level predictions are shown in Table 1. Unfortunately, only three participants provided systems descriptions, which we will briefly summarize:

Table 1. Official WNUT’16 tweet- and user-level results ranked by tweet median error distance (in kilometers). Individual best results for all three criteria are highlighted in bold face.

Team FujiXerox (Miura et al. 2016) built a neural network using text, user declared locations, timezone values, and user self-descriptions. For feature preprocessing the authors build several mapping services using external resources, such as GeoNames and time zone boundaries. Finally, they train a neural network using the fastText n-gram model (Joulin et al. 2016) on post text, user location, user description, and user timezone.

Team csiro (Jayasinghe et al. 2016) used an ensemble learning method built on several information resources. First, the authors use post texts, user location text, user time zone information, messenger source (e.g., Android or iPhone) and reverse country lookups for URL mentions to build a list of candidate cities contained in GeoNames. Furthermore, they scraped specific URL mentions and screened the website metadata for geographic coordinates. Second, a relationship network is built from tweets mentioning another user. Third, posts are used to find similar texts in the training data to calculate a class-label probability for the most similar tweets. Fourth, text is classified using the geotagging tool pigeo (Rahimi et al. 2016). The output of individual stages is then used in an ensemble learner.

Team cogeo (Chi et al. 2016) employ multinomial naïve Bayes and focus on the use of textual features (i.e., location indicative words, GeoNames gazetteers, user mentions, and hashtags).

3 Methods

We used the WNUT’16 shared task data consisting of 12,827,165 tweet IDs, which have been assigned to a metropolitan city center from the GeoNames databaseFootnote 1, using the strategy described in Han et al. (2012). As Twitter does not allow to share individual tweets, posts need to be retrieved using the Twitter API, of which we were able to retrieve 9,127,900 (71.2%). The remaining tweets are no longer available, usually because users deleted these messages. In comparison, the winner of the WNUT’16 task (Miura et al. 2016) reported that they were able to successfully retrieve 9,472,450 (73.8%) tweets. The overall training data consists of 3,362 individual class labels (i.e., city names). In our dataset we only observed 3,315 different classes.

For text preprocessing, we use a simple whitespace tokenizer with lower casing, without any domain specific processing, such as unicode normalization (Davis et al. 2001) or any lexical text normalization (see for instance Han and Baldwin (2011)). The text of tweets, and metadata fields containing texts (user description, user location, user name, timezone) are converted to word embeddings (Mikolov et al. 2013), which are then forwarded to a Long Short-Term Memory (LSTM) unit (Hochreiter and Schmidhuber 1997). In our experiments we randomly initialized embedding vectors. We use batch normalization (Ioffe and Szegedy 2015) for normalizing inputs in order to reduce internal covariate shift. The risk of overfitting by co-adapting units is reduced by implementing dropout (Srivastava et al. 2014) between individual neural network layers. An example architecture for textual data is shown in Fig. 1a. Metadata fields with a finite set of elements (UTC offset, URL–domains, user language, tweet publication time, and application source) are converted to one-hot encodings, which are forwarded to an internal embedding layer, as proposed by Guo and Berkhahn (2016). Again batch normalization and dropout is applied to avoid overfitting. The architecture is shown in Fig. 1b.

Individual models are completed with a dense layer for classification, using a softmax activation function. We use stochastic gradient descent over shuffled mini-batches with Adam (Kingma and Ba 2014) and cross-entropy loss as objective function for classification. The parameters of our model are shown in Table 2.

Fig. 1.
figure 1

Architectures for city prediction.

Table 2. Selected parameter settings

The WNUT’16 task requires the model to predict class labels and longitude/latitude pairs. To account for this, we predict the mean city longitude/latitude location given the class label. For user-level prediction, we classify all messages individually and predict the city label with the highest probability over all messages.

3.1 Model Combination

The internal representations for all different resources (i.e., text, user-description, user-location, user-name, user-timezone, links, UTC offset, user lang, tweet-time and source) are concatenated to build a final tweet representation. We then evaluate two training strategies: In the first training regime, we train the combined model from scratch. The parameters for all word embeddings, as well as all network layers, are initialized randomly. The parameters of the full model including the softmax layer combining the output of the individual LSTM– and metadata– models are learned jointly. For the second strategy, we first train each model separately, and then keep their parameters fixed while training only the final softmax layer.

4 Results

The individual performance of our different models is shown in Table 3. As simple baseline, we predict the city label most frequently observed in the training data (Jakarta in Indonesia). According to our bottom-up analysis, the user-location metadata is the most productive kind of information for tweet- and user-level location prediction. Using the text alone, we can correctly predict the location for 19.5% of all tweets with a median distance of 2,190 km to the correct location. Aggregation of pretrained models also increases performance for all three evaluation metrics in comparison to training a model from scratch.

Table 3. Tweet level results ranked by median error distance (in kilometers). Individual best results for all three criteria are highlighted in bold face. Full-scratch refers to a merged model trained from scratch, whereas the weights of the full-fixed model are only retrained where applicable. The baseline predicts the location most frequently observed in the training data (Jakarta).

For tweet-level prediction, our best merged model outperforms the best submission (FujiXerox.2) in terms of accuracy, median and mean distance by 2.1% points, 21.9 km, and 613.1 km respectively. The ensemble learning method (csiro) outperforms our best models in terms of accuracy by 0.6% points, but our model performs considerably better on median and mean distance by 27.1 and 1358.8 km respectively. Additionally, the approach of csiro requires several dedicated services, such as GeoNames gazetteers, time zone to GeoName mappings, IP country resolver and customized scrapers for social media websites. The authors describe custom link handling for FourSquare, Swarm, Path, Facebook, and Instagram. On our training data we observed that these websites account for 1,941,079 (87.5%) of all 2,217,267 shared links. It is therefore tempting to speculate that a customized scraper for these websites could further boost our results for location prediction.

As team cogeo uses only the text of a tweet, the results of cogeo.1 are comparable with our text-model. The results show that our text-model outperforms this approach in terms of accuracy, median and mean distance to the gold standard by 4.9% points, 1,234 km, and 866 km respectively.

For user-level prediction, our method performs on a par with the individual best results collected from the three top team submissions (FujiXerox.2, csiro.1, and FujiXerox.1). A notable difference is the mean predicted error distance, where our model outperforms the best model by 125.3 km.

5 Conclusion

We presented our neural network architecture for the prediction of city labels and geo-coordinates for tweets. We focus on the classification task and derive longitude/latitude information from the city label. We evaluated models for individual Twitter (meta)-data in a bottom up fashion and identified highly location indicative fields. The proposed combination of individual models requires no customized text-preprocessing, specific website crawlers, database lookups or IP to country resolution while achieving state-of-the-art performance on a publicly available data set. For better comparability, source code and pretrained models are freely available to the community.

As future work, we plan to incorporate images as another type of metadata for location prediction using the approach presented by Simonyan and Zisserman (2014).