Detecting social media users based on pedestrian networks and neighborhood attributes: an observational study
- 268 Downloads
- 1 Citations
Abstract
This paper proposes a methodological approach to explore the ability to detect social media users based on pedestrian networks and neighborhood attributes. We propose the use of a detection function belonging to the Spatial Capture–Recapture (SCR) which is a powerful analytical approach for detecting and estimating the abundance of biological populations. To test our approach, we created a set of proxy measures for the importance of pedestrian streets as well as neighborhood attributes. The importance of pedestrian streets was measured by centrality indicators. Additionally, proxy measures of neighborhood attributes were created using multivariate analysis of census data. A series of candidate models were tested to determine which attributes are most important for detecting social media users. The results of the analysis provide information on which attributes of the city have promising potential for detecting social media users. Finally, the main results and findings, limitations and extended use of the proposed methodological approach are discussed.
Keywords
Mexico city Social media Pedestrian networks Socio-demographic attributes User behaviour Protest march Mixed methodsAbbreviations
- AIC
Akaike information criterion
- Δ AIC
AIC differences
- SSP–CMDX
Centro de Información Vial de la Secretaría de Seguridad Pública de la Ciudad de México;
- ICT
Information and communications technology
- IDW
Inverse distance weighting
- LGBTTTI
Lesbian, gay, bisexual, transgender, transvestite, transsexual and intersexual
- MMDM
Mean maximum distance moved
- PCA
Principal components analysis
- RMSE
Root mean squared error
- SCR
Spatial capture–recapture
Introduction
How can habitat elements in a given city contribute to detecting users of social media? As a constructive answer to this question, we propose a novel methodological approach that relies on the centralities of pedestrian networks together with socio–demographic attributes. Our ongoing research proposes a novel methodological approach to evaluate whether the centralities of pedestrian networks and/or the socio-demographic attributes of the neighborhood contribute to detecting social media users. Our example data stem from Mexico City and is portrayed in the context of a planned urban protest march.
Previous research has been focused on individual and socio-demographic attributes to understand social media usage. One strand of research identifies, classifies, or predicts aspects of social media users from their personal attributes (Hiruta et al. 2012; Pratama and Sarno 2015), or also proves that social media users have socio-demographic characteristics which are not representative of the general population (Malik et al. 2015; Li et al. 2013; Mislove et al. 2011). The underlying idea of this type of research is, that the socio-demographic dimension plays an important role in explaining the behavior of the social media user. We could call this the socio-demographic hypothesis. It is based on the assumption that socio-demographic attributes have the potential to explain the use of social media in a particular context of place and time.
Other approaches draw more explicitly on spatial structures such as the street network of a city and use social network theory for modeling and analysis (Neal 2012; Porta et al. 2006; Crucitti et al. 2006). From this perspective, it has been found, for example, that street centralities are positively correlated with different types of land use (Rui and Ban 2014); or that the importance of street intersections, measured by betweenness centrality, is positively correlated with the flow of pedestrian movement (Bielik et al. 2018). Other authors have shown that the spatial distribution of outdoor serious violence can be explained from the configuration of the street network (Summers and Johnson 2016). These investigations support the general hypothesis that the centralities of street networks influence spatial human behavior. It can be deduce that the idea of detecting social media users on a geographical plane using street centralities is an instance of this second general research program.
Define a geographic area and case study
Collect data describing who, when and where a particular user was captured using a social networking site in the area of study
Calculate the centrality of pedestrian networks
Create socio–demographic neighborhood indicators
Assess if pedestrian streets or neighborhood attributes contribute to detecting social media users
The contribution of this study consists of introducing an exploratory methodological approach that explores the structural elements of the city that have the potential to detect social media users in geographical space. In fact, social media research generally attempts to explain spatial behavior based on variables of the individual, but in our approach, we try to emphasize that a user uses social media in a given habitat and context of communication. We believe that our approach can be a contribution to the study of complexity in the city, especially if we consider that “the growing number of urban and network researchers (...) vary immensely in their research questions, scales of analysis, disciplinary perspective, and intended audiences” (Derudder and Neal 2019, p. 1). Thus, the proposed method is particularly relevant, or even necessary, given that there are multiple competing models, that use different types of measurements and analytical units which are complex to analyze together.
The rest of this paper is structured as follows: “Conceptual analytical approach” section introduces a conceptual analytic approach for detecting social media users; “Materials and Methods” section describes the proposed methodological workflow (see, Fig. 2), the set of variables generated and the techniques of analysis applied; “Results” section sets out the results; and finally, “Discussion” section discusses the main findings, the limitations of this study and possible future extensions.
Conceptual analytical approach
We propose the use of computational methods developed in the field of population and landscape ecology, called Spatial Capture–Recapture (SCR) (Royle et al. 2013), to assess which attributes contribute to detecting social media users in the urban space.
The SCR is an approximation to infer the density and detectability of biological populations in a given habitat. This approach has brought a new wave of research, as the previous traditional models used have ignored the spatial dimension of the habitat of organisms (Royle et al. 2017). SCR samples organisms as they are captured or recaptured over time, and draws inferences about the detectability of organisms using a variety of live trapping devices distributed over space (i.e., in the study area) (Efford 2004). On the practical side, SCR can be understood as a non-invasive approach that has generated invaluable information for conservation programs^{1}.
If we take the same assumption, we can state that social media users have an unobservable activity center and that their activity takes place in an area of the city. In the same way, as it is done with SCR, it is proposed to carry out samples of users as they are captured or recaptured in the study area. More specifically, social media data was sampled at k = 1,...,K occasions through the use of traps (i.e. cells of a spatial grid) allocated in a given area of study. The number of traps is explicitly defined, as j=1,...,J traps, as well as the location of each one of the traps, which we will denote as x_{j}.
Where the parameter logit(p_{0}) =α_{0} is the baseline encounter probability (i.e., the maximum probability of encountering an individual), the parameter σ describes the rate at which the probability of detection declines as a function of distance, and d(x_{j},s_{i}) corresponds to the Euclidean distance between the trap j and the activity center of social media user i. Therefore, the detection model of social media users requires the estimation of the parameters p_{0} and σ. The final model considers all the values observed in Eq. (3), plus corrections with respect to the total population of individuals under observation.
In (Sutherland et al. 2019), it appears that the model is adjusted using the maximum likelihood criterion for generalized linear models, where they simultaneously calculate the estimate of the values p_{0},σ, and s (i.e. activity centers) and weights of the covariates used as explanatory variables. The complexity of the calculation of the centers of activities (s) is reduced using the Eq. (1) as an a priori distribution. If we assume that activity centers are distributed uniformly, we can assume that the activity surfaces (of those centers) in a grid of states of space are uniform as well. When individuals are captured or recaptured in the activity centers, they affect the density on the surface of the activity centers. The effect, for the general model, is considered to be negatively dependent on the exponential of the Euclidean distance to the activity center. Considering the three previous ideas, the activity centers are estimated.
The detection function can be enriched with the incorporation of spatial covariants. In our methodological approach, we include as spatial covariants the centrality of pedestrian streets and socio–demographic attributes of neighborhoods, among other related variables that will be described below. With the above considerations regarding the basics of the method and keeping state of the art in mind, the next section describes the experimental method adopted for the present study.
Materials and Methods
The proposed method consists of a several steps (see, Fig. 2). The first step is to select a study area of the city. The second is to create a history of social media usage over the study area. The third and fourth steps consist of creating spatial covariants of the centrality of pedestrian networks as well as socio-demographic indicators of the neighborhoods. Finally, the fifth step is to adjust a set of candidate models and select the one that offers the best results. The methodological steps are discussed in detail below.
Define an area and case of study
The area of study corresponds to Cuauhtémoc, a borough in Mexico City. This area of study is of great interest because here, different social movements are being demonstrated. Mexico City, like most Latin American cities, is a complex and socio–demographically diverse city with stratified traffic routes. In particular, we are interested in detecting social media users who used social media in the historic center of Mexico City^{2}. The historical center is located in Cuauhtémoc (area ∼32.44 km ^{2}), which is one of the 16 alcaldías (i.e., boroughs) into which Mexico City is divided.
In order to safeguard the participants of the march of the LGBTTTI community, as well as the general public, the Mexican Secretary of Public Safety and Security and the municipal and local authorities, implemented a plan to regulate the transit of vehicles and pedestrians during the demonstration event. As a consequence of the planned march, most of the area under study was transformed into an almost exclusively walkable area.
Collection of data and generation of capture history
Social media data was collected using the Twitter API. For these purposes, geotagged tweets were collected for 24 h and the area under study corresponds to the following rectangular shaped area: [lat ≥ 19.39, lat ≤ 19.46, long ≥ -99.18, long: ≤ -99.12]. In ecology, a spatial grid is used over space and physical traps devices are placed to sample where organisms are captured or recaptured. In our methodology, we created a grid over the studied geographical area, in which each grid cell represents a trap where the social media user can be captured. Through this spatial grid, as we mentioned before, we can index who (i), when (k), and where (j) the users were captured or recaptured, where y_{i,j,k}= 1 denotes that an individual was captured in a given grid cell in one occasion, and y_{i,j,k}= 0 means that the social media user was not captured. To identify users, we used the unique identifier naturally provided by the Twitter API (i.e., User ID) to represent (i), we also generated an identifier for grid cells –a grid of hexagons was used where each cell has a size of 0.39497 Km ^{2} where users were captured or recaptured– to represent (j), and finally, we created 24 identifiers corresponding to the 24 h of observation to represent k.
Compute centrality of pedestrian networks
In the literature, there are various hypotheses on how the structure of the street networks can support different explanatory mechanisms. A pedestrian network is a type of spatial network or geometric graph in which street intersections are represented by nodes, and the edges between pairs of nodes represent intersections that are connected by a street. Specifically, a pedestrian network is conveniently described as a graph G=(V,E), where the set V of vertices represents street intersections, and E the set of edges represents streets connecting pairs of intersection nodes. Also, if the Euclidean length of the streets is added as a weight of the edges we obtain a weighted graph known as Euclidean Graph.
For this paper, the centralities of the Mexico City pedestrian network were calculated using centrality measures and are defined as follows. Let a(e) be a function representing the existence of an edge e in E. If a(e)=1, then there exists the edge e∈E, and it does not exist if a(e)=0. Similarly, let ω be a ω-weight function on the edges, where ω(e)>0 for weighted graphs. Let denote e_{v} an edge, which v∈V is one of the vertices.
Define a path from s∈V to t∈V as an alternating sequence of vertices and edges, from vertex s to t, so that each edge connects its preceding with its succeeding vertex. We use δ(v,t) in order to denote the distance between vertices s and t (i.e., the minimum length of all paths connecting s and t). By definition δ(s,s)=0 for every s∈V and δ(s,t)=δ(t,s) for s,t∈V.
(Opsahl et al. 2010) define a distance measure called αω-weighted length, which is a generalization of ω-weighted length. It should be noted that the length considering between two vertices connecting by an edge e is 1, (Opsahl et al. 2010) instead define the length by \(\frac {1}{\omega (e)^{\alpha }}\), with α≥0. In addition, (Opsahl et al. 2010) define the αω-weighted distance δ^{αω}(s,t) between any pair of vertices s,t∈V based on the αω-weighted length. A particular case is α=1 obtaining the ω-weighted length δ^{ω}(s,t) from the αω-weighted distance δ^{αω}(s,t).
Let σ(s,t)=σ(t,s) denote the number of shortest paths from s∈V to t∈V, where σ(s,s)=1 by convention. Let σ(s,t|v) the number of shortest paths from s to t where v∈V lies on the path. In addition, by using the definition αω-weighted distance (Opsahl et al. 2010), define σ^{αω}(s,t),σ^{αω}(s,t|v). In case of α=1, we obtain σ^{ω}(s,t) and σ^{ω}(s,t|v), respectively.
Node centrality in weighted networks
Centrality measure | Notation | Definition | Reference |
---|---|---|---|
degree | C_{D}(v) | \(\sum \limits _{e_{v}\in E}{a(e_{v})}\) | |
ω-weighted degree | \( C_{D}^{\omega }(v)\) | \(\sum \limits _{e_{v}\in E}{\omega (e_{v})}\) | |
αω-weighted degree | \(C_{D}^{\alpha \omega }(v)\) | \(C_{D}(v)^{(1-\alpha)}C_{D}^{\omega }(v)^{\alpha }\,,\alpha >0\) | |
betweenness | C_{B}(v) | \(\sum \limits _{s\neq v\neq t\in V}\frac {\sigma (s,t|v)}{\sigma (s,t)}\) | |
ω-weighted betweenness | \(C_{B}^{\omega }(v)\) | \(\sum \limits _{s\neq v\neq t\in V}\frac {\sigma ^{\omega }(s,t|v)}{\sigma ^{\omega }(s,t)}\) | |
αω-weighted betweenness | \(C_{B}^{\alpha \omega }(v)\) | \(\sum \limits _{s\neq v\neq t\in V}\frac {\sigma ^{\alpha \omega }(s,t|v)}{\sigma ^{\alpha \omega }(s,t)}\) | |
closeness | C_{C}(v) | \(\frac {1}{\sum _{t\in V}{\delta (v,t)}}\) | |
ω-weighted closeness | \(C_{C}^{\omega }(v)\) | \(\frac {1}{\sum _{t\in V}{\delta ^{\omega }(v,t)}}\) | |
αω-weighted closeness | \(C_{C}^{\alpha \omega }(v)\) | \(\frac {1}{\sum _{t\in V}{\delta ^{\alpha \omega }(v,t)}}\) |
In this context, and taking into account the previous definitions, the pedestrian networks of Mexico City were retrieved using the approach developed by Boeing (Boeing 2017), a flexible and powerful approach that allows to download data from OpenStreetMap using configurable user queries. Under this framework, a walk or pedestrian network includes all the public streets and paths that pedestrians can use. After preparing the database, we obtained a total of 112188 nodes and 164586 edges representing the pedestrian network of Mexico City. The length of the edges had a mean of 88.91 meters and and the standard deviation was SD = 128.12.
Create socio–demographic neighborhood indicators
As we pointed out in Fig. 2, we also intend to explore the performance of centrality measures in the problem of detecting social media users. For this purpose, we use the principal components analysis (PCA) (Lê et al. 2008; Husson and LêS Pagès 2017) to create a series of indicators that characterize the neighborhoods of Mexico City. PCA has been a method frequently used to create proxy measures. For example, it has been used to create socio-economic scales based on household assets (Townend et al. 2015), to construct socio-economic status indices (Vyas and Kumaranayake 2006), and it is commonly used to create poverty indicators in Latin America (Santos and Villatoro 2016).
Age composition: This proxy measure describes the age structure of the inhabitants of the census block (see, Figure 8 in Appendix 2).
Educational level: This proxy measure describes the stratification of census block according to their educational level of its inhabitants (see, Figure 9 in Appendix 2).
Dwelling: This proxy measure describes the dwelling assets existing in the census block (see, Figure 10 in Appendix 2).
Information and communications technology (ICT): This proxy measure describes the information and communications technological devices (i.e. including the number of radios, TVs, computers, landline telephones, cell phones, and dwelling with internet access) existing in the census block (see, Figure 11 in Appendix 2).
Population density: This proxy measure describes the population density per census block, the population density per dwelling, and the population density per home (see, Figure 12 in Appendix 2).
.
The visualization of principal component analysis results on census data is presented in Appendix 2.
Interpolate spatial data points
Both the spatial points of the centralities of the pedestrian networks and the socio–demographic attributes of the census blocks were interpolated over Mexico City. The detail of the interpolation algorithm, model tuning, and validation are described below.
Where, \(\hat {Z}_{x}\) is the interpolated value at position x, z_{i} the value of the sample at position x_{i},d(x,x_{i}) is the Euclidean distance from points x_{i} to x. n is the size of the population or the number of cases accepted as neighbors of point x, and p is an integer named power factor.
Comparative research has shown that IDW is an efficient interpolation technique compared to more sophisticated geo–statistical techniques. In fact, IDW has been reported to perform slightly better than the classical Kriging techniques (Gong et al. 2014). In other research, it was reported that the IDW was a better estimator in a variety of analysis and data treatments that aims to estimate values at peak points (Setianto and Triandini 2013).
The IDW was used in our approach for three main reasons. First, our purpose is to model how values decay from peak values. Because we have an extensive sample for both pedestrian networks and census block data, IDW is particularly suitable for this purpose. Second, the IDW method is an intuitive and deterministic method, which makes it possible to account for and interpret the results obtained. Third, we also selected it for practical reasons, because this algorithm achieves a good balance between predictive performance and computation time in large databases.
where \(\hat {z}_{i}\) is the estimated value at point i interpolated from remaining n–1 points and z_{i} correspond to its actual value at the point i. Finally, n corresponds to the number of data points. Therefore, RMSE was determined sequentially for each of the centrality measures of pedestrian networks as well as for the first two principal component–scores of each socio-demographic indicators.
RMSE is not difficult to interpret because it represents the sample standard deviation of the differences between predicted values and observed values. RMSE varies between 0 and infinity, and in our context, it means that the IDW model achieving RMSE values close to 0, corresponds to a better interpolation. Using this procedure, interpolated variables that generated lower RMSE values were selected and used for comparative purposes.
Additional covariates
Two type covariants were created to enrich the analysis.
Creating an individual covariate
We create an individual covariate (i.e. a covariate that applies to the individual level) denoting if the individual is a supporter of the LGBTTTI march or a generic user. We named this individual covariate Type of User. To this end, a native Spanish speaker trained in qualitative data analysis classified whether a user published content associated with the planned march. Specifically, both the text of the Tweet, as well as the URLs, and associated hashtags were read individually to look for evidence of their support to the march. The qualitative coding scheme and inter–rater reliability is reported in Appendix 2.
Physical distance to the demonstration
Finally, we created two spatial covariants, which measure the Haversine distance between the place where the social media user were captured and the starting point of the march (i.e. Distance to the Ángel of Independence covariate) and the same distance measure between the place where the social media users was captured and the ending point of the march (i.e. Distance to the Zócalo covariate). These two covariants were created to test whether the detection of social media user depends on the distance to the place of the planned march. With these two simple distance measures, we wanted to test if the detectability of the users of social media decays with the distance to the place of the demonstration.
Model fitting
Several candidate models were fitted to assess the use of the proposed methodology. First, we created a null model where the density of social media user, the probability of detection, and the scale parameter is constant over the plane. Then, we generated a series of models considering different types of covariates. For the social media user density model, we use the interpolated population density. To estimate the baseline probability of detection of social media users, we used the interpolated centralities of the pedestrian network and socio–demographic indicators of neighborhoods at the centroid of the traps, as well as the individual covariate TypeofUser, and also testing if the detectability is constant. Finally, to model the scale of the parameter, we used the individual covariate TypeofUser to assess whether the probability of detection decays differently among those who protested or not during the observation day. For this parameter, we also tested if sigma is constant. In total, a set of 75 candidate model configurations were generated.
Maximum likelihood estimation was used to jointly estimate the parameters of the models, and to evaluate which candidate model has the best fit to our data. Specifically, we used a likelihood analysis of the models using the R package oSCR (Sutherland et al. 2016) which can be thought of as a type of generalized linear mixed model. This approach is particularly flexible, as maximum likelihood methods allow the comparison of multiple competing models and spatial explanatory variables. To select the best model, the Akaike Information Criterion (AIC) values are reported for each candidate model and their differences are used to rank them. The model that obtains the lowest ΔAICc values is interpreted as the best explanatory model.
Results
The results are organized as follows. First, the results of the capture history are reported (see, “Results of spatial capture history” section). Then, the results of the interpolation of the measures of centrality and attributes of the neighborhoods are reported (see, “Results of the inverse distance interpolation” section). Finally, model fitting results are reported (see, “Results of model fitting” section).
Results of spatial capture history
This section reports the results of the data collection process as well as the description of the capture history. Figure 4 shows the geotagged tweets collected during the 24 h of June 23, 2018. The initial inspection allows observing Tweets that are found in the main locations where the march was planned. This is the first indication that relates to the use of social media in areas where the march took place. However, the idea of performing a spatial correlation between the centrality of the pedestrian networks and the attributes of the neighborhood directly on these points does not make sense, because the activity center of the users during the observed period must be assumed as unknown. To obtain a different perspective of the data, it is necessary to build their history of encounters.
The capture history was constructed from the collected social media data. The aggregated spatial captures are shown and summarised in Fig. 5a. In this figure, each black dot represents the centroid of the trap and the red lines connecting pairs traps indicate that the same social media user was recaptured in both traps during the day of observation. The number of individuals captured in the area under study was 2051. The average number of captures was 1.36 and the Mean Maximum Distance Moved (MMDM) was 2308.10 meters.
The results of the spatial interpolation of the covariants are presented below.
Results of the inverse distance interpolation
The analysis of the candidate models showed that the IDW algorithm achieved good performance. The comparison between candidate models allowed to determine that the IDW algorithm was useful for interpolating the attributes of the neighborhoods and the centralities of the pedestrian networks.
Interpolation of neighborhood indicators
Interpolation errors for neighborhood attributes
Neighborhood attribute | RMSE | RMSE | RMSE | RMSE | RMSE |
---|---|---|---|---|---|
p=1 | p=2 | p=3 | p=4 | p=5 | |
Age-PC1 | 978×10^{−3} | 738×10^{−3}^{*} | 0.770 | 798×10^{−3} | 812×10^{−3} |
Age-PC2 | 0.610 | 555×10^{−3}^{*} | 581×10^{−3} | 609×10^{−3} | 631×10^{−3} |
ICT-PC1 | 1.34 | 986×10^{−3}^{*} | 992×10^{−3} | 1.02 | 1.04 |
ICT-PC2 | 227×10^{−3} | 195×10^{−3}^{*} | 208×10^{−3} | 219×10^{−3} | 225×10^{−3} |
Education-PC1 | 1.04 | 768×10^{−3}^{*} | 794×10^{−3} | 818×10^{−3} | 0.830 |
Education-PC2 | 609×10^{−3} | 475×10^{−3}^{*} | 483×10^{−3} | 0.500 | 516×10^{−3} |
Dwelling-PC1 | 1.04 | 949×10^{−3}^{*} | 981×10^{−3} | 1.04 | 1.09 |
Dwelling-PC2 | 903×10^{−3} | 872×10^{−3}^{*} | 0.980 | 1.03 | 1.07 |
Population density-PC1 | 1.04 | 949×10^{−3}^{*} | 981×10^{−3} | 1.04 | 1.09 |
Population density-PC2 | 562×10^{−3} | 414×10^{−3}^{*} | 431×10^{−3} | 447×10^{−3} | 455×10^{−3} |
Interpolation of centrality measures
Interpolation errors for centrality measures
Centrality measure | RMSE | RMSE | RMSE | RMSE | RMSE | |
---|---|---|---|---|---|---|
p=1 | p=2 | p=3 | p=4 | p=5 | ||
αω-weighted degree | α=0 | 792×10^{−3} | 763×10^{−3}^{*} | 793×10^{−3} | 821×10^{−3} | 0.840 |
α=0.5 | 7.58 | 7.03 | 7.15 | 7.31 | 7.43 | |
α=1 | 76.4 | 70.2 | 70.5 | 71.5 | 72.5 | |
αω-weighted betweenness | α=0 | 3.90 | 3.69^{*} | 3.83 | 3.96 | 4.05 |
α=0.5 | 4.48 | 4.42 | 4.70 | 4.89 | 5.01 | |
α=1 | 5.00 | 4.97 | 5.29 | 5.51 | 5.65 | |
αω-weighted closeness | α=0 | 459×10^{−6} | 203×10^{−6}^{*} | 175×10^{−6} | 176×10^{−6} | 179×10^{−6} |
α=0.5 | 34.7×10^{−6} | 16.9×10^{−6} | 14.8×10^{−6} | 14.9×10^{−6} | 15.0×10^{−6} | |
α=1 | 5.96×10^{−6} | 4.06×10^{−6} | 3.96×10^{−6} | 3.89×10^{−6} | 3.85×10^{−6} |
Results of model fitting
The results of the detection model obtained are presented below.
Selection of model and comparison
Summary of model fitting and selection
Model | d_{0} | p_{0} | σ | logL | AIC | ΔAIC |
---|---|---|---|---|---|---|
Best | ∼1 | ∼ICT-PC1 + Type of User | ∼Type of User | 8247.87 | 16509.73 | 0.00 |
Alternative | ∼1 | ∼αω-weighted degree | ∼Type of User | 9192.42 | 18396.83 | 1887.10 |
Alternative | ∼1 | ∼αω-weighted betweenness | ∼Type of User | 9280.70 | 18573.40 | 2063.67 |
Alternative | ∼1 | ∼αω-weighted closeness | ∼Type of User | 9372.17 | 18756.34 | 2246.61 |
Null | ∼1 | ∼1 | ∼1 | 9476.20 | 18960.40 | 2450.67 |
Alternative | ∼1 | ∼Distance to the Zócalo | ∼Type of User | 11071.88 | 22155.76 | 5646.76 |
Alternative | ∼1 | ∼Distance to the Ángel | ∼Type of User | 11251.07 | 22514.14 | 6004.41 |
The ranking of the models allows us to obtain interesting observations if we compare them to the null model. As it can be seen in Table 4, a better detection model for social media users is achieved using the neighborhood attribute ICT-PC1 and the individual covariant Type of User, and using the individual covariant Type of User for sigma (ΔAIC = 0).
Another interesting element to observe is that the models including measures of centrality of pedestrian networks are better than the null model: the degree of centrality best explains the detection of social media users, followed by betweenness and closeness. In other words, the results show that the centrality of pedestrian streets have the potential to detect social media users on the plane.
Finally, we can observe that the variables Distance to the Zócalo as well as Distance to the Ángel of Independence obtained a lower performance than the null model. This is very interesting since the physical distance of the traps to the starting point or end of the march does not seem to contribute to the detection of social media users.
In general terms, the results show that detection models based on neighborhood attributes and the individual covariate performed better than the use of other types of variables.
Modeling variation in detectability
Model summary
Estimate | SE | z | P(> ∣ z ∣) | |
---|---|---|---|---|
p0.(Intercept) | -7.836 | 0.082 | -95.498 | 0.000 |
p0.Supporting | 0.431 | 0.104 | 4.131 | 0.000 |
t.beta.ICT–PC1 | -3.118 | 0.066 | -47.173 | 0.000 |
sig.(Intercept) | 6.635 | 0.031 | 215.284 | 0.000 |
sig.Supporting | 0.137 | 0.051 | 2.698 | 0.007 |
d0.(Intercept) | 4.672 | 0.044 | 105.927 | 0.000 |
psi.constant | -1.714 | 0.084 | -20.485 | 0.000 |
Based on Eq. 6, it is understood that the users who supported the march have a greater σ than those who did not, and therefore, their probability of detection was higher.
Discussion
This paper aimed to explore the contribution of pedestrian networks as well as the characteristics of the neighborhood in the detection of social media users. The present study has pointed out the existence of a relationship between ICTs in the neighborhood and the probability of being detected in a given region of Mexico City. Below we offer a series of observations about our results.
Comparison to previous research
According to our knowledge, there is no previous research dedicated to exploring whether socio-demographic variables or pedestrian networks allow to detect social media users at the same time. As mentioned in “Introduction” section, the literature has been dedicated to exploring the contribution of social network measures to determine the flow of pedestrians under different contexts (Porta et al. 2006; Crucitti et al. 2006; Rui and Ban 2014; Bielik et al. 2018; Summers and Johnson 2016). On the other hand, the literature on social movements has emphasized different elements, such as socio-economic factors (e.g. such as inequality or grievances in a given city) that could contribute to explain the willingness to support a given protest using social media tools. Among these elements, it has also been studied whether participation in protests declines with distance to the end point of the march (Traag et al. 2017). However, to the best of our knowledge, a combined exploration of all these types of variables (i.e., neighborhood attributes, pedestrian networks, individual covariates, and the physical distance to the demonstration march) in a subpopulation of social media users has not been previously carried out.
The relationship between situational factors and human behavior is not always easy to discern. First, the model obtained indicates that social media users who participate in the protest march are slightly more likely to be detected compared to generic users. One of the few investigations that show a result similar to ours is one conducted by (Zhang et al. 2016), researchers who analyzed geolocalized Twitter user panels. In this study, the authors found that geolocated users who were exposed to an event occurring in a city were slightly more likely to mention the event compared to a random sample of users. However, one mayor difference is that these authors indirectly assume that the physical distance to the event location explains the communication content posted on Twitter (Zhang 2016; Zhang et al. 2016). In our case, we include this piece of information at a more detailed level and taking two spatial points related to the event (i.e., the starting and ending point of the march), and we find that the measurements of pedestrian street networks as well as neighborhood indicators have greater explanatory potential for detecting social media users than the physical distance to the protest location.
Second, it makes sense to detect social media users in areas showing low ICTs values. Mobile information technologies are precisely designed for that purpose: to be used, for example, in places where it is not possible to have access to the information technology that is available at home, or when a user moves through the city. In other words, we believe that there is a situational user behavior. On the one hand, when users are at home, they can use non-mobile communication technologies, but when they leaves home they starts using mobile technologies and social media services such as Twitter. In this way, users are detected when they are physically distant from other types of non-mobile communication technologies usually found at home. On the other hand, in the case of people who support a protest, the use of Twitter has the purpose of creating and disseminating content about the user’s participation in a given place and time, which is related to a target–driven user behavior. That is to say, the coordinated social media events seems to reduce randomness in the movements and social media usage of individuals, which increases their detectability. This would be a reasonable theoretical conjecture, but very difficult –or even inappropriate– to prove by using separately social media, census, or network data, or by studying user behavior under an experimental or quasi-experimental design.
Third, the fact that user detectability is explained by lower ICTs values in the observed area may be a contradiction. We thought that this result is due to two main reasons. On the one hand, using common sense, we can expect there to be a positive linear relationship between the availability of ICTs in a given area and the detection of social media users. However, according to our interpretation of the data, the detection model is showing that users have been detected in mainly non–residential areas, where ICT density declines over space. The historical center of Mexico City can be characterized as a non-residential area, full of old and colonial buildings, cathedrals, museums, public and tourist areas, and downtown–squares such as Zócalo.
In this sense, our results show something important for future research: it cannot be assumed that there is always a positive relationship between the availability of ICTs in a given area and the detectability of social media users. Further research is required to understand the complex relationship between the structural factors of a city and user behavior activity patterns that contribute to explain the probability of detecting social media user over space.
Limitations, future research and practical applications
The research has three main limitations. First, in this study, we have created a detection model based on 24 h of observation. Our idea was to create a prototype to demonstrate that social media users can be detected using the proposed methodological design. However, more days of observation are required to improve the quality of the models. In theory, if observed over a longer period of time and in a larger geographical area, the method would provide us with better information about the activity centres of social media users. In other words, the explanatory variables would theoretically relate to the neighborhood where the social media user inhabits, and not only to the user activity center during the observed day.
Second, the encounter probability model is based on Euclidean distance (see, Eq. 3). This means that symmetrical home ranges are assumed, as many of the actual spatial capture-recapture models available in ecological research (Royle et al. 2014). In order to obtain a model with different assumptions, a different distance measure must be tested. In future research, the ecological distance proposed in (Royle et al. 2013; Efford 2019) could be used instead. This type of distance requires using cost surface based on some theoretically relevant –but still unknown– spatial covariant. In any case, to compare the center of activity of the social media user over time (e.g., comparing the detectability or density of social media users before and and after the protest event), a different design is required that includes more days of observation, over a larger observation area, and possibly additional spatial covariants and candidate models to test.
Third, in this study, a deterministic algorithm to interpolate the data was used. However, we observe that the incorporation of street length in centrality measurements increased the RMSE error. In this sense, the limitation in the capacity of Euclidean graphs in the detection of social media users was not evaluated. This means that it can be used as a methodological approach to evaluate more complex network measures available at the city level. We think that the use of non-deterministic models may allow the incorporation of this type of network attributes in the comparative analysis.
Fourth, we also recommend testing different detection functions in future investigations. Changing the properties of the detection function also allows you to create new detection functions. An attractive idea might be to develop a detection function that, instead of the Euclidean distance, uses a network-based distance from a street network. This detector would have the potential to detect objects located primarily on street networks. A comparative study of different detection functions is needed for future research to evaluate possible advantages and disadvantages in the task of detecting social media users.
Finally, beyond the particular case that we analyze in this paper, the detection of social media users can be used for more practical purposes. For example, it can be of great value in urban planning to identify where to optimize the placement of public wifi spots, or also in which places in the city prioritize for development and applications of augmented reality using social media. Also, the methodological approach could be adapted for international organizations and non-governmental organizations focused on human rights to monitor how the detectability of opposing political groups increases or deteriorates in countries under authoritarian or dictatorial regimes.
Conclusion
We conclude that the neighborhood socio-demographic indicators have a better capacity to detect social media users compared to the centralities of pedestrian networks tested in this study. We also conclude that social media users who supported the demonstration have a slightly higher probability of being detected during the day of observation and this probability decays differently compared to that of a generic social media user. Additionally, and based on the observation that the tested centrality measures performed better than the null model, it leads us to think that the use of different complex network measures could obtain a better performance in the task of detecting the users of social media in the city. Finally, the physical distance to key locations of the protest march performed worse than the null model in detecting social media users during the observed time and area of study.
The present study is observational and requires testing and comparing different detection functions and additional variables in future research. We hope that the interdisciplinary methodological approach proposed here, and its variants, helps other researchers to explore how social network measures, compared to other types of explanatory factors, contribute to detect social media users over the city.
Appendix 1: Visualization of principal component analysis
Appendix 2: Qualitative coding scheme and inter-rater reliability
In this section, we summarize how content analysis was performed to classify users based on the content of the Tweets.
Qualitative coding scheme:
Step 1: Identify in the text of the Tweet if it contains hashtags related to the protest march. The process begins like most previous investigations, exploring the data to identify the hashtags that a user posts. In previous investigations carried out in the context of social media networks, it has been used as a proxy support measure to a cause that the user posts a certain hashtags (e.g. #pride, #pride2018, #loveislove, #gaypride, #gaypride2018, #pride_cdmx, #lgbt, #lgbtpride, #lgbttti, #orgullogay, #orgullo2018, #marchaorgullogay2018, #marchadelorgullo, #instagay, #pridemonth, #pridemonth2018, #pridemexico, #pride2018cdmx, #diversidad, #rainbow, #marchalgbt, #marchagay, #happypride, #prideparade).
Step 2: Identify in the text of the Tweet the existence of emojis related to the LGBTTTI movement. We do this because the user demonstrates his support for a social movement through iconographic communication. A flag with six rainbow colors, usually including red, orange, yellow, green, blue and purple is commonly used by the LGBTTTI movement as a gay pride flag, or simply as a pride flag. Additional icons related to the LGBTTTI protest march were also included.
Step 3: Open the URL and explore if there is content (i.e. images, video, or maps, or another type of media content) that denote support for the LBGTTTI community march.
Step 4: The final step is for the qualitative analyst to read the full text of the post. In this task, the analyst assessed whether he has evidence that the user, despite having posted a hashtag, emoji, or posted URL content related to the event, is posting content no related with the LBGTTTI march event. For example, if a Tweet contains the gay pride flag but uses it to promote tickets to a nightclub, the aim of the Tweet is for purposes other than to support the protest march.
Step 5: Finally, the qualitative analyst performs the coding of the data, taking into consideration the previous steps. First, if the Tweet contain information related to the march, it was coded with Yes label, and if it does not contain information about the march, it was coded with No label. Second, we define that if users have posted at least one Tweet to support the demo march, they will be considered a user Supporting the march (coded 1). Otherwise, they will be considered a Generic user publishing different types of content (coded 0) during the day of observation. Employing this manual coding, it was possible to identify the Type of User who supported the protest and those who posted other types of information.
Inter–rater reliability:
where p_{0} represents the actual observed agreement and p_{c} represents chance agreement, and where an outcome equal to 1 represents a perfect agreement. The coefficient can be negative (it is no lower bound). As can be seen, the qualitative coding procedure achieved an excellent agreement between the two raters (κ=0.84).
Footnotes
- 1.
For an extended overview of this approach, see (Royle et al. 2017).
- 2.
For additional details, see https://en.wikipedia.org/wiki/Historic_{c}enter_{o}f_{M}exico_{C}ity
- 3.
see, https://en.wikipedia.org/wiki/Angel_{o}f_{I}ndependence
- 4.
- 5.
We applied the transformation f(x)= log(x+1) for αω-weighted betweenness centrality. This transformation quickly reduced RMSE values.
- 6.
The “ ∼1” notation stands for null or intercept only models, that are models which have no covariate effects.
Notes
Acknowledgements
We thank the anonymous reviewers for their valuable suggestions and comments of this paper.
Authors’ contributions
VM designed the conceptual and methodological approach, studied the research domain, carried out the data collection, conducted the empirical tests and outcome reports, and wrote the draft manuscript. FC performed the interpolation of data. All authors read, checked and approved the manuscript.
Funding
This work is supported by the German Research Foundation (DFG) under grant No. GRK 2167, Research Training Group “User-Centred Social Media”. We acknowledge support by the Open Access Publication Fund of the University of Duisburg-Essen.
Competing interests
The authors declare that they have no competing interests.
References
- Beauchamp, MA (1965) An improved index of centrality. Behav Sci 10(2):161–163. Available from: https://doi.org/10.1002%2Fbs.3830100205.CrossRefGoogle Scholar
- Bielik, M, König R, Schneider S, Varoudis T (2018) Measuring the impact of street network configuration on the accessibility to people and walking attractors. Netw Spat Econ. Available from: https://doi.org/10.1007%2Fs11067-018-9426-x.MathSciNetCrossRefGoogle Scholar
- Boeing, G (2017) OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Comput Environ Urban Syst 65:126–139. Available from: https://doi.org/10.1016\%2Fj.compenvurbsys.2017.05.004.CrossRefGoogle Scholar
- Burrough, P, McDonnell R (1998) Creating continuous surfaces from point data. In: Burrough P, Goodchild M, McDonnell R, Witzer PMW (eds)Principles of Geographic Information Systems.. Oxford University Press, Oxford.Google Scholar
- Crucitti, P, Latora V, Porta S (2006) Centrality measures in spatial networks of urban streets. Phys Rev E 73(3). Available from: https://doi.org/10.1103\%2Fphysreve.73.036125.
- Derudder, B, Neal Z (2019) Uncovering Links Between Urban Studies and Network Science. Netw Spat Econ. Available from: https://doi.org/10.1007\%2Fs11067-019-09453-w.
- Diestel, R (2017) Graph Theory. Springer, Berlin. Available from: https://doi.org/10.1007\%2F978-3-662-53622-3.CrossRefGoogle Scholar
- Efford, M (2004) Density estimation in live-trapping studies. Oikos 106(3):598–610. Available from: https://doi.org/10.1111\%2Fj.0030-1299.2004.13043.x.CrossRefGoogle Scholar
- Efford, MG (2019) Non-circular home ranges and the estimation of population density. Ecology 100(2):e02580. Available from: https://doi.org/10.1002\%2Fecy.2580.CrossRefGoogle Scholar
- Freeman, LC (1977) A Set of Measures of Centrality Based on Betweenness. Sociometry 40(1):35. Available from: https://doi.org/10.2307\%2F3033543.CrossRefGoogle Scholar
- Gong, G, Mattevada S, O’Bryant SE (2014) Comparison of the accuracy of Kriging and IDW interpolations in estimating groundwater arsenic concentrations in Texas. Environ Res 130:59–69. Available from: https://doi.org/10.1016\%2Fj.envres.2013.12.005.CrossRefGoogle Scholar
- Hiruta, S, Yonezawa T, Jurmu M, Tokuda H (2012) Detection, classification and visualization of place-triggered geotagged tweets In: Proceedings of the 2012 ACM Conference on Ubiquitous Computing - UbiComp 12. ACM, 956–963.. ACM Press. Available from: https://dl.acm.org/citation.cfm?doid=2370216.2370427. https://doi.org/10.1145/2370216.2370427.
- Husson, F, LêS Pagès J (2017) Exploratory multivariate analysis by example using R. Chapman and Hall/CRC. Available from: https://doi.org/10.1201\%2Fb21874.
- Japkowicz, N, Shah M (2009) Evaluating Learning Algorithms. Cambridge University Press. Available from: https://doi.org/10.1017\%2Fcbo9780511921803.
- Lê, S, Josse J, Husson F (2008) FactoMineR: An R Package for Multivariate Analysis. J Stat Soft 25(1). Available from: https://doi.org/10.18637\%2Fjss.v025.i01.
- Li, L, Goodchild MF, Xu B (2013) Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr. Cartogr Geogr Inf Sci 40(2):61–77. Available from: https://doi.org/10.1080\%2F15230406.2013.777139.CrossRefGoogle Scholar
- Malik, MM, Lamba H, Nakos C, Pfeffer J (2015) Population bias in geotagged tweets In: Ninth international AAAI conference on web and social media, 18–27.. AAAI press, Oxford.Google Scholar
- Mislove, A, Lehmann S, Ahn YY, Onnela JP, Rosenquist JN (2011) Understanding the demographics of Twitter users In: Fifth international AAAI conference on weblogs and social media, 554–557.. AAAI, Palo Alto.Google Scholar
- Neal, ZP (2012) The Connected City: How Networks are Shaping the Modern Metropolis In: The Metropolis and Modern Life.. Routledge, New York and London.Google Scholar
- Opsahl, T, Agneessens F, Skvoretz J (2010) Node centrality in weighted networks: Generalizing degree and shortest paths. Soc Netw 32(3):245–251. Available from: https://doi.org/10.1016\%2Fj.socnet.2010.03.006.CrossRefGoogle Scholar
- Porta, S, Crucitti P, Latora V (2006) The network analysis of urban streets: A primal approach. Environ Plan B Plan Des 33(5):705–725. Available from: https://doi.org/10.1068\%2Fb32045.CrossRefGoogle Scholar
- Pratama, BY, Sarno R (2015) Personality classification based on Twitter text using Naive Bayes, KNN and SVM In: 2015 International Conference on Data and Software Engineering (ICoDSE), 170–174.. IEEE. Available from: https://doi.org/10.1109\%2Ficodse.2015.7436992. https://doi.org/10.1109/icodse.2015.7436992.
- Royle, JA, Chandler RB, Gazenski KD, Graves TA (2013) Spatial capture–recapture models for jointly estimating population density and landscape connectivity. Ecology 94(2):287–294. Available from: https://doi.org/10.1890\%2F12-0413.1.CrossRefGoogle Scholar
- Royle, JA, Chandler RB, Sollmann R, Gardner B (2014) Spatial Capture-recapture. Elsevier, Academic Press, Waltham.Google Scholar
- Royle, JA, Fuller AK, Sutherland C (2017) Unifying population and landscape ecology with spatial capture-recapture. Ecography 41(3):444–456. Available from: https://doi.org/10.1111\%2Fecog.03170.CrossRefGoogle Scholar
- Rui, Y, Ban Y (2014) Exploring the relationship between street centrality and land use in Stockholm. Int J Geogr Inf Sci 28(7):1425–1438. Available from: https://doi.org/10.1080\%2F13658816.2014.893347.CrossRefGoogle Scholar
- Sabidussi, G (1966) The centrality index of a graph. Psychometrika 31(4):581–603. Available from: https://doi.org/10.1007\%2Fbf02289527.MathSciNetCrossRefGoogle Scholar
- Santos, ME, Villatoro P (2016) A multidimensional poverty index for Latin America. Rev Income Wealth 64(1):52–82. Available from: https://doi.org/10.1111\%2Froiw.12275.CrossRefGoogle Scholar
- Setianto, A, Triandini T (2013) Comparison of Kriging and Inverse Distance Weighted (IDW) interpolation methods in lineament extraction and analysis. J Appl Geol 5(1):21–29.Google Scholar
- Shepard, D (1968) A two-dimensional interpolation function for irregularly-spaced data In: Proceedings of the 1968 23rd ACM national conference. ACM, 517–524.. ACM Press. Available from: https://doi.org/10.1145\%2F800186.810616.
- Summers, L, Johnson SD (2016) Does the configuration of the street network influence where outdoor serious violence takes place? Using space syntax to test crime pattern theory. J Quant Criminol 33(2):397–420. Available from: https://doi.org/10.1007\%2Fs10940-016-9306-9.CrossRefGoogle Scholar
- Sutherland, C, Royle J, Linden D (2016) oSCR: Multisession sex-structured spatial capture–recapture models. Proc R Soc B 285(20172603):8. R package version 0.42.Google Scholar
- Sutherland, C, Royle JA, Linden DW (2019) oSCR: A Spatial Capture-Recapture R Package for Inference about Spatial Ecological Processes. Ecography 0(0). Available from: https://onlinelibrary.wiley.com/doi/abs/10.1111/ecog.04551.
- Townend, J, Minelli C, Harrabi I, Obaseki DO, El-Rhazi K, Patel J, et al. (2015) Development of an international scale of socio-economic position based on household assets. Emerg Themes Epidemiol 12(1):13. Available from: https://doi.org/10.1186\%2Fs12982-015-0035-6.
- Traag, VA, Quax R, Sloot PMA (2017) Modelling the distance impedance of protest attendance. Phys A Stat Mech Appl 468:171–182. Available from: https://doi.org/10.1016\%2Fj.physa.2016.10.054.CrossRefGoogle Scholar
- Vyas, S, Kumaranayake L (2006) Constructing socio-economic status indices: How to use principal components analysis. Health Pol Plan 21(6):459–468. Available from: https://doi.org/10.1093\%2Fheapol\%2Fczl029.CrossRefGoogle Scholar
- Willmott, CJ (1982) Some comments on the evaluation of model performance. Bull Am Meteorol Soc 63(11):1309–1313.CrossRefGoogle Scholar
- Zhang, H (2016) Physical Exposures to Political Protests Impact Civic Engagement: Evidence from 13 Quasi-Experiments with Chinese Social Media. SSRN Electron J. Available from: https://doi.org/10.2139\%2Fssrn.2647222.
- Zhang, H, Hill S, Rothschild D (2016) Geolocated Twitter Panels to Study the Impact of Events In: 2016 AAAI Spring Symposium Series.. AAAI press, Palo Alto.Google Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.