\addtokomafont

section

The NetMob23 Dataset: Population Density and OD Matrices from Four LMIC Countries

Wenlan Zhang1,2    Miguel Nunez del Prado1    Vincent Gauthier3   
Sveta Milusheva1

1The World Bank
   USA
2University College London
   UK
3SAMOVAR
   Telecom SudParis    Institut Polytechnique de Paris    FR
(September 2024)
Abstract

The NetMob24 dataset offers a unique opportunity for researchers from a range of academic fields to access comprehensive spatiotemporal data sets spanning four countries (India, Mexico, Indonesia, and Colombia) over the course of two years (2019 and 2020). This dataset, developed in collaboration with Cuebiq (Also referred to as Spectus)111https://www.cuebiq.com/, comprises privacy-preserving aggregated data sets derived from mobile application (app) data collected from users who have voluntarily consented to anonymous data collection for research purposes. It is our hope that this reference dataset will foster the production of new research methods and the reproducibility of research outcomes.

1 Introduction

In light of the widespread adoption of mobile technology, the digital data generated from mobile devices and mobile applications provides us with the ability to examine a multitude of human behaviors, including mobility, at a previously unattainable scale. These data have the potential to facilitate the analysis of the habits of large populations at large scales, such as those of cities or countries. In certain instances, they can even supplant traditional sources of data, such as surveys in countries where data are scarce. The value of data for development and enhancement of public policies is largely untapped, as they can help fill data gaps and provide real-time and finer-scale insight (WB, 2021). While there are clear benefits to be gained from leveraging these data for the public good, there are also a number of challenges to be addressed. These include the need to ensure data privacy and security, as well as the need to develop new methodologies for data analysis. In the context of data for public good, several topics emerge as particularly relevant, including the use of mobile data for transportation planning, disaster recovery and epidemic response, socioeconomic analysis, and tourism. In this paper, we start out by providing an overview of the current state of research in these areas, highlighting the key findings and challenges that remain to be addressed. We then provide an overview of the Data Challenge dataset that has been made available to facilitate new research on these and other topics. We present some of the limitations of this data as well as some descriptive analysis. Finally, we highlight additional data sources that could be combined with the data to help in studying some of these topics.

With population growth in many developing countries and concentration of resources and opportunities in urban areas, many cities around the world are facing challenges in terms of transportation. The use of mobile phone data can provide valuable insights into the movement of populations in urban areas, which can be used to inform the development of transportation infrastructure and services. For example, in González et al. (2008), the authors use mobile phone data to study the movement of people in cities, with the aim of improving the efficiency of public transportation systems. Different mobile phone data is used to study the movement of people in urban area, with the aim of improving the design of transportation networks including GPS traces from mobile phones (Pappalardo et al., 2015), mobile phone traffic data (Furno et al., 2016), mobile network metadata (Khodabandelou et al., 2019), and mobile phone data (Bachir et al., 2019; Vilella et al., 2020). Finally, Ucar et al. (2021) use mobile phone metadata to study the relationship between the use of mobile services and the development of transportation infrastructure in cities.

Disaster recovery represents a critical domain where targeted resource allocation and response during crises is of paramount importance. The provision of a dynamic view of the situation, with the objective of enhancing situational awareness, represents a key area for consideration, as it allows for a more targeted and effective response. In this context, mobile phone data of all kinds can provide valuable insights into the movement of populations during and after disasters, as well as the impact of these movements Yabe et al. (2021); Lu et al. (2012); Wang and Taylor (2014); Bagrow et al. (2011), for example, on the demand for health services or the pre-positioning of response operations Wang et al. (2021). In addition to mobile data, other proxies for human mobility can also provide valuable information. For instance,  Alatrista-Salas et al. (2021) and Eyre et al. (2020) analyze the impact of natural disasters on purchasing behavior, with a particular focus on shifts in healthcare demand to alternative facilities. For a more detailed examination of the role played by mobile phone location data in the disaster recovery process, please refer to Yabe et al. (2022).

In low-income countries, communicable diseases continue to have a significant impact on public health, including lower respiratory infections, diarrheal diseases, HIV/AIDS, malaria, and tuberculosis. In countries where data are limited, the use of new big data sources can inform public policy interventions aimed at reducing the mortality and morbidity rates associated with infectious diseases. As early as 2008, González et al. (2008) began exploring the potential of mobile phone data for measuring population mobility and its subsequent application to the study of epidemics, with other researchers exploring applications to different diseases including malaria, rubella, dengue and cholera (Wesolowski et al., 2012, 2015b, 2015a; Bengtsson et al., 2015; Milusheva, 2020).

More recently, the utilization of data for the purpose of managing the spread of the Coronavirus Disease 2019 (Covid-19) pandemic has become a standard practice in numerous countries. This encompasses the monitoring of individuals’ geographical locations with the objective of gaining insights into mobility patterns during periods of lockdown or to facilitate disease contact tracing. CDRs were not originally designed with the intention of supporting public policy-making or enabling the government to monitor the movements of individuals (Milusheva et al., 2021b). However, they exemplify the reuse and repurposing of data for novel purposes. In this context, mobile phone data can provide valuable feedback in quantifying the effectiveness of policies, ranging from partial curfews to strict lockdowns (Oliver et al., 2020). The measurement of population density, travel patterns, and population mixing can be used to estimate population movement from mobile phone data and can also be used to improve the predictions of epidemiological models for the number of cases and geographical spread. Although both private companies and government actors have produced mobile phone applications for contact tracing, their efficacy relative to more traditional forms of contact tracing has not yet been established (Servick, 2020).

Some economic development and well-being metrics are now derived at scale through the lens of mobility data. Various segregation related issues (Gambetta et al., 2023; Gao et al., 2024a) are also heavily studied with the help mobility data. More specifically in the last decade, mobile phone data has opened a new perspective by measuring and mapping poverty at country levels. For instance, the work of Steele et al. (2017) produces accurate, high-resolution estimates of poverty distribution in Bangladesh. Another example in Guatemala, Hernandez et al. (2017) use CDR data to overcome the limited fiscal and budgetary resource limitations for producing poverty estimates. Njuguna and McSharry (2017) combine mobile ownership per capita and call volume per phone with normalized satellite nightlight data and population density, to estimate the multi-dimensional poverty index (MPI) in Rwanda. In the same line, Voukelatou et al. (2020) describe the advantages and limitations when calculating well-being indicators using CDR data. Pokhriyal et al. (2020) highlight the use of mobile phone data for cost effective recurring poverty indicators calculation in Haiti. The work of Aiken et al. (2022b) use survey data to train machine-learning algorithms to recognize patterns of poverty in mobile phone data in Togo. Gao et al. (2024b) employ mobile phone data for income estimation via mobility indicators, activity footprints, and travel graphs with machine learning models. Additionally, Aiken et al. (2022a) use mobile phone data as an input to Machine Learning models for identifying ultra-poor households in Afghanistan.

In recent studies of the tourism industry, various methodologies have been proposed to enhance the understanding of tourist behaviors and flows. Kovalcsik et al. (2022) introduced a methodology aimed at comprehending tourist flows by accounting for unobserved tourists. Similarly, Altin et al. (2022) utilized Call Detail Record (CDR) data to analyze the different types of visits to Estonia from 2006 to 2013. Expanding on the use of CDR data, the work of Xu et al. (2021) compares tourist mobility patterns across various cities in South Korea. Grassini et al. (2021) focused on analyzing the volume of tourist flows to Florence, Italy. Park et al. (2023) proposed a model for segmenting tourists based on their activities using mobile phone data. In Hungary, Michalkó et al. (2023) examined tourism dynamics between large cities and their surrounding areas using mobile phone data. Lastly, Sun et al. (2021) developed a methodology to differentiate tourists from locals within extensive mobile phone data sets. Together, these studies illustrate a growing trend in leveraging mobile and CDR data to refine our insights into tourist behaviors and patterns.

The use of mobile phone data has opened up new avenues for research and analysis in a variety of fields, as illustrated above. By leveraging these large-scale datasets, researchers can gain unprecedented insights into human behavior and mobility patterns. Despite the many benefits of using mobile phone data for research and analysis, there are also challenges that need to be addressed, such as ensuring the development of privacy-aware algorithms for robust methodologies in a big data context. As the field evolves, researchers and policymakers must work together to address these challenges and fully realize the potential of mobile phone data for the public good. Despite the considerable progress that has been made in the field of human mobility, a number of significant challenges remain to be addressed. In their study Pappalardo et al. (2023) identify several promising avenues for future research, including the development of methods for mobility data that avoid bias, a better understanding of the diversity of travel modes, as well as a better understanding of the impact of algorithms on human mobility, and interesting new developments in the development of computational models and AI for mobility modeling (Luca et al., 2021). The Netmob 2024 Data Challenge dataset can be used to address some of these and other research areas, helping to expand knowledge in this field.

This article is organized as follows. Section 2 reviews the data sources used to generate the netmob24 dataset. Section 3 presents the aggregation methodology and ethical considerations used for the generation of this dataset. Section 4 we provide a details description of the datasets provided, and the possible anomalies present inside it. Finally, we conclude we some additional resources in section 5.

2 Data Source

The NetMob data challenge 2024 dataset, developed in collaboration with Cuebiq, consists of aggregated datasets that have been produced using mobile application (app) data collected from users who voluntarily provided informed consent for anonymous data collection for research purposes. Through their secure Spectus Data Clean Room platform, Cuebiq makes it possible to analyze a variety of datasets, including privacy-enhanced device location data, detected stop locations, and user trajectories across multiple countries.

Device location data capture the position of a device at a specific moment in time, recorded as individual observations. From these observations, device stop data—defined as locations where a device remained for a period—are derived using a clustering algorithm based on spatio-temporal proximity. Cuebiq also produces a trajectory dataset that includes observations on the path a device traveled between two consecutive stops within a single day. We use the device location and trajectory datasets from four countries—Mexico, Colombia, Indonesia, and India—to prepare the data challenge datasets for Population Density (PD) and Origin-Destination (OD) Matrices, as shown in Figure 1. Details on the datasets and their creation follow in the subsections. These countries were chosen due to the limited existing research with mobile phone data in the context of low- and middle-income countries and data availability on the platform. The dataset covers the years 2019 and 2020, allowing for cross-year comparisons. It is important to note that data collection in Colombia began only in late October 2019, resulting in data availability for only November and December of that year.

Refer to caption
Figure 1: Data Processing Workflow

2.1 Device Location Data

The device location data records the location of a given device at specific times. This dataset includes details such as the event time, anonymous ID, coordinates, accuracy in meters, operating system name, device manufacturer name, timezone offset in seconds, and speed in meters per second, among other variables. It should be noted that in some cases the original latitude and longitude values are transformed to preserve the privacy of users. For example, home areas are re-assigned to centroid of the corresponding Geohash 6 tile, and points falling with Sensitive Points of Interest are removed from the dataset for privacy purposes. Cuebiq first classifies the point as a recurring area, a whitelisted area222These are points of interest (POI) included in Cuebiq’ POI whitelist such as commercial locations and other POIs that are permissible for use according to Cuebiq’ privacy requirements. or an ”Other” area. Any point that is in a recurring area is transformed to the coordinates of the centroid of the nearest geometry with 600+ households. Importantly, the point always remains in the same geohash 6 where the original point was located. For generating the population dataset, we use only the event time, anonymous ID, and coordinates variables. Data points with latitude or longitude values recorded as errors or zeros were removed.

2.2 Trajectory

The trajectory data records the path traveled between two consecutive stops by a device within a given day, capturing the movement of users throughout the day. Each trajectory provides various details, including the anonymous ID, which uniquely identifies the device without revealing personal information, and the start and end coordinates, which indicate the geographic locations where the trip began and ended. The latitude and longitude related to home locations are transformed as described above, always keeping the points within the same Geohash 6 as the original coordinates. The trajectory WKT (Well-Known Text) offers a geometric representation of the path taken. Additional information includes the operating system name and device manufacturer, which help identify the type of device used, and the duration in minutes, which records the time spent traveling between the start and end point. The trip length in meters quantifies the distance covered, while the number of points in a trajectory indicates how many device location points were collected during the trip. It is important to note that the same user can have multiple trajectories per day, and trajectories that start on one day and end on another were filtered out in the Cuebiq trajectory data. To create the Data Challenge dataset, we focus on the anonymous ID, start and end coordinates, duration in minutes, trip length in meters, and the number of points for each trajectory.

3 Methodology

Data has been spatially encoded and temporally aggregated to preserve personal privacy. Spatially, Geohash (GH) and H3 have been used in order to provide different levels of spatial resolution. Geohash is a widely adopted system for encoding geographic coordinates into indexes. It consists of 12 levels, with each level providing a different level of spatial precision. Each character in a geohash string corresponds to a specific level of precision, where the number of characters in the string indicates the precision level. For example, Geohash 3 (GH3), which has 3 characters, represents an area of 156 km x 156 km, while Geohash 5 (GH5), with 5 characters, represents an area of 4.9 km x 4.9 km. H3 is another global grid system for indexing geographies into a hexagonal grid, developed at Uber (H3, 2024). Hexagons offer better spatial properties, such as more uniform distance between the center of the hexagon and its neighbors. This reduces edge effects and provides more accurate modeling of spatial relationships. The H3 index is represented as a 15-character (or 16-character) hexadecimal string, and the second character is a hexadecimal digit that encodes the resolution level. The selected H3 level 7 has an average edge length of 1.41km. GH3, GH5, and H37 have been selected to balance privacy concerns and maintain detailed information.

After encoding the data, all individual observations were aggregated by time interval, and the relevant features were calculated, as shown in Figure 1. Time aggregations include 3 hourly (3h), daily, weekly and monthly depending on the dataset. For the 3h aggregation, the time intervals are divided starting at midnight with 8 intervals during the day. For weekly data, it is important to note that the week53 from 2019 data includes only 2 days (20191230-20191231) and week 1 from 2020 includes only 5 days (20200101 - 20200105).

3.1 Population Density Data

The Population Density (PD) dataset describes the presence of mobile app users, offering insights into the number of devices detected at specific locations. The dataset was generated by aggregating device location data at 3-hourly and daily intervals, using spatial units of GH3 and GH5. It includes several key attributes for analyzing spatial and temporal patterns. The geohash_5 or geohash_3 columns represent the spatial index for each observation. The no_of_points column captures the total count of observations from the device location dataset recorded within each geohash unit during a given time interval (an observation is generated every time a device has a data activity logged in one of the apps that shares data with Cuebiq), reflecting the density of data points within a specific spatial area. Note that the same device can show up multiple times in the same geohash unit and time interval, and all of these observations are summed. Additionally, the no_of_unique_users column provides the count of distinct users, based on unique anonymous device ID in the device location dataset, associated with observations within each geohash unit for the same time interval. For this variable, a device is only counted one time for a given geohash unit and interval no matter how many device location observations it has. Finally, the event_time column records the time interval of each observation, formatted as either YYYYMMDD HH:00-HH:00 for 3-hourly intervals or YYYYMMDD for daily intervals, thus providing temporal context to the data. An example of the dataset is presented in Table 1

geohash_5 local_date no_of_unique_users no_of_points
qqg7g 20190115 85 892
t9rn6 20200102 10 86
6rfyf 20191103 88 940
9u8dq 20191230 95 1023
Table 1: Sample Daily Population Density Data

3.2 Origin-Destination Matrix

The Origin-Destination (OD) matrix dataset represents the flow of app users from a specific origin to a particular destination, providing information on the number of app user trips between different locations. The OD matrix was generated by aggregating trajectories within the same start and end spatial units of GH3, GH5 and H37 at 3-hourly, daily, weekly and monthly intervals. The dataset includes various attributes essential for analyzing travel patterns and spatial interactions. The start_geohash and end_geohash columns represent the spatial units for the origin and destination of trips, encoded using GH3, GH5 or H37, offering a detailed spatial reference for each trip. The trip_count column aggregates the total number of trips between each start and end geohash/H3 pair for each time interval, providing insights into trip frequency. To characterize the temporal aspects of travel, the dataset includes m_duration_min, mdn_duration_min, and sd_duration_min, which respectively denote the mean, median, and standard deviation of trip durations in minutes between the start and end units. Similarly, the spatial dimensions of trips are captured through m_length_m, mdn_length_m, and sd_length_m, representing the mean, median, and standard deviation of trip lengths in meters for each day. The dataset also includes measures of observational density, with m_points_no, mdn_points_no, and sd_points_no indicating the average, median, and standard deviation of recorded device location observations per trip between geohash pairs. Same as PD, there are time columns of local_time (formatted as YYYYMMDD HH:00-HH:00 for 3 hourly interval data) and local_date (formatted as YYYYMMDD for daily data) , which records the date and datetime of each trip, providing a temporal context for the observed travel patterns. An example OD matrix is presented in Table 2

Geohash3 Trip No Trip Duration (min) Trip Length (m) No Points per Trip Date
Start End Mean Median SD Mean Median SD Mean Median SD
6rf 6rf 30 142.48 48.49 183.83 17903.78 733.10 89495.21 4.73 4 2.79 20191101
d0r d0r 141 78.03 32.15 129.87 2130.48 994.02 2837.16 5.34 4 4.26 20191101
abc def 45 120.45 40.22 150.87 19500.56 750.89 89000.12 5.67 5 3.45 20191102
ghi jkl 67 98.34 35.78 110.56 2500.78 1050.33 3000.45 6.23 5 4.12 20191103
Table 2: Sample Daily Origin-Destination Matrix Data

3.3 Ethics Considerations

To ensure a high level of privacy for individuals, we implemented several measures while handling the mobile app data. The data is sourced only from users who have consented to share their information through various mobile applications. Before the data was shared with us, all personally identifiable information was removed, and device IDs were anonymized by Cuebiq. Additionally, Cuebiq applied a privacy enhancement method that integrates classification and transformation in a sequential process. Initially, each location point is categorized based on specific criteria. Points that fall within a ”whitelisted” point of interest (POI), as defined by Cuebiq’ privacy-compliant POI whitelist, or those that do not meet other classification criteria, retain their original latitude and longitude values without modification. Conversely, points identified as being near a recurring area associated with the device, likely the home location, undergo a transformation. In these cases, the original latitude and longitude values are adjusted to the centroid of the nearest geometry containing 600 or more households. The transformation ensures that the adjusted point remains within the same geohash level 6 as the original point.

Upon receiving the privacy-enhanced data, the first step applied was geohash encoding using the GH3, GH5 or H37 level, followed by aggregation by 3 hourly, daily, weekly or monthly time interval, rather than individual or device-level data. Only cells with 10 or more users were included, leading to the exclusion of some cells, as illustrated in Figure 2. The figure shows the cell observations in the available dataset against the total cell observations from the original aggregated data, to demonstrate what proportion of data is excluded from each dataset in order to maintain the minimum of 10 users per time/location interval. All data aggregation was conducted within the secure environment of Spectus Data Clean Room. Only the aggregated and threshold-applied data were exported, with Cuebiq’s consent.

Refer to caption
(a) Proportion of PD Points with Minimum 10 Unique Users that are Kept Compared to Total PD Points
Refer to caption
(b) Proportion of OD Pairs with Minimum 10 Trips that are Kept Compared to Total OD Pairs
Figure 2: Preserved Data for the 4 Countries and Different Temporal and Spatial Datasets, 2019

3.4 Final dataset format

The NetMob 2024 data challenge dataset includes PD and OD data from four low and middle income countries (LMICs): Colombia(CO), Indonesia(ID), India(IN), and Mexico(MX), spanning the years 2019 to 2020. Data collection for CO began in November 2019, so both PD and OD only started then. The dataset is provided based on country, spatial geoencode index and temporal interval. The hierarchical organization of files can be seen in Figure 3. The structure is based on dataset-temporal interval-encode-year-country.

Refer to caption
Figure 3: Hierarchical Organization of Folder

4 Data Description

This section presents an exploratory analysis of the provided data challenge PD and OD datasets. The analysis aims to offer an understanding of the datasets from the points of view of completeness, spatial, and time interval perspectives, highlighting their strengths and weaknesses, to support participants with their research applications.

4.1 Data Completeness and Anomalies

Figure 4 presents the time series of the number of unique users from the PD dataset, the trip count from the OD dataset, and the ratio of trip count to unique users for 2019 and 2020. This visualization aims to highlight key feature changes and identify anomalies. Notably, visually, some of the days we can observe anomalies include May 10 to May 20, 2019, as well as on October 22, 2019. We are identifying these anomalies visually based on drops in the line charts in Figure 4(a) and Figure 4(b), but when conducting analysis with the data, it may be advisable to identify all anomalies using a standard definition. An example of such a definition would be identifying observations that are more than 2 standard deviations from the average values within 30 days. It is important to note that for the OD data, certain days are missing data for certain countries. The OD has missing data for all four countries on 20200501; India and Indonesia for 20191022 and 20191231. Additionally, there are some days where certain time intervals are missing (e.g. Colombia and Mexico on 20191231 for the hourly intervals starting at 0,3,6, and 9). In the 3h datasets, these observations would be missing, but in the daily datasets, there would be an observation for the day since there are some time intervals with data, but the values would be much lower since part of the day is missing. Again, these cases could be identified through anomalie detection methods.

Refer to caption
(a) Number of Unique Users from PD
Refer to caption
(b) Number of Trip Counts from OD
Refer to caption
(c) Ratio of Trip Count to Unique Users
Figure 4: Time Series PD, OD and OD/PD Highlighting Some Anomalies, 2019–2020

In addition to anomalies occurring on specific days or short periods of time, there are also important shifts in the data due to the data generating process. The data is coming from users providing their location on mobile apps. If new apps start to provide data to Cuebiq or else other apps stop providing this data, then it can lead to big shifts up or down in the number of users and the number of trips measured. We can see this happen with Indonesia in late 2019, when the number of unique users per day suddenly drops by more than half. We can also see across Mexico, India and Indonesia the ramp-up of unique users in early 2019 as more apps share their data. We do not have information on the specific apps from which the data is coming and when they start or stop sharing their data. It is important to factor in that these changes in the data generating process are occurring because otherwise they may be considered as representing real changes in the number of people in a location or the number of trips, when in fact the changes are due to a shift in the number of people providing their location.

When working with the trips dataset, one option for accounting for the change in the subscriber base contributing data or other issues in terms of anomalies in the number of observations is to also compare the proportion of trips to users. Figure 4(c) demonstrates that while the number of trip counts seems to change quite significantly at different times of the year, when accounting for the number of users producing observations on a given day, the ratio stays relatively stable across the year for each country. India seems to be the most steady with a ratio around .2 trips per unique user per day. Mexico is higher, with a ratio of around .3, though there is a large increase in the ratio around June 2019 that may therefore represent a real shift in people moving more (rather than a function of more users providing data). Indonesia has more variability during the year, and Colombia starts out much higher than the other countries but then levels out close to their values. This different pattern in Colombia when the data first starts signals potentially that the first month of data may not be as reliable, therefore potentially excluding this data or conducting robustness checks with and without the data may be advisable.

So far the focus has been on aspects of the data generating process that may impact the data at a particular time, but another important area related to completeness is the representativeness of the data. In order to be part of the sample of users in the data, it is a necessary condition to have a smartphone and to also have the resources to pay for mobile data to use the apps through which the data points are generated. When studying low and middle income countries, there can be large portions of the population that are not able to afford a smartphone or to pay for mobile data on a regular basis, and therefore would not be represented in the dataset Milusheva et al. (2021a). Additionally, the apps that different people use may differ based on demographic characteristics like gender or income, and as some apps become excluded or included in contributing to the data, the demographic make-up of those contributing their data could change, affecting also the behaviors seen in the data. This can also vary from country to country as certain apps may be more popular in some countries and not in others. These are not aspects that it is possible to correct for in the datasets provided for the Challenge, but important aspects to consider when interpreting the data and results and to be further analyzed in the future.

4.2 Spatial and Temporal Analysis

4.2.1 Population Density Data

Refer to caption
Figure 5: Mexico Unique Users from PD Dataset in Geohash 3 & 5, Avg Daily Value for November 2019

Understanding population density is critical for a wide range of applications, from national-level policy planning to urban management. As shown in Figure 5, the two levels of PD data provided can support both macro and micro-level studies.333The figure uses Jenks (1967) natural breaks method for grouping cells by population density. GH3, with its broad spatial coverage, is ideal for strategic planning and resource allocation on a macro scale, enabling policymakers to assess regional disparities or optimize large-scale infrastructure projects. In contrast, GH5 provides a finer resolution suited for city-level urban planning and management. It allows for detailed mapping of population concentrations within cities, guiding decisions on zoning, public service distribution, and disaster preparedness. Together, these geohash levels enable a comprehensive approach to population density analysis, supporting both high-level policy decisions and granular urban management. Example maps for all four countries are included in the appendix.

4.2.2 Origin-Destination

It is possible to visualize some of the patterns of movement between origin and destination locations. Looking at India and Mexico, they have very different patterns of movement between origin-destination pairs. Figure 9 is a heatmap showing movement between the top 30 geohash 5 areas, and they have been ordered such that being closer together on the axis means they are closer together geographically. In India, focusing on the top 30 pairs in terms of total trips, they are very concentrated geographically. Mexico, on the other hand, has concentration as well (which can be seen with more blue colors along the diagonal which represents cells that are closer together geographically), but there is also movement to places further away.

In Figure 7, we see that there is also a different pattern of movement when it comes to cross-country movement. In India, there are several major cities that act as hubs, and the concentrated movement we saw in the OD heatmap showed that high levels of movement are concentrated within those major city hubs. There is also movement that happens across the big hubs though, but it is much less than within the proximate area to the hubs. In Mexico, on the other hand, Mexico City seems to act as the main central point from which movement across the country radiates. This type of pattern was also seen in the heatmap.

Refer to caption
(a) India OD
Refer to caption
(b) Mexico OD
Figure 6: Top 30 OD Movement Geohash Pairs Heat Map, December 2019
Refer to caption
(a) India OD
Refer to caption
(b) Mexico OD
Figure 7: Visualization of Movement Between Different Areas of the Country, December 2019

Using the 3 hour dataset, it is also possible to look at how patterns of movement change over the course of a day. Figure 8 shows the total trips in November across Mexico City for different time intervals. It demonstrates that most trips are detected during the middle of the day, 12:00-15:00 and 15:00-18:00 and are concentrated in the northern part of the city. Combining datasets at different time intervals and different spatial resolutions can help with learning more about the mobility patterns across the four countries.

It is important to note that again, the data generating process may affect the final datasets. In particular, Cuebiq removes any trips that start in one day and end in a different day. Therefore, trips starting later in the day that might not end until the following day will be removed, potentially decreasing the number of trips measured in the evening. Additionally, as already discussed, the data is a function of the users providing that information. If there are parts of the city or country with a much higher proportion of low-income individuals who do not have a smartphone, are not able to afford to pay for data, or are less likely to use mobile apps that collect this data, there will be much less data collected from these areas. This is especially true for more rural areas or lower-income areas within cities.

Refer to caption
Figure 8: Trip Count by 3h Timely Origin in Mexico City, November 2019 at Geohash 5 Level

5 Additional Resources

Several complementary data sources have been identified that could supplement the mobile app data for the data challenge. These sources provide additional context and depth, enabling a more comprehensive analysis. They include demographic and health surveys, geospatial datasets, socio-economic indicators, and environmental data, all of which can help to validate, enrich, and cross-reference the mobile app data, offering a fuller understanding of the patterns and trends observed.

WorldEX (Solatorio et al., 2024) is a platform that utilizes H3 indexing to facilitate the discovery of geospatial data, particularly at the sub-national level, to support socio-economic research and policy-making. The website aggregates and summarizes publicly available datasets from a variety of reputable sources, including Climate Trace, Ember Climate, the Humanitarian Data Exchange (HDX), Source Cooperative, the Uppsala Conflict Data Program (UCDP), the United Nations High Commissioner for Refugees (UNHCR), the World Bank, and WorldPop. These datasets cover a wide range of topics on a global scale, offering insights into various levels of administrative boundaries, socio-economic features such as population demographics and age distribution, natural hazards, as well as environmental factors like forest coverage and building footprints. WorldEX provides direct links to these datasets, making it a valuable resource for researchers and policymakers seeking to access and analyze detailed geospatial information for informed decision-making444https://worldex.org/.

The Demographic and Health Surveys (DHS) Program, established by USAID, provides nationally representative data essential for health and population research in developing countries. With over 30 years of data collection across more than 90 countries, DHS covers topics like fertility, maternal and child health, HIV/AIDS, and more. The data are collected at various spatial resolutions, typically linked to administrative units like regions or districts, with GPS coordinates recorded at the cluster level—representing small communities or neighborhoods—and displaced up to 2 kilometers in urban areas and 5 kilometers in rural areas for privacy. This level of spatial resolution allows for granular geographic analysis, which is particularly valuable when combined with mobile phone data. By integrating DHS data with mobile phone app data, researchers can gain deeper insights into population movement, health service access, and disease spread, thus enhancing the ability to track health trends, evaluate programs, and inform policy555https://dhsprogram.com/Data/.

Climate data There are various sources of data for precipitation and other weather related indicators across the world as well as air pollution and other indicators that are provided through satellite imagery and modelling and assimilation projects that rely on raw satellite data combined with other data sources. Some resources include NASA’s Global Precipitation Measurement666https://gpm.nasa.gov/data, NASA’s Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2)777https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/, and the European Union’s Copernicus Air Monitoring System (CAMS)888https://atmosphere.copernicus.eu/.

World Bank Microdata Repository In addition to making various data available on indicators like GDP, population and other country statistics999https://data.worldbank.org/, the World Bank also has a Microdata Catalog where it can be possible to find many different microdatasets for each country. Many of these are available openly or through a data request101010https://microdata.worldbank.org/index.php/home.

6 Concluding Remarks

The Netmob 2024 Data Challenge dataset provides researchers with the opportunity to study mobility and population patterns for four countries across the world, with the goal of increasing knowledge and research that is especially relevant for low- and middle-income countries. Different levels of spatial and temporal aggregations can be used for different applications and studies, and there can also be important questions looked at related to the data itself and methods for tackling some of the challenges that can arise when working with passively collected data. This paper helps to present the dataset and some of these challenges. It is our hope that this reference dataset will foster the production of new research methods and the reproducibility of research outcomes.

Acknowledgements

We would like to thank Cuebiq for providing the data and secure platform that were used to produce the final aggregated datasets for the Data Challenge. In particular, we would like to thank Brennan Lake and Éadaoin Ilten for support in working with the data to produce the datasets and reviewing research proposals. We would also like to thank the Government of Spain for funding support provided to the Global Data Facility - Mobile Phone Data for Policy Program, which is helping to sponsor Netmob 2024.

References

  • Aiken et al. (2022a) Aiken, E., Bedoya, G., Joshua E, B. and Coville, A. (2022a). Program Targeting with Machine Learning and Mobile Phone Data: Evidence from an Anti-Poverty Intervention in Afghanistan. The World Bank.
  • Aiken et al. (2022b) , Bellue, S., Karlan, D., Udry, C. and Blumenstock, J. E. (2022b). Machine learning and phone data can improve targeting of humanitarian aid. Nature, 603 (7903), 864–870.
  • Alatrista-Salas et al. (2021) Alatrista-Salas, H., Gauthier, V., Nunez-del Prado, M. and Becker, M. (2021). Impact of natural disasters on consumer behavior: Case of the 2017 el niño phenomenon in peru. PLOS ONE, 16 (1), e0244409.
  • Altin et al. (2022) Altin, L., Ahas, R., Silm, S. and Saluveer, E. (2022). Megastar concerts in tourism: a study using mobile phone data. Scandinavian Journal of Hospitality and Tourism, 22 (2), 161–180.
  • Bachir et al. (2019) Bachir, D., Khodabandelou, G., Gauthier, V., El Yacoubi, M. and Puchinger, J. (2019). Inferring dynamic origin-destination flows by transport mode using mobile phone data. Transportation Research Part C: Emerging Technologies, 101, 254–275.
  • Bagrow et al. (2011) Bagrow, J. P., Wang, D. and Barabási, A.-L. (2011). Collective response of human populations to large-scale emergencies. PLoS ONE, 6 (3), e17680.
  • Bengtsson et al. (2015) Bengtsson, L., Gaudart, J., Lu, X., Moore, S., Wetter, E., Sallah, K., Rebaudet, S. and Piarroux, R. (2015). Using mobile phone data to predict the spatial spread of cholera. Scientific reports, 5 (1), 8923.
  • Eyre et al. (2020) Eyre, R., De Luca, F. and Simini, F. (2020). Social media usage reveals recovery of small businesses after natural hazard events. Nature Communications, 11 (1).
  • Furno et al. (2016) Furno, A., Fiore, M., Stanica, R., Ziemlicki, C. and Smoreda, Z. (2016). A tale of ten cities: Characterizing signatures of mobile traffic in urban areas. IEEE Transactions on Mobile Computing, 16 (10), 2682–2696.
  • Gambetta et al. (2023) Gambetta, D., Mauro, G. and Pappalardo, L. (2023). Mobility constraints in segregation models. Scientific Reports, 13.
  • Gao et al. (2024a) Gao, Q. L., Zhong, C. and Wang, Y. (2024a). Unpacking urban scaling and socio-spatial inequalities in mobility: Evidence from england. Environment and Planning B: Urban Analytics and City Science.
  • Gao et al. (2024b) , , Yue, Y., Cao, R. and Zhang, B. (2024b). Income estimation based on human mobility patterns and machine learning models. Applied Geography, 163, 103179.
  • González et al. (2008) González, M. C., Hidalgo, C. A. and Barabási, A.-L. (2008). Understanding individual human mobility patterns. Nature, 453 (7196), 779–782.
  • Grassini et al. (2021) Grassini, L., Dugheri, G. et al. (2021). Mobile phone data and tourism statistics: a broken promise. National Accounting Review, 3 (1), 50–68.
  • H3 (2024) H3 (2024). Tables of cell statistics across resolutions.
  • Hernandez et al. (2017) Hernandez, M., Hong, L., Frias-Martinez, V., Whitby, A. and Frias-Martinez, E. (2017). Estimating poverty using cell phone data: evidence from guatemala. World Bank Policy Research Working Paper,  (7969).
  • Jenks (1967) Jenks, G. F. (1967). The data model concept in statistical mapping. International Yearbook of Cartography, pp. 186–190.
  • Khodabandelou et al. (2019) Khodabandelou, G., Gauthier, V., Fiore, M. and El-Yacoubi, M. A. (2019). Estimation of static and dynamic urban populations with mobile network metadata. IEEE Transactions on Mobile Computing, 18 (9), 2034–2047.
  • Kovalcsik et al. (2022) Kovalcsik, T., Elekes, Á., Boros, L., Könnyid, L. and Kovács, Z. (2022). Capturing unobserved tourists: Challenges and opportunities of processing mobile positioning data in tourism research. Sustainability, 14 (21), 13826.
  • Lu et al. (2012) Lu, X., Bengtsson, L. and Holme, P. (2012). Predictability of population displacement after the 2010 haiti earthquake. Proceedings of the National Academy of Sciences, 109 (29), 11576–11581.
  • Luca et al. (2021) Luca, M., Barlacchi, G., Lepri, B. and Pappalardo, L. (2021). A survey on deep learning for human mobility. ACM Comput. Surv., 55 (1).
  • Michalkó et al. (2023) Michalkó, G., Prorok, M., Kondor, A. C., Ilyés, N. and Szabó, T. (2023). Mobility patterns of satellite travellers based on mobile phone cellular data. Hungarian Geographical Bulletin, 72 (2), 163–178.
  • Milusheva (2020) Milusheva, S. (2020). Managing the spread of disease with mobile phone data. Journal of Development Economics, 147, 102559.
  • Milusheva et al. (2021a) , Bjorkegren, D. and Viotti, L. (2021a). Assessing bias in smartphone mobility estimates in low income countries. In Proceedings of the 4th ACM SIGCAS Conference on Computing and Sustainable Societies, pp. 364–378.
  • Milusheva et al. (2021b) , Lewin, A., Gomez, T. B., Matekenya, D. and Reid, K. (2021b). Challenges and opportunities in accessing mobile phone data for covid-19 response in developing countries. Data & Policy, 3, e20.
  • Njuguna and McSharry (2017) Njuguna, C. and McSharry, P. (2017). Constructing spatiotemporal poverty indices from big data. Journal of Business Research, 70, 318–327.
  • Oliver et al. (2020) Oliver, N., Letouzé, E., Sterly, H., Delataille, S., Nadai, M. D., Lepri, B., Lambiotte, R., Benjamins, R., Cattuto, C., Colizza, V., de Cordes, N., Fraiberger, S. P., Koebe, T., Lehmann, S., Murillo, J., Pentland, A., Pham, P. N., Pivetta, F., Salah, A. A., Saramäki, J., Scarpino, S. V., Tizzoni, M., Verhulst, S. and Vinck, P. (2020). Mobile phone data for informing public health actions across the covid-19 pandemic life cycle. Science Advances, 6 (23), eabc0764.
  • Pappalardo et al. (2023) Pappalardo, L., Manley, E., Sekara, V. and Alessandretti, L. (2023). Future directions in human mobility science. Nature Computational Science, 3 (7), 588–600.
  • Pappalardo et al. (2015) , Simini, F., Rinzivillo, S., Pedreschi, D., Giannotti, F. and Barabási, A.-L. (2015). Returners and explorers dichotomy in human mobility. Nature Communications, 6 (1).
  • Park et al. (2023) Park, S., Zu, J., Xu, Y., Zhang, F., Liu, Y. and Li, J. (2023). Analyzing travel mobility patterns in city destinations: Implications for destination design. Tourism Management, 96, 104718.
  • Pokhriyal et al. (2020) Pokhriyal, N., Zambrano, O., Linares, J. and Hernández, H. (2020). Estimating and forecasting income poverty and inequality in haiti using satellite imagery and mobile phone data. publications.
  • Servick (2020) Servick, K. (2020). Covid-19 contact tracing apps are coming to a phone near you. how will we know whether they work?
  • Solatorio et al. (2024) Solatorio, A., Bongocan, R. G., Miclat, J. T. and Dupriez, O. (2024). WorldEx. (P180150) Indexing the World: Enabling the effective and efficient discovery of geospatial data for holistic and localized research, KCP IV - TF085237.
  • Steele et al. (2017) Steele, J. E., Sundsøy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., Bjelland, J., Engø-Monsen, K., De Montjoye, Y.-A., Iqbal, A. M. et al. (2017). Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14 (127), 20160690.
  • Sun et al. (2021) Sun, H., Chen, Y., Lai, J., Wang, Y. and Liu, X. (2021). Identifying tourists and locals by k-means clustering method from mobile phone signaling data. Journal of Transportation Engineering, Part A: Systems, 147 (10), 04021070.
  • Ucar et al. (2021) Ucar, I. n., Gramaglia, M., Fiore, M., Smoreda, Z. and Moro, E. (2021). News or social media? socio-economic divide of mobile service consumption. Journal of The Royal Society Interface, 18 (185), 20210350.
  • Vilella et al. (2020) Vilella, S., Paolotti, D., Ruffo, G. and Ferres, L. (2020). News and the city: understanding online press consumption patterns through mobile data. EPJ Data Science, 9 (1).
  • Voukelatou et al. (2020) Voukelatou, V., Gabrielli, L., Miliou, I., Cresci, S., Sharma, R., Tesconi, M. and Pappalardo, L. (2020). Measuring objective and subjective well-being: dimensions and data sources. International Journal of Data Science and Analytics, 11 (4), 279–309.
  • Wang et al. (2021) Wang, J., Cai, J., Yue, X. and Suresh, N. C. (2021). Pre-positioning and real-time disaster response operations: Optimization with mobile phone location data. Transportation Research Part E: Logistics and Transportation Review, 150, 102344.
  • Wang and Taylor (2014) Wang, Q. and Taylor, J. E. (2014). Quantifying human mobility perturbation and resilience in hurricane sandy. PLoS ONE, 9 (11), e112608.
  • WB (2021) WB (2021). World development report 2021: data for better lives. World Bank Group.
  • Wesolowski et al. (2012) Wesolowski, A., Eagle, N., Tatem, A. J., Smith, D. L., Noor, A. M., Snow, R. W. and Buckee, C. O. (2012). Quantifying the impact of human mobility on malaria. Science, 338 (6104), 267–270.
  • Wesolowski et al. (2015a) , Metcalf, C., Eagle, N., Kombich, J., Grenfell, B. T., Bjørnstad, O. N., Lessler, J., Tatem, A. J. and Buckee, C. O. (2015a). Quantifying seasonal population fluxes driving rubella transmission dynamics using mobile phone data. Proceedings of the National Academy of Sciences, 112 (35), 11114–11119.
  • Wesolowski et al. (2015b) , Qureshi, T., Boni, M. F., Sundsøy, P. R., Johansson, M. A., Rasheed, S. B., Engø-Monsen, K. and Buckee, C. O. (2015b). Impact of human mobility on the emergence of dengue epidemics in pakistan. Proceedings of the National Academy of Sciences, 112 (38), 11887–11892.
  • Xu et al. (2021) Xu, Y., Xue, J., Park, S. and Yue, Y. (2021). Towards a multidimensional view of tourist mobility patterns in cities: A mobile phone data perspective. Computers, Environment and urban systems, 86, 101593.
  • Yabe et al. (2022) Yabe, T., Jones, N. K., Rao, P. S. C., Gonzalez, M. C. and Ukkusuri, S. V. (2022). Mobile phone location data for disasters: A review from natural hazards and epidemics. Computers, Environment and Urban Systems, 94, 101777.
  • Yabe et al. (2021) , Rao, P. and Ukkusuri, S. V. (2021). Resilience of interdependent urban socio-physical systems using large-scale mobility data: Modeling recovery dynamics. Sustainable Cities and Society, 75, 103237.

7 Appendix:

Figure 9(a) presents the OD matrix by geohash 5 for 4 countries. All trips within the same spatial unit have been removed for the figure since it is dominating. Only the top 30 pairs with the most trips are included in the visualisation.

Refer to caption
(a) CO
Refer to caption
(b) ID
Refer to caption
(c) IN
Refer to caption
(d) MX
Figure 9: Top 30 OD Heat Map Dec 2019

Figure 10 illustrates the distribution of unique users from the PD dataset at GH3 level for the four countries in the Data Challenge. The highest concentrations are observed in the capital areas of each country, with Mumbai in India also displaying a notably high population count.

Refer to caption
Figure 10: Unique User from Population Density Map in GH3, Average across days in Nov 2019