Satellite Imagery and Machine Learning – The dynamic duo to combat data gaps?

Accelerator Award Montreal Tinotenda Matsika Haoyi Qiu Michelle Murphy Raphaelle Tseng Rayan Awad Alim

The Datallite team, generously supported by, Concertation Montreal as winners of the, AI4Good Lab Montreal Accelerator Award, have developed a model that utilizes satellite imagery to extract and measure different socioeconomic and infrastructural indicators - all with a vision to fuel impactful interventions and bridge data gaps. Read more to find out how Datallite is providing data where it's needed most!

KEYWORDS:

Satellite imagery, machine learning, AI, data collection, data gaps, humanitarian efforts, education, demography, poverty

Very often, Artificial Intelligence (AI) is seen as a magical wand that humans can use to solve all the world’s problems. However, in reality, AI models are only as good as the data that is used to create them.

The key concept is that large amounts of data are required in order to create these complex models. Unfortunately, developing regions are not as up to date with the amount or availability of public data, compared to that of developed nations. Often AI tools can’t even begin to be created before the issue of data availability can be addressed or solved.

This lack of available public data was exactly the roadblock our team ran into when trying to come up with an AI tool to improve education and literacy rates in Africa. After more research, our team came to the realization that developing regions in Sub-Saharan Africa are in a data drought.

We knew it was time to reframe our problem…

With this in mind we decided to ask ourselves, how can we come up with a way to impact areas most in need of public data in relation to education and literacy rates? From this question, we knew Datallite’s mission was clear.

Let’s locate those data gaps and get it to where it’s needed most!

DATA GAPS AND EFFECTIVE ACTION

Data drives the decision-making processes in our world. Most decisions in our society are based on data analytics.

For stakeholders like policymakers and Non-Governmental Organizations (NGOs) who try to accurately monitor progress on sustainable development goals, they are often driving in the dark.

As stated by the United Nations, the 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future.¹ At its heart, are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries, developed and developing, in a global partnership.¹ The UN recognizes that ending poverty and other deprivations must go hand-in-hand with strategies that improve health and education, reduce inequality, and spur economic growth, all while tackling climate change and working to preserve our oceans and forests.¹

Moreover, since the inception of the UN’s Agenda for Sustainable Development, there has been a global challenge to create effective, data-driven policies that can help maximize their impact. Despite the innovative technologies and demand for better data, data availability has not kept pace evenly across regions. The availability of public data in places like South America, Africa and other parts of Asia are not getting better but getting worse. So much so that public data in developing countries has declined since 2015. And most notably, the data on education and education inequality.

When the team took a look at all of this information, we really felt compelled to zero in on SDG #4 as its mandate is to ensure inclusive and equitable quality education and promote lifelong learning opportunities for all. However, to progress our goal the team needed to determine one area or region most in need of educational data.

THE PROBLEM WITH CURRENT DATA COLLECTION

The World Bank database contains minimal information on a majority of African countries due to the expensive and time-consuming nature of survey collection. Furthermore, most of the data collected by surveys are not accessible to the general public thus making it difficult to target humanitarian aid from third party organizations. These traditional data collection efforts require a multitude of resources that can not be kept current and updated annually. These efforts require billions of dollars to implement and scale. Without the timely and reliable data collection processes used by developed countries, policies in Sub-Saharan countries can become misdirected and progress to development is stunted.

MACHINE LEARNING AND DATA COLLECTION

Given these difficulties, our team set out to implement methods to first find and then fill these data gaps. We managed to incorporate the machine learning knowledge and AI modelling skills learned in the AI4Good Lab, to make this possible. Our team decided to utilize public data from satellite imagery to compile data analytics.

Therefore, we found it only appropriate that:

DATA plus SATELLITE gives to the world DATALLITE!

From here, the aim of our project became to help users easily track and assess social, economic and environmental conditions in developing nations. Our AI model was developed with the goal to transform the way countries collect data. No longer do regions need to rely on outdated census surveys that are expensive, inaccessible and often missing valuable information.

For our proof of concept, our team focused on training our model to predict education expenditure of different regions in Nigeria. The dataset containing education expenditures was obtained from the General Household Data 2015-2016 Nigeria, the World Bank - Living Standards Measurement Survey (LSMS). The model’s school geolocation dataset is from GRID3 Nigeria.² Our team examined both datasets for any missing information and we decided how to effectively use machine learning to fill these data gaps.

EXISTING METHODS/RESEARCH

The need for large public datasets has been referred to as the ‘new gold rush’. Private companies have refined their methods for mining and collecting large amounts of information. The same cannot be said of most global organizations in the public sphere trying to tackle social development. The United Nations Statistics Division has collected data since 1948 through a series of annual questionnaires. The World Bank still relies on surveys to collect household data. This can make it difficult to ensure consistency across collected datasets, as differences in timing and reporting practices may cause the information to differ across sources. This adds a layer of uncertainty when trying to combine any set of data. Furthermore, the collection of data using surveys is expensive and time-consuming, and as a result, information cannot be readily updated to reflect the actual current state of a population. These challenges in the methodology of collecting data result in large swaths of missing information.

In order to tackle social development, a more reliable method is required to understand the current state and challenges being faced on the ground. The Stanford Sustainability and Artificial Intelligence Lab wrote a paper titled ‘Combining Satellite Imagery and Machine Learning to Predict Poverty’. Their work led to the creation of the organization, Atlas AI, which aims to help people decide where to invest in Africa and South Asia by identifying new markets, providing insight into where organizations can grow most successfully, and informing where development capital can have the greatest socio-economic impact.

DATALLITE’S LAUNCHPAD PLATFORM

We created a platform called the LaunchPad - it’s a dashboard that gives users an easy, efficient way to interact with the data that they need and to effectively visualize key indicators and metrics across different regions. This interface can be customized depending on the needs of organizations and users, and the problems they are tackling.

In our first case study, we looked at data on education in Nigeria. We are taking satellite images that look at infrastructure and referencing census data and existing data to make predictions of different socioeconomic indicators.

Below is a video demo of Datallite’s Launchpad Platform:

youtube-video-thumbnail
HOW DOES IT WORK?

The core of Datallite’s AI model extends the Stanford model by incorporating new metrics to predict educational inequality. Our model is a deep convolutional neural network that takes in approximately 10,500 satellite images and outputs educational indicators to produce interactive maps identifying areas that are most in need.

In our prototype, the available survey datasets, which were collected by the World Bank, were processed. We were interested in household size and education expenditure as indicators for the Nigerian case study. In conjunction with the survey dataset, additional web data was sourced, through web-scraping, to include the school coordinates in Nigeria. In the preprocessing step, we generated some clusters using the education expenditure, household size, and school coordinates. We used these clusters to generate a 10 by a 10-kilometre bounding box, from which we randomly sampled 50 images, to decrease bias in our model. We labelled our training data according to the density of schools, with 0 as regions with few schools, 1, regions with an average number of schools, and 2 regions with many schools.

When it came to training our model, we divided the downloaded images into the training set and the validation set with a ratio of 80% to 20%, respectively. The labels for training were 0, 1, 2 where 0 represents regions with a low number of schools, 1 region with an average number of schools and 2 regions with many schools. A Gaussian Mixture Model is used to find the threshold of labels. We used the VGG16 model pre-trained on ImageNet and replaced the last layer with a linear layer. We achieved an accuracy of 80% on training and 79% on validation. The last layer (classifier) was removed from our trained model and passed all validation images to obtain feature vectors.

The basic architecture of the VGG16 model is depicted in the diagram below.

Figure 1: Image detailing the architecture of the VGG16 structure, a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”.³ The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes.³

The next step was to generate the feature vectors. This was accomplished by removing the last layer (classifier) from our trained model and passing all validation images through the model. The features were averaged on a cluster basis and used as a new feature to compare with actual education expenditures. Finally, we obtained the value of the predicted education expenditure per region as output. In order to limit access to sensitive information, the data was aggregated.

FUTURE DIRECTIONS

Collecting educational datasets in Nigeria is obviously important, but for the Datallite project, it's just the start. We are able to expand Datallite’s Launchpad platform to evaluate other socioeconomic indicators in other regions of the world.

We look forward to applying the Datallite model to address data gaps in other regions. This would include adding datasets for satellite imagery specifying water sources and infrastructure to indicate water scarcity. Looking to the future, we would use Datallite’s model for risk assessment of areas affected by natural disasters such as floods or wildfires.

In the future, Datallite’s team looks to incorporate our vision through three main goals:

  1. Improve the image recognition features to better assess different infrastructures from the satellite images.
  2. Utilize natural language processing (NLP) to analyze the project descriptions that users input upon signing up to the Launchpad platform.
  3. Recommend dataset’s to be leveraged by users in their projects.

BOTTOM LINE

The importance and need for fast, inexpensive public data continues to grow daily. It is our hope that Datallite is moving in the direction towards closing the data gaps worldwide. Datallite’s potential is truly limitless. Our model has proved to be highly accurate at such an early stage. We aim to further expand and be adaptable to filling data gaps in other regions.

We are Datallite and we provide data where it's needed most!

ACKNOWLEDGMENTS

Datallite was made possible by the incredible support we have received throughout our journey. We wish to acknowledge the AI4Good Lab for providing us with the platform in which we were able to join and work towards an impactful solution.

We would like to thank our TA- Mohammed Nevid Fekri who’s guidance throughout the AI4Good Lab program was invaluable. Thank you to our mentors, Flynn Strathearn (MNP Technology), Ella Wilson (Borealis AI), Ankit Anand (DeepMind), Deval Pandya (Vector Institute), all our amazing lecturers, workshop facilitators, and the AI4Good coordinators for their support and guidance.

References

  1. United Nations Sustainable Development Goals
  2. The GRID3 Nigeria project
  3. VGG16 – Convolutional Network for Classification and Detection

CONTACT:

Email: datallite.ai@gmail.com

LinkedIn https://www.linkedin.com/company/datallite/