TGR - Space Apps Challenge

TGR| Chasers of the Lost Data

Project Details

Awards & Nominations

TGR has received the following awards and nominations. Way to go!

Global Nominee

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

Landslider

Landslider is an AI-powered software solution which reconstructs lost landslide data & tries to predict where future landslides could happen

THE TEAM

TGR consists of two senior students from the Electronic Systems school in Sofia (associated with Technical University - Sofia). Our studies specialize in the fields of System Programming and Computer Networks, but our interests are in the fields of algorithms & machine learning, so we used our personally acquired skills to tackle the Chasers Of The Lost Data challenge.

Some background about us (past projects) & roles in developing this project:

Evgeni Dimov - Researching Algorithms, Developing the Solution, Presenter
Boris Dermendzhiev - Researching Algorithms, Developing the Solution

Background

The Global Landslide Catalog (GLC) was developed with the goal of identifying rainfall-triggered landslide events around the world, regardless of size, impacts or location. The GLC considers all types of mass movements triggered by rainfall, which have been reported in the media, disaster databases, scientific reports, or other sources. The GLC has been compiled since 2007 at NASA Goddard Space Flight Center.

OUR IDEA

In Machine Learning & Predictive Modeling performance depends a lot on the data an algorithm is being fed. We were thinking of a way to both show that our data reconstruction has a positive impact on machine learning performance and use the data for a great cause. We came up with "Landslider" - a project which tries to predict future landslides.

SOLUTION

Since this is a Machine Learning driven problem we took a Machine Learning approach to solve it.

We took the following steps

Data Analysis
Data Cleanup
Data Reconstruction
Model Training
Hyperparameter tuning

Data Analysis

We explored the dataset until we understood each and every part of it and we have insight on which information is meaningful to us (could be used for the predictive modeling of future landslides or could be reconstructed) and which is not.

Data Cleanup

Since the dataset includes a lot of information which is irrelevant we simply get rid of it. Examples of such information is Ids in databases, links to articles related to the landslides, information about people who uploaded the articles, date/time of creation/last edit of the record in the database, information about editor/creator of the record, columns with duplicate information with different format, country codes, etc.. (it really goes on...).

Data Reconstruction

Traditional methods of imputing missing data rely on statistics. Simple methods that can be used for imputing are mean/mode values or constants, but those strategies are not very reliable and could lead to imperfections/biases in statistics and in predictive modeling performance. We have used a Machine Learning approach to impute the missing values in the dataset. For categorical values, we have used the K-Nearest Neighbours (KNN) algorithm, and for numeric values, we have used regression.

KNN is a classification algorithm based on plotting the data as points in space. When we want to classify a new sample we do it based on the classes of the nearest points.

KNN visualization

Regression visualization

Model Training & Hyper-parameter tuning

Landslider predicts the following: the severity of the landslide, the type and the location (latitude, longitude). For predicting latitude & longitude we have used regression and for severity and type we have used Support Vector Machines (SVM). For finding the best algorithm & hyperparameters we have user Stratified K-Fold validation and Grid Search Cross-Validation. Since the dataset is relatively small (tens of thousands of records) we have not used Neural Networks, even though they would be a more elegant solution.

We have trained our models on both the reconstructed dataset & the original dataset (discarding the entries with missing information). We observed a slight improvement in our results (based on different metrics - F1 score, R2 score, mean squared error, confusion matrix). Since the quantity of missing values in this particular dataset is relatively low, the improvement in performance is not of great size (it certainly exists), but applying the same algorithms to a more "lossy" / "noisy" dataset will yield much better results.

Future development

We would like to further enhance our Machine Learning models by feeding more data trough them (e.g. they should take into consideration when/what landslide prevention methods were executed). Furthermore, we would like to deploy our solution on a non-stop running service (feeding it with real-time data) which raises alarms as soon as a potential threat is predicted.

Resources & Links

Our Code Repository: https://github.com/GenchoBG/ChasersOfTheLostData
Dataset we used: https://catalog.data.gov/dataset/global-landslide-catalog
Information about algorithms we used: https://scikit-learn.org
Data operations (exploration, analysis, clean-up): https://pandas.pydata.org/