Project Details

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

HiatusFinder

Approximating the lost data in large datasets to improve the performance of machine learning predictive models, or the usage of data in general.

HiatusFinder

1. Introduction

Machine learning (ML) and artificial intelligence (AI) are conquering almost every field in our lives nowadays. They can help create new horizon for how scientists and engineers make use of experimental data.

ML/AI can also help draw conclusions from very large datasets, which in turn can be used to validate physics-based modeling or run autonomy research to find patterns that were not detected before.

Having a comprehensive dataset is a very important component of ML and data-driven modeling by which , we can extract numerous, possibly even hundreds of features and the model starts learning by training it how to predict the next step according to those features

2. Problem

All algorithms in ML need a huge amount of data. Those data are most frequently collected during experiments or modelling processes using different kinds of sensors.

Nevertheless, errors occur and some of the data get lost while collection. This may happen due to failure of unmonitored sensors during the measuring processes, the limitations in sensors themselves or because of noises in the signals compromising the experimental data for environmental reasons.

The consequences of data being lost during the collection process is a dataset with some gaps, and this kind of databases is not what researcher would prefer as it may lead to misleading conclusions on interpreting the data or applying any machine learning model on that dataset.

3. Objective

The core of our work is to find an appropriate and effective method to approximate the values of those gaps in a dataset in a way that is consistent and can be applicable on different types of data from different types of sensors and can be applied to large datasets.

We also aim at building a machine learning model to evaluate the method we are using to improve the dataset by describing the improvement in that model's performance before and after the data recovery method is applied.

4. Methods

4.1. Machine Learning Model
Buck [1] at his paper published 1690 stated a method of linear regression to estimate the statistical parameters from multi-variable data results which are missing some information. We used this method and improved it by making it iterable using an infinite linear regression loop with regress on a variable at random.

4.2. Tools and Packages
- SciKit Learn package for Python and Numpy were used for the primary machine learning model.
- Matplotlib for data visualization
- Pandas for data munging and preparation.

5. Challenges Faced

Despite being able to fill in gaps in all numerical datasets, it could be not as accurate as much in some cases such as:
- non-linear numeric data; however, it would produce a result that may be accurate enough
- and categorical datasets because it depends on statistical numerical analysis

6. Source Code and Documentation

https://github.com/MKMousa/NasaSA

7. Future Improvements

We are developing another more sophisticated method for approximating data loss by using feed-forward neural networks and tensorflow for deep learning to be able to deal with the non-linear or categorical data.

8. Resources

[1] A Method of Estimation of Missing Values in Multivariate Data Suitable for use with an Electronic Computer S. F. Buck Journal of the Royal Statistical Society. Series B (Methodological) Vol. 22, No. 2 (1960), pp. 302-306
[2] https://scikit-learn.org/stable/modules/generated/...
[3] https://scikit-learn.org/stable/modules/linear_mod...
[4] https://catalog.data.gov/dataset/meteorite-landing...
[5] https://ai.googleblog.com/2019/08/bi-tempered-logi...