Project Details

Awards & Nominations

NVG has received the following awards and nominations. Way to go!

Local Peoples' Choice Winner

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

Pie.py

We artificially multiplicated the data we had (numerical) via using approximation methods and adding small noise. Next, we trained the neural network on the data we obtained to fill the blanks.

NVG

We are fond of data science and machine learning and that is why we chose the Challenge "Chasers of the lost data". This Challenge is a great opportunity to work with the real-world data and try face some difficulties which many data scientists meet. The missing data problem is a real challenge not only for Data Science but Science in general as it hurts the statistics and statistical computing. In our everyday life GPS fully depends on the quantity and quality of the input data, and in scientific scale, for example in NASA context these areas are Astrostatistics and Geographic Information Systems. It can be crucial to predict trajectory of a near-earth meteor while sensors can provide us with inaccurate or incomplete data.

What is our solution?

So, we had to find a way to recover lost data. And we thought, why couldn’t a neural network do it for us? All we had to do is to train it how to recover these data. But, to train a neural network we must have a lot of data. for example, NASA’s Near-Earth Comets dataset has only 160 rows, and some of the data are missing. Firstly, we applied the KNN algorithm.Then we clone the data and apply some noise on the copy. We concatenate the arrays and repeat for 3 times. After that, we apply imputation algorithms with strategies mode, mean and median, inject noise and clone. Thus from an 160-line dataset we came up to an 22000-one. The important key that the obtained data are not the same 160 lines we had in the beginning, we have a lot of imposed data after some kind of blockchain of operations, but being similar to the original make them approximied to the real-world conditions. Of course, there is an insignificant error, but these data are now to be fed to the neural network, and all we want is the Pie to have some idea of what is going on in the dataset, which dependencies are in there, and it becomes possible as we have much data.

The network is designed to fill the missing data. Randomly an element of each input array is chosen and we replace it with zero in each original array – these arrays are our features, out input. The original array is our target data. Although we train it as if only one feature is missing, we could replace the last layer, which has only one output channel, to the one that has two and, freezing other layers, train only the weights of the last layer, which gives tools for open-source transfer-learning.

Unfortunately, it took us 12 hours to train the network. But its performance is 90%. It is important that the dependencies are determined for a particular set of features, in other words, you can’t use a model trained on the comets data to predict landslides. That’s why a list of available trained network on particular sensors should be provided.

We used the Near-Earth Comets Dataset by NASA (https://data.nasa.gov/Space-Science/Near-Earth-Comets-Orbital-Elements/b67r-rgxc). We chose this particular dataset as its significant values are numerical, thus we worked on the easiest case of our study to check whether the approach works and if it can be implemented to other, more complicated datasets.

The project is to be improved by adding visualization of the data (plots), usage of the diverse types of data (text, images).

Built with: Python
Libraries used: numpy, pandas, matplotlib, impyute, sklearn, keras.
Software: Jupyter Notebook

The repository of our project : https://github.com/lonagi/space-app-19_10 . There is also the PowerPoint Presentation of our project with plots and tables.

#python #comets #artifical_intelligence #machine_learning #missing_data