Project Details

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

A novel approach to missing data (re)discovery

The proposed project aims to build the pipeline for the recovery of various data: categorical, numerical and textual – collected from multiple resources.

Inglorious Basterds

A novel approach to missing data (re)discovery


    Datasets, aggregated from different sources, can have missing or incomplete values that impose difficulties on data analysis and research. The proposed project aims to build the pipeline for the recovery of various data: categorical, numerical and textual – collected from multiple resources. In the project, we focus on incomplete data, such as geographical locations (cities, places, highways, coordinates), information sources (news websites, TV channels, articles) and measured features of celestial objects (meteorites’ masses and types).

    The project description


      Background

    Machine Learning enthusiasts and AI specialists always rely on large data sets when building their models. Whether they are looking for patterns in the data or creating predictive models, complete and clean data is crucially important for positive results. We chose this challenge in order to help the scientific community, environmentalists and space researchers to computationally recover missing, although crucial data entries in the number of NASA datasets.

    What it does

      Our project consists of four jupyter notebooks that describe the process of data recovery and cleaning for the NASA datasets: (i) different experiments on numerical correlations of features in the Meteorite Landings dataset; (ii) missing values recovery (meteorites mass, location, and type) using the best practices of data imputation; (iii) identifying information sources from the URL links and (iv) search for the exact geographical locations given word descriptions for Global Landslide Catalog.

      Details

        1. The file Meteorite_Landing -- Experimental data proccessing.ipynb contains the various data exploration techniques, such as distributions, interfeature correlations, and pairplots.
        2. In the file Meteorite_Landing -- filling the data.ipynb we apply algorithms for missing data imputation for a number of features (meteorites mass, and type), such as mean/meadian/mode imputation, kNN, regression and MICE.
        3. The file Landslides -- URL2source.ipynb contains a code that fills in missing values for source name of the information given the link of the source. For example, the function will take the link www.nytimes.com/2019/10/20/us/fort-worth-shooting-jefferson-dean.html as an argument and returns the name of the source as "The New York Times".
        4. The file Landslides -- determine country by nearest place description.ipynb interpretes the word description of the place such as "rt 20, just west of Diablo rd, near Newhalem, WA" and returns the country of that geographical location ("United States of America") for Global Landslide Catalog.

        NASA resources

          1. Global Landslide Catalog (https://data.nasa.gov/api/views/9ns5-uuif/rows.csv...)
          2. Meteorite Landings dataset (https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv...)

          These data sets are available at data.nasa.gov and were chosen because they fit our requirements: (i) combined from various sources at different timescales, (ii) have a lot of missing or incomplete data entries.

          Future plans

            Our project requires more time for completion, thus we concentrate all our effort on providing the best performance for the number of features in the data. We are aiming at the quality of performance rather than broadness. Our plans include making a unified framework with

            Built with

              Jupyter notebooks and Python 3.6 with various open-source libraries

              Try it out

                Our project is available on GitHub: https://github.com/klyshko/NASA-hackathon