Project Details

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

D-1000 Augementor

A Machine Learning approach to generate discrete as well as continuous data

Problem Statement


  • Present data imputation techniques rely heavily on statistical approaches and single missing value prediction. We explore a multiple variable output predictor for missing values and show that for features which are dependent on each other which most often is the case this would be the most resilient approach.
  • We also open source the package with many best & recent data imputation techniques for the community.
  • The dataset used in this project is NASA Facilities dataset. Any dataset with any dimension and samples with or without target can be used with out package.
  • Opensource code

Machine Learning based approaches for missing value generation

Supervised Discrete value generation

  • If the missing value to be predicted is categorical based on the data we have built models with multiple input and multi output classifier models : Support vector classifier, Random Forest classifier, Custom Neural net Classifier.

Supervised Continuous value generation

  • If the missing value to be predicted is continuous we have built models with multiple input and multi output regression models Support vector Regression, Random Forest Regressor, Custom Neural net Regressor with last layer dimension equal to number of missing values to predict.
  • the neural net architecture has 7 hidden layers.

Unsupervised based data generation

  • Missing data is also generated by using hierarchical clustering based on cosine similarity metric
  • Missing data is also generated by using Multiple Imputation by Chained Equations.

Statistical approaches

  • MEAN, MEDIAN, MOST FREQUENT imputing
  • Synthetic minority over-sampling technique
  • Adaptive Synthetic Sampling Approach
  • Automated process to detect Categorical, Continuous values. Plot Pie chart for categorical data. Plot Histogram for continuous data.
  • Distribution of the data is automated.
  • Class imbalance is automatically detected based on threshold.