
A novel approach to missing data (re)discovery
Datasets, aggregated from different sources, can have missing or incomplete values that impose difficulties on data analysis and research. The proposed project aims to build the pipeline for the recovery of various data: categorical, numerical and textual – collected from multiple resources. In the project, we focus on incomplete data, such as geographical locations (cities, places, highways, coordinates), information sources (news websites, TV channels, articles) and measured features of celestial objects (meteorites’ masses and types).
The project description
Machine Learning enthusiasts and AI specialists always rely on large data sets when building their models. Whether they are looking for patterns in the data or creating predictive models, complete and clean data is crucially important for positive results. We chose this challenge in order to help the scientific community, environmentalists and space researchers to computationally recover missing, although crucial data entries in the number of NASA datasets.
What it does
Our project consists of four jupyter notebooks that describe the process of data recovery and cleaning for the NASA datasets: (i) different experiments on numerical correlations of features in the Meteorite Landings dataset; (ii) missing values recovery (meteorites mass, location, and type) using the best practices of data imputation; (iii) identifying information sources from the URL links and (iv) search for the exact geographical locations given word descriptions for Global Landslide Catalog.
Details
NASA resources
These data sets are available at data.nasa.gov and were chosen because they fit our requirements: (i) combined from various sources at different timescales, (ii) have a lot of missing or incomplete data entries.
Future plans
Our project requires more time for completion, thus we concentrate all our effort on providing the best performance for the number of features in the data. We are aiming at the quality of performance rather than broadness. Our plans include making a unified framework with
Built withJupyter notebooks and Python 3.6 with various open-source libraries
Try it out
Our project is available on GitHub: https://github.com/klyshko/NASA-hackathon