Project Details

The Challenge | Chasers of the Lost Data

Help find ways to improve the performance of machine learning and predictive models by filling in gaps in the datasets prior to model training. This entails finding methods to computationally recover or approximate data that is missing due to sensor issues or signal noise that compromises experimental data collection. This work is inspired by data collection during additive manufacturing (AM) processes where sensors capture build characteristics in-situ, but it has applications across many NASA domains.

dta-Driver

dta-Driver aims to drive away data gaps which inevitably appear in data-driven applications.

Introduction and motivation

資料恢復與補全再過去一直都是集中化的計算叢集來處理,但是隨著我們所使用的感測器越來越多也越來越多元,單純的集中處理將沒辦法發揮最佳的效果。所以我們想驗證一套新的方法來進行資料恢復與補全,那就是讓所有邊端設備都能夠在回傳給計算叢集前都先對自身資料做資料恢復與補齊。這個方法能夠大幅度減少計算叢集的壓力,並且也能夠針對每個感測器的資料特質來創造獨特的處理模型 (Processing Model)。

Data lost is a ubiquitous issuein data-driven modeling.We are interested in data lostrecovery in the context ofplanet exploration.Planet exploration usuallyemploys edge devices fordata collection.Traditionally, theydo not handle data lost:data is simply collected by themand uploaded to central devices,where data lost is taken care of.The very same scenario is also foundin day-to-day data-driven applications.

Note that this scenario becomesinefficient, if not impractical,in the following situations:

  1. the amount of collecteddata overwhelmsthe computing power of thecentral devices;
  2. certain data pre-processing isneeded immediate after collectionon edge devices.For example, data compressionmay be required if transmissionbandwidth or memory capacityis limited.

A possible example for the first caseis when a spacecraft uses sensor swarm(e.g., NASA project OpGrav).As for the second case,Mars Reconnaissance Orbiter (MRO)provides a good example:while its high resolutioncamera HiRISE can generateimage of size 16.4 Gb,the size of its memory is only 28 Gb.In such a case, image compression is necessary.


Solution

We would like toverifying the plausibilitythat each edge devicedeals with data lost on its own.We expect such an approach will

  1. make allocation of computing resourcesmore reasonable andflexible, and
  2. allow each edge device to establishprocessing modeldedicated to the data it collects.


    Problems and Methods

    We consider two different cases: time series data lost and visual data lost.

    Visual Data Lost

    我們先針對資料進行 downsampling (algorithm: bilinear interpolation) 如此一來就能將高頻雜訊先去除,之後我們利用 super resolution (algorithm: ESPCN) 將資料再 upscaling 回到原本的尺寸,我們的方法讓能夠在計算能力受限的設備上,依舊能在約 350ms 的時間達到 state-of-the-art 的 PSNR (peak signal-to-noise ratio),並且我們在也驗證我們的方法恢復的資料在模型訓練依舊達到與用原本資料訓練的模型相同的 performance。

    Many common types of image glitches canbe considered as visual data lost;this includes noise, blur, missing ordeteriorated area, etc.

    We are targeting image denoisingvia the following approach:First perform downsamplingwith bilinear interpolationin order to filterout high frequency noise.Next, super-resolution with ESPCNreconstructs the original imagefrom the downsampling one.A state-of-the-art PSNRcan be achieved for around 350mson a device with limited computing power.


    Time Series Data Lost

    我們使用帶有長短期記憶的循環神經網路 (RNN with LSTM) 模型。在神經網路的訓練過程中,我們將以時序 [t1~tn] 排列的數欄 (column) 關聯數位訊號 (signal data) 與一欄目標訊號組合成一個矩陣作為輸入放入神經網路訓練。經過一定的 epoch 後,我們將得到一欄以時序 [(t1)-k, (tn)-k] 排列的目標訊號的填充後的序號,其中參數 k 為可選定的 k 個時間戳記 (time stamp) 前的資料。而缺失資料 (missing data) 的模擬我們使用兩個方式。第一種為隨機缺失資料 (random missing data),第二種為區塊缺失資料 (fiber missing data)。我們認為第二種的資料缺失模式更加符合感測器故障 (sensor issue) 或是通訊不良 (communication loss) 所帶來的資料缺失情況。在進行實驗後,我們發現我們的方法在兩種情況中都能夠在保證一定精準度的情況下填補 (imputate) 缺失的資料。

    We consider two data lost patterns:random missing values andconsecutive missing values.The latter characterizespractical situations such assensor issue and communication lost.The imputation is done by RecurrentNeural Network with Long Short-TermMemory.


    Imeplementation

    我們採用 MediaTek 的 NeuroPilot 作為我們的 edge AI solution 來驗證我們的方法論是否能達到預期。相關的實驗數據,請參閱我們的 GitHub。

    We tested our solution on NeuroPilot,which is an Edge AI solutiondesigned by MediaTek.Please refer to our GitHub pagefor the experimental data.


    References

    1. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network
    2. MediaTek NeuroPilot
    3. OpenCV
    4. Atmosphere data on Mars collected by Curiosity
    5. https://atmos.nmsu.edu/PDS/data/mslrem_0001/DATA/


    Github: https://github.com/Deadline-Driven/NASA2019_Project