Data generator tool

3/16/2023

Depending on the task and available data, this setup may require splitting a few long time sequences into shorter chunks that can be used as the examples for training. When combined into a training data set, the examples look like a 2d array of attributes (example x fixed variable) and a 3d array of features (example x time x time variable). Each example consists of 0 or more attribute values, fixed variables that do not vary over time, and 1 or more features that are observed at each time point. DoppelGANger requires training data with multiple examples of time series. Uses Wasserstein loss with gradient penalty to reduce mode collapse and improve training.Ī small note on terminology and data setup.For example, differences of several orders of magnitude in page views for popular versus rare wikipedia pages. Supports per-example scaling of continuous variables to handle data with large dynamic range.This information is often found with time series data, for example, an industry or sector associated with each stock in financial price history data. Supports fixed variables (attributes) that do not vary over time.For example, one model can use and create 10 or 15 seconds of sensor measurements. Supports variable-length sequences in both training and generation (planned, but not yet implemented in our PyTorch version).Generator contains an LSTM to produce sequence data, but with a batch setup where each LSTM cell outputs multiple time points to improve temporal correlations.

A few of these key modifications are listed below: These range from generic GAN improvements, to time-series specific tricks. review existing synthetic time series approaches and their own observations to identify limitations and propose several specific improvements that make up DoppelGANger. Once trained, arbitrary amounts of synthetic time-series data can be created by passing input noise to the generator network. As a GAN, the model uses an adversarial training scheme to simultaneously optimize the discriminator (or critic) and generator networks by comparing synthetic and real data. In this article, we give a brief overview of the DoppelGANger model, provide sample usage of our PyTorch implementation, and demonstrate excellent synthetic data quality on a task synthesizing daily wikipedia web traffic with a ~40x runtime speedup compared to the TensorFlow 1 implementation.ĭoppelGANger is based on a generative adversarial network ( GAN) with some modifications to better fit the time series generation task. As part of that work, we reimplemented the DoppelGANger model in PyTorch and are thrilled to release it as part of our open source gretel-synthetics library. al.) and are in the process of incorporating this model into our APIs and console. We really liked the DoppelGANger model and associated paper ( Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions by Lin et. The additional dimension of time where trends and correlations across time are just as important as correlations between variables creates added challenges for synthetic data.Īt Gretel, we’ve previously published blogs on synthesizing time series data ( financial data, time series basics), but are always looking at new models that can improve our synthetic data generation. Some applications for synthetic time series data include sensor readings, timestamped log messages, financial market prices, and medical records. Just as with tabular data, we often want to generate synthetic time series data to protect sensitive information or create more training data when real data is rare. Time series data, a sequence of measurements of the same variables across multiple points in time, is ubiquitous in the modern data world.

0 Comments

Data generator tool

Leave a Reply.

Author

Archives

Categories