There is a lot of good data available out in public and also in private workspaces which is basically confined within some organization and only accessible within that environment. As much as the public datasets do play a crucial role in building analytics and modeling driven solutions, what if the intent is to use live production data and demonstrate similar capablities in public space.
What I’m getting at, is the ability to demonstrate data visualization and monitoring capabilities without using the actual production data, but kind of mimic a similar scenario which basically makes for the similar climax as we would expect in a production data. This obviously comes from the understanding that we simply can’t use production data directly in public due to compliance and other regulations.
I hope that kind of clarifies the intent and sets the narratives straight to the readers, as to what our focus of this blogpost is going to look like. Yes, it is about generating a synthetic data which kind of mimics the production data and then we can go about building solutions around analytics like we would have done otherwise using the original data.
What does the data look like?
The nature of data we are talking about is website impressions recorded in time series fashion.
So for each website along with other demographic and categorical features, we record number of impressions received on hourly basis.
The data schema is going look like below:
websiteName: Name of the website
category: IAB category of the website
country: Country impression
state: State impression
city: City impression
os: OS impression
device: Device impression
product: Type of Ad product
impression_count: No. of impressions
It is important to note that the time series data we have is of hierarchical fashion and each unique group from the above schema becomes a unique time series entity which the model has to learn from.
Approaches to synthetic data generation
Since we do not intend to build something from scratch and are more concerned about the results of open source packages, we will make use of an library called SDV - Synthetic Data Vault
What is SDV?
The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to generate new Synthetic Data that has the same format and statistical properties as the original dataset.
The library can also cater to different nature of data as below: 1. single-table - Used to model single table datasets. 2. multi-table - Used to model relational datasets. 3. timeseries - Used to model time-series datasets.
Feature Highlights
Synthetic data generators for single tables with the following features:
Using Copulas and Deep Learning based models.
Handling of multiple data types and missing data with minimum user input.
Support for pre-defined and custom constraints and data validation.
Synthetic data generators for complex multi-table, relational datasets with the following features:
Definition of entire multi-table datasets metadata with a custom and flexible JSON schema.
Using Copulas and recursive modeling techniques.
Synthetic data generators for multi-type, multi-variate timeseries with the following features:
Using statistical, Autoregressive and Deep Learning models.
Conditional sampling based on contextual attributes.
Getting started with SDV
pip install sdv
Code
import pandas as pdfrom sdv.timeseries import PARimport altair as alt
/home/sharad/repo/projects/data-glance/.venv/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Lets understand how we go about defining our auto-regressive time series model to model our data. There are 2 key parameters that would specifically define how our model is going to learn from our time series data.
As mentioned before about the heirarchical nature of the data, we would define our PAR model parameters according to that:
entity_columns: This uniquely defines the entity for which we want to generate a time series. In our case the entity is not singular i.e we want time series for each unique group comprising of websiteName, category, country, state, city, os, device, product.
sequence_index: Datetime column for time series reference.
CPU times: user 28.4 s, sys: 154 ms, total: 28.5 s
Wall time: 11 s
Conditional Sampling
There is one key feature that PAR model provides, which we are going to specifically use for our problem, that is instead of generating random samples from the model based on raw data, we can provide specific context in which we want the samples.
This can be achieved by defining a dataframe consisting of unique entities for which we want to generate a time-series for. One thing to note here is that, the raw data we provided to the model, has 24 hours of data for each unique entity. So the multiple contexts that we define in the next step, the model is going to output 24 rows for each entity.
For the purpose ofo visualizing the raw and model generated data, we create unique group-ids for both raw(groupid_og) and synthetic(groupid_syn) data
def plot_timeseries(df):# select a point for which to provide details-on-demand label = alt.selection_single( encodings=['x'], # limit selection to x-axis value on='mouseover', # select on mouseover events nearest=True, # select data point nearest the cursor empty='none'# empty selection includes no data points )# define our base line chart of stock prices base = alt.Chart(df).mark_line().encode( alt.X('date:T'), alt.Y('count:Q'), alt.Color('group_id:N') )return alt.layer( base, # base line chart# add a rule mark to serve as a guide line alt.Chart().mark_rule(color='#aaa').encode( x='date:T' ).transform_filter(label),# add circle marks for selected time points, hide unselected points base.mark_circle().encode( opacity=alt.condition(label, alt.value(1), alt.value(0)) ).add_selection(label),# add white stroked text to provide a legible background for labels base.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode( text='count:Q' ).transform_filter(label),# add text labels for stock prices base.mark_text(align='left', dx=5, dy=-5).encode( text='count:Q' ).transform_filter(label), data=df ).properties( width=500, height=400 )