The Synthetic Data Story

The Data You Know; The Story You Don’t

TL;DR

There is a lot of good data available out in public and also in private workspaces which is basically confined within some organization and only accessible within that environment. As much as the public datasets do play a crucial role in building analytics and modeling driven solutions, what if the intent is to use live production data and demonstrate similar capablities in public space.

What I’m getting at, is the ability to demonstrate data visualization and monitoring capabilities without using the actual production data, but kind of mimic a similar scenario which basically makes for the similar climax as we would expect in a production data. This obviously comes from the understanding that we simply can’t use production data directly in public due to compliance and other regulations.

I hope that kind of clarifies the intent and sets the narratives straight to the readers, as to what our focus of this blogpost is going to look like. Yes, it is about generating a synthetic data which kind of mimics the production data and then we can go about building solutions around analytics like we would have done otherwise using the original data.

What does the data look like?

  • The nature of data we are talking about is website impressions recorded in time series fashion.
  • So for each website along with other demographic and categorical features, we record number of impressions received on hourly basis.
  • The data schema is going look like below:
    1. websiteName: Name of the website
    2. category: IAB category of the website
    3. country: Country impression
    4. state: State impression
    5. city: City impression
    6. os: OS impression
    7. device: Device impression
    8. product: Type of Ad product
    9. impression_count: No. of impressions
  • It is important to note that the time series data we have is of hierarchical fashion and each unique group from the above schema becomes a unique time series entity which the model has to learn from.

Approaches to synthetic data generation

Since we do not intend to build something from scratch and are more concerned about the results of open source packages, we will make use of an library called SDV - Synthetic Data Vault

What is SDV?

The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to generate new Synthetic Data that has the same format and statistical properties as the original dataset.

The library can also cater to different nature of data as below: 1. single-table - Used to model single table datasets. 2. multi-table - Used to model relational datasets. 3. timeseries - Used to model time-series datasets.

Feature Highlights

  • Synthetic data generators for single tables with the following features:
    • Using Copulas and Deep Learning based models.
    • Handling of multiple data types and missing data with minimum user input.
    • Support for pre-defined and custom constraints and data validation.
  • Synthetic data generators for complex multi-table, relational datasets with the following features:
    • Definition of entire multi-table datasets metadata with a custom and flexible JSON schema.
    • Using Copulas and recursive modeling techniques.
  • Synthetic data generators for multi-type, multi-variate timeseries with the following features:
    • Using statistical, Autoregressive and Deep Learning models.
    • Conditional sampling based on contextual attributes.

Getting started with SDV

pip install sdv

Code

import pandas as pd
from sdv.timeseries import PAR
import altair as alt
/home/sharad/repo/projects/data-glance/.venv/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Load synthetic data

df = pd.read_csv("../data/synthetic.csv")
df.loc[:, "date"] = pd.to_datetime(df.date)
df.head()
websiteName category country state city os device product date count
0 okaz.com.sa society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube 2022-05-01 00:00:00 545
1 okaz.com.sa society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube 2022-05-01 01:00:00 884
2 okaz.com.sa society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube 2022-05-01 02:00:00 708
3 okaz.com.sa society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube 2022-05-01 03:00:00 550
4 okaz.com.sa society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube 2022-05-01 04:00:00 271
  • Lets understand how we go about defining our auto-regressive time series model to model our data. There are 2 key parameters that would specifically define how our model is going to learn from our time series data.
  • As mentioned before about the heirarchical nature of the data, we would define our PAR model parameters according to that:
    1. entity_columns: This uniquely defines the entity for which we want to generate a time series. In our case the entity is not singular i.e we want time series for each unique group comprising of websiteName, category, country, state, city, os, device, product.
    2. sequence_index: Datetime column for time series reference.
entity_columns = ['category', 'country', 'state', 'city', 'os', 'device', 'product', 'websiteName']
sequence_index = 'date'

model = PAR(
            entity_columns=entity_columns,
            sequence_index=sequence_index,
           )

Fit PAR model on the original data

%%time
model.fit(df)
CPU times: user 28.4 s, sys: 154 ms, total: 28.5 s
Wall time: 11 s

Conditional Sampling

  • There is one key feature that PAR model provides, which we are going to specifically use for our problem, that is instead of generating random samples from the model based on raw data, we can provide specific context in which we want the samples.
  • This can be achieved by defining a dataframe consisting of unique entities for which we want to generate a time-series for. One thing to note here is that, the raw data we provided to the model, has 24 hours of data for each unique entity. So the multiple contexts that we define in the next step, the model is going to output 24 rows for each entity.
context_columns = ['category', 'country', 'state', 'city', 'os', 'device', 'product', 'websiteName']
df_freq = df.groupby(context_columns).date.count().sort_values(ascending=False).reset_index()
most_freq_group = df_freq[context_columns].to_dict(orient="records")

context = pd.DataFrame(
   most_freq_group
)
context
category country state city os device product websiteName
0 arts&entertainment saudi arabia ar riyad riyadh iOS mobile RotatingCube okaz.com.sa
1 arts&entertainment saudi arabia ar riyad riyadh iOS mobile vibe okaz.com.sa
2 society&culture saudi arabia makkah al mukarramah jeddah iOS mobile impulse okaz.com.sa
3 society&culture saudi arabia makkah al mukarramah jeddah iOS mobile RotatingCube okaz.com.sa
4 society&culture saudi arabia ar riyad riyadh iOS mobile vibe okaz.com.sa
5 society&culture saudi arabia ar riyad riyadh iOS mobile impulse okaz.com.sa
6 society&culture saudi arabia ar riyad riyadh iOS mobile RotatingCube okaz.com.sa
7 news saudi arabia makkah al mukarramah jeddah iOS mobile vibe okaz.com.sa
8 news saudi arabia makkah al mukarramah jeddah iOS mobile impulse okaz.com.sa
9 news saudi arabia makkah al mukarramah jeddah iOS mobile RotatingCube okaz.com.sa
10 news saudi arabia ar riyad riyadh iOS mobile vibe okaz.com.sa
11 news saudi arabia ar riyad riyadh iOS mobile impulse okaz.com.sa
12 news saudi arabia ar riyad riyadh iOS mobile RotatingCube okaz.com.sa
13 health&fitness indonesia jawa barat bekasi Android mobile impulse sajiansedap.grid.id
14 business&finance saudi arabia makkah al mukarramah jeddah iOS mobile RotatingCube okaz.com.sa
15 business&finance saudi arabia ar riyad riyadh iOS mobile vibe okaz.com.sa
16 business&finance saudi arabia ar riyad riyadh iOS mobile RotatingCube okaz.com.sa
17 arts&entertainment saudi arabia makkah al mukarramah jeddah iOS mobile vibe okaz.com.sa
18 arts&entertainment saudi arabia makkah al mukarramah jeddah iOS mobile RotatingCube okaz.com.sa
19 society&culture saudi arabia makkah al mukarramah jeddah iOS mobile vibe okaz.com.sa

Generate synthetic data

df_synthesized = model.sample(context=context)
df_synthesized.shape
(480, 10)
  • As we passed 20 contexts to the model and we were expecting 24 rows for each context, the size of synthesize data is therfore 20 * 24 = 480
df_synthesized.loc[:, "group_id"] = df_synthesized.groupby(['category', 'country', 'state', 'city', 'os', 'device', 'product', 'websiteName']).ngroup().apply(lambda x: str(x) + "_syn")

df.loc[:, "group_id"] = df.groupby(['category', 'country', 'state', 'city', 'os', 'device', 'product', 'websiteName']).ngroup().apply(lambda x: str(x) + "_og")
  • For the purpose ofo visualizing the raw and model generated data, we create unique group-ids for both raw(groupid_og) and synthetic(groupid_syn) data
def plot_timeseries(df):
    # select a point for which to provide details-on-demand
    label = alt.selection_single(
        encodings=['x'], # limit selection to x-axis value
        on='mouseover',  # select on mouseover events
        nearest=True,    # select data point nearest the cursor
        empty='none'     # empty selection includes no data points
    )

    # define our base line chart of stock prices
    base = alt.Chart(df).mark_line().encode(
        alt.X('date:T'),
        alt.Y('count:Q'),
        alt.Color('group_id:N')
    )

    return alt.layer(
        base, # base line chart

        # add a rule mark to serve as a guide line
        alt.Chart().mark_rule(color='#aaa').encode(
            x='date:T'
        ).transform_filter(label),

        # add circle marks for selected time points, hide unselected points
        base.mark_circle().encode(
            opacity=alt.condition(label, alt.value(1), alt.value(0))
        ).add_selection(label),

        # add white stroked text to provide a legible background for labels
        base.mark_text(align='left', dx=5, dy=-5, stroke='white', strokeWidth=2).encode(
            text='count:Q'
        ).transform_filter(label),

        # add text labels for stock prices
        base.mark_text(align='left', dx=5, dy=-5).encode(
            text='count:Q'
        ).transform_filter(label),

        data=df
    ).properties(
        width=500,
        height=400
    )

Groupwise OG vs Synthetic time-series

df_vis = pd.concat([df[df.group_id.isin(["0_og"])], df_synthesized[df_synthesized.group_id.isin(["0_syn"])]], ignore_index=True)
plot_timeseries(df_vis)
df_vis = pd.concat([df[df.group_id.isin(["1_og"])], df_synthesized[df_synthesized.group_id.isin(["1_syn"])]], ignore_index=True)
plot_timeseries(df_vis)
df_vis = pd.concat([df[df.group_id.isin(["2_og"])], df_synthesized[df_synthesized.group_id.isin(["2_syn"])]], ignore_index=True)
plot_timeseries(df_vis)
df_vis = pd.concat([df[df.group_id.isin(["3_og"])], df_synthesized[df_synthesized.group_id.isin(["3_syn"])]], ignore_index=True)
plot_timeseries(df_vis)
df_vis = pd.concat([df[df.group_id.isin(["4_og"])], df_synthesized[df_synthesized.group_id.isin(["4_syn"])]], ignore_index=True)
plot_timeseries(df_vis)