11 min read

Generative AI & Synthetic Data: An Exclusive Interview with Leading AI Researcher Dr. Zilong Zhao

Leading AI researcher Dr. Zilong Zhao shares his personal journey and expertise on generative AI and synthetic data.
Generative AI & Synthetic Data: An Exclusive Interview with Leading AI Researcher Dr. Zilong Zhao

In this edition of Thought Leaders in Tech, leading AI researcher Dr. Zilong Zhao (PhD, research lead at Betterdata) shares his personal journey and expertise on generative AI and its specific application to synthetic data generation.


What's your story? How did you get involved in the fields of generative AI and synthetic data?

From a young age, I was passionate about math and had a natural talent for it. I grew up in Fuxin, a small, cold city in Liaoning Province, China, near Korea and Mongolia. As I approached university, I knew I wanted a math-related field, and I felt drawn to engineering due to its practical applications. After considering several options, I ultimately chose computer science, enjoying how it combined mathematical concepts with real-world problem-solving. My interest in video games like Counter-Strike also made computer science particularly appealing at the time.

After earning my Bachelor’s in China, I moved to France for my Master’s and PhD, focusing on optimizing machine learning models, particularly improving model performance and convergence. Toward the end of my PhD, I met my research mentor, Lydia Chen, from IBM Zurich. She later became my supervisor at Delft University of Technology (TU Delft) in the Netherlands. We collaborated during the last year of my PhD, publishing two papers, and her move to TU Delft led to an opportunity for me to join her there.

At TU Delft, Lydia introduced me to a project involving generative models for tabular data, a solution for one of the largest insurance companies in the Netherlands. Due to privacy and security regulations, the insurance company's research teams couldn’t directly access user data, slowing their research and development. The idea was to use generative models to create synthetic data that preserved the structure of real data without violating privacy rules, allowing for faster, more secure testing. Synthetic data proved valuable for accelerating development in large organizations like insurance companies, where access to data can be heavily regulated. This has become my focus for the past 5–6 years. 

I’ve since moved to Singapore to continue my work. Right now, I’m currently working as an AI Research Manager at the National University of Singapore and as the Research Lead at a leading AI startup specialising in generative AI for synthetic data (Betterdata).

What is generative AI?

Generative AI is a challenging area. In the field of machine learning, there are generally two types of models: predictive and generative models. 

  • Predictive models take data and make predictions, like determining whether someone has cancer based on medical data, or classifying whether an email is spam or not. For example, in spam detection, the model extracts features from the email—such as the number of question marks, unusual words, or other patterns—and uses these features to classify the email as either spam or not.
  • Generative models focus on creating new content. For example, if you have a set of images, these images have certain patterns, like textures, colors, and shapes, which form a distribution. If you feed a model 100 images of cats, the model learns the distribution of features across these images—what makes a cat look like a cat. It can then generate new cat images that adhere to the same general characteristics. You can even manipulate certain aspects of the generated images, like changing the color or the shape of the eyes. This is what generative AI aims to achieve: the creation of new data that reflects the distribution of the original data.

Generative AI in areas like voice, video, and image generation is quite popular in the mainstream media, especially with the release of models like Stable Diffusion, which can generate images, or tools like Sora, which can generate videos. These applications are very impressive and useful for everyday consumers

For most businesses, however, there isn’t a high demand for this type of generative AI because the majority of their data is structured tabular data stored in databases, not images or video. The need for generative AI in these sectors is often driven by privacy concerns, as businesses want to use synthetic data instead of real data to protect their users’ sensitive personal information.

How does generative AI work?

Generative AI is probably best understood through the transformer model, the most popular type of generative AI model today. This type of model powers many well-known AI applications, most notably OpenAI’s ChatGPT (Generative Pre-trained Transformer).

Transformers are just one type of machine learning model. They are especially suited for handling sequential data, like text or time-series data.  In generative AI, transformers help create new content—like text, images, or even music—by “learning” patterns in the data they were trained on.

The Transformer architecture processes data in a specific way: 

  1. Input: When data is input into a Transformer model, it first passes through something called a tokenizer. This means that instead of dealing directly with a raw series of numbers, the data is tokenized, breaking it down into smaller units or “tokens” that represent meaningful components, like words or parts of words in the case of text. 
  2. Training: Once tokenized, the Transformer learns the relationships between these tokens. For example, in a sentence like “I am happy,” the model will recognize that the token “am” is likely to be followed by “happy” based on the relationships learned during training.
  3. Output: The Transformer will predict the next token based on the probability distribution of all possible tokens. This is why we can consider it a probabilistic model, as it predicts the most likely next token in a sequence based on the patterns it has learned.

What is synthetic data?

Synthetic data is generated data that is based on real data, but it doesn’t directly contain any sensitive personal information. For instance, when banks want to use their data to train a model, they might need to remove names, social security numbers, or account details. However, they still want the synthetic data to retain the statistical properties of the original dataset so that their model can be trained effectively without using the real data.

To clarify, we don’t just replace sensitive information like names or addresses with pseudonyms. We also ensure that the generated data maintains the distribution of the original dataset, meaning that the relationships and patterns between variables (e.g., income, credit history, loan amounts) are preserved. For example, in a dataset of customer heights, we wouldn’t just swap out individual values but would ensure that the generated heights are similar in range and variation to the original dataset.

How does synthetic data work?

Let’s say you have a dataset with people’s heights—perhaps from basketball players in the NBA. If you want to generate synthetic data based on this dataset, you don’t just copy the exact heights; instead, you generate values that fall within the same distribution. So, if you have players who are 2 meters, 1.9 meters, and 1.81 meters, the synthetic data might include players who are 2.1 meters, 1.95 meters, and so on. These new data points reflect the general distribution of heights but are not exact copies of the original data.

This process ensures that the synthetic data retains the statistical properties of the original dataset while avoiding the risk of exposing any private information. In other words, you can still use the synthetic data for analysis or model training, but you won’t be able to identify any individual from the dataset.

What are the advantages of synthetic data?

One key advantage of synthetic data generation, as opposed to traditional data anonymization (where identifying personal information is simply removed from a data set or obscured using pseudonyms), is that synthetic data can be scaled. In traditional anonymization, you’re limited by the size of the original dataset. If you have one million data points, you can anonymize them, but you’re still limited to those one million points. However, with synthetic data, you can generate as much data as you need, allowing you to scale datasets from one million to one billion data points. This scalability is particularly useful for applications like software testing, where you might need a much larger dataset than what’s available.

Another advantage is that synthetic data can be used in industries that are concerned with privacy regulations.  For example, financial institutions, healthcare providers, or any company handling sensitive information may have additional legal restrictions on how they can use customer data. By using synthetic data, they can still perform analysis, train models, or make predictions while avoiding regulatory risks associated with handling sensitive customer data.

What are the practical applications of using synthetic data?

Now, when we shift focus to the application of generative models for structured data, the use case becomes more business-oriented. The goal isn’t to generate a few rows of data just for the sake of having data. Rather, it’s to provide businesses with a way to generate synthetic datasets that closely resemble their actual real-world data while protecting sensitive information.

For example, banks may have millions of rows of customer data, but they cannot directly use it due to privacy concerns. However, they may still need to train a model to predict things like credit scores or detect fraud. In these cases, they can generate synthetic data that mimics their real customer data without using actual personal information. This synthetic data can then be used to train models for tasks like credit scoring or fraud detection, while ensuring privacy.

We recently worked with a luxury brand in France that wanted to predict which of their VIP customers were most likely to buy their products next month. Their problem was that they didn’t have enough data on VIP customers, as they only represented about 1.7% of their total dataset. To improve their prediction model, they needed to increase the volume of their data, so they turned to us. By generating more synthetic data that followed the same distribution as their original VIP customer data, we were able to enhance their predictive model, increasing their model’s precision by around 4%. This helped them target the right customers for future marketing campaigns.

How will this transform industries in the future?

From a company’s perspective, the vision is that all data flows through every department in a synthetic form. This would eliminate concerns around data leakage, as every time data is accessed or used, it would be in a synthetic format, not tied to real, sensitive information. This way, the company can mitigate risks—even those posed by its own engineers—since, despite efforts to secure data, there’s always a potential risk of exposure. For instance, someone could accidentally take sensitive data home or lose it, which still presents a security threat. By converting all data to synthetic data before it flows through the company, they can eliminate this risk entirely.

Synthetic data generation will also accelerate the AI industry itself. For example, OpenAI has mentioned that when training models like ChatGPT, they used a vast amount of data sourced from the internet—pretty much every publicly available dataset they could access. But to continue improving these models, they’ll eventually need synthetic data to augment the real-world data they’ve already collected. As the need for more data grows, synthetic data will become an essential resource for training large models. This trend is already happening in multiple industries. For example, Tesla has been using synthetic data to train its autopilot systems, proving that synthetic data isn’t just a theoretical concept—it’s already being applied in critical, real-world systems.

What is the biggest challenge with this technology?

The biggest challenge, I’d say, is multifaceted. 

From a research perspective, one of the main issues is making our models more efficient and cost-effective. As I mentioned earlier, GPUs are essential for training models, but they are extremely expensive. This drives up the costs of development, so finding ways to reduce the resource consumption while maintaining model quality is a major challenge.

Another challenge is more market-oriented. Generative AI is still a relatively new field, and many businesses don’t yet fully understand how they can leverage this technology. For example, when we talk to non-tech business leaders, explaining the value and application of generative AI can be tough. With something like ChatGPT, most people have a basic understanding of what generative models are. But in our case, we often have to spend a lot of time educating our clients and helping them understand how this technology can benefit their business. Many businesses may not even realize they have a problem that generative AI can solve, so part of our job is to help them see that solution.

How might the technology itself develop in the future?

There is still a lot of room for improvement in this area. First and foremost, training generative models remains both time-consuming and expensive. For instance, training models like ChatGPT can cost millions of dollars—potentially up to $100 million (according to Sam Altman)—and requires enormous amounts of energy. In fact, as you may have seen in recent discussions, some companies have suggested that they may need to build nuclear power plants just to meet the energy demands of training large AI models.

Energy consumption is a significant challenge, and it’s something the AI industry will need to address as models become more complex and widely used. While I’m not training models on the same scale as something like ChatGPT, even the models I work with still require substantial computational resources, such as GPUs, to train effectively. So, even on a smaller scale, training time and resource usage are key concerns.