April 13, 2022

AI models are great but they suck without data

Caroline Zaborowski

Many data scientists and AI engineers are like cats. They are easily distracted by shiny new ideas but get bored quickly. As a result, they will often spend a huge amount of time and energy finding and experimenting with new approaches for creating AI models. That is no bad thing, but it starts to become problematic if they lose sight of the importance of data exploration and fidelity. So, how can you make sure your AI team is focusing on the things that actually matter? First, you need to understand something about how a new AI model is created.

A simple guide to building an AI solution

Despite its ubiquity, AI is not widely understood, even among the technical community. Most people know that AI involves taking some data, passing it through some AI black box, and getting some sort of prediction out. That may be asking the black box to identify cats in photos. Or it may be parsing your instruction “Hey Google, turn on the lights.” But how do you actually create that black box and train it to behave as required? The answer (usually) is machine learning. And for that you need data.

A simplified AI pipeline

Data discovery

The first step is to find all relevant data and convert it into a usable format. This is known as data discovery and often includes importing legacy data into the cloud. 

Data exploration 

Next, you need to analyze the data, come up with a hypothesis and explore which aspects of the data matter. These aspects (or features) may require feature engineering. 

Model selection

Once you have a solid dataset you are ready to choose a model. There are thousands of AI models available nowadays, each optimized to solve a different problem. 

Model training and validation

Now you need to train the model to recognize certain features and react in some way. Once trained, the model must be validated. Often, these two steps need to be repeated.


The final step is deploying the model. For instance, adding a stock prediction bot to your inventory management system. Or embedding a chatbot in your customer service site.

Why modeling gets all the attention

In AI, model training is typically seen as the sexy part of the process. After all, it’s the model that can achieve the apparently superhuman feats that we look for in AI. Moreover, this is where much of the most active research is happening. New models and approaches are being created all the time. As a result, there’s always some new approach to learn and try out. Often, these new approaches bring important benefits. For instance, allowing you to perform more accurate NLP (natural language processing). Or resulting in more efficient models that can be run in edge devices like mobile phones. But it’s important to remember that this is an active field of academic research. That means that many of these new models and approaches are more about academic interest than practical use. 

Why the data is what really matters

If you take away just one key idea from this blog it’s that data is king when it comes to AI. If you don’t have good data, you can never have a good AI model. Despite what some people may claim about the power of synthetic data. To understand why the data is so important consider what the AI model is actually doing. In essence, it takes all your data and looks for patterns within it. When you train the model, it’s learning how these patterns correlate to things that you are interested in. Imagine you want to teach it to recognize photos of cats. If you only have 100 face-on photos of tabby cats, it won’t be able to learn how to recognise a black cat from behind. In fact, to be able to reliably recognize any color cat in any pose you will need to feed in millions of photos of cats. 

As a result, it’s really vital that your team spends the lion’s share of their time on data discovery and exploration. They have to make sure the data is good quality, reliable, and (importantly) will also be available when the model is in production. Your best data scientists will also be able to predict exactly how robust the resulting model is based on the data available. If you don’t have enough data, or if the data fails to support the intended business case, they will let you know. Their job is to ensure you don’t fall into the trap of “garbage in, garbage out”. 

At this point it’s important to point out another key source of bias in AI. Often, AI systems rely, at least in part, on publicly available data. Many systems that analyze images have become biased because they have been trained on the images that people post on social media. Almost by definition, such photos aren’t really representative of real life. As a result, AIs believe certain things are far more common than they are.  It’s important to understand this sort of bias if you are going to augment your data with public datasets. 

How can Sonasoft help?

Our aim is to provide complete end-to-end AI. We have a simple mantra: AI should be zero-effort. As a result, we have developed a 3-step approach to building new AI solutions.

The Sonasoft 3 step process

Feasibility study

One of the key problems your data science team faces is that they are often too close to the data and may find it hard to take a step back. At the start of every engagement we undertake a short (~1 month) feasibility study. During this study, our highly experienced data scientists will talk to you about your business problem. From this, they can establish the sort of AI solution that will help you. They will work with your team to find all the data you have available and see if it supports such a solution. The team isn't afraid to have honest conversations with you if there are any issues with the data. Moreover, they won’t come with any preconceptions about how your business works, what the data shows, etc. That is important as it allows them to provide unbiased opinions and to do robust data exploration. 


Assuming that you have the data you need, our next step is to deliver a proof-of-concept (POC) for you. This will be built in SAIBRE, our bespoke deep learning AI platform. Our team will take your data and further refine it. They will test multiple AI models until they find one that delivers good results and solves your stated business problem. At the end of this, you will have a fully functioning AI model that is ready to be deployed

Deploy and maintain

The final step is to deploy the new AI model into our production system. This is achieved with a single click but under the hood a lot of magic is happening. Once it is running live on SAIBRE, we refer to it as a bot. This is because it is a complete solution including all data connections, the AI model itself, and any data outputs. SAIBRE offers industry-leading smart monitoring of all running bots. This includes monitoring the health of all data sources as well as the accuracy of the model. Over time, all models start to drift in accuracy. If this happens, SAIBRE also offers you the ability to do zero-effort maintenance, retraining and updating the model with just a few clicks.

Contact us today to learn more about how we can help you deliver robust AI solutions.

White Paper

SAIBRE AI Ecosystem

End-to-end AI applications that solve any business problem