A cheat sheet to the best practices for data preparation for machine learning

Image: conceptcafe/Adobe Stock

Machine learning, or ML, is growing in importance for enterprises that want to use their data to improve their customer experience, develop better products and more. But before an enterprise can make good use of machine learning technology, it needs to ensure it has good data to feed into artificial intelligence and ML models.

Jump to:

What is data preparation?

Data preparation involves cleaning, transforming and structuring data to make it ready for further processing and analysis. Data doesn’t typically reach enterprises in a standardized format and thus needs to be prepared for enterprise use.

SEE: The machine learning master class bundle (TechRepublic Academy)

Before data scientists can run machine learning models to tease out insights, they’re first going to need to transform the data—reformatting it or perhaps correcting it—so it’s in a consistent format that serves their needs. In fact, as much as 80% of a data scientist’s time is spent on data preparation. Given how costly it can be to recruit and retrain data science talent, this is an indication of just how important data preparation is to data science.

Why is data preparation important to machine learning?

ML models will always require specific data formats in order to function properly. Data preparation can fix missing or incomplete information, ensuring the models can be applied to good data.

Some of the data an enterprise collects in its data lake or elsewhere is structured—like customer names, addresses and product preferences—while most is almost certainly unstructured—like geo-spatial, product reviews, mobile activity and tweet data. Either way, this raw data is effectively useless to the company’s data science team until it’s formatted in standardized, consistent ways.

SEE: 4 steps to purging big data from unstructured data lakes (TechRepublic)

Talend, a company that provides tools to help enterprises manage data integrity, has suggested a few key benefits of data preparation, which include the ability to fix errors quickly by “catch[ing] errors before processing” and the reduction of data management costs that can balloon when you try to apply bad data to otherwise good machine learning (ML) models.

Best practices for data preparation in machine learning

For a broad overview, you can check out these top five tips for data preparation; these more general tips mostly apply to ML data preparation as well. However, there are some particular nuances for ML data preparation that are worth exploring.

Prepare your data according to a plan

You likely know in advance what you want your ML model to predict, so it pays to prepare accordingly. If you have a good sense of the outcome you’re hoping to achieve, you can better define the kinds of data you’ll want to collect and how you want to clean it up.

This also allows you to better respond to missing or incomplete data. A common approach to missing data is null value replacement. For example, if you’re an airline with passenger data, you might elect to drop a null value into the field that tracks meal preferences.

But depending on your application, null value replacement might be a terrible approach. From our previous example, the airline shouldn’t insert a null value for missing passenger nationality data, as this could create serious problems with their travel experience. Knowing which data is critical and how you’ll deal with incomplete records is essential.

SEE: Hiring kit: Data scientist (TechRepublic Premium)

Consider the people involved in data collection

Though you should consider investing in robotic process automation to handle simple, repetitive tasks, lest your employees get burdened with tedium, people will remain your greatest asset and hurdle to good data prep for ML. It’s often true that, even within the same department, enterprises will be overrun by data silos.

A news organization, for example, may understand a reader’s interests on the web but fail to personalize a mobile app that’s run by a different team with different underlying storage systems.

Helping employees become collectively data-driven means working to collect and use data but also sharing that data in useful ways across departments and roles. Collective data collection and usage processes are critical to ensuring better data for ML models.

Avoid target leakage

Google, a leader in data science and ML, offers some smart advice when it comes to target leakage in ML training data: “Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction.”

Google’s experts went on to explain that this can cause ML models to perform badly when they move from pure predictive evaluation metrics to real data. The important task here is to make sure you have all of the historical data you need to make accurate predictions.

Break up your data

Deepchecks, a company that offers an open-source Python library for ML, suggests that companies should split their data into training, validation and test sets for better results.

By “develop[ing] insights from the training data, and then apply[ing] processing to all datasets,” you’ll get a good sense for how your model will perform against real-world data. Most often, it will make sense to have 80% of your data in the training set and 20% in the test set.

Beware of bias

Though we may assume that machines always yield unbiased, correct decisions, sometimes these machines are simply more efficient at conveying our own biases. Because of the potential for bias to creep into ML models, it’s essential to closely examine the data sources you use to train models.

Machine learning models are only as smart as the data that feeds them, and that data is limited by the people who collect it. In turn, people are influenced by the data that comes from the machines and can become ever more distant from raw data. As a whole, this makes us ever more incapable of giving good data to our models because we’ve come to trust them so wholeheartedly.

A strong dose of humility and circumspection is critical to preparing data for ML so biases don’t proliferate through several generations of data and models. To ensure your data team is not only technically savvy but also aware of where problems can arise in machine learning data preparation, consider signing them up for a comprehensive machine learning course.

Make time for data exploration

It can be tempting to jump straight into model building without first laying a strong foundation through data exploration. Data exploration is an important first step because it allows you to examine individual variables’ data distributions or the relationships between variables. You can also check for things like collinearity, which can point to variables that move together. Data exploration is a great way to get a strong sense for where your data may be incomplete or where further transformation may help.

Disclosure: I work for MongoDB but the views expressed herein are mine.

Source: TechRepublic