Data scientists want their AI as clean and relevant as possible. Here are a few techniques for managing your data to get the best results.
The use of data dropout to screen out unwanted data is just one of several ways that organizations can control their data—and how much they want of it—for their artificial intelligence. It is a way to assure that the data you’re using is relevant for the business problem you want your AI to address.
SEE: Snowflake data warehouse platform: A cheat sheet (free PDF) (TechRepublic)
Data scientists use data dropout in AI to eliminate upfront all data that is deemed to be extraneous to a particular AI process. For instance, if all you care about are the demographics for the state of Indiana, you can exclude the data that comes in from other states that is irrelevant to your study. The processing time for data is reduced, and the time to market for AI results is expedited and the quality and value of the data that you input into your AI application is improved.
There are other techniques that IT and data scientists can use to maintain control of the data they admit into AI. Here are a few more:
Data source control
If you’re performing scientific research and you don’t see the value of some of the worldwide sources you’re pulling data from, you can eliminate those feeds. Data feeds are generally eliminated because of two things: you either believe that the data source will not be relevant to your application or you distrust the accuracy of the data or the data source.
SEE: How to prepare for big data projects: 6 key elements of a successful strategy (TechRepublic)
Business use case control
One of the risks of processing too much AI data is that the AI can drift away from what your original business case was.
If your business use case is focused solely on monitoring the health of tracks throughout your municipal tram system, picking up excess Internet of Things data about traffic counts, engine component failures, etc., might not be necessary (although this data could be used in another business case).
SEE: How algorithms are used to hurt consumers and competition (TechRepublic)
Data elimination decisions should always be made with the primary business use case in mind. If other business use cases come up, they could be placed in a “parking lot” of future data analytics projects.
The 95% rule
When companies use AI for process automation, they strive to attain 95% accuracy or better. This means that the AI will perform the task assigned within 95% accuracy when compared with a similar manual or human process.
SEE: How edge computing can help save the environment (TechRepublic)
The only way organizations get to this 95% accuracy standard is by iteratively revising and testing their analytics algorithms until the algorithms are fine-tuned to 95% accuracy of results. It is during the algorithmic fine-tuning process that organizations might see the need to further pare down data they are plugging into their algorithms.
The data balancing act
Choosing to exclude data for an AI process often is a necessary step, but it also carries risk.
Some years ago, a UK retailer wanted to know why its online sales were higher on Sunday afternoons. The retailer discovered that Sunday afternoons were when women’s husbands went away to soccer games The women used their alone time at home to make online orders.
This was an unusual data discovery that a more straightforward AI analytics program could have missed if data deemed irrelevant was excluded at the front of the AI process. So, while it’s important to limit the amount of data that your AI must process, you also want to avoid making data cuts that are too extreme.
Finding a way to balance the elimination of data junk while avoiding the danger of excluding too much data is a central data management challenge that IT must address.