Developments in AI are a high priority for businesses and governments globally. Yet, a fundamental aspect of AI remains neglected: poor data quality.
AI algorithms rely on reliable data to generate optimal results – if the data is biased, incomplete, insufficient, and inaccurate, it leads to devastating consequences.
AI systems that identify patient diseases are an excellent example of how poor data quality can lead to adverse outcomes. When ingested with insufficient data, these systems produce false diagnoses and inaccurate predictions resulting in misdiagnoses and delayed treatments. For example, a study conducted at the University of Cambridge of over 400 tools used for diagnosing Covid-19 found reports generated by AI entirely unusable, caused by flawed datasets.
In other words, your AI initiatives will have devastating real-world consequences if your data isn’t good enough.
There is quite a debate on what ‘good enough’ data means. Some say good enough data doesn’t exist. Others say the need for good data causes analysis paralysis – while HBR outrightly states your machine learning tools are useless if your information is terrible.
At WinPure, we define good enough data as “complete, accurate, valid data that can be confidently used for business processes with acceptable risks, the level of which is subjected to individual objectives and circumstances of a business.’
Most companies struggle with data quality and governance more than they admit. Add to the tension; they are overwhelmed and under immense pressure to deploy AI initiatives to stay competitive. Sadly, this means problems like dirty data are not even part of boardroom discussions until it causes a project to fail.
Data quality issues arise at the start of the process when the algorithm feeds on training data to learn patterns. For example, if an AI algorithm is provided with unfiltered social media data, it picks up abuses, racist comments, and misogynist remarks, as seen with Microsoft’s AI bot. Recently, AI’s inability to detect dark-skinned persons was also believed as due to partial data.
How is this related to data quality?
The absence of data governance, the lack of data quality awareness, and isolated data views (where such a gender disparity may have been noticed) lead to poor outcomes.
When businesses realize they’ve got a data quality problem, they panic about hiring. Consultants, engineers, and analysts are blindly hired to diagnose, clean up data and resolve issues ASAP. Unfortunately, months pass before any progress is made, and despite spending millions on the workforce, the problems don’t seem to disappear. A knee-jerk approach to a data quality problem is hardly helpful.
Actual change starts at the grass root level.
Here are three crucial steps to take if you want your AI/ML project to move in the right direction.
For starters, evaluate the quality of your data by building a culture of data literacy. Bill Schmarzo, a powerful voice in the industry, recommends using design thinking to create a culture where everyone understands and can contribute to an organization’s data goals and challenges.
In today’s business landscape, data and data quality is no longer the sole responsibility of IT or data teams. Business users must be aware of dirty data problems and inconsistent and duplicate data, among other issues.
So the first critical thing to do – make data quality training an organizational effort and empower teams to recognize poor data attributes.
Here’s a checklist you can use to begin a conversation on the quality of your data.
Businesses often make the mistake of undermining data quality problems. They hire data analysts to do the mundane data cleaning tasks instead of focusing on planning and strategy work. Some businesses use data management tools to clean, de-dupe, merge, and purge data without a plan. Unfortunately, tools and talents cannot solve problems in isolation. It would help if you had a strategy to meet data quality dimensions.
The strategy must address data collection, labeling, processing, and whether the data fits the AI/ML project. For instance, if an AI recruitment program only selects male candidates for a tech role, it’s obvious the training data for the project was biased, incomplete (since it did not gather enough data on female candidates), and inaccurate. Thus, this data did not meet the true purpose of the AI project.
Data quality goes beyond the mundane tasks of cleanups and fixes. Setting up data integrity and governance standards before beginning the project is best. It saves a project from going kaput later!
There are no universal standards for ‘good enough data or data quality levels. Instead, it all depends on your business’s information management system, guidelines for data governance (or the absence of them), and the knowledge of your team and business goals, among numerous other factors.
Here are a few questions to ask your team before kickstarting the project:
Ask the right questions, assign the right roles, implement data quality standards and help your team address challenges before they become problematic!
Data quality isn’t just fixing typos or errors. It ensures AI systems aren’t discriminatory, misleading, or inaccurate. Before launching an AI project, it’s necessary to address the flaws in your data and tackle data quality challenges. Moreover, initiate organization-wide data literacy programs to connect every team to the overall objective.
Frontline employees who handle, process, and label the data need training on data quality to identify bias and errors in time.
Featured Image Credit: Provided by the Author; Thank you!
Interior Article Images: Provided by the Author; Thank you!
Farah Kim is a human-centric marketing consultant with a knack for problem-solving and simplifying complex information into actionable insights for business leaders. She’s been involved in tech, B2B, and B2C since 2011.