Garbage in, garbage out. The importance of data quality in artificial intelligence.

Updated: Mar 25

The European Union Agency for Fundamental Rights writes that: "Data quality and artificial intelligence – mitigating bias and error to protect fundamental rights' has highlighted that '[a]lgorithms used in machine learning systems and artificial intelligence (AI) can only

be as good as the data used for their development". When using bad-quality data, you cannot expect good outcomes. Conclusions you might draw from the analysis of this data, correlations you may find, and decisions you shall take from its examination will not be valuable.

The most significant risks with poor quality data are adverse effects on your business. Remember the Apple and Goldman Sachs scandal where gender-biased data resulted in discrimination from Goldman Sachs towards women's credit limitations? The value of credit granted was 10 to 20 times lower than for men with similar or the same conditions and characteristics. This discrimination resulted from the bank using an Apple algorithm filled with biased data, garbage data. Goldman Sachs's defence was that the algorithm based its credit scoring on customers' creditworthiness assessment and not factors of gender, race, age or sexual orientation.

But what does come in as a factor for determining such scoring? As mathematician Cathy O'Neal told Slate in 2019, when companies choose to use algorithms, "[t]hey look at the upside—which is faster, scalable, quick decision-making—and they ignore the downside, which is that they're taking on many risks." Garbage data is a risk. Because data does not always reflect society as is it, it only demonstrates its developer's intent and knowledge. It is easy to imagine how negative in terms of PR, trust, and reputation such a situation was for both companies.

Since then, the European Union has been working on a proposal of regulations on artificial intelligence that will impose a set of requirements on developers and users of said high-risk AI systems (such as creditworthiness assessment, HR resources...). The Commission working the proposal says they give particular attention to data and data governance: "Training, validation, and testing data sets shall be relevant, representative, free of errors, and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used. These characteristics of the data sets may be met at the level of individual data sets or a combination thereof".

The same level of requirements is needed if you would like to implement an AI system whose objective is to be trustworthy. You should constantly assess data quality and eliminate any potential errors. Bias or discrimination may result in reputational risks and actual losses – administrative fines and an exodus of clients.

So, how can you ensure the excellent quality of your data?

First, you have to implement an organised data governance framework. No one expects you to have a sophisticated system if you are a small business. Each company will create their proportional-sized framework. At Docmatic, we provide help to our clients for them to be able to assess the extent of changes they need to make to their current framework.

Then, it would be best to verify that you indeed possess the data you need. If your information does not fit the purpose you wish, your problem does not lie in quality yet. To make sure your objectives and data are correct, you should set the goals of your AI-based system and compare them to a set of your data. You should also ensure whether the amount of data you possess is sufficient to build an accurate and practical model. Don't forget that data is the oil for any AI-based system. When your system and data are aligned, you must check if your chosen information is representative and not biased. Having a good partner for such assessment is essential.

Finally, don't forget that having a human-in-the-loop approach is also desirable. You should therefore monitor and supervise your model and react if needed. Remember, effort at the beginning will bring benefits at the end.

To ensure the quality of your data, send an email to or click here.

21 views0 comments

Related Posts

With online shopping booming, so did the need for data processing in eCommerce. Our smartphones and laptops are data warehouses, gathering intel about our purchases, characteristics, daily habits, and

The concept of automation, including the adoption of Machine Learning and Natural Language Processing tools, has many benefits for business enterprises. But it presents many challenges like regulation

For a long time, the leading opinion was that only large corporations could implement and sell artificial intelligence-based solutions at a grander scale for many years. We now know machine learning i