Garbage in, garbage out. The importance of data quality in artificial intelligence.

Updated: Oct 12


The European Union Agency for Fundamental Rights writes that:

"Data quality and artificial intelligence – mitigating bias and error to protect fundamental rights' has highlighted that '[a]lgorithms used in machine learning systems and artificial intelligence (AI) can only be as good as the data used for their development".

When using bad-quality data, you cannot expect good outcomes. Conclusions you might draw from the analysis of this data, correlations you may find, and decisions you shall take from its examination will not be valuable.

The most significant risks with poor quality data are adverse effects on your business.

Remember the Apple and Goldman Sachs scandal, where gender-biased data resulted in discrimination from Goldman Sachs towards women's credit limitations? The value of credit granted was 10 to 20 times lower than for men with similar or the same conditions and characteristics. This discrimination resulted from

the bank using an Apple algorithm filled with biased data, and garbage data.

Goldman Sachs's defence was that the algorithm based its credit scoring on customers' creditworthiness assessment and not factors of gender, race, age or sexual orientation.


But what does come in as a factor for determining such scoring? As mathematician Cathy O'Neal told Slate in 2019 when companies choose to use algorithms,

"[t]hey look at the upside—which is faster, scalable, quick decision-making—and they ignore the downside, which is that they're taking on many risks."

Garbage data is a risk.


Because data does not always reflect society, it only demonstrates its developer's intent and knowledge. It is easy to imagine how negative in terms of PR, trust, and reputation such a situation was for both companies.


Since then, the European Union has been working on a proposal of regulations on artificial intelligence that will impose a set of requirements on developers and users of said high-risk AI systems (such as creditworthiness assessment and HR resources...). The Commission working on the proposal says they give particular attention to data and data governance:


"Training, validation, and testing data sets shall be relevant, representative, free of errors, and complete. They shall have the appropriate statistical properties, including, where applicable, the persons or groups of persons on which the high-risk AI system is intended to be used. These characteristics of the data sets may be met at the level of individual data sets or a combination thereof".

The exact requirements are needed if you want to implement an AI system whose objective is to be trustworthy. You should constantly assess data quality and eliminate any potential errors. Bias or discrimination may result in reputational risks and actual losses – administrative fines and an exodus of clients.


So, how can you ensure the excellent quality of your data?

First, you have to implement an organised data governance framework. No one expects you to have a sophisticated system if you are a small business. Each company will create a proportional-sized framework.


At Docmatic, we provide help to our clients, so they can assess the extent of changes they need to make to their current framework.

Then, it would be best to

verify that you indeed possess the data you need.

If your information does not fit the purpose you wish, your problem does not lie in quality yet. To ensure your objectives and data are correct, you should set the goals of your AI-based system and compare them to a set of your data. You should also ensure that the amount of data you possess is sufficient to build an accurate and practical model.


Don't forget that data is the oil for any AI-based system.

When your system and data are aligned, you must check if your chosen information is representative and not biased. Having a good partner for such an assessment is essential.


Finally, don't forget that having a human-in-the-loop approach is also desirable. You should therefore monitor and supervise your model and react if needed. Remember, effort at the beginning will bring benefits in the end.


If you are interested in learning more about Docmatic's innovative technology or you want to test our solutions live on your documents, please contact us at: office@docmatic.ai.
4 views0 comments

Related Posts

See All

With the increasing regulations allowing the transfer of personnel paper records to an electronic record-keeping system, the 21st century has seen a massive move toward paperless and digital transform

With online shopping booming, so did the need for data processing in eCommerce. Our smartphones and laptops are data warehouses, gathering intel about our purchases, characteristics, daily habits, and

The concept of automation, including adopting Machine Learning and Natural Language Processing tools, has many benefits for business enterprises. But it presents several challenges like regulation, so