Why data quality is a major growth driver for companies? Sébastien Garcin, CEO of YZR, met Samir Amellal, CDO of La Redoute, to talk about this essential topic. A summary is available here (in French).
How can I ensure that my product data is reliable? How can I fully exploit its potential? How can I desilute it to better share it? These major questions are on the agenda of more and more companies in 2021. However, answering them requires an organizational transformation that requires the adoption of a *data-driven* culture, i.e. one that is centered around data. Indeed, the emergence of big data, artificial intelligence and machine learning has profoundly changed the way we do business in recent years. If until now our algorithms were mainly used to compile and display data, they are now capable of much more. Marketing analysis, sales forecasting, process optimization: the use cases are numerous and allow companies to create value much faster than before.
However, one prerequisite for the effective use of these machine learning models has still not been fully addressed: data quality. As a proof, currently, data scientists spend almost 80% of their time preparing their data. They perform what is commonly called feature engineering. This involves transforming the training data so that it is error-free and in the same format in order to correctly feed the analysis algorithms. Although this work is crucial, it is nevertheless very time-consuming and not very rewarding for these experts with valuable and sought-after technical skills.
Therefore, given the huge amount of data to which companies have access, it is impossible to correct them all manually. The poor quality of this data then reduces the performance of the models with a direct impact on growth.
What kind of data are we talking about? There are many types of data, but one category in particular remains under-exploited: product data.
In this article, we propose to explain :
- Why data quality, and in particular product data, is an essential issue for any company if it wants to adopt a data-driven culture.
- Why current methods are not fully satisfactory.
- What we propose to achieve this.
It is obvious that customer data is extremely important for any company. Defining your target audience precisely and being able to segment it is the basis of any marketing strategy.
However, this data is increasingly sensitive and difficult to manipulate. For this reason, the regulations regarding the protection of personal data, such as the RGPD, have become increasingly strict in recent years. It has even become complex to build a data lake containing customer data. At a time when companies are looking for more agility within their organization and want to be much more responsive in their decision-making processes, this situation becomes very problematic.
In reality, there are other types of data that are also strategic. This is the case for logistics, HR and especially product data. The latter in particular are very simple to process and have a predominant place in many departments (purchasing, marketing, ...) and verticals (geography, industry, ...) of companies.
The problem is that their volume keeps growing year after year to the point that it is less and less easy to manage them. For example, until recently, Apple only had about 100 different product references. At that time, a small number of people could complete product files in a few hours and even conduct sales analyses. In 2021, this is simply unthinkable. The major retail companies have no less than 500,000 product references and more than a million if you take into account their marketplace.
This exponential growth in data volumes has been very rapid, too rapid in fact. So much so that the definition and the implementation of unique and shared product repositories is a subject that has never been fully addressed. We often find bits and pieces of agreements defined by several types of people at various hierarchical levels and without any real agreement on a large scale. This heterogeneity then generally leads to a degradation of data quality (read our detailed article on the subject here). The data is entered manually in different places by different people and can therefore contain textual errors, preventing in some cases the triggering of sales. For example, it has already happened that a marketplace showed a pajama for a price of €19,000 per unit instead of €19 because of an input problem. Obviously, no purchase was recorded, but for a series of 100,000 items, the loss in terms of turnover can quickly become significant. Moreover, if this error is quickly identifiable, it is far from always being so obvious. If these pajamas had been offered at 99€, the gap would have been less visible, and the after-sales service would have had to deal with unhappy customers very quickly after realizing this "scam".
In reality, in the age of big data, the challenge is not to have a large amount of data. With the emergence of APIs and open data, it is now quite easy to collect large volumes of data. It often happens that the legacy of some companies exceeds 500 TB! The heart of the matter lies ultimately in the ability of organizations to exploit and make reliable data from customers, suppliers or business partners. There is a huge amount of work to come on the quality of the huge amount of data they possess. In other words, the more data that needs to be managed, the more important it is to industrialize its quality.
The main interest for companies is to be able to extract value from their product or performance data. Retailers are increasingly faced with powerful suppliers who demand that their data be shared with them free of charge. If they are perfectly managed and standardized, there is then the possibility to monetize them and therefore to reap more profits.
So, if until now, these issues of product data quality were considered minor, they have become the focus of all concerns.
However, several steps are necessary to solve them.
For a company with a marketplace, tracking sales performance is central. It is a matter of performing statistical analyses based on customer purchases to optimize the product offering. One of the traditional methods for this is to use the EAN codes assigned to each product. The problem is that companies use both internal and external EANs, which are likely to be confused, and these EANs are constantly changing during the sales cycle. It often happens that suppliers re-assign the same code to different products or that companies re-use identical codes when products are renewed. Consequently, if there is no long-term monitoring of the permanence of these EAN codes, there is a significant risk that the models will generate statistical errors.
In practice, tools are developed within companies to reconstruct the history of EANs when they change. But this work is very tedious and not always reliable. Above all, this type of solution is quickly confronted with problems of scale, as the number of codes to be tracked grows day by day.
One solution is to focus solely on product descriptions. After a standardization work, it is possible to perform matching by considering that if two products have very similar descriptions then there is a strong chance that they are identical. This is what we propose at YZR (see below).
The data produced is processed throughout a production chain that involves many actors. These intermediaries, who are generally individuals, interfere with the data and generate errors that must be corrected. For example, when an e-commerce player makes a sale on its site, it must be able to manage the product data at different levels: storage warehouse, logistics partners, delivery point, website for order tracking and finally customer support. It is therefore essential that this data is synchronous and homogeneous to ensure optimal service.
To achieve this, two methods exist:
- Deploying safeguards to avoid human error, i.e. imposing constraints that prevent staff from freely filling in the data they want. For example, this could be a drop-down menu with imposed choices for filling in a cell in an Excel spreadsheet. However, this solution alone is not satisfactory and can sometimes even be counterproductive. As human beings, we are specialists in getting around obstacles, especially if it saves us energy. If in the Excel spreadsheet to be completed, the drop-down lists to fill in the cells are too long, who would not be tempted to systematically choose the first option and directly fill in the information in the free "comment" field? The time saving would be considerable! But it would be at the expense of the quality of the data...
- Develop or obtain data correction solutions. This time, the human is free to fill in the data and is helped a posteriori by data correction tools. He uses them mainly in the event of a data entry error or, for a product reference, a description that is too terse or a blurred photo. This solution is relevant, but it is still necessary to be able to identify where the poor quality data is located, and to have the right data quality tools with trained employees who are comfortable using them. Moreover, most of the time, these software do not correct the data directly but perform calculations to evaluate the probability of error. If a certain threshold is exceeded (e.g. 50%), then an alert is triggered to indicate the need for correction. But this creates another problem, since it is necessary to identify the right person with sufficient expertise to correct the data. This is especially true in the pharmaceutical industry. Who, apart from a specialist, will know that "Singulair" and "Montelukast" are in fact the same drug, one being the genetic of the other? A data scientist will have great difficulty in developing a production optimization algorithm. In many other situations, data scientists find themselves just as helpless, not knowing where to turn to find the right information to improve their data. It is therefore important that these correction tools are specifically designed to be used by business specialists, their operational vision making them relevant for this type of task.
The flow of data within companies is comparable to the circulation of blood in the human body. Just as blood connects the vital organs, data must circulate rapidly between the different entities of an organization. Moreover, each department must be able to appropriate the data it processes. The marketing branch must be able to perform analyses on its promotional campaigns, the sales branch must be able to make its own sales forecasts, the logistics branch must be able to optimize its robotic product collection arms. Everyone must be able to manipulate their own data! In fact, it is particularly important that data science does not become a completely centralized department, just as IT once was, at the risk of making processes considerably more cumbersome, generating frustrations and above all losing productivity. However, the autonomy of each department in data matters is based on a prerequisite: the data entrusted to them must be regulated, monitored and of good quality. In other words, trust in the work of each individual can only be achieved if the data processed is sufficiently secure.
Thus, a company can be considered fully data-driven as soon as its quality management and governance (which makes it possible to know who has the authorization to manipulate the data) are centralized and all the resulting analyses are distributed to the business experts. The operation is then optimal: the data circulates at an identical level of reliability and irrigates the entire organization of the company, ensuring the maintainability and rapid scalability of all business management systems.
YZR is a no-code artificial intelligence platform 100% dedicated to textual data normalization. In the form of a plug&play tool, it is aimed at operational business people (product managers, buyers, etc.) and all those who fully understand the business context in which data is used. Because we are convinced that their skills would be much better used in exploiting the data rather than preparing it.
Our data quality tool is specially designed to solve your problems related to
- The multiplicity of your data sources
- The absence of naming conventions
- Manual data correction
- Data governance and sharing
Our SaaS solution also integrates perfectly with your various tools (Product Information Management, Master Data Management, Data Science Machine Learning, Business Intelligence), to enable you to achieve, among other things
- A better customer knowledge
- Optimized sales forecasts
- Accelerated digitization of your offer
In other words, with YZR, you exploit the full potential of your data.
Want to know more? Would you like a demonstration of our product? Do not hesitate to contact us directly on our website or at firstname.lastname@example.org
If today, the development of artificial intelligence and machine learning offers companies new possibilities to analyze and model the huge amount of data they produce, they still face a very important difficulty: poor quality and heterogeneity of data at the source.
This is especially true in the retail and consumer products industries, where solving the problem of product data heterogeneity is a major growth driver.
To understand why and to learn more, feel free to download our white paper available here!