Data warehouse, data lake, data hub, data fabric, it is sometimes very difficult to find one's way around the semantics of data storage and processing solutions. Over the decades, major evolutions have revolutionized the way data is used within companies, but confusion still reigns, preventing the full potential of these solutions from being exploited.
It is still very important to know the type of data you are handling and in particular the way it is organized. This has an influence on the storage architecture to be favored and the algorithms used to process this data. There are 3 types of data:
- Structured data (data rigorously organized in the form of tables, with rows and columns: spreadsheets, directories, etc.)
- Semi-structured data (weakly organized data with only tags or separators: tweets organized by hashtags, files organized by folders, ...)
- Unstructured data (data that has no particular form of organization: emails, images, videos, ...)
While originally only structured data could be stored and processed, the explosion in the amount of data and the massive use made of it today has led to the development of new systems capable of integrating semi-structured or even unstructured data.
Here is a brief overview of these different solutions: how they work, their advantages and limitations, and how they differ and are complementary.
A database is a collection of highly structured information. Generally organized in the form of a table, each piece of data is stored in a predefined cell, whose row designates an object and whose column corresponds to the object's attribute. For example, in a company's sales database, each row indicates a prospect and each column includes an attribute such as name, postal address, telephone number, etc. Using a database is therefore the simplest way to store data, but also the most rigid. Indeed, on the one hand, only structured data can be fed into them, and on the other hand, the data fields cannot be modified once they have been filled in. Another particularity is that databases are used exclusively for data exchanges. They are therefore not designed to allow analysis tasks to be carried out. Databases are processed by means of database management systems (DBMS) that use query languages to manipulate the objects they contain, the best known and most widely used being SQL (Structured Query Language).
Data warehouses are more advanced storage systems than databases. They are in fact large storage places connected to several operational databases (from marketing, sales, Enterprise Resource Platform, Customer Relationship Management, etc.). The goal of data warehouses is to centralize all or part of this data so that it can be used as a reference for analysis tools. For example, to determine the effectiveness of a prospecting campaign, you will need both sales data and prospect data. Rather than fetching data directly from the corresponding databases, which could alter them and complicate the solution's architecture, an algorithm will process the data directly in the data warehouse. The data warehouse is therefore a real data base at the heart of business intelligence.
Feeding a data warehouse therefore requires processing data from very diverse sources. This generic process is called ETL for Extract, Transform & Load. It consists of extracting data from different databases, transforming them so that they are in the right format and without errors and finally loading them into the data warehouse. Among these steps, the Transform part is crucial. It is indeed impossible to use business intelligence programs if the data that feeds them is not of good quality. It is therefore necessary to standardize them in the same format, to deduplicate them to eliminate redundancies and to sort them. Even if there are tools that support them, these ETL functions are performed by manual coding or by correcting and modifying the data directly in Excel spreadsheets. In addition to the fact that this type of manual task is extremely time-consuming, it also poses maintenance and scaling problems as the volume of data increases. Thus, having a data warehouse necessarily implies putting in place a data quality management system in order to be able to perform reliable analyses.
Finally, it should be noted that a data warehouse can only contain structured data. Any unstructured data cannot be stored there.
So how can we exploit this type of data, which can be very abundant in some organizations? In this case, a third storage system must be implemented: the data lake.
Data lakes, like data warehouses, are storage spaces designed to contain very large volumes of data. However, unlike data warehouses, data lakes are designed to be fed with structured, semi-structured and unstructured data. A data lake will therefore mainly host data that has been minimally transformed and that can be in any format: video, text, images, etc. Like data warehouses, data lakes are entry points for analysis tools capable of processing a wide variety of data. Their usefulness is therefore considerable when it comes to deploying machine learning projects. They are indeed adapted to ingest large volumes of data, even unstructured data. The data lake is therefore a very flexible storage place from which many models come to get their data. Another advantage is that it can be deployed on-premise or on the cloud. This makes deployment and connection with other cloud services for data analysis, visualization or processing much easier.
However, this agile architecture requires particular precautions to be fully operational. Among them:
- The need to implement data search tools. Data lakes can be very dense, with many different types of data mixed together. It is essential to be able to find your way around.
- The need to implement a governance system. The data that is stored there can come from various sources (local, regional, global) with varying degrees of sensitivity and strategic importance (especially with regard to customer data). Being able to control access at a fine level of granularity is essential.
- The need to integrate data preparation tools. The particularity of data lakes is that the data they contain is very disparate. This often leads to data quality problems: wrong format, textual errors, duplicates. This can be very problematic when these data are used to feed artificial intelligence algorithms. Preparing data from data lakes is therefore of crucial importance.
We have therefore seen the three main storage systems that companies can have. More or less elaborate, they all have the particularity of serving as a basis for the deployment of various data processing tools: preparation, analysis, visualization, etc. However, they are not sufficient by themselves. There are in fact other classes of infrastructure that will enable the management of the data flows that circulate within organizations.
Among them, we find data hubs. Data hubs are platforms whose purpose is to promote data sharing and governance. The added value of the data hub lies in its ability to connect storage systems to each other and to business applications (such as predictive sales models). Indeed, the problem that most companies face is that their data is organized in silos, according to the company's main branches of activity (marketing, sales, HR, logistics, etc.). Each one having its own storage system, they do not allow to have a global view of the company's activity. However, in order to efficiently analyze their data, solve concrete business problems or answer specific questions from their suppliers, customers or partners, companies need their data to be linked as much as possible between their storage infrastructures. The data hub acts as a central point that ensures this connection. The data that is stored there is very brief and it is not directly the support for analysis tools, unlike data lakes and data warehouses. It can nevertheless serve as an interface for many users to search, access or process their data. Finally, it can act as a governance body by controlling access to different types of data according to the user's profile.
Data fabrics correspond to the most advanced phase of data sharing within companies. It is a data management architecture that takes the form of a logical network of structured data. The functioning of a data fabric is therefore similar to that of a human brain. The latter is a physical network that connects information from different areas to make decisions in real time. In a similar way, a data fabric is a logical network that connects data between various business entities for both operational and strategic use cases. The possibilities offered are therefore much more numerous than those that can be found in other data lake type infrastructures. Two key elements distinguish data fabrics from other types of architecture:
- All data ingestion, integration, preparation and delivery processes are fully automated. The user, from the data scientist who wants to train his model to the decision-maker who wants to know the reasons for a drop in turnover at a very local level, thus has direct access to reliable, high-quality data. The use cases made possible by data fabrics are therefore extremely varied and complex: predictive analyses adjusted in real time, intelligent decision support assistants, fine-tuning of processes, etc.
- The active use of metadata (information about the data itself: location, owner, date of update, etc.) enables governance at a very high level of granularity. In concrete terms, the data processed within the data fabric is updated in real time and access to it can be adjusted by the user for an entire data set, a specific file or even a cell in a spreadsheet. Furthermore, data is completely connected through graph-like models, which are structures made of nodes and links that allow separate elements to be connected very easily.
Data fabrics are not actually software solutions in their own right, but rather architectures composed of several tools whose purpose is to share and govern data much more efficiently. They will not replace other infrastructures already in place, but rather unify and operationalize them on a larger scale. This is why we talk about design data fabric. It is about efficiently connecting the entire data ecosystem of an organization to develop complex data management models quickly and on a much larger scale than what is done today.
It appears that the storage systems available to companies are becoming increasingly complex. Originally, only small volumes of structured data could be processed in databases, but today it is possible to process large volumes of data in different formats in data warehouses or even data lakes. However, the main problem that organizations face is that their data is often hard to find and mostly organized in silos. For this reason, data hub infrastructures are being implemented more and more. An essential point is that data warehouses, data lakes and data hubs are not to be opposed, on the contrary they are complementary and must be put in interaction to be fully effective.
In the longer term, the data fabric design will ensure optimal data sharing, governance and quality, with a multitude of business applications at stake.
To achieve this, it is important to integrate the right tools within these architectures to deliver reliable, relevant and valuable data in real time. This is a real strategic challenge to become fully data-driven (see our linked article here).
YZR is a no-code artificial intelligence platform 100% dedicated to the normalization of textual data, which is one of the most important phase in the preparation of your data. As a plug&play tool, it is aimed at operational business people (product managers, buyers, etc.) and all those who fully understand the business context in which the data is embedded. Because we are convinced that their skills would be much better used to exploit the data rather than wasting a lot of time preparing it manually.
Our SaaS tool is specially designed to solve your problems related to
- The multiplicity of your data sources
- The absence of naming conventions
- Manual data correction
- Data governance and sharing
It also integrates perfectly with your various tools (Product Information Management, Master Data Management, Data Science - Machine Learning, Business Intelligence), to enable you to achieve, among other things
- A better customer knowledge
- Optimized sales forecasts
- Accelerated digitization of your offer.
In other words, with YZR, you exploit the full potential of your data.
Want to know more? Would you like a demonstration of our product? Do not hesitate to contact us directly on our website or at firstname.lastname@example.org
If today, the development of artificial intelligence and machine learning offers companies new possibilities to analyze and model the huge amount of data they produce, they still face a very important difficulty: poor quality and heterogeneity of data at the source.
This is especially true in the retail and consumer products industries, where solving the problem of product data heterogeneity is a major growth driver.
To understand why and learn more, feel free to download our white paper available here!
- Gartner; Data Hubs, Data Lakes and Data Warehouses: How They Are Different and Why They Are Better Together; Ted Friedman, Nick Heudecker; 02 June 2021.
- Gartner; What Is Data Fabric Design?; Robert Thanaraj, Mark Beyer, Ehtisham Zaidi; April 14, 2021.