I have been working for the last three months as a data scientist for Fresenius Medical Care a german DAX 30 company, which is specialized in dialysis kidney services. Even though I started as a data scientist I am currently in the role of Big Data Solution Architect and jointly responsible for the technical architecture of a central data store.
At the moment the data at Fresenius is isolated in silos. The data produced in the different regions and business units are inaccessible to other parts of the company. To allow for data-driven decision making a centralized store for data is needed. The requirements can be divided into two pillars the data infrastructure and data governance.
The data lake needs to be a scalable data store, with rapid data retrieval and agnostic to clients. We abstracted these requirements into four layers: the source layer, the storage layer, the query layer, and the client layer.
The source layer summarizes the different data sources. At Fresenius, we have to deal with all kinds of data systems from relational databases over customized SAP systems to simple Excel sheets. To make this data accessible it has to be ingested by an Extract-Transform-Load (ETL) script, which takes care of the data aggregation and cleansing. Data aggregation is used in summarizing information. E.g. Europe, the Middle East and Africa (EMEA) supply chain (SC) SAP Business Warehouse (BW) have more than 50 000 tables, which are incomprehensible without treatment. Furthermore, many data fields are misused or incompletely filled by the local data owners and have to correct or removed, which is referred to as data cleansing.
The outgoing stream of the ETL gets written down as Apache Parquet files in S3 file storage. Parquet is a columnar storage format.
S3 stands for Amazon Simple Storage Service, which is based on the Amazon Dynamo paper. S3 is a scalable, distributed object storage. At Fresenius we are using Minio, the open-source, on-premise S3 alternative as most of the company uses Microsoft products, including the Azure cloud.
Apache Parquet is a columnar storage representation. It is based on the Dremel Paper. Parquet restructures nested data into parquet files, which are self-contained. Apart from the data, each parquet file saves additional metadata about the data structure. Column-major order, as opposed to row-major order, is better suited for paralyzed processing. The columnar stored data can be more compressed, as a single column of the same data type can be placed as a continuous piece of memory. Lastly, as the user is most often only interested in a subset of all possible data columns a query over columnar storage does not need to read every row, but only the needed columns.
To access the data we implement a data virtualization layer using Presto. Presto is an open-source distributed SQL query engine for running interactive analytic queries.
Presto can connect to a variety of data sources like S3, MySQL, Postgresql, MongoDB and many more. Presto comes with a build-in Hive Metastore, which can be used as a data catalog. As every SQL query passes trough Presto it can furthermore be enhanced via plugins to add authorization capabilities with Apache Ranger and data lineage creation with Uber QueryParser. Authorization declares, which identity is allowed to perform specific actions on the data objects. Data lineage gives information about the origin of the data object.
Presto has been developed by Facebook in 2012. Unfortunately, the community of Presto is split in two: prestosql and prestodb, where the latter is maintained by Facebook. I prefer the prestosql distribution, as it has more data connectors, is compatible with Apache Ranger (an authorization framework), and is not controlled by Facebook.
To consume data we have to connect clients to the query layer. At Fresenius, we are using a variety of Business Intelligence (BI) software like Tableau, Qlik, or Power BI. All of them provide Presto connectivity. As Presto has an out of the box Open Data Connectivity (ODBC) it can be connected to almost any client. In the future, further tools like Jupyter Notebook or Spark for intensive data analysis will be connected to the data lake.
Not all data should be accessible by anybody, available data should be easy to discover and data should meet quality requirements. Consequently, data security, data accessibility, and data integrity need to be implemented.
Data security is about protecting data from unauthorized users. In our data lake architecture, it consists of two parts: authentication and authorization.
Authentication is the process of verifying your identity. E.g. when signing in you enter your email and your password to authenticate your identity. The two most important protocols concerning authentication are the Lightweight Directory Access Protocol (LDAP) and Kerberos. Kerberos is more versatile, but due to limitations of the Tableau Preso integration, we use LDAP. At Fresenius, most of the staff authenticate themselves via Azure Active Directory (AAD). As a result, we do not have to create another identity for our employers to identify themselves in the data lake but can use their current credentials for authentication.
On the other hand, authorization specifies the access of each user. It defines the actions a user can perform on a data object. To achieve authorization, we use a tool from the Hadoop platform: Apache Ranger. Apache Ranger can be plugged into Presto and uses LDAP for authentication. It contains a web GUI to grant permissions to users, groups, and roles on a columnar level. E.g. the SC needs access to all manufacturing data, but are not allowed to view the financial information of the sold products. With Ranger, we can create a supply_chain group, map all SC staff to the group and then only allow the group to execute the SELECT action on a query on tables including manufacturing data while excluding every column that contains financial data!
To discover available data objects a data catalog is needed. A data catalog comes in the form of a search engine, which is able to query databases, tables columns, and their metadata. E.g. the metadata of a data column contains an implicity data type, comment, its relation to other data objects like a table or BI reports. The top player of proprietary software is Collibra and Informatica, which need to be extensively customized, to fit in our presto data visualization layer while being costly. For our purposes, we have taken the decision to develop a custom solution.
For the data discovery, we will use the Presto built-in Hive metastore. The Metastore has a REST API (WebHCat), which allows us to query databases, tables, and columns with their corresponding metadata. It further allows us to update the metadata. Currently, we are building a frontend, which is connected to the REST API, based on Vue Elements Admin. In the first phase, the query capabilities of the data catalog will still be limited. In the second phase, we plan to add a proper backend to the data catalog, which will be in sync with the Metastore (e.g. Rangers audit capabilities require Apache Solr, which could be plugged into our data catalog). The plan is to add a business glossary that allows Data Citizens across an organization to collaboratively agree on the definitions, rules, and policies that define their data.
The data lineage maps the relations of data objects in a directed acyclic graph (DAG). The DAG can be visualized using a layered graph drawing (Sugiyama). Knowing the data lineage we can perform an impact analysis and a lineage analysis. Currently, we follow the Presto/ Hive model of data object relations:
columns -> tables -> data source -> report
In an impact analysis, we track the relations between data objects to get to know which parts are impacted, if a data object is removed. E.g. if we remove a table, we would like to know which report would be unavailable. In an impact analysis, we normally follow from the bottom to the top (from columns over tables and data sources to the report).
Performing a lineage analysis, we are interested in the composition of reports and data sources. E.g. we want to know what the data sources of a popular report are to use them in our report as well. In a lineage analysis, we follow the relations from the top to the bottom (from databases over tables to columns).
To get model the data lineage we have to parse each query Presto executes. Luckily the guys from Uber created QueryParser, an open-source solution, based on Haskell. We create a Presto plugin, which sends every query to QueryParser, which then models data object relations. Furthermore, we plan to implement the relations to the BI reports using the Tableau Metadata API.
We need to make sure that a data quality threshold is met and that processes are implemented, which constantly improve and harmonize the different data sources. For this, we implement two stages: data cleansing and data control.
Data Cleansing is the process of detecting and correcting corrupt or inaccurate data records. We will implement an algorithm in our ETL, that checks if the data source fulfills certain metrics. E.g. that no more than 80% of all records in the column can have a NULL value or that triggers an alert if a column that should contain a number contains a string.
To improve the data quality on aspects, which (still) have to be detected by humans we enhance our data catalog with the data control ticketing system. If a data citizen notices problems with a database, table, or column he then can directly create a ticket, which then has to be reviewed by the corresponding data owner.
Being in control of your data is a complex but important pillar of the digitalization of your company. Having an easily accessible and secure single source of truth is becoming a must-have to be competitive. Many companies, like Fresenius Medical Care, are still struggling of managing their data. I hope my article could give valuable insights into how to architect a Data Lake for Big Data.