Understanding Data Storage: Lakes vs. Warehouses

Now more than ever, companies are looking for new ways to incorporate data analytics into their daily operations and leverage data-driven insights to improve business functions. They are frequently turning to complex data for tasks like machine learning and artificial intelligence, which are becoming necessary to understand and reach customer segments across industries. However, understanding data storage is a key factor in developing a successful strategy. The two most common storage formats are data lakes and data warehouses – but there are benefits and pitfalls to each that organizations must understand in order to properly capitalize on them.

Data Lakes vs. Data Warehouses Use Cases

One effective way to better understand the different functions of data lakes and data warehouses is to focus on the end use case. Typically, there are three types of data consumers: farmers, explorers and executives.

Farmers need data and information to execute their day-to-day activities. They must report on the key performance indicators (KPIs) required to execute their job by providing data in a structured format.
Explorers are users who want to experiment with data, look at new types of data pools and gather specific insights. If data is presented in a fixed format, it hinders their progress, as they seek data that doesn’t have a pre-defined structure.
Executives are responsible for making business decisions, and often require information presented at different aggregates, with the ability to narrow in on data subsets as needed. However, they require information that is somewhat structured, and speed is critical for them.

If we look at these perspectives more broadly, they essentially boil down to two key types of data usage – the exact needs that data warehouses and data lakes are designed for. Farmers and executives are largely served through data warehouses, which require information to be timely and structured. Some of the main use cases for data warehouses include operational analytics, business intelligence and predictive analytics for data science and machine learning applications. Explorers are best served through data lakes, since the data formats are not predetermined. Data lakes are often most effective for use cases like a centralized data catalogue that enables organizations to view all sources of data in a single place, or recent advances in AI application development that require the processing of unstructured data, such as text, images, video and audio.

Challenges and Convergence of Storage Formats

Despite the differences, most companies can use both data lakes and data warehouses for their various analytics needs. However, to successfully capitalize on the benefits, it is also important to understand the challenges. In particular, data lakes present difficulties when it comes to parsing data. They were initially designed to capture data across the enterprise in its natural format, without enforcing schema, so that users could garner more insights; a fundamental aspect in creating a data-driven culture across the organization. However, without a predetermined use case, data lakes can quickly become data swamps. End users were historically unable to figure out whether the data was stale, and ownership was not well established. This has been remedied by taking a use case-driven approach and only including data that has defined use cases or ownership.

Now that many of the issues surrounding data lakes have been resolved, the two storage formats must complement each other to meet the varying needs of the customer. This has contributed to blurred lines between data lakes and data warehouses. Data lakes are now capable of schema enforcement and answering rapid business intelligence queries, which were traditionally qualities of data warehouses. Data warehouses have separated compute and storage and can read directly from big data file systems, enabling users to read semi-structured data. As such, we are rapidly moving toward integrated data environments and the convergence of data lakes and data warehouses.

Supporting Data-Driven Initiatives

For initiatives like artificial intelligence and machine learning to succeed, data must be presented as an immutable entity that can be used for experimentation. With data lakes, it is crucial to separate the data into different zones and maintain a refined zone after transformation. Then, in the refined zone, companies can enforce schema and allow the schema to evolve to ensure the data is ready for machine learning and data science needs. Most importantly, they must catalog the data and enforce metadata management, data quality and governance. Data warehouses, on the other hand, excel in providing data sets ready for discovery and consumption. Companies should integrate these data sets with an interactive data catalog so they are discoverable – this is the most important step in making artificial intelligence and machine learning possible.

If organizations can learn how to best capitalize on data warehouses and data lakes for their intended purposes, they will be well equipped to uncover key data-driven insights to help guide their business strategy. This will help enable use of other advanced technologies and help transforming them into data-driven companies.