Data Warehouse and Data Lake

DB/Nosql 2020. 7. 10. 12:12

1. Overview

1.1 Data Warehouse

A data warehouse is a blend of technologies and components which allows the strategic use of data. It is a technique for collecting and managing data from varied sources to provide meaningful business insights.

It is the electronic storage of a large amount of information by a business that is designed for query and analysis instead of transaction processing. It is a process of transforming data into information.

1.2 Data Lake

A Data Lake is a storage repository that can store large amounts of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.

Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.

2. Difference between Data Warehouse and Data Lake

Attribute	Data Warehouse	Data lake
Schema	Schema-on-write	Schema-on-read
Scale	Scales to moderate to large volumes at moderate cost	Scales to huge volumes at low cost
Access Methods	Accessed through standardized SQL and BI tools	Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools
Workload	Supports batch processing as well as thousands of concurrent users performing interactive analytics	Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries form users
Data	Cleansed	Raw and refined
Data Complexity	Complex integrations	Complex processing
Cost/Efficiency	Efficiently uses CPU/IO but high storage and processing costs	Efficiently uses storage and processing capabilities at a very low cost
Data Purpose	Determined	Undetermined
Usage	Business reporting	Data analytics and Exploration
Built for	SQL Server ecosystem	Hadoop ecosystem
Regulatory compliance	Yes	No
Storage	expensive for large data volumes	designed for low-cost storage
Agility	less agile, fixed, configuration	highly agile, configure and reconfigure as needed
Users	Business Professionals	Data Scientist et.al
Example	Panoply	Amazon Redshift Spectrum
Benefits	· Transform once, use many · Easy to consume data · Fast response times · Mature governance · Provides a single enterprise-wide view of data from multiple sources · Clean, safe, secure data · High concurrency · Operational integration	· Transforms the economics of storing large amounts of data · Easy to consume data · Fast response times · Mature governance · Provides a single enterprise-wide view of data · Scales to execute on tens of thousands of servers · Allows use of any tool · Enables analysis to begin as soon as data arrives · Allows usage of structured and unstructured content form a single source · Supports Agile modeling by allowing users to change models, applications, and queries · Analytics and big data analytics
Drawbacks	· Time-consuming · Expensive · Difficult to conduct ad hoc and exploratory analytics · Only structured data	· Complexity of big data ecosystem · Lack of visibility if not managed and organized · Big data skills gap

3. Example

3.1 Data Lake: Amazon Redshift Spectrum

This solution from Amazon extends the analytic capabilities of Redshift beyond the data stored on its local disks. It can directly query unstructured data in an Amazon S3 data lake, data warehouse-style, without having to load or transform it. Redshift Spectrum optimizes queries on the fly and scales up processing transparently to return results quickly, regardless of the scale of data being processed.

3.2 Data Warehouse: Panoply

Panoply is a cloud-based data warehouse that integrates with S3 data lakes and many other data sources. Panoply is a pioneer of data warehouse automation, offering a self-optimizing architecture, which uses machine learning and natural language processing (NLP) to model the data journey from source to analysis.

Panoply allows you to pull large volumes of data from a cloud-based data lake like S3, without having an ETL process in place. Once the data is in Panoply it is automatically treated, prepared, and optimized for fast analysis - you can immediately start running analytical queries.