-
Data Warehouse and Data LakeDB/Nosql 2020. 7. 10. 12:12
1. Overview
1.1 Data Warehouse
A data warehouse is a blend of technologies and components which allows the strategic use of data. It is a technique for collecting and managing data from varied sources to provide meaningful business insights.
It is the electronic storage of a large amount of information by a business that is designed for query and analysis instead of transaction processing. It is a process of transforming data into information.
1.2 Data Lake
A Data Lake is a storage repository that can store large amounts of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.
Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.
2. Difference between Data Warehouse and Data Lake
Attribute Data Warehouse Data lake Schema Schema-on-write Schema-on-read Scale Scales to moderate to large volumes at moderate cost Scales to huge volumes at low cost Access Methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools Workload Supports batch processing as well as thousands of concurrent users performing interactive analytics Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries form users Data Cleansed Raw and refined Data Complexity Complex integrations Complex processing Cost/Efficiency Efficiently uses CPU/IO but high storage and processing costs Efficiently uses storage and processing capabilities at a very low cost Data Purpose Determined Undetermined Usage Business reporting Data analytics and Exploration Built for SQL Server ecosystem Hadoop ecosystem Regulatory compliance Yes No Storage expensive for large data volumes designed for low-cost storage Agility less agile, fixed, configuration highly agile, configure and reconfigure as needed Users Business Professionals Data Scientist et.al Example Panoply Amazon Redshift Spectrum Benefits · Transform once, use many
· Easy to consume data
· Fast response times
· Mature governance
· Provides a single enterprise-wide view of data from multiple sources
· Clean, safe, secure data
· High concurrency
· Operational integration· Transforms the economics of storing large amounts of data
· Easy to consume data
· Fast response times
· Mature governance
· Provides a single enterprise-wide view of data
· Scales to execute on tens of thousands of servers
· Allows use of any tool
· Enables analysis to begin as soon as data arrives
· Allows usage of structured and unstructured content form a single source
· Supports Agile modeling by allowing users to change models, applications, and queries
· Analytics and big data analyticsDrawbacks · Time-consuming
· Expensive
· Difficult to conduct ad hoc and exploratory analytics
· Only structured data· Complexity of big data ecosystem
· Lack of visibility if not managed and organized
· Big data skills gap3. Example
3.1 Data Lake: Amazon Redshift Spectrum
This solution from Amazon extends the analytic capabilities of Redshift beyond the data stored on its local disks. It can directly query unstructured data in an Amazon S3 data lake, data warehouse-style, without having to load or transform it. Redshift Spectrum optimizes queries on the fly and scales up processing transparently to return results quickly, regardless of the scale of data being processed.
3.2 Data Warehouse: Panoply
Panoply is a cloud-based data warehouse that integrates with S3 data lakes and many other data sources. Panoply is a pioneer of data warehouse automation, offering a self-optimizing architecture, which uses machine learning and natural language processing (NLP) to model the data journey from source to analysis.
Panoply allows you to pull large volumes of data from a cloud-based data lake like S3, without having an ETL process in place. Once the data is in Panoply it is automatically treated, prepared, and optimized for fast analysis - you can immediately start running analytical queries.
4. Reference
https://aws.amazon.com/big-data/what-is-hive/
https://www.xenonstack.com/insights/what-is-a-modern-data-warehouse/
https://aws.amazon.com/products/storage/data-lake-storage/
https://panoply.io/data-warehouse-guide/data-warehouse-vs-data-lake/
https://gr.pinterest.com/pin/734860864179519006/
https://www.samsungsds.com/global/ko/support/insights/1209115_2284.html
https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
'DB > Nosql' 카테고리의 다른 글
MongoDB Arbiter (0) 2020.04.10 Elasticsearch (0) 2020.03.03 MongoDB Transactions (0) 2020.02.23 MongoDB vs Casandra (0) 2019.09.30 MongoDB (0) 2019.09.20