ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Data Warehouse and Data Lake
    DB/Nosql 2020. 7. 10. 12:12

    1. Overview

    1.1 Data Warehouse

    A data warehouse is a blend of technologies and components which allows the strategic use of data. It is a technique for collecting and managing data from varied sources to provide meaningful business insights.

    It is the electronic storage of a large amount of information by a business that is designed for query and analysis instead of transaction processing. It is a process of transforming data into information.

    1.2 Data Lake

    A Data Lake is a storage repository that can store large amounts of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration.

    Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time.

    2. Difference between Data Warehouse and Data Lake

    Attribute Data Warehouse Data lake
    Schema Schema-on-write Schema-on-read
    Scale Scales to moderate to large volumes at moderate cost Scales to huge volumes at low cost
    Access Methods Accessed through standardized SQL and BI tools Accessed through SQL-like systems, programs created by developers and also supports big data analytics tools
    Workload Supports batch processing as well as thousands of concurrent users performing interactive analytics Supports batch and stream processing, plus an improved capability over data warehouses to support big data inquiries form users
    Data Cleansed Raw and refined
    Data Complexity Complex integrations Complex processing
    Cost/Efficiency Efficiently uses CPU/IO but high storage and processing costs Efficiently uses storage and processing capabilities at a very low cost
    Data Purpose Determined Undetermined
    Usage Business reporting Data analytics and Exploration
    Built for SQL Server ecosystem Hadoop ecosystem
    Regulatory compliance
    Yes No
    Storage expensive for large data volumes designed for low-cost storage
    Agility less agile, fixed, configuration highly agile, configure and reconfigure as needed
    Users Business Professionals Data Scientist et.al
    Example Panoply Amazon Redshift Spectrum
    Benefits · Transform once, use many
    · Easy to consume data
    · Fast response times
    · Mature governance
    · Provides a single enterprise-wide view of data from multiple sources
    · Clean, safe, secure data
    · High concurrency
    · Operational integration
    · Transforms the economics of storing large amounts of data
    · Easy to consume data
    · Fast response times
    · Mature governance
    · Provides a single enterprise-wide view of data
    · Scales to execute on tens of thousands of servers
    · Allows use of any tool
    · Enables analysis to begin as soon as data arrives
    · Allows usage of structured and unstructured content form a single source
    · Supports Agile modeling by allowing users to change models, applications, and queries
    · Analytics and big data analytics
    Drawbacks · Time-consuming
    · Expensive
    · Difficult to conduct ad hoc and exploratory analytics
    · Only structured data
    · Complexity of big data ecosystem
    · Lack of visibility if not managed and organized
    · Big data skills gap

    3. Example

    3.1 Data Lake: Amazon Redshift Spectrum

    This solution from Amazon extends the analytic capabilities of Redshift beyond the data stored on its local disks. It can directly query unstructured data in an Amazon S3 data lake, data warehouse-style, without having to load or transform it. Redshift Spectrum optimizes queries on the fly and scales up processing transparently to return results quickly, regardless of the scale of data being processed.

    3.2 Data Warehouse: Panoply

    Panoply is a cloud-based data warehouse that integrates with S3 data lakes and many other data sources. Panoply is a pioneer of data warehouse automation, offering a self-optimizing architecture, which uses machine learning and natural language processing (NLP) to model the data journey from source to analysis.

    Panoply allows you to pull large volumes of data from a cloud-based data lake like S3, without having an ETL process in place. Once the data is in Panoply it is automatically treated, prepared, and optimized for fast analysis - you can immediately start running analytical queries.

    4. Reference

    https://aws.amazon.com/big-data/what-is-hive/

    https://www.xenonstack.com/insights/what-is-a-modern-data-warehouse/

    https://www.guru99.com/data-lake-architecture.html#:~:text=Data%20Lake%20is%20like%20a,flowing%20through%20in%20real%2Dtime.

    https://aws.amazon.com/products/storage/data-lake-storage/

    https://panoply.io/data-warehouse-guide/data-warehouse-vs-data-lake/

    https://gr.pinterest.com/pin/734860864179519006/

    https://www.samsungsds.com/global/ko/support/insights/1209115_2284.html

    https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

    https://en.wikipedia.org/wiki/Data_lake#:~:text=A%20data%20lake%20is%20usually,advanced%20analytics%20and%20machine%20learning.

    'DB > Nosql' 카테고리의 다른 글

    MongoDB Arbiter  (0) 2020.04.10
    Elasticsearch  (0) 2020.03.03
    MongoDB Transactions  (0) 2020.02.23
    MongoDB vs Casandra  (0) 2019.09.30
    MongoDB  (0) 2019.09.20

    댓글

Designed by Tistory.