To Nha Notes | July 1, 2022, 8:58 p.m.
High - given cloud data warehouses rely on proprietary data formats and offer an end to end solution together, the cost is high
Low - Cloud data warehouses are optimized for BI/SQL based scenarios, there is some support for data science/exploratory scenarios which is restrictive due to format constraints
Low - there is less moving parts and you can get started almost immediately with an end to end solution
High - for SQL/BI scenarios, Low - for other scenarios
Low - the tools and ecosystem are largely well understood and ready to be consumed by organizations of any shape/size.
Medium - the data preparation and historical data can be moved to the data lake at lower cost, still need a cloud warehouse which is expensive
Medium - diverse ecosystem of tools nad more exploratory scenarios supported in the data lake, correlating data in the warehouse and data lake needs data copies
Medium - the data engineering team needs to ensure that the data lake design is efficient and scalable, plenty of guidance and considerations available, including this book
Medium - the data preparation and data engineering ecosystem, such as Spark/Hadoop has a higher maturity, tuning for performance and scale needed, High - for consumption via data warehouse
Medium - the data platform team needs to be skilled up to understand the needs of the organization and make the right design choices at the least to support the needs of the organization
Low - the data lake storage acts as the unified repository with no data movement required, compute engines are largely stateless and can be spun up and down on demand
High - flexibility of running more scenarios with a diverse ecosystem enabling more exploratory analysis such as data science, and ease of sharing of data between BI and data science teams
Medium to High - careful choice of right datasets and the open data format needed to support the lakehouse architecture
Medium to High - some technologies such as Delta Lake have a high maturity and adoption, while others such as Apache Iceberg are gaining strong adoption, requires thoughtful design
Medium to High - the data platform team needs to be skilled up to understand the needs of the organization and the technology choices that are still new
Medium - while the distributed design ensures cost is lower, lot of investment required in automation/blueprint/data governance solutions
High - flexibility in supporting different architectures and solutions in the same organization, and no bottlenecks on a central lean organization
High - this relies on an end to end automated solution and an architecture that scales to 10x growth and sharing across architectures/cloud solutions
Low - relatively nascent in guidance and available toolsets
High - data platform team and product/domain teams need to be skilled up in data lakes.