The trade off between cost and complexity across the different architectures.

To Nha Notes | July 1, 2022, 8:58 p.m.

Architecture
Total cost of solution
Flexibility of scenarios
Complexity of development
Maturity of ecosystem
Organizational maturity required

Cloud data warehouse

High - given cloud data warehouses rely on proprietary data formats and offer an end to end solution together, the cost is high

Low - Cloud data warehouses are optimized for BI/SQL based scenarios, there is some support for data science/exploratory scenarios which is restrictive due to format constraints

Low - there is less moving parts and you can get started almost immediately with an end to end solution

High - for SQL/BI scenarios, Low - for other scenarios

Low - the tools and ecosystem are largely well understood and ready to be consumed by organizations of any shape/size.

Modern data warehouse

Medium - the data preparation and historical data can be moved to the data lake at lower cost, still need a cloud warehouse which is expensive

Medium - diverse ecosystem of tools nad more exploratory scenarios supported in the data lake, correlating data in the warehouse and data lake needs data copies

Medium - the data engineering team needs to ensure that the data lake design is efficient and scalable, plenty of guidance and considerations available, including this book

Medium - the data preparation and data engineering ecosystem, such as Spark/Hadoop has a higher maturity, tuning for performance and scale needed, High - for consumption via data warehouse

Medium - the data platform team needs to be skilled up to understand the needs of the organization and make the right design choices at the least to support the needs of the organization

Data lakehouse

Low - the data lake storage acts as the unified repository with no data movement required, compute engines are largely stateless and can be spun up and down on demand

High - flexibility of running more scenarios with a diverse ecosystem enabling more exploratory analysis such as data science, and ease of sharing of data between BI and data science teams

Medium to High - careful choice of right datasets and the open data format needed to support the lakehouse architecture

Medium to High - some technologies such as Delta Lake have a high maturity and adoption, while others such as Apache Iceberg are gaining strong adoption, requires thoughtful design

Medium to High - the data platform team needs to be skilled up to understand the needs of the organization and the technology choices that are still new

Data mesh

Medium - while the distributed design ensures cost is lower, lot of investment required in automation/blueprint/data governance solutions

High - flexibility in supporting different architectures and solutions in the same organization, and no bottlenecks on a central lean organization

High - this relies on an end to end automated solution and an architecture that scales to 10x growth and sharing across architectures/cloud solutions

Low - relatively nascent in guidance and available toolsets

High - data platform team and product/domain teams need to be skilled up in data lakes.