Chapter 7. Approaches to Design
第 7 章 設計方法設計方法

This chapter explores design approaches to handling and organizing data and how these methods help build powerful, adaptable, and reliable systems for data management. In simple terms, we’ll learn about the strategies that help us decide how data should be stored, processed, and accessed to best enhance the speed and dependability of our data systems.
本章將探討處理和組織資料的設計方法,以及這些方法如何幫助建​​立強大、適應性強和可靠的資料管理系統。簡單來說,我們將了解幫助我們決定如何儲存、處理和存取資料的策略,以最佳方式提高資料系統的速度和可靠性。

To clarify, let’s distinguish between data design and data modeling, which is the subject of Chapter 8. Think of data design as like building a city. It’s about deciding where the buildings go, which roads connect different parts of the city, and how traffic flows. On the other hand, data modeling is more like designing individual buildings—it’s about arranging rooms, deciding how they connect, and defining the purpose of each room.
為了說明問題,讓我們把資料設計和資料建模區分開來,後者是第 8 章的主題。把數據設計想像成建造一座城市。它需要決定建築物的位置、連接城市不同部分的道路以及交通流量。另一方面,資料建模更像是設計單一建築物--它需要安排房間,決定它們如何連接,並定義每個房間的用途。

In this chapter, we’ll look at different types of data designs. We’ll compare and contrast methods like OLTP and OLAP, which are essentially different ways of processing and analyzing data. We’ll also explore concepts like SMP and MPP, which are strategies to process data more efficiently. Then, we’ll learn about Lambda and Kappa architectures, which are blueprints for handling large amounts of data in real time. Lastly, we’ll talk about an approach called polyglot persistence, which allows us to use different types of data storage technologies within the same application.
在本章中,我們將了解不同類型的資料設計。我們將比較 OLTP 和 OLAP 等方法,它們本質上是處理和分析資料的不同方法。我們也將探討 SMP 和 MPP 等概念,它們是更有效率地處理資料的策略。然後,我們將了解 Lambda 和 Kappa 架構,它們是即時處理大量資料的藍圖。最後,我們還將討論一種名為 "polyglot persistence "的方法,它允許我們在同一個應用程式中使用不同類型的資料儲存技術。

My aim here isn’t to push one approach as the best solution for every situation but to help you understand the strengths and weaknesses of each. This way, you can choose or combine the right methods based on your specific needs. By the end of this chapter, you’ll have a stronger grasp of how to create efficient data systems that can adapt to changing needs and technological advancements.
我在這裡的目的不是要把一種方法推崇為適用於所有情況的最佳解決方案,而是要幫助你了解每種方法的優缺點。這樣,你就可以根據自己的具體需求選擇或組合正確的方法。在本章結束時,你將對如何創建高效的數據系統有更深入的了解,從而適應不斷變化的需求和技術進步。

Online Transaction Processing Versus Online Analytical Processing
線上事務處理與線上分析處理

Online transaction processing (OLTP) is a type of informational system or application that processes online create, read, update, and delete (CRUD) transactions (see Chapter 2) in a real-time environment. OLTP systems are designed to support high levels of concurrency, which means they can handle a large number of transactions at the same time. They typically use a relational model (see Chapter 8) and are optimized for low latency, which means they can process transactions very quickly. Examples include point-of-sale applications, ecommerce websites, and online banking solutions.
線上事務處理(OLTP)是一種在即時環境中處理線上建立、讀取、更新和刪除(CRUD)交易(請參閱第 2 章)的資訊系統或應用程式。 OLTP 系統旨在支援高並發性,這意味著它們可以同時處理大量事務。它們通常使用關係模型(請參閱第 8 章),並針對低延遲進行了最佳化,這意味著它們可以非常快速地處理交易。例子包括銷售點應用程式、電子商務網站和網路銀行解決方案。

In an OLTP system, transactions are typically processed quickly and in a very specific way. For example, a customer’s purchase at a store would be considered a transaction. The transaction would involve the customer’s account being debited for the purchase amount, the store’s inventory count being reduced, and the store’s financial records being updated to reflect the sale. OLTP systems use a variety of DBMSs to store and manage the data, such as Microsoft SQL Server and Oracle.
在 OLTP 系統中,事務通常以非常特定的方式快速處理。例如,顧客在商店購物就屬於事務處理。該事務包括從顧客的帳戶中扣除購買金額,減少商店的庫存數量,以及更新商店的財務記錄以反映銷售情況。 OLTP 系統使用各種 DBMS 來儲存和管理數據,如 Microsoft SQL Server 和 Oracle。

OLTP systems are often contrasted with online analytical processing (OLAP) systems, which are used for data analysis and reporting to support business intelligence and decision making. OLAP systems are optimized for fast query performance, allowing end users to easily and quickly analyze data from multiple perspectives by slicing and dicing the data in reports and dashboards much faster than with an OLTP system. Think of an OLAP system as “write-once, read-many” (as opposed to CRUD). Often, multiple OLTP databases are used as the sources that feed into a data warehouse, which then feeds into an OLAP database, as shown in Figure 7-1. (However, sometimes you can also build an OLAP database directly from OLTP sources.)
OLTP 系統通常與線上分析處理(OLAP)系統形成對比,後者用於資料分析和報告,以支援商業智慧和決策制定。 OLAP 系統針對快速查詢效能進行了最佳化,與OLTP 系統相比,OLAP 系統能以更快的速度在報表和儀表板中對資料進行切分,從而使最終用戶能從多個角度輕鬆、快速地分析數據。將 OLAP 系統視為 "一次寫入,多次讀取"(與 CRUD 相反)。通常,如圖 7-1 所示,多個 OLTP 資料庫被用作資料倉儲的資料來源,然後資料倉儲再將資料輸入 OLAP 資料庫。 (不過,有時也可以直接從 OLTP 資料來源建立 OLAP 資料庫)。

Figure 7-1. OLAP database architecture
圖 7-1.OLAP 資料庫架構

An OLAP database is typically composed of one or more OLAP cubes. An OLAP cube is where data is pre-aggregated by the cube, meaning that it has already been summarized and grouped by certain dimensions; end users can quickly access it from multiple dimensions and levels of detail without having to wait for long-running queries to complete. Creating an OLAP cube generally involves using a multidimensional model, which uses a star or snowflake schema to represent the data. (Chapter 8 will discuss these schemas in more detail.)
OLAP 資料庫通常由一個或多個 OLAP 立方體組成。 OLAP 立方體是指資料已由立方體預先聚合,這表示資料已按特定維度匯總和分組;終端使用者可從多個維度和詳細程度快速存取數據,而無需等待長時間運行的查詢完成。建立 OLAP 立方體通常需要使用多維模型,該模型使用星形或雪花模式來表示資料。 (第 8 章將詳細討論這些模式)。

For example, a retail corporation might use an OLAP database with OLAP cubes to quickly analyze sales data. This pre-aggregated data is organized by time, allowing for the analysis of sales trends, whether yearly or daily or anywhere in between. The data is also sorted by location, which facilitates geographic comparisons to look at, for example, product sales in different cities. Furthermore, the data is categorized by product, making it easier to track the performance of individual items. If a regional manager needs to quickly review how a particular product category performed in their region during the last holiday season, the OLAP cube has already prepared the data. Instead of waiting for a lengthy query to sort through each individual sale, the manager finds the relevant data—sorted by product, region, and time—is readily available, accelerating the decision-making process.
例如,一家零售公司可能會使用具有 OLAP 立方體的 OLAP 資料庫來快速分析銷售資料。這種預先匯總的數據是按時間組織,可以分析銷售趨勢,無論是按年、按日或介於兩者之間的任何時間。數據也按地點分類,便於進行地理比較,例如,查看不同城市的產品銷售情況。此外,數據還按產品分類,以便於追蹤單一產品的業績。如果區域經理需要快速查看特定產品類別在其所在區域上一個假期季節的表現,OLAP 立方體已經準備好了資料。經理無需等待冗長的查詢來對每筆銷售進行分類,而是可以隨時獲得按產品、地區和時間分類的相關數據,從而加快決策過程。

You can use another, more recent model, called a tabular data model, that gives similarly fast query performance as an OLAP cube. Instead of a multidimensional model, a tabular data model uses a tabular structure, similar to a relational database table, to represent the data. This makes it more flexible and simpler than a multidimensional model. This book refers to the multidimensional model as an “OLAP cube” and the tabular model as a “tabular cube,” but please note that the term cube is used interchangeably in the industry to denote either of these models.
您可以使用另一種更新穎的模型,即表格資料模型,它能提供與 OLAP 立方體類似的快速查詢效能。與多維模型相比,表格資料模型使用類似關聯式資料庫表的表格結構來表示資料。這使得它比多維模型更靈活、更簡單。本書將多維模型稱為 "OLAP 立方體",將表格模型稱為 "表格立方體",但請注意,在行業中,立方體一詞可交替使用,以表示這兩種模型中的任何一種。

Table 7-1 compares the OLTP and OLAP design styles.
表 7-1 比較了 OLTP 和 OLAP 的設計風格。

Table 7-1. Comparison of OLTP and OLAP systems
表 7-1.OLTP 和 OLAP 系統的比較
OLTPOLAP/Tabular OLAP/表格
Processing type 加工類型TransactionalAnalytical
Data nature 數據性質Operational data 運行數據Consolidation data 合併數據
OrientationApplication oriented 面向應用Subject oriented 物件導向
PurposeWorks on ongoing business tasks
處理正在進行的業務任務
Helps in decision making 有助於決策
Transaction frequency 交易頻率Frequent transactions 交易頻繁Occasional transactions 偶爾交易
Operation types 操作類型Short online transactions: Insert, Update, Delete
簡短的線上交易:插入、更新、刪除
Lots of read scans 大量閱讀掃描
Data design 數據設計Normalized database (3NF)
規範化資料庫(3NF)
Denormalized database 非規範化資料庫
Common usage 常見用法Used in retail sales and other financial transactions systems
用於零售和其他金融交易系統
Often used in data mining, sales, and marketing
通常用於資料探勘、銷售和行銷
Response time 回應時間Response time is instant 即時回應時間Response time varies from seconds to hours
反應時間從幾秒到幾小時不等
Query complexity 查詢複雜性Simple and instant queries
簡單、即時的查詢
Complex queries 複雜查詢
Usage pattern 使用模式Repetitive usage 重複使用Ad hoc usage 臨時使用
Transaction nature 交易性質Short, simple transactions
短期、簡單的交易
Complex queries 複雜查詢
Database size 資料庫大小Gigabyte database size 千兆位元組資料庫大小Terabyte database size 百萬兆位元組資料庫規模

OLAP databases and data warehouses (see Chapter 4) are related but distinct concepts and are often used in conjunction. Through a multidimensional or tabular model, OLAP databases provide a way to analyze the data stored in the DW in a way that is more flexible and interactive than executing traditional SQL-based queries on a DW that contains large amounts of data. Multidimensional models and tabular models are often considered semantic layers or models, which means that they provide a level of abstraction over the schemas in the DW.
OLAP 資料庫和資料倉儲(見第 4 章)是相互關聯但又截然不同的概念,兩者經常結合使用。透過多維模型或表格模型,OLAP 資料庫提供了一種分析儲存在資料倉儲中的資料的方法,這種方法比在包含大量資料的資料倉儲上執行傳統的基於SQL 的查詢更靈活,互動性更強。多維模型和表格模型通常被視為語義層或模型,這意味著它們為 DW 中的模式提供了一個抽象層。

Operational and Analytical Data
運行和分析數據

Operational data is real-time data used to manage day-to-day operations and processes. It is captured, stored, and processed by OLTP systems. You can use it to take a “snapshot” of the current state of the business to ensure that operations are running smoothly and efficiently. Operational data tends to be high volume and helps in making decisions quickly.
運行數據是用於管理日常運行和流程的即時數據。它由 OLTP 系統採集、儲存和處理。您可以用它來 "快照 "業務的當前狀態,以確保業務平穩且有效率地運作。營運數據往往是大容量數據,有助於快速做出決策。

Analytical data comes from collecting and transforming operational data. It is a historical view of the data maintained and used by OLAP/Tabular systems and DWs. With data analytics tools, you can perform analysis to understand trends, patterns, and relationships over time. Analytical data is optimized for creating reports and visualizations and for training machine learning models. It usually provides a view of data over a longer period than you get with operational data, often has a lower volume, and is generally consolidated and aggregated. Data is usually ingested in batches and requires more processing time than operational data.
分析資料來自業務資料的收集和轉換。它是 OLAP/Tabular 系統和 DW 所維護和使用的資料的歷史視圖。利用數據分析工具,您可以進行分析,了解一段時間內的趨勢、模式和關係。分析資料經過最佳化,可用於建立報表和視覺化以及訓練機器學習模型。與操作資料相比,分析資料通常能提供更長的資料視圖,資料量通常較小,而且一般都經過合併和匯總。資料通常是成批攝取的,需要比業務資料更多的處理時間。

In summary, operational data is used to monitor and control business processes in real time, while analytical data is used to gain insights and inform decision making over a longer period. Both types of data are essential for effective business management, and they complement each other to provide a complete view of an organization’s operations and performance.
總之,營運數據用於即時監控和控制業務流程,而分析數據則用於在較長時間內獲得洞察力並為決策提供資訊。這兩類數據對於有效的業務管理都是必不可少的,它們相輔相成,為組織的營運和績效提供了一個完整的視角。

Think of OLTP as the technology used to implement operational data, and of OLAP/Tabular and DWs as the technology used to implement analytical data.
將 OLTP 視為用於執行操作資料的技術,而將 OLAP/Tabular 和 DW 視為用於執行分析資料的技術。

Symmetric Multiprocessing and Massively Parallel Processing
對稱多處理與大規模平行處理

Some of the first relational databases used a symmetric multiprocessing (SMP) design, where computer processing is done by multiple processors that share disk and memory, all in one server—think SQL Server and Oracle. (This is pictured on the left side of Figure 7-2.) To get more processing power for these systems, you “scale up” by increasing the processors and memory in the server. This works well for OLTP databases but not so well for the write-once, read-many environment of a DW.
一些最早的關聯式資料庫使用對稱多處理(SMP)設計,即電腦處理由多個處理器完成,它們共享磁碟和內存,所有這些處理器都在一台伺服器中,如 SQL Server 和 Oracle。 (要提高這些系統的處理能力,可以透過增加伺服器中的處理器和記憶體來"擴展"。這對OLTP 資料庫很有效,但對DW 的"一次寫入,多次讀取"環境就不太適用了。

As data warehouses grew in popularity in the 1990s and started to ingest huge amounts of data, performance became a big problem. To help with that, along came a new kind of database design. In a massively parallel processing (MPP) design, the database has multiple servers, each with multiple processors, and (unlike in SMP) each processor has its own memory and its own disk. This allows you to “scale out” (rather than up) by adding more servers.
隨著資料倉儲在 20 世紀 90 年代的普及,並開始攝取大量數據,效能成為一個大問題。為了解決這個問題,一種新型資料庫設計應運而生。在大規模並行處理(MPP)設計中,資料庫有多個伺服器,每個伺服器有多個處理器,而且(與 SMP 不同)每個處理器都有自己的記憶體和磁碟。這樣,您就可以透過增加伺服器來 "向外擴展"(而不是向上擴展)。

MPP servers distribute a portion of the data from the database to the disk on each server (whereas SMP databases keep all the data on one disk). Queries are then sent to a control node (also called a name node) that splits each query into multiple subqueries that are sent to each server (called a compute node or worker node), as shown on the right in Figure 7-2. There, the subquery is executed and the results from each compute node are sent back to the control node, mashed together, and sent back to the user. This is how solutions such as Teradata and Netezza work.
MPP 伺服器將資料庫中的部分資料分發到每台伺服器的磁碟上(而 SMP 資料庫將所有資料保存在一個磁碟上)。然後,查詢被傳送到控制節點(也稱為名稱節點),控制節點會將每個查詢拆分成多個子查詢,然後傳送到每台伺服器(稱為計算節點或工作節點),如圖7- 2 右側所示。在那裡,子查詢被執行,來自每個計算節點的結果被傳送回控制節點,合併後再傳送給使用者。這就是 Teradata 和 Netezza 等解決方案的工作原理。

Figure 7-2. SMP and MPP database designs
圖 7-2.SMP 和 MPP 資料庫設計

By way of analogy, imagine your friend Fiona has a deck of 52 cards and is looking for the ace of hearts. It takes Fiona about 15 seconds, on average, to find the card. You can “scale up” by replacing Fiona with another friend, Max, who is faster. Using Max brings the average time down to 10 seconds—a limited improvement. This is how SMP databases work.
打個比方,假設你的朋友菲奧娜有一副 52 張的撲克牌,她正在找紅桃 A。菲奧娜平均需要 15 秒才能找到這張牌。您可以 "放大 "菲奧娜,換成另一位速度更快的朋友麥克斯。使用 Max 可以將平均時間縮短到 10 秒--這只是有限的改進。這就是 SMP 資料庫的工作原理。

Now imagine you “scale out” instead of up, by replacing Fiona with 26 people, each of whom has only two cards. Now the average time to find the card is just 1 second. That’s how MPP databases work.
現在想像一下,把菲奧娜換成 26 個人,每個人只有兩張牌,"擴大 "而不是 "增加"。現在,找到卡片的平均時間僅為 1 秒。這就是 MPP 資料庫的工作原理。

SMP and MPP databases started as on-prem solutions, and these are still prevalent today, but there are now many equivalent solutions in the cloud.
SMP 和 MPP 資料庫最初是內部部署的解決方案,如今仍很普遍,但現在雲端中也有許多類似的解決方案。

Lambda Architecture Lambda 架構

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by using both batch and real-time stream processing methods. The idea is to get comprehensive and accurate views of the batch data and to balance latency, throughput, scaling, and fault tolerance by using batch processing, while simultaneously using real-time stream processing to provide views of online data (such as IoT devices, Twitter feeds, or computer log files). You can join the two view outputs before the presentation/serving layer.
Lambda 架構是一種資料處理架構,旨在透過使用批次和即時串流處理方法來處理大量資料。其理念是透過使用批次來獲得全面、準確的批次資料視圖,並平衡延遲、吞吐量、擴展和容錯,同時使用即時串流處理來提供線上資料(如物聯網設備、Twitter feed 或電腦日誌文件)視圖。您可以在呈現/服務層之前將兩個視圖輸出連接起來。

Lambda architecture bridges the gap between the historical “single source of truth” and the highly sought after “I want it now” real-time solution by combining traditional batch processing systems with stream consumption tools to meet both needs.
Lambda 架構將傳統的批次系統與串流消費工具結合,滿足了歷史上的 "單一真相來源 "與備受追捧的 "我現在就要 "即時解決方案之間的差距。

The Lambda architecture has three key principles:
Lambda 架構有三個關鍵原則:

Dual data model 雙資料模型

The Lambda architecture uses one model for batch processing (batch layer) and another model for real-time processing (stream layer). This allows the system to handle both batch and real-time data and to perform both types of processing in scalable and fault-tolerant ways.
Lambda 架構使用一種模式進行批次處理(批次層),另一種模式進行即時處理(流層)。這樣,系統就能同時處理批次和即時數據,並以可擴展和容錯的方式執行這兩種類型的處理。

Single unified view 單一的統一視圖

The Lambda architecture uses a single unified view (called the presentation layer) to present the results of both batch and real-time processing to end users. This allows the user to see a complete and up-to-date view of the data, even though it’s being processed by two different systems.
Lambda 架構使用單一的統一視圖(稱為呈現層)向終端使用者呈現批次和即時處理的結果。這樣,即使資料是由兩個不同的系統處理的,使用者也能看到完整的最新資料視圖。

Decoupled processing layers
解耦處理層

The Lambda architecture decouples the batch and real-time processing layers so that they can be scaled independently and developed and maintained separately, allowing for flexibility and ease of development.
Lambda 架構將批次層和即時處理層分開,使它們可以獨立擴展,單獨開發和維護,從而提高了開發的靈活性和便利性。

Figure 7-3 depicts a high-level overview of the Lambda architecture.
圖 7-3 描述了 Lambda 架構的高層概覽。

On the left of Figure 7-3 is the data consumption layer. This is where you import the data from all source systems. Some sources may be streaming the data, while others only provide data daily or hourly.
圖 7-3 左側是資料消費層。這是從所有來源系統導入資料的地方。有些資料來源可能是串流數據,而有些資料來源只提供每日或每小時的資料。

In the top middle, you see the stream layer, also called the speed layer. It provides for incremental updating, making it the more complex of the two middle layers. It trades accuracy for low latency, looking at only recent data. The data in here may be only seconds behind, but the trade-off is that it might not be clean. Data in this layer is usually stored in a data lake.
中間頂層是流層,也稱為速度層。它提供增量更新,是兩個中間層中較為複雜的一層。它以準確性換取低延遲,只查看最近的數據。這裡的數據可能只滯後幾秒鐘,但代價是數據可能不乾淨。這一層的資料通常儲存在資料湖中。

Figure 7-3. Overview of Lambda architecture
圖 7-3.Lambda 架構概覽

Beneath that is the batch layer, which looks at all the data at once and eventually corrects the data that comes into the stream layer. It is the single source of truth, the trusted layer. Here there’s usually lots of ETL, and data is stored in a traditional data warehouse or data lake. This layer is built using a predefined schedule, usually daily or hourly, and including importing the data currently stored in the stream layer.
其下方是批次層,批次層一次查看所有數據,並最終修正進入流程處理層的資料。它是唯一的真相來源,是可信層。這裡通常有大量的 ETL,資料儲存在傳統的資料倉儲或資料湖中。此層的建置採用預先定義的時間表,通常是每天或每小時一次,包括匯入目前儲存在流層中的資料。

At the right of Figure 7-3 is the presentation layer, also called the serving layer. Think of it as a mediator; when it accepts queries, it decides when to use the batch layer and when to use the speed layer. It generally defaults to the batch layer, since that has the trusted data, but if you ask it for up-to-the-second data (perhaps by setting alerts for certain log messages that indicate a server is down), it will pull from the stream layer. This layer has to balance retrieving the data you can trust with retrieving the data you want right now.
圖 7-3 右邊是表現層,也稱為服務層。可以把它看作是一個中介;當它接受查詢時,它會決定何時使用批次層,何時使用速度層。一般情況下,它預設使用批次層,因為批次層擁有可信的數據,但如果你要求它提供最新的數據(也許是透過對某些表明伺服器宕機的日誌資訊設定警報),它將從流層提取資料。此層必須在檢索可信任資料和檢索目前所需資料之間取得平衡。

The Lambda architecture is an excellent choice for building distributed systems that need to handle both batch and real-time data, like recommendation engines and fraud detection systems. However, that doesn’t mean it’s the best choice for every situation. Some potential drawbacks of the Lambda architecture include:
Lambda 架構是建立需要處理大量和即時資料的分散式系統(如推薦引擎和詐欺偵測系統)的絕佳選擇。然而,這並不意味著它是每種情況下的最佳選擇。 Lambda 架構的一些潛在缺點包括

Complexity 複雜性

The Lambda architecture includes a dual data model and a single unified view. That can be more complex to implement and maintain than other architectures.
Lambda 架構包括雙重資料模型和單一統一視圖。這可能比其他架構的實施和維護更加複雜。

Limited real-time processing
有限的即時處理

The Lambda architecture is designed for both batch and real-time processing, but it may not be as efficient at handling high volumes of real-time data as the Kappa architecture (discussed in the next section), which is specifically designed for real-time processing.
Lambda 架構既可用於批次處理,也可用於即時處理,但在處理大量即時資料方面,其效率可能不如專門用於即時處理的 Kappa 架構(將在下一節討論)。

Limited support for stateful processing
對有狀態處理的支援有限

The Lambda architecture is designed for stateless processing and may not be well suited for applications that require maintaining state across multiple events. For example, consider a retail store with a recommendation system that suggests products based on customers’ browsing and purchasing patterns. If this system used a Lambda architecture, which processes each event separately without maintaining state, it could miss the customer’s shopping journey and intent. If the customer browses for shoes, then socks, and then shoe polish, a stateless system might not correctly recommend related items like shoelaces or shoe storage, because it doesn’t consider the sequence of events. It might also recommend items that are already in the customer’s cart.
Lambda 架構專為無狀態處理而設計,可能不太適合需要在多個事件中保持狀態的應用。例如,考慮一家擁有推薦系統的零售店,該系統會根據客戶的瀏覽和購買模式推薦產品。如果系統使用 Lambda 架構,在不維護狀態的情況下單獨處理每個事件,那麼它可能會忽略客戶的購物過程和意圖。如果顧客先瀏覽鞋子,然後是襪子,最後是鞋油,那麼無狀態系統可能無法正確推薦鞋帶或鞋櫃等相關商品,因為它沒有考慮到事件的順序。它也可能推薦顧客購物車中已有的商品。

Overall, you should consider the Lambda architecture if you need to build a distributed system that can handle both batch and real-time data but needs to provide a single unified view of the data. If you need stateful processing or to handle high volumes of real-time data, you may want to consider the Kappa architecture.
總的來說,如果您需要建立一個既能處理批量資料又能處理即時資料的分散式系統,但又需要提供單一的統一資料視圖,那麼您應該考慮使用 Lambda 架構。如果需要有狀態處理或處理大量即時數據,則可能需要考慮 Kappa 架構。

Kappa Architecture 卡帕建築事務所

As opposed to the Lambda architecture, which is designed to handle both real-time and batch data, Kappa is designed to handle just real-time data. And like the Lambda architecture, Kappa architecture is also designed to handle high levels of concurrency and high volumes of data. Figure 7-4 provides a high-level overview of the Kappa architecture.
Lambda 架構可同時處理即時數據和批次數據,而 Kappa 架構則只處理即時數據。與 Lambda 架構一樣,Kappa 架構也是為處理高並發和大量資料而設計的。圖 7-4 提供了 Kappa 架構的高層概覽。

The three key principles of the Kappa architecture are:
Kappa 架構的三大原則是

Real-time processing 即時處理

The Kappa architecture is designed for real-time processing, which means that events are processed as soon as they are received rather than being batch processed later. This decreases latency and enables the system to respond quickly to changing conditions.
Kappa 架構專為即時處理而設計,這意味著一旦收到事件,就會立即對其進行處理,而不是稍後再批量處理。這減少了延遲,使系統能夠對不斷變化的情況做出快速反應。

Single event stream 單一事件流

The Kappa architecture uses a single event stream to store all data that flows through the system. This allows for easy scalability and fault tolerance, since the data can be distributed easily across multiple nodes.
Kappa 架構使用單一事件流來儲存流經系統的所有資料。由於資料可以輕鬆分佈在多個節點上,因此可以輕鬆實現可擴展性和容錯性。

Stateless processing 無狀態處理

In the Kappa architecture, all processing is stateless. This means that each event is processed independently, without relying on the state of previous events. This makes it easier to scale the system, because there is no need to maintain state across multiple nodes.
在 Kappa 架構中,所有處理都是無狀態的。這意味著每個事件都是獨立處理的,無需依賴先前事件的狀態。這使得系統更容易擴展,因為無需在多個節點上維護狀態。

Figure 7-4. Overview of Kappa architecture
圖 7-4.Kappa 架構概覽

The layers in the Kappa architecture are exactly the same as in the Lamba architecture, except that the Kappa architecture does not have a batch layer.
Kappa 架構的層與 Lamba 架構的層完全相同,只是 Kappa 架構沒有批次層。

Some potential drawbacks of the Kappa architecture include:
Kappa 架構的一些潛在缺點包括

Complexity 複雜性

The Kappa architecture involves a single event stream and stateless processing, which can be more complex to implement and maintain than other architectures.
Kappa 架構涉及單一事件流和無狀態處理,與其他架構相比,其實作和維護更為複雜。

Limited batch processing
有限的批量處理

The Kappa architecture is designed for real-time processing and does not easily support batch processing of historical data. If you need to perform batch processing, you may want to consider the Lambda architecture instead.
Kappa 架構專為即時處理而設計,不易支援歷史資料的批次。如果需要執行批次處理,可以考慮使用 Lambda 架構。

Limited support for ad-hoc queries
對臨時查詢的支援有限

Because the Kappa architecture is designed for real-time processing, it may not be well suited for ad hoc queries that need to process large amounts of historical data.
由於 Kappa 架構是為即時處理而設計的,因此可能不太適合需要處理大量歷史資料的臨時查詢。

Overall, the Kappa architecture is an excellent choice for building distributed systems that need to handle large amounts of data in real time and that need to be scalable, fault tolerant, and have low latency. Examples include streaming platforms and financial trading systems. However, if you need to perform batch processing or support ad hoc queries, then the Lambda architecture may be a better choice.
總體而言,Kappa 架構是建立分散式系統的絕佳選擇,這些系統需要即時處理大量數據,並具有可擴展性、容錯性和低延遲性。例如串流媒體平台和金融交易系統。但是,如果您需要執行批次處理或支援臨時查詢,那麼 Lambda 架構可能是更好的選擇。

Note that the Lambda and Kappa architectures are high-level design patterns that can be implemented within any of the data architectures described in Part III of this book. If you use one of those architectures to build a solution that supports both batch and real-time data, then that architecture supports the Lambda architecture; if you use one of those architectures to build a solution that supports only real-time data, that architecture supports the Kappa architecture.
請注意,Lambda 和 Kappa 架構是高階設計模式,可在本書第三部分所述的任何資料架構中實現。如果使用其中一種架構建立同時支援批次和即時資料的解決方案,則該架構支援 Lambda 架構;如果使用其中一種架構建立僅支援即時資料的解決方案,則該架構支援 Kappa 架構。

Polyglot Persistence and Polyglot Data Stores
多語言持久性和多語言資料存儲

Polyglot persistence is a fancy term that means using multiple data storage technologies to store different types of data within a single application or system, based upon how the data will be used. Different kinds of data are best kept in different data stores. In short, polyglot persistence means picking the right tool for the right use case. It’s the same idea as the one behind polyglot programming, in which applications are written in a mix of languages to take advantage of different languages’ strengths in tackling different problems.
多語言持久性是一個花哨的術語,意思是根據資料的使用方式,在單一應用程式或系統中使用多種資料儲存技術來儲存不同類型的資料。不同類型的資料最好保存在不同的資料儲存中。簡而言之,多語言持久性意味著為正確的用例選擇正確的工具。這與多語言程式設計背後的理念相同,即混合使用多種語言編寫應用程序,以發揮不同語言在解決不同問題方面的優勢。

By contrast, a polyglot data store means using multiple data stores across an organization or enterprise. Each data store is optimized for a specific type of data or use case. This approach allows organizations to use different data stores for different projects or business units, rather than a one-size-fits-all approach for the entire organization.
相較之下,多語言資料儲存意味著在整個組織或企業中使用多個資料儲存。每個資料儲存都針對特定類型的資料或用例進行了最佳化。這種方法允許企業針對不同的專案或業務部門使用不同的資料存儲,而不是針對整個企業使用一刀切的方法。

For example, say you’re building an ecommerce platform that will deal with many types of data (shopping carts, inventory, completed orders, and so forth). Instead of trying to store all the different types of data in one database, which would require a lot of conversion, you could take a polyglot persistence approach and store each kind of data in the database best suited for it. So, an ecommerce platform might look like the diagram in Figure 7-5.
例如,您正在建立一個電子商務平台,該平台將處理多種類型的資料(購物車、庫存、已完成訂單等)。與其在一個資料庫中儲存所有不同類型的資料(這將需要大量轉換),不如採用多點持久性方法,將每種資料儲存在最適合它的資料庫中。因此,電子商務平台的外觀可能如圖 7-5 所示。

Figure 7-5. An ecommerce platform with a polyglot persistence design
圖 7-5.採用多重持久性設計的電子商務平台

This results in the best tool being used for each type of data. In Figure 7-5, you can see that the database uses a key-value store for shopping cart and session data (giving very fast retrieval), a document store for completed orders (making storing and retrieving order data fast and easy), an RDBMS for inventory and item prices (since those are best stored in a relational database due to the structured nature of the data), and a graph store for customer social graphs (since it’s very difficult to store graph data in a non-graph store).
這樣,每種類型的資料都能使用最好的工具。在圖7-5 中,您可以看到資料庫使用鍵值儲存來儲存購物車和會話資料(提供非常快速的檢索),使用文件儲存來儲存已完成的訂單(使儲存和檢索訂單資料變得快速而簡單),使用RDBMS 來儲存庫存和商品價格(因為這些資料的結構化性質最好儲存在關聯式資料庫中),使用圖儲存來儲存客戶社交圖(因為在非圖儲存中很難儲存圖資料) 。

This will come at a cost in complexity, since each data storage solution means learning a new technology. But the benefits will be worth it. For instance, if you try to use relational databases for non-relational data, the design can significantly slow application development and performance; using the appropriate storage type pays off in speed.
這將以複雜性為代價,因為每種資料儲存解決方案都意味著要學習一種新技術。但這樣做的好處是值得的。例如,如果試圖將關聯式資料庫用於非關聯式數據,這種設計會大大降低應用程式開發速度和效能;而使用適當的儲存類型則可以提高速度。

Summary 摘要

This chapter explored the architectural concepts and design philosophies that form the basis of effective data systems.
本章探討了構成有效資料系統基礎的架構概念和設計理念。

First, you learned about the two primary types of data processing systems: online transaction processing (OLTP) and online analytical processing (OLAP). OLTP systems are designed for fast, reliable, short transactions, typically in the operational databases that power daily business operations. In contrast, OLAP systems support complex analytical queries, aggregations, and computations used for strategic decision making, typically in a data warehouse. You then learned about the differences between operational and analytical data.
首先,您了解了兩種主要類型的資料處理系統:線上事務處理 (OLTP) 和線上分析處理 (OLAP)。 OLTP 系統設計用於快速、可靠、簡短的事務處理,通常用於支援日常業務運作的作業資料庫。相較之下,OLAP 系統支援用於策略決策的複雜分析查詢、聚合和計算,通常在資料倉儲中使用。然後,您了解了操作資料和分析資料之間的差異。

You also learned the differences between symmetric multiprocessing (SMP) and massively parallel processing (MPP) architectures. We then delved into two modern big data–processing architectures: Lambda and Kappa. Last, we explored the concepts of polyglot persistence and polyglot data stores, which promote using the best-suited database technology for the specific needs and workload characteristics of the given data.
您也了解了對稱多處理(SMP)和大規模平行處理(MPP)架構之間的差異。然後,我們深入研究了兩種現代大數據處理架構:Lambda 和 Kappa。最後,我們探討了多點持久性和多點資料儲存的概念,這些概念提倡針對給定資料的特定需求和工作負載特徵使用最適合的資料庫技術。

Starting with the next chapter, our focus will shift from data storage and processing to the principles and practices of data modeling: the crucial bridge between raw data and meaningful insights. The approaches to data modeling, such as relational and dimensional approaches and the common data model, serve as an underpinning structure that allows you to use and interpret data efficiently across diverse applications.
從下一章開始,我們的重點將從資料儲存和處理轉向資料建模的原則和實踐:原始資料和有意義的見解之間的重要橋樑。資料建模的方法,如關係和維度方法以及通用資料模型,是一種基礎結構,可以讓你在各種應用中有效地使用和解釋資料。

As we delve into these topics, you’ll see how data modeling lets you use the storage and processing solutions we’ve studied efficiently, serving as a blueprint for transforming raw data into actionable insights.
隨著我們對這些主題的深入探討,您將看到資料建模如何讓您有效率地使用我們所研究的儲存和處理解決方案,並將其作為將原始資料轉化為可操作見解的藍圖。