This is a bilingual snapshot page saved by the user at 2024-6-25 19:24 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

尚硅谷大数据技术之高频面试题
High frequency interview questions for big data technology in Silicon Valley

—————————————————————————————

尚硅谷大数据技术之高频面试题
High-frequency interview questions on big data technology in Silicon Valley

作者:尚硅谷研究院)
(Author: Shang Silicon Valley Research Institute)

版本:V9.2.0
Version: V9.2.0

目录

第1章 核心技术12
Chapter 1 Core Technologies 12

1.1 Linux&Shell12

1.1.1 Linux常用高级命令12
1.1.1 Common Linux Advanced Commands 12

1.1.2 Shell常用工具及写过的脚本12
1.1.2 Common Shell Tools and Scripts Written 12

1.1.3 Shell中单引号和双引号区别13
1.1.3 Differences between single and double quotation marks in shells 13

1.2 Hadoop13

1.2.1 Hadoop常用端口号13
1.2.1 Common Hadoop port number 13

1.2.2 HDFS读流程和写流程14
1.2.2 HDFS read and write flows 14

1.2.3 HDFS小文件处理15
1.2.3 HDFS Small File Handling 15

1.2.4 HDFS的NameNode内存15

1.2.5 Shuffle及优化16

1.2.6 Yarn工作机制17
1.2.6 How Yarn works 17

1.2.7 Yarn调度器17
1.2.7 Yarn scheduler 17

1.2.8 HDFS块大小18
1.2.8 HDFS block size 18

1.2.9 Hadoop脑裂原因及解决办法?19
1.2.9 What are the causes and solutions of Hadoop split-brain? 19

1.3 Zookeeper19

1.3.1 常用命令19
1.3.1 Common commands 19

1.3.2 选举机制19
1.3.2 Electoral mechanisms 19

1.3.3 Zookeeper符合法则中哪两个?21
1.3.3 Which two of the Zookeeper rules meet? 21

1.3.4 Zookeeper脑裂21

1.3.5 Zookeeper用来干嘛了21
1.3.5 What is Zookeeper used for 21

1.4 Flume21

1.4.1 Flume组成,Put事务,Take事务21
1.4.1 Flume Composition, Put Transaction, Take Transaction 21

1.4.2 Flume拦截器23
1.4.2 Flume interceptor 23

1.4.3 Flume Channel选择器24

1.4.4 Flume监控器24
1.4.4 Flume Monitor 24

1.4.5 Flume采集数据会丢失吗?24
1.4.5 Will the data collected by Flume be lost? 24

1.4.6 Flume如何提高吞吐量24
1.4.6 How Flume can increase throughput 24

1.5 Kafka24

1.5.1 Kafka架构24

1.5.2 Kafka生产端分区分配策略26
1.5.2 Kafka Production-side Partition Allocation Policy 26

1.5.3 Kafka丢不丢数据27
1.5.3 Kafka loses data 27

1.5.4 Kafka的ISR副本同步队列28
1.5.4 ISR replica synchronization queue for Kafka 28

1.5.5 Kafka数据重复28
1.5.5 Kafka data duplication 28

1.5.6 Kafka如何保证数据有序or怎么解决乱序29
1.5.6 How Kafka keeps data in order or solves disorder 29

1.5.7 Kafka分区Leader选举规则31
1.5.7 Kafka Divisional Leader Election Rule 31

1.5.8 Kafka中AR的顺序31
1.5.8 Order of AR in Kafka 31

1.5.9 Kafka日志保存时间32
1.5.9 Kafka log retention time 32

1.5.10 Kafka过期数据清理32
1.5.10 Kafka obsolete data cleanup 32

1.5.11 Kafka为什么能高效读写数据33
1.5.11 Why Kafka reads and writes data efficiently 33

1.5.12 自动创建主题34
1.5.12 Automatic topic creation 34

1.5.13 副本数设定34
1.5.13 Number of copies set 34

1.5.14 Kakfa分区数34
1.5.14 Number of Kakfa partitions 34

1.5.15 Kafka增加分区35
1.5.15 Kafka increases partition 35

1.5.16 Kafka中多少个Topic35

1.5.17 Kafka消费者是拉取数据还是推送数据35
1.5.17 Kafka Whether Consumers Pull or Push Data 35

1.5.18 Kafka消费端分区分配策略36
1.5.18 Kafka Consumer Partition Allocation Policy 36

1.5.19 消费者再平衡的条件37
1.5.19 Conditions for consumer rebalancing 37

1.5.20 指定Offset消费37
1.5.20 Specify Offset Consumption 37

1.5.21 指定时间消费38
1.5.21 Consumption at specified times 38

1.5.22 Kafka监控38

1.5.23 Kafka数据积压38
1.5.23 Kafka data backlog 38

1.5.24 如何提升吞吐量39
1.5.24 How to improve throughput 39

1.5.25 Kafka中数据量计算40
1.5.25 Calculation of data volume in Kafka 40

1.5.26 Kafka如何压测?40
1.5.26 How does Kafka pressure test? 40

1.5.27 磁盘选择42
1.5.27 Disk Selection 42

1.5.28 内存选择42
1.5.28 Memory Selection 42

1.5.29 CPU选择43

1.5.30 网络选择43
1.5.30 Network Selection 43

1.5.31 Kafka挂掉44

1.5.32 Kafka的机器数量44
1.5.32 Number of machines in Kafka 44

1.5.33 服役新节点退役旧节点44
1.5.33 New nodes in service Retired old nodes 44

1.5.34 Kafka单条日志传输大小44
1.5.34 Kafka Single Log Transfer Size 44

1.5.35 Kafka参数优化45
1.5.35 Kafka parameter optimization 45

1.6 Hive46

1.6.1 Hive的架构46
1.6.1 Hive architecture 46

1.6.2 HQL转换为MR流程46
1.6.2 HQL to MR Flow 46

1.6.3 Hive和数据库比较47
1.6.3 Hive vs. Database Comparison 47

1.6.4 内部表和外部表48
1.6.4 Internal and external tables 48

1.6.5 系统函数48
1.6.5 System functions 48

1.6.6 自定义UDF、UDTF函数49
1.6.6 Custom UDF, UDTF functions 49

1.6.7 窗口函数50
1.6.7 Window functions 50

1.6.8 Hive优化52

1.6.9 Hive解决数据倾斜方法59
1.6.9 Hive Solutions for Data Skew 59

1.6.10 Hive的数据中含有字段的分隔符怎么处理?64
1.6.10 What about separators in Hive data that contain fields? 64

1.6.11 MySQL元数据备份64
1.6.11 MySQL Metadata Backup 64

1.6.12 如何创建二级分区表?65
1.6.12 How do I create a secondary partition table? 65

1.6.13 Union与Union all区别65

1.7 Datax65

1.7.1 DataX与Sqoop区别65

1.7.2 速度控制66
1.7.2 Speed control 66

1.7.3 内存调整66
1.7.3 Memory Adjustments 66

1.7.4 空值处理66
1.7.4 Handling of null values 66

1.7.5 配置文件生成脚本67
1.7.5 Profile generation scripts 67

1.7.6 DataX一天导入多少数据67
1.7.6 DataX How much data is imported in a day 67

1.7.7 Datax如何实现增量同步68
1.7.7 Datax How to achieve incremental synchronization 68

1.8 Maxwell68

1.8.1 Maxwell与Canal、FlinkCDC的对比68

1.8.2 Maxwell好处68

1.8.3 Maxwell底层原理69
1.8.3 Maxwell's underlying principles 69

1.8.4 全量同步速度如何69
1.8.4 How to synchronize the speed of the whole quantity 69

1.8.5 Maxwell数据重复问题69
1.8.5 Maxwell Data Duplication Problem 69

1.9 DolphinScheduler调度器69

1.9.1 每天集群运行多少指标?69
1.9.1 How many metrics does the cluster run per day? 69

1.9.2 任务挂了怎么办?69
1.9.2 What if the mission fails? 69

1.9.3DS挂了怎么办?69
1.9.3 What happens when DS dies? 69

1.10 Spark Core & SQL70

1.10.1 Spark运行模式70
1.10.1 Spark Operating Mode 70

1.10.2 Spark常用端口号70
1.10.2 Spark Common Port Number 70

1.10.3 RDD五大属性70
1.10.3 RDD Five Attributes 70

1.10.4 RDD弹性体现在哪里71
1.10.4 Where is RDD Resilience Reflected 71

1.10.5 Spark的转换算子(8个)71
1.10.5 Spark's transformation operators (8) 71

1.10.6 Spark的行动算子(5个)72
1.10.6 Spark's Action Operators (5) 72

1.10.7 map和mapPartitions区别72

1.10.8 Repartition和Coalesce区别72

1.10.9 reduceByKey与groupByKey的区别73

1.10.10 Spark中的血缘73
1.10.10 Bloodlines in Spark 73

1.10.11 Spark任务的划分73
1.10.11 Division of Spark Tasks 73

1.10.12 SparkSQL中RDD、DataFrame、DataSet三者的转换及区别74
1.10.12 Conversion and Difference of RDD, DataFrame and DataSet in SparkSQL 74

1.10.13 Hive on Spark和Spark on Hive区别74

1.10.14 Spark内核源码(重点)74
1.10.14 Spark kernel source code (focus) 74

1.10.15 Spark统一内存模型76
1.10.15 Spark Unified Memory Model 76

1.10.16 Spark为什么比MR快?77
1.10.16 Why is Spark faster than MR? 77

1.10.17 Spark Shuffle和Hadoop Shuffle区别?78

1.10.18 Spark提交作业参数(重点)78
1.10.18 Spark Submission Job Parameters (Key) 78

1.10.19 Spark任务使用什么进行提交,JavaEE界面还是脚本79
1.10.19 What do Spark tasks use for submission, Java EE interface or script 79

1.10.20 请列举会引起Shuffle过程的Spark算子,并简述功能。79
1.10.20 List the Spark operators that give rise to Shuffle processes and describe their functions briefly. 79

1.10.21 Spark操作数据库时,如何减少Spark运行中的数据库连接数?79
1.10.21 How do I reduce the number of database connections Spark has running when I'm working with a database? 79

1.10.22 Spark数据倾斜79
1.10.22 Spark data tilt 79

1.10.23 Spark3.0新特性79

1.12 Flink80

1.12.1 Flink基础架构组成?80
1.12.1 Flink Infrastructure Composition? 80

1.12.2Flink和Spark Streaming的区别?80

1.12.3 Flink提交作业流程及核心概念81
1.12.3 Flink Submission Workflow and Core Concepts 81

1.12.4 Flink的部署模式及区别?82
1.12.4 Flink deployment patterns and differences? 82

1.12.5 Flink任务的并行度优先级设置?资源一般如何配置?82
1.12.5 Parallel Priority Setting for Flink Tasks? How are resources generally allocated? 82

1.12.6 Flink的三种时间语义83
1.12.6 Flink's three temporal semantics 83

1.12.7 你对Watermark的认识83
1.12.7 What you know about Watermark 83

1.12.8 Watermark多并行度下的传递、生成原理83
1.12.8 Watermark Transmission and Generation Principle under Multi-parallelism 83

1.12.9 Flink怎么处理乱序和迟到数据?84
1.12.9 How does Flink handle out-of-order and late data? 84

1.12.10 说说Flink中的窗口(分类、生命周期、触发、划分)85
1.12.10 Talk about windows in Flink (classification, lifecycle, trigger, partition) 85

1.12.11 Flink的keyby怎么实现的分区?分区、分组的区别是什么?86
1.12.11 Flink keyby how to achieve partition? What is the difference between subdivisions and groupings? 86

1.12.12 Flink的Interval Join的实现原理?Join不上的怎么办?86
1.12.12 How does Flink's Interval Join work? What if I don't join? 86

1.12.13 介绍一下Flink的状态编程、状态机制?86
1.12.13 Introduce Flink's state programming and state mechanism? 86

1.12.14 Flink如何实现端到端一致性?87
1.12.14 How does Flink achieve end-to-end consistency? 87

1.12.15 分布式异步快照原理88
1.12.15 Principles of Distributed Asynchronous Snapshots 88

1.12.16 Checkpoint的参数怎么设置的?88
1.12.16 How are Checkpoint parameters set? 88

1.12.17 Barrier对齐和不对齐的区别88
1.12.17 Difference between Barrier alignment and misalignment 88

1.12.18 Flink内存模型(重点)89
1.12.18 Flink Memory Model (Key) 89

1.12.19 Flink常见的维表Join方案89
1.12.19 Flink Common Dimension Tables Join Scheme 89

1.12.20 Flink的上下文对象理解89
1.12.20 Context Object Understanding for Flink 89

1.12.21 Flink网络调优-缓冲消胀机制90
1.12.21 Flink Network Tuning-Buffering and Deinflation Mechanism 90

1.12.22 FlinkCDC锁表问题90
1.12.22 FlinkCDC lock table problem 90

1.13 HBase90

1.13.1 HBase存储结构90
1.13.1 HBase Storage Structure 90

1.13.2 HBase的写流程92
1.13.2 HBase writing flow 92

1.13.3 HBase的读流程93
1.13.3 HBase reading flow 93

1.13.4 HBase的合并94
1.13.4 Consolidation of HBase 94

1.13.5 RowKey设计原则94
1.13.5 RowKey Design Principles 94

1.13.6 RowKey如何设计94
1.13.6 How RowKey is designed 94

1.13.7 HBase二级索引原理95
1.13.7 HBase secondary indexing principle 95

1.14 Clickhouse95

1.14.1 Clickhouse的优势95

1.14.2 Clickhouse的引擎95

1.14.3 Flink写入Clickhouse怎么保证一致性?96
1.14.3 Flink writes Clickhouse How to guarantee consistency? 96

1.14.4 Clickhouse存储多少数据?几张表?96
1.14.4 How much data does Clickhouse store? How many forms? 96

1.14.5 Clickhouse使用本地表还是分布式表96
1.14.5 Clickhouse Using Local or Distributed Tables 96

1.14.6 Clickhouse的物化视图96
1.14.6 Materialized Views of Clickhouse 96

1.14.7 Clickhouse的优化97

1.14.8 Clickhouse的新特性Projection97

1.14.9 Cilckhouse的索引、底层存储97
1.14.9 Indexing and underlying storage for Cilckhouse 97

1.15 Doris99

1.15.1 Doris中的三种模型及对比?99
1.15.1 The three models in Doris and how they compare? 99

1.15.2 Doris的分区分桶怎么理解,怎么划分字段99
1.15.2 How to understand the partition and bucket of Doris, how to divide the field 99

1.15.3 生产中节点多少个,FE,BE 那个对于CPU和内存的消耗大99
1.15.3 How many nodes are in production, FE, BE That consumes more than 99 CPU and memory

1.15.4 Doris使用过程中遇到过哪些问题?99
1.15.4 What problems have you encountered with Doris? 99

1.15.5 Doris跨库查询,关联MySQL有使用过吗99
1.15.5 Doris cross-database query, associated MySQL have you ever used 99

1.15.6 Doris的roll up和物化视图区别99
1.15.6 Difference between roll up and materialized view of Doris 99

1.15.7 Doris的前缀索引99
1.15.7 Prefix index of Doris 99

1.16 可视化报表工具99
1.16 Visual reporting tools 99

1.17 JavaSE100

1.17.1 并发编程100
1.17.1 Concurrent programming 100

1.17.2 如何创建线程池100
1.17.2 How to create a thread pool 100

1.17.3 ThreadPoolExecutor构造函数参数解析101
1.17.3 Parsing ThreadPoolExecutor constructor parameters 101

1.17.4 线程的生命周期101
1.17.4 Life cycle of threads 101

1.17.5 notify和notifyall区别101

1.17.6 集合101
1.17.6 Collections 101

1.17.7 列举线程安全的Map集合101
1.17.7 Listing Thread-Safe Map Collections 101

1.17.8 StringBuffer和StringBuilder的区别101

1.17.9 HashMap和HashTable的区别101

1.17.10 HashMap的底层原理102
1.17.10 The underlying principle of HashMap 102

1.17.11 项目中使用过的设计模式103
1.17.11 Design patterns used in projects 103

1.18 MySQL104

1.18.1 SQL执行顺序104
1.18.1 SQL Execution Order 104

1.18.2 TRUNCATE 、DROP、DELETE区别104

1.18.3 MyISAM与InnoDB的区别104

1.18.4 MySQL四种索引104
1.18.4 MySQL Four Indexes 104

1.18.5 MySQL的事务105

1.18.6 MySQL事务隔离级别105
MySQL Transaction Isolation Level 105

1.18.7 MyISAM与InnoDB对比105

1.18.8 B树和B+树对比106
1.18.8 Comparison of B trees and B+ trees

1.19 Redis106

1.19.1 Redis缓存穿透、缓存雪崩、缓存击穿106
1.19.1 Redis cache penetration, cache avalanche, cache breakdown 106

1.19.2 Redis哨兵模式107
1.19.2 Redis Sentinel Mode 107

1.19.3 Redis数据类型107
1.19.3 Redis data types 107

1.19.4 热数据通过什么样的方式导入Redis108
1.19.4 How hot data is imported into Redis 108

1.19.5 Redis的存储模式RDB,AOF108
1.19.5 Redis storage mode RDB, AOF 108

1.19.6 Redis存储的是k-v类型,为什么还会有Hash?108
1.19.6 Redis stores k-v type, why is there a Hash? 108

1.20 JVM108

第2章 离线数仓项目108
Chapter 2: The Last Day

2.1 提高自信108
2.1 Increased Confidence 108

2.2 为什么做这个项目109
2.2 Why did you do this project 109?

2.3 数仓概念109
2.3 Number of Concept 109

2.4 项目架构110
2.4 Project Architecture 110

2.5 框架版本选型110
2.5 Frame Version Selection 110

2.6 服务器选型112
2.6 Server selection 112

2.7 集群规模112
2.7 Cluster Size 112

2.8 人员配置参考115
2.8 Staffing Reference 115

2.8.1 整体架构115
2.8.1 Overall structure 115

2.8.2 你的的职级等级及晋升规则115
2.8.2 Your rank and promotion rules 115

2.8.3 人员配置参考116
2.8.3 Staffing Reference 116

2.9 从0-1搭建项目,你需要做什么?117
2.9 Building a project from 0-1, what do you need to do? 117

2.10 数仓建模准备118
2.10 Warehouse modeling preparation 118

2.11 数仓建模119
2.11 Warehouse modeling 119

2.12 数仓每层做了哪些事122
2.12 What do you do on each floor? 122.

2.13 数据量124
2.13 Volume of data 124

2.14 项目中遇到哪些问题?(*****)125
2.14 What problems were encountered in the project?(*****) 125

2.15 离线---业务126
2.15 Offline---Business 126

2.15.1 SKU和SPU126

2.15.2 订单表跟订单详情表区别?126
2.15.2 What is the difference between an order form and an order details form? 126

2.15.3 上卷和下钻126
2.15.3 Upper and Lower Drills 126

2.15.4 TOB和TOC解释127

2.15.5 流转G复活指标127
2.15.5 Circulation G Revival Indicator 127

2.15.6 活动的话,数据量会增加多少?怎么解决?128
2.15.6 How much more data will be added if the event is active? How? 128

2.15.7 哪个商品卖的好?128
2.15.7 Which product sells well? 128

2.15.8 数据仓库每天跑多少张表,大概什么时候运行,运行多久?128
2.15.8 How many tables does the data warehouse run per day, when and how long does it run? 128

2.15.9 哪张表数据量最大129
2.15.9 Which table has the largest amount of data 129

2.15.10 哪张表最费时间,有没有优化130
2.15.10 Which table is the most time-consuming and optimized? 130

2.15.11 并发峰值多少?大概哪个时间点?131
2.15.11 How many concurrent peaks? About what time? 131

2.15.12 分析过最难的指标131
2.15.12 The most difficult indicators analyzed 131

2.15.13 数仓中使用的哪种文件存储格式131
2.15.13 Which file storage format is used in Warehouse 131

2.15.14 数仓当中数据多久删除一次131
2.15.14 How often is data deleted in the warehouse 131

2.15.15 Mysql业务库中某张表发生变化,数仓中表需要做什么改变131
2.15.15 A table in Mysql business library has changed. What changes need to be made to the table in data warehouse 131

2.15.16 50多张表关联,如何进行性能调优131
2.15.16 50 Multiple table associations, how to tune performance 131

2.15.17 拉链表的退链如何实现131
2.15.17 How to realize the chain withdrawal of zipper table 131

2.15.18 离线数仓如何补数132
2.15.18 How to make up 132 offline positions

2.15.19 当ADS计算完,如何判断指标是正确的132
2.15.19 When ADS is calculated, how to determine whether the indicator is correct 132

2.15.20 ADS层指标计算错误,如何解决132
2.15.20 ADS layer index calculation error, how to solve 132

2.15.21 产品给新指标,该如何开发132
2.15.21 How to develop new indicators for products 132

2.15.22 新出指标,原有建模无法实现,如何操作133
2.15.22 New indicators, the original modeling can not be achieved, how to operate 133

2.15.23 和哪些部门沟通,以及沟通什么内容133
2.15.23 Which departments to communicate with and what to communicate

2.15.24 你们的需求和指标都是谁给的133
2.15.24 Who gave you your needs and indicators?

2.15.25 任务跑起来之后,整个集群资源占用比例133
2.15.25 After the task runs, the resource consumption ratio of the whole cluster is 133

2.15.26 业务场景:时间跨度比较大,数据模型的数据怎么更新的,例如:借款,使用一年,再还款,这个数据时间跨度大,在处理的时候怎么处理133
2.15.26 Business scenario: The time span is relatively large. How to update the data of the data model, for example: borrow money, use it for one year, and then repay it. The time span of this data is large. How to deal with it when processing 133

2.15.27 数据倾斜场景除了group by 和join外,还有哪些场景133
2.15.27 Data Tilt Scenarios What are the Scenarios besides group by and join?

2.15.28 你的公司方向是电子商务,自营的还是提货平台?你们会有自己的商品吗?134
2.15.28 Is your company oriented towards e-commerce, self-employed or delivery platform? Do you have your own merchandise? 134

2.15.29 ods事实表中订单状态会发生变化,你们是通过什么方式去监测数据变化的134
2.15.29 ods fact table order status will change, how do you monitor the data change 134

2.15.30 用户域你们构建了哪些事实表?登录事实表有哪些核心字段和指标?用户交易域连接起来有哪些表?134
2.15.30 User Domain What fact tables have you constructed? What are the core fields and metrics of the login fact table? What tables are there for linking user transaction domains? 134

2.15.31 当天订单没有闭环结束的数据量?135
2.15.31 Data volume of orders without closed loop closure on the day? 135

2.15.32 你们维度数据要做ETL吗?除了用户信息脱敏?没有做其他ETL吗136
2.15.32 Do you want ETL for dimensional data? Besides user information desensitization? No other ETL? 136

2.15.33 怎么做加密,加密数据要用怎么办,我讲的md5,他问我md5怎么做恢复136
2.15.33 How to encrypt, how to encrypt data, I talked about md5, he asked me how to do md5 recovery 136

2.15.34 真实项目流程137
2.15.34 Real Project Flow 137

2.15.35 指标的口径怎么统一的(离线这边口径变了,实时这边怎么去获取的口径)137
2.15.35 How to unify the caliber of indicators (the caliber changed offline, how to obtain the caliber in real time) 137

2.15.36 表生命周期管理怎么做的?138
2.15.36 How is life cycle management done? 138

2.15.37 如果上游数据链路非常的多,层级也非常的深,再知道处理链路和表的血缘的情况下,下游数据出现波动怎么处理?139
2.15.37 If the upstream data link is very many and the hierarchy is very deep, and then know how to deal with the link and the kinship of the table, how to deal with the fluctuation of the downstream data? 139

2.15.38 十亿条数据要一次查询多行用什么数据库比较好?139
2.15.38 Billion pieces of data to query multiple rows at once What database is better? 139

2.16 埋点139
2.16 Buried Point 139

第3章 实时数仓项目141
Chapter 3: The Last Day

3.1 为什么做这个项目141
3.1 Why did you do this project 141

3.2 项目架构141
3.2 Project Architecture 141

3.3 框架版本选型141
3.3 Frame Version Selection 141

3.4 服务器选型141
3.4 Server selection 141

3.5 集群规模142
3.5 Cluster Size 142

3.6 项目建模142
3.6 Project Modeling142

3.7 数据量144
3.7 Volume of data 144

3.7.1 数据分层数据量144
3.7.1 Data Stratification Data Volume 144

3.7.2 实时组件存储数据量145
3.7.2 Real-time Component Storage Data Volume 145

3.7.3 实时QPS峰值数据量145
3.7.3 Real-time QPS peak data volume 145

3.8 项目中遇到哪些问题及如何解决?145
3.8 What problems were encountered in the project and how to solve them? 145

3.8.1 业务数据采集框架选型问题145
3.8.1 Selection of Business Data Acquisition Framework 145

3.8.2 项目中哪里用到状态编程,状态是如何存储的,怎么解决大状态问题146
3.8.2 Where state programming is used in the project, how state is stored, and how to solve large state problems

3.8.3 项目中哪里遇到了反压,造成的危害,定位解决(*重点*)146
3.8.3 Where the project encountered back pressure, resulting in harm, positioning solution (* key *) 146

3.8.4 数据倾斜问题如何解决(****重点***)147
3.8.4 How to solve the data skew problem (**** Focus **) 147

3.8.5 数据如何保证一致性问题148
3.8.5 How to ensure consistency of data

3.8.6 FlinkSQL性能比较慢如何优化148
3.8.6 FlinkSQL performance is slow how to optimize 148

3.8.7 Kafka分区动态增加,Flink监控不到新分区数据导致数据丢失148
3.8.7 Kafka partition dynamically increases, Flink fails to monitor new partition data resulting in data loss 148

3.8.9 Kafka某个分区没有数据,导致下游水位线无法抬升,窗口无法关闭计算148
3.8.9 Kafka a partition has no data, resulting in the downstream water mark can not be raised, the window can not be closed calculation 148

3.8.10 Hbase的rowkey设计不合理导致的数据热点问题148
3.8.10 Data Hotspots Caused by Irrational Rowkey Design of Hbase 148

3.8.11 Redis和HBase的数据不一致问题148
3.8.11 Data inconsistency between Redis and HBase

3.8.12 双流join关联不上如何解决149
3.8.12 How to solve the problem of double stream join

3.9 生产经验150
3.9 Production experience 150

3.9.1 Flink任务提交使用那种模式,为何选用这种模式150
3.9.1 Which mode does Flink task submission use and why? 150

3.9.2 Flink任务提交参数,JobManager和TaskManager分别给多少150
3.9.2 Flink task submission parameters, JobManager and TaskManager how much 150

3.9.3 Flink任务并行度如何设置150
3.9.3 How to set Flink Task Parallelism to 150

3.9.4 项目中Flink作业Checkpoint参数如何设置150
3.9.4 How to set Flink Checkpoint parameter in project 150

3.9.5 迟到数据如何解决150
3.9.5 How to solve the problem of late data 150

3.9.6 实时数仓延迟多少151
3.9.6 How much is the delay in real-time counting 151

3.9.7 项目开发多久,维护了多久151
3.9.7 How long has the project been developed and maintained 151

3.9.8 如何处理缓存冷启动问题151
3.9.8 How to handle cache cold start problems

3.9.9 如何处理动态分流冷启动问题(主流数据先到,丢失数据怎么处理)151
3.9.9 How to deal with dynamic shunt cold start problem (mainstream data arrives first, how to deal with lost data) 151

3.9.10 代码升级,修改代码,如何上线151
3.9.10 Code upgrade, code modification, how to go online 151

3.9.11 如果现在做了5个Checkpoint,Flink Job挂掉之后想恢复到第三次Checkpoint保存的状态上,如何操作151
3.9.11 If 5 Checkpoints are made now, Flink Job hangs and wants to restore to the state saved by the third Checkpoint, how to operate 151

3.9.12 需要使用flink记录一群人,从北京出发到上海,记录出发时间和到达时间,同时要显示每个人用时多久,需要实时显示,如果让你来做,你怎么设计?152
3.9.12 Need to use flink to record a group of people, from Beijing to Shanghai, record departure time and arrival time, at the same time to show how long each person takes, need real-time display, if let you do, how do you design? 152

3.9.13 flink内部的数据质量和数据的时效怎么把控的152
3.9.13 Flink internal data quality and data timeliness how to control 152

3.9.14 实时任务问题(延迟)怎么排查152
3.9.14 How to troubleshoot real-time task problems (delays) 152

3.9.15 维度数据查询并发量152
3.9.15 Dimension Data Query Concurrent Volume 152

3.9.16 Prometheus+Grafana是自己搭的吗,监控哪些指标152
3.9.16 Is Prometheus+Grafana built by itself and which indicators are monitored 152

3.9.17 怎样在不停止任务的情况下改flink参数153
3.9.17 How to change flink parameters without stopping the task 153

3.9.18 hbase中有表,里面的1月份到3月份的数据我不要了,我需要删除它(彻底删除),要怎么做153
3.9.18 There is a table in hbase. I don't want the data from January to March in it. I need to delete it (delete it completely). How to do 153

3.9.19 如果flink程序的数据倾斜是偶然出现的,可能白天可能晚上突然出现,然后几个月都没有出现,没办法复现,怎么解决?153
3.9.19 If the data skew of flink program appears accidentally, it may suddenly appear during the day or at night, and then it does not appear for several months. There is no way to reproduce it. How to solve it? 153

3.9.20 维度数据改变之后,如何保证新join的维度数据是正确的数据153
3.9.20 How to ensure that the dimension data of the new join is correct after the dimension data is changed

3.10 实时---业务153
3.10 Real-time--business 153

3.10.1 数据采集到ODS层153
3.10.1 Data Acquisition to ODS Layer 153

3.10.2 ODS层154

3.10.3 DWD+DIM层154

3.10.4 DWS层155

3.10.5 ADS层157

第4章 数据考评平台项目158
Chapter 4: The Last Day

4.1项目背景158
4.1 Background of the project 158

4.1.1 为什么做数据治理158
4.1.1 Why Data Governance 158

4.1.2 数据治理概念158
4.1.2 Data Governance Concepts

4.1.3 数据治理考评平台做的是什么158
4.1.3 What does the Data Governance Assessment Platform do?

4.1.4 考评指标158
4.1.4 Evaluation indicators 158

4.2 技术架构159
4.2 Technical architecture 159

4.3 项目实现了哪些功能159
4.3 What functions are implemented by the project 159

4.3.1 元数据的加载与处理及各表数据的页面接口159
4.3.1 Loading and processing of metadata and page interfaces for table data 159

4.3.2 数据治理考评链路(**核心**)159
4.3.2 Data Governance Assessment Link (** Core **) 159

4.3.3 数据治理考评结果核算160
4.3.3 Data Governance Assessment Results Accounting 160

4.3.4 可视化治理考评提供数据接口160
4.3.4 Visual governance assessment provides data interface 160

4.4 项目中的问题/及优化161
4.4 Problems in the project/optimization 161

4.4.1 计算hdfs路径数据量大小、最后修改访问时间161
4.4.1 Calculate hdfs path data size, last modified access time 161

4.4.2 考评器作用是什么?161
4.4.2 What is the role of the evaluator? 161

4.4.3 稍微难度考评器实现思路161
4.4.3 Slightly difficult evaluator implementation ideas 161

4.4.4 利用多线程优化考评计算161
4.4.4 Using multithreading to optimize evaluation calculations 161

4.4.5 实现过哪些指标161
4.4.5 What indicators have been achieved

第4章 用户画像项目162
Chapter 4: The Last Day

4.1 画像系统主要做了哪些事162
4.1 What does the system do? 162.

4.2 项目整体架构162
4.2 Overall project structure 162

4.3 讲一下标签计算的调度过程163
4.3 Let's talk about the scheduling process of label calculation 163.

4.4 整个标签的批处理过程163
4.4 Batch process for whole label 163

4.5 你们的画像平台有哪些功能 ?163
4.5 What are the functions of your portrait platform? 163

4.6 是否做过Web应用开发,实现了什么功能163
4.6 Have you done Web application development and what functions have been implemented? 163

4.7 画像平台的上下游164
4.7 upstream and downstream of the portrait platform 164

4.8 BitMap原理,及为什么可以提高性能164
4.8 BitMap Principle and Why It Can Improve Performance 164

第5章 数据湖项目164
Chapter 5-The Data Lake Project

5.1 数据湖与数据仓库对比164
5.1 Data Lake vs. Data Warehouse

5.2 为什么做这个项目?解决了什么痛点?164
5.2 Why do this project? What pain points were solved? 164

5.3 项目架构165
5.3 Project Architecture 165

5.4 业务165
5.4 Business 165

5.5 优化or遇到的问题怎么解决165
5.5 How to solve the problem of optimization or encounter 165

第6章 测试&上线流程166
Chapter 6: The Last Day

6.1 测试相关166
6.1 Test related 166

6.1.1 公司有多少台测试服务器?166
6.1.1 How many test servers does the company have? 166

6.1.2 测试服务器配置?166
6.1.2 Test Server Configuration? 166

6.1.3 测试数据哪来的?167
6.1.3 Where did the test data come from? 167

6.1.4 如何保证写的SQL正确性(重点)167
6.1.4 How to ensure the correctness of SQL writing (emphasis) 167

6.1.5 测试之后如何上线?167
6.1.5 How to go online after testing? 167

6.1.6 A/B测试了解167
6.1.6 A/B Test Understanding 167

6.2 项目实际工作流程169
6.2 Project actual workflow 169

6.3 项目当前版本号是多少?多久升级一次版本171
6.3 What is the current version number of the project? How often do I update version 171?

6.4 项目中实现一个需求大概多长时间171
6.4 How long does it take to implement a requirement in a project? 171

6.5 项目开发中每天做什么事171
6.5 What do you do every day in project development? 171

第7章 数据治理172
Chapter 7: Data Governance

7.1 元数据管理172
7.1 Metadata management 172

7.2 数据质量监控173
7.2 Data quality monitoring 173

7.2.1 监控原则173
7.2.1 Principles of surveillance 173

7.2.2 数据质量实现174
7.2.2 Data Quality Realization 174

7.2.3 实现数据质量监控,你具体怎么做,详细说?174
7.2.3 Data quality monitoring, how do you do it, in detail? 174

7.3 权限管理(Ranger)175
7.3 Authority Management (Ranger) 175

7.4 用户认证(Kerberos)175
7.4 User authentication (Kerberos) 175

7.5 数据治理176
7.5 Data Governance 176

第8章 中台178
Chapter 8: The Last Day

8.1 什么是中台?179
8.1 What is the middle stage? 179

8.2 各家中台180
8.2 180 in each home.

8.3 中台具体划分180
8.3 180 in detail.

8.4 中台使用场景181
8.4 Central Station Usage Scene 181

8.5 中台的痛点182
8.5 Pain point 182 in the middle stage

第9章 算法题(LeetCode)182
Chapter 9: LeetCode 182

9.1 时间复杂度、空间复杂度理解182
9.1 Time complexity, spatial complexity understanding 182

9.2 常见算法求解思想182
9.2 Common algorithms for solving ideas182

9.3 基本算法183
9.3 Basic algorithm 183

9.3.1 冒泡排序183
9.3.1 Bubble Sorting 183

9.3.2 快速排序183
9.3.2 Quick Sorting 183

9.3.3 归并排序184
9.3.3 Merged sorting 184

9.3.4 遍历二叉树185
9.3.4 Traversing Binary Trees 185

9.3.5 二分查找185
9.3.5 Binary search 185

9.4 小青蛙跳台阶186
9.4 Little Frog Jumping Steps 186

9.5 最长回文子串186
9.5 Longest palindrome substring 186

9.6 数字字符转化成IP186
9.6 Digital characters converted to IP 186

9.7 最大公约数187
9.7 Maximum common divisor 187

9.8 链表反转187
9.8 List inversion 187

9.9 数组寻找峰值187
9.9 Array Looking for Peak 187

第10章 场景题187
Chapter 10: The Last Day

10.1 手写Flink的UV187

10.2 Flink的分组TopN187

10.3 Spark的分组TopN187

10.4 如何快速从40亿条数据中快速判断,数据123是否存在187
10.4 How to quickly determine from 4 billion pieces of data whether data 123 exists 187

10.5 给你100G数据,1G内存,如何排序?187
10.5 Give you 100 gigabytes of data, 1 gigabyte of memory, how to sort? 187

10.6 公平调度器容器集中在同一个服务器上?187
10.6 Fair scheduler containers centralized on the same server? 187

10.7 匹马赛跑,1个赛道,每次5匹进行比赛,无法对每次比赛计时,但知道每次比赛结果的先后顺序,最少赛多少次可以找出前三名?188
10.7 A horse race, 1 track, 5 horses at a time, unable to time each race, but know the order of each race results, at least how many races can find the top three? 188

10.8 给定一个点、一条线、一个三角形、一个有向无环图,请用java面向对象的思想进行建模188
10.8 Given a point, a line, a triangle, and a directed acyclic graph, model it using java object-oriented thinking.

10.9 现场出了一道sql题,让说出sql的优化,优化后效率提升了多少188
10.9 A sql question was given on the spot. Let's say how much sql optimization has improved efficiency after optimization. 188

第11章 HQL场景题188
Chapter 11: The Last Day

第12章 面试说明188
Chapter 12: The Last Day

12.1 面试过程最关键的是什么?188
12.1 What is the most important part of the interview process? 188

12.2 面试时该怎么说?188
12.2 What should I say during the interview? 188

12.3 面试技巧189
12.3 Interview skills 189

12.3.1 六个常见问题189
12.3.1 Six common questions 189

12.3.2 两个注意事项190
12.3.2 Two considerations 190

12.3.3 自我介绍190
12.3.3 Self-introduction 190

第1章 核心技术
Chapter 1: The Core Technology

1.1 Linux&Shell

1.1.1 Linux常用高级命令
1.1.1 Linux Common Advanced Commands

序号

命令

命令解释
command interpretation

1

top

实时显示系统中各个进程的资源占用状况(CPU、内存和执行时间)
Real-time display of resource usage (CPU, memory, and execution time) of various processes in the system

2

jmap -heap 进程号
jmap -heap process number

查看某个进程内存
View a process memory

3

free -m

查看系统内存使用情况
View system memory usage

4

ps -ef

查看进程
viewing process

5

netstat -tunlp | grep 端口号

查看端口占用情况
View port occupancy

6

du -sh 路径*
du -sh path *

查看路径下的磁盘使用情况
View disk usage under path

例如:$ du -sh /opt/*
For example: $ du -sh /opt/*

7

df -h

查看磁盘存储情况
View disk storage

1.1.2 Shell常用工具及写过的脚本
1.1.2 Shell Common Tools and Written Scripts

1awk、sed、cut、sort

2)用Shell写过哪些脚本
2) What scripts have been written in Shell?

(1)集群启动,分发脚本
(1) Cluster startup, distribution script

#!/bin/bash

case $1 in

"start")

for i in hadoop102 hadoop103 hadoop104

do

ssh $i "绝对路径"
ssh $i "absolute path"

done

;;

"stop")

;;

esac

2)数仓层级内部的导入:ods->dwd->dws ->ads
(2) Import within warehouse hierarchy: ods->dwd->dws ->ads

①#!/bin/bash

②定义变量 APP=gmall
② Define variable APP=gmall

③获取时间
③ Acquisition time

传入 按照传入时间
incoming by incoming time

不传 T+1
No T+1

④sql="

先按照当前天 写sql => 遇到时间 $do_date 遇到表 {$APP}.
First write sql => encounter time $do_date encounter table {$APP} according to the day before yesterday.

自定义函数 UDF UDTF {$APP}.
UDF UDTF {$APP}.

"

⑤执行sql
Execution of SQL

1.1.3 Shell中单引号和双引号区别
1.1.3 Shell Single Quotation and Double Quotation Difference

1)在/home/atguigu/bin创建一个test.sh文件
1) Create a test.sh file at/home/atguigu/bin

[atguigu@hadoop102 bin]$ vim test.sh

文件中添加如下内容
Add the following to the file

#!/bin/bash

do_date=$1

echo '$do_date'

echo "$do_date"

echo "'$do_date'"

echo '"$do_date"'

echo `date`

2)查看执行结果
2) View execution results

[atguigu@hadoop102 bin]$ test.sh 2022-02-10

$do_date

2022-02-10

'2022-02-10'

"$do_date"

2022年 05月 02日 星期四 21:02:08 CST
Thursday May 2nd, 2022 21:02:08 CST

3)总结:
3) Summary:

(1)单引号取变量值
(1) Single quotes do not take variable values

(2)双引号取变量值
(2) Double quotes take variable values

(3)反引号`,执行引号中命令
(3) Back quotes `, execute the command in quotes

(4)双引号内部嵌套单引号,取出变量值
(4) Double quotation marks nested within single quotation marks, take out variable values

(5)单引号内部嵌套引号,不取出变量值
(5) Double quotation marks nested inside single quotation marks, do not take out variable values

1.2 Hadoop

1.2.1 Hadoop常用端口号
1.2.1 Common port numbers for Hadoop

hadoop2.x

hadoop3.x

访问HDFS端口
Access HDFS port

50070

9870

访问MR执行情况端口
Access MR Performance Port

8088

8088

历史服务器
history server

19888

19888

客户端访问集群端口
Client Access Cluster Port

9000

8020

1.2.2 HDFS读流程写流程
1.2.2 HDFS Read and Write Flow

注意:HDFS写入流程时候,某台dataNode挂掉如何运行?
Note: When HDFS writes the process, how does a dataNode hang up?

当DataNode突然挂掉了,客户端接收不到这个DataNode发送的ack确认,客户端会通知NameNode,NameNode检查并确认该块的副本与规定的不符,NameNode会通知闲置的DataNode去复制副本,并将挂掉的DataNode作下线处理。挂掉的DataNode节点恢复后, 删除该节点中曾经拷贝的不完整副本数据。
When a DataNode suddenly hangs up, the client does not receive the ack acknowledgement sent by the DataNode, and the client notifies the NameNode. The NameNode checks and confirms that the copy of the block does not match the specified copy. The NameNode notifies the idle DataNode to copy the copy, and the suspended DataNode is offline. After the suspended DataNode node is restored, delete the incomplete copy data that was copied in the node.

1.2.3 HDFS小文件处理
1.2.3 HDFS Small File Handling

1)会有什么影响
1) What impact will it have?

(1)存储层面
(1) Storage level

1个文件块,占用namenode多大内存150字节
1 file block, occupying 150 bytes of namenode memory

128G能存储多少文件块? 128 g* 1024m*1024kb*1024byte/150字节 = 9.1亿文件块
How many file blocks can 128G store? 128 g* 1024m*1024kb*1024 bytes/150 bytes = 910 million file blocks

(2)计算层面
(2) Calculation level

每个小文件都会起到一个MapTask,1个MapTask默认内存1G。浪费资源。
Each small file will play a MapTask, 1 MapTask default memory 1G. Waste of resources.

2)怎么解决
2) How to solve

(1)采用har归档方式,将小文件归档
(1) Use har filing method to file small files

(2)采用CombineTextInputFormat

(3)自己写一个MR程序将产生的小文件合并成一个大文件。如果是Hive或者Spark有merge功能自动帮助我们合并。
(3) Write your own MR program to merge the small files generated into one large file. If it's Hive or Spark, there's a merge feature that automatically helps us merge.

4)有小文件场景开启JVM重用;如果没有小文件,不要开启JVM重用,因为会一直占用使用到的Task卡槽,直到任务完成才释放
(4) There are small file scenarios to enable JVM reuse; if there are no small files, do not enable JVM reuse, because it will occupy the used Task card slot until the task is completed.

JVM重用可以使得JVM实例在同一个job中重新使用N次,N的值可以在Hadoop的mapred-site.xml文件中进行配置。通常在10-20之间。
JVM reuse enables JVM instances to be reused N times in the same job, and the value of N can be configured in Hadoop's mapred-site.xml file. Usually between 10-20.

<property>

<name>mapreduce.job.jvm.numtasks</name>

<value>10</value>

<description>How many tasks to run per jvm,if set to -1 ,there is no limit</description>

</property>

1.2.4 HDFS的NameNode内存

1)Hadoop2.x系列,配置NameNode默认2000m
1) Hadoop 2.x series, configuration NameNode default 2000m

2)Hadoop3.x系列,配置NameNode内存是动态分配的
2) Hadoop 3.x series, configuration NameNode memory is dynamically allocated

NameNode内存最小值1G,每增加100万个文件block,增加1G内存。
NameNode memory minimum of 1G, each increase of 1 million file blocks, an increase of 1G memory.

1.2.5 Shuffle及优化
1.2.5 Shuffle and optimization

1.2.6 Yarn工作机制
1.2.6 Yarn working mechanism

1.2.7 Yarn调度器
1.2.7 Yarn Scheduler

1Hadoop调度器重要分为三类
1) Hadoop scheduler is divided into three important categories

FIFO、Capacity Scheduler(容量调度器)和Fair Sceduler(公平调度器)。
FIFO, Capacity Scheduler, and Fair Scheduler.

Apache默认的资源调度器是容量调度器
Apache's default resource scheduler is Capacity Scheduler.

CDH默认的资源调度器是公平调度器。
The default resource scheduler for CDH is the fair scheduler.

2)区别
2) Difference

FIFO调度器:支持单队列 、先进先出 生产环境不会用。
FIFO scheduler: supports single queue, FIFO production environment will not be used.

容量调度器:支持多队列。队列资源分配,优先选择资源占用率最低的队列分配资源;作业资源分配,按照作业的优先级和提交时间顺序分配资源;容器资源分配,本地原则(同一节点/同一机架/不同节点不同机架)。
Capacity Scheduler: Supports multiple queues. Queue resource allocation: priority is given to the queue with the lowest resource occupancy rate; job resource allocation: resources are allocated according to job priority and submission time order; container resource allocation: local principle (same node/same rack/different nodes and different racks)

公平调度器:支持多队列,保证每个任务公平享有队列资源资源不够时可以按照缺额分配。
Fair scheduler: Support multiple queues to ensure that each task has equal access to queue resources. When resources are insufficient, they can be allocated according to the shortfall.

3)在生产环境下怎么选择?
3) How to choose in the production environment?

大厂:如果对并发度要求比较高,选择公平,要求服务器性能必须OK。
Dachang: If the concurrency requirements are relatively high, choose fairness, and require the server performance to be OK.

中小公司,集群服务器资源不太充裕选择容量。
Small and medium-sized companies, cluster server resources are not sufficient to choose capacity.

4)在生产环境怎么创建队列?
4) How do you create queues in a production environment?

(1)调度器默认就1个default队列,不能满足生产要求。
(1) The scheduler defaults to one default queue, which cannot meet the production requirements.

2)按照部门:业务部门1、业务部门2。
(2) According to departments: business department 1, business department 2.

3)按照业务模块:登录注册、购物车、下单。
(3) According to the business module: login registration, shopping cart, order.

5)创建多队列的好处?
5) What are the benefits of creating multiple queues?

(1)因为担心员工不小心,写递归死循环代码,把所有资源全部耗尽。
(1) Worried that employees might not be careful, write recursive endless loop code that exhausts all resources.

(2)实现任务的降级使用,特殊时期保证重要的任务队列资源充足。
(2) Implement the degraded use of tasks, and ensure that important task queue resources are sufficient in special periods.

业务部门1(重要)=》业务部门2(比较重要)=》下单(一般)=》购物车(一般)=》登录注册(次要)
Business Unit 1 (Important)= Business Unit 2 (More Important)= Orders (General)= Shopping Cart (General)= Login Registration (Secondary)

1.2.8 HDFS块大小
1.2.8 HDFS Block Size

1)块大小
1) Block size

1.x 64m

2.x 3.x 128m

本地 32m
Local 32m

企业 128m 256m 512m

2)块大小决定因素
2) Block Size Determinants

磁盘读写速度
disk read/write speed

普通的机械硬盘 100m/s => 128m
Ordinary mechanical hard disk 100m/s => 128m

固态硬盘普通的 300m/s => 256m
Solid state drive ordinary 300m/s => 256m

内存镜像 500-600m/s => 512m
Memory Mirrors 500-600m/s => 512m

1.2.9 Hadoop脑裂原因及解决办法?
1.2.9 Hadoop Brain Split Causes and Solutions?

1)出现脑裂的原因
1) Causes of brain splitting

Leader出现故障,系统开始改朝换代,当Follower完成全部工作并且成为Leader后,原Leader又复活了(它的故障可能是暂时断开或系统暂时变慢,不能及时响应,但其NameNode进程还在),并且由于某种原因它对应的ZKFC并没有把它设置为Standby,所以原Leader还认为自己是Leader,客户端向它发出的请求仍会响应,于是脑裂就发生了。
Leader failure, the system began to change dynasties, when the Follower completed all work and became the Leader, the original Leader revived (its failure may be temporarily disconnected or the system temporarily slowed down, unable to respond in time, but its NameNode process is still in), and for some reason its corresponding ZKFC did not set it to Standby, so the original Leader still thinks it is the Leader, the client will still respond to requests sent to it, so the brain split occurs.

2)Hadoop通常不会出现脑裂。
2) Hadoop usually does not show brain splitting.

如果出现脑裂,意味着多个Namenode数据不一致,此时只能选择保留其中一个的数据。例如:现在有三台Namenode,分别为nn1nn2nn3,出现脑裂,想要保留nn1的数据,步骤为:
If there is a split brain, it means that multiple Namenode data are inconsistent, and only one of the data can be selected to retain. For example, there are three Namenode, namely nn1, nn2, nn3, there is a brain split, want to keep nn1 data, the steps are:

(1)关闭nn2和nn3
(1) Close nn2 and nn3

(2)在nn2和nn3节点重新执行数据同步命令:hdfs namenode -bootstrapStandby
(2) Re-execute the data synchronization command at nn2 and nn3 nodes: hdfs namenode -bootstrapStandby

(3)重新启动nn2和nn3
(3) Restart nn2 and nn3

1.3 Zookeeper

1.3.1 常用命令
1.3.1 Common commands

lsgetcreate、delete、deleteall

1.3.2 选举机制
1.3.2 Electoral mechanisms

半数机制(过半机制):2n + 1,安装奇数台。
Half mechanism (half mechanism): 2n + 1, odd number of stations installed.

10服务器:3台。
10 servers: 3.

20台服务器:5台。
20 servers: 5.

100台服务器:11台。
100 servers: 11.

台数多,好处:提高可靠性;坏处:影响通信延时。
More numbers, advantages: improve reliability; disadvantages: affect communication delay.

1.3.3 Zookeeper符合法则中哪两个?
1.3.3 Which two of Zookeeper's laws fit?

1.3.4 Zookeeper脑裂

Zookeeper采用过半选举机制,防止了脑裂。
Zookeeper uses a majority voting mechanism to prevent brain splitting.

1.3.5 Zookeeper用来干嘛了
1.3.5 What is Zookeeper Used For?

(1)作为HA的协调者:如 HDFS的HA、YARN的HA。
(1) As HA coordinator: such as HA of HDFS, HA of YARN.

(2)被组件依赖:如Kafka、HBase、CK。
(2) Dependent on components: such as Kafka, HBase, CK.

1.4 Flume

1.4.1 Flume组成,Put事务,Take事务
1.4.1 Flume Composition, Put Transaction, Take Transaction

1)Taildir Source

(1)断点续传、多目录
(1) Breakpoint continuation, multi-directory

(2)taildir底层原理
(2) taildir underlying principle

3Taildir挂了怎么办?
(3) What if Taildir dies?

不会丢数:断点续传
No loss: breakpoint resume

重复数据:有可能
Duplicate data: Possible

(4)存在的问题及解决方案
(4) Existing problems and solutions

①问题:
1 Question:

新文件判断条件 = iNode值 + 绝对路径(包含文件名)
New file judgment condition = iNode value + absolute path (including file name)

日志框架凌晨修改了文件名称=》导致会再次重读一次昨天产生的数据
Log frame modified file name ="in the early morning causing rereading of yesterday's data

②解决:
② Solution:

方案一:建议生成的文件名称为带日期的。同时配置日志生成框架为不更名的;
Option 1: It is recommended that the generated file name be dated. At the same time, configure the log generation framework to be unrenamed;

方案二:修改TairDirSource源码,只按照iNode值去确定文件
Option 2: Modify the TairDirSource source code, only determine the file according to the iNode value

修改源码视频地址:
Modify source video address:

https://www.bilibili.com/video/BV1wf4y1G7EQ?p=14&vd_source=891aa1a363111d4914eb12ace2e039af

2)file channel /memory channel/kafka channel

(1)File Channel

数据存储于磁盘,优势:可靠性高;劣势:传输速度低
Data is stored on disk, advantages: high reliability; disadvantages: low transmission speed

默认容量:100万个event
Default capacity: 1 million events

注意:FileChannel可以通过配置dataDirs指向多个路径,每个路径对应不同的硬盘,增大Flume吞吐量。
Note: FileChannel can increase Flume throughput by configuring dataDirs to point to multiple paths, each path corresponding to a different hard disk.

(2)Memory Channel

数据存储于内存,优势:传输速度快;劣势:可靠性差
Data storage in memory, advantages: fast transmission speed; disadvantages: poor reliability

默认容量:100个event
Default capacity: 100 events

(3)Kafka Channel

数据存储于Kafka,基于磁盘;
Data stored in Kafka, disk-based;

优势:可靠性高;
Advantages: high reliability;

传输速度快 Kafka Channel 大于Memory Channel + Kafka Sink 原因省去了Sink阶段
Fast transmission speed Kafka Channel is greater than Memory Channel + Kafka Sink

4)生产环境如何选择
(4) How to choose the production environment

如果下一级是Kafka,优先选择Kafka Channel。
If the next level is Kafka, Kafka Channel is preferred.

如果是金融、对钱要求准确的公司,选择File Channel。
If you are a financial company, select File Channel.

如果就是普通的日志,通常可以选择Memory Channel。
If it is a normal log, you can usually choose Memory Channel.

每天丢几百万数据 pb级 亿万富翁,掉1块钱会捡?
Millions of pb-level billionaires lose data every day. Will they pick up a dollar?

3)HDFS Sink

(1)时间(半个小时) or 大小128m 且 设置Event个数等于0,该值默认10
(1) Time (half an hour) or Size 128m and set the number of Events equal to 0, which defaults to 10

具体参数:hdfs.rollInterval=1800,hdfs.rollSize=134217728 hdfs.rollCount=0

4)事务
4) Business

SourceChannel是Put事务

Channel到Sink是Take事务
Channel to Sink is Take Business

1.4.2 Flume拦截器
1.4.2 Flume interceptor

1拦截器注意事项
1) Interceptor precautions

1)时间戳拦截器:主要是解决零点漂移问题
(1) Timestamp interceptor: mainly to solve the zero drift problem

2)自定义拦截器步骤
2) Custom interceptor steps

(1)实现 Interceptor

(2)重写四个方法
(2) Rewrite four methods

initialize 初始化
initialize

public Event intercept(Event event) 处理单个Event

public List<Event> intercept(List<Event> events) 处理多个Event,在这个方法中调用Event intercept(Event event)

close方法
close method

(3)静态内部类,实现Interceptor.Builder
(3) Static inner class, implement Interceptor.Builder

3)拦截器可以不用吗?
3) Can the interceptor be used?

时间戳拦截器建议使用。如果不用需要采用延迟15-20分钟处理数据的方式,比较麻烦。
Timestamp interceptors are recommended. If you don't need to delay 15-20 minutes to process data, it's more troublesome.

1.4.3 Flume Channel选择器

Replicating:默认选择器。功能:将数据发往下一级所有通道。
Replicating: Default selector. Function: Send data to all channels at the next level.

Multiplexing:选择性发往指定通道。
Multiplexing: selective routing to specified channels.

1.4.4 Flume监控器
1.4.4 Flume Monitor

1)监控到异常现象
1) Monitoring abnormal phenomenon

采用Ganglia监控器,监控到Flume尝试提交的次数远远大于最终成功的次数,说明Flume运行比较差。主要是内存不够导致的。
With Ganglia Monitor, Flume attempts to submit far more than the number of successful attempts, indicating that Flume is running poorly. Mainly due to insufficient memory.

2)解决办法?
2) The solution?

(1)自身:默认内存是20m,考虑增加flume内存,在flume-env.sh配置文件中修改flume内存为 4-6g
(1) itself: default memory is 20m, consider increasing flume memory, modify flume memory to 4-6g in flume-env.sh configuration file

(2)找朋友:增加服务器台数
(2) Find friends: increase the number of servers

搞活动 618 =》增加服务器 =》用完在退出
Engage in activity618 =》Add server =》Run out in exit

日志服务器配置:8-16g内存、磁盘8T
Log server configuration: 8-16g memory, 8T disk

1.4.5 Flume采集数据会丢失吗?
1.4.5 Is Flume collecting data lost?

如果是kafka channel 或者FileChannel不会丢失数据,数据存储可以存储在磁盘中。
If the kafka channel or FileChannel does not lose data, the data store can be stored on disk.

如果是MemoryChannel有可能丢。
MemoryChannel may be lost.

1.4.6 Flume如何提高吞吐量
1.4.6 How Flume Increases Throughput

调整taildir sourcebatchSize大小可以控制吞吐量,默认大小100个Event。
Adjust the batchSize of taildir source to control throughput, the default size is 100 Events.

吞吐量的瓶颈一般是网络带宽。
The bottleneck in throughput is usually network bandwidth.

1.5 Kafka

1.5.1 Kafka架构

生产者、Broker、消费者、Zookeeper。
Producer, Broker, Consumer, Zookeeper.

注意:Zookeeper中保存Broker id和controller等信息,但是没有生产者信息。
Note: Zookeeper stores Broker id and controller information, but no producer information.

1.5.2 Kafka生产端分区分配策略
1.5.2 Kafka production partition allocation strategy

Kafka官方为我们实现了三种Partitioner(分区器),分别是DefaultPartitioner(当未指定分区器时候所使用的默认分区器)、UniformStickyPartitioner、RoundRobinPartitioner。
Kafka officially implemented three partitions for us, namely DefaultPartition (the default partition used when no partition is specified), UniformStickyPartition, RoundRobinPartition.

1)DefaultPartitioner默认分区器
1) DefaultPartition

下图说明了默认分区器的分区分配策略:
The following figure illustrates the partition allocation policy for the default partitioner:

2)UniformStickyPartitioner纯粹的粘性分区器
2) UniformStickyPartitioner Pure sticky partitioner

(1)如果指定了分区号,则会按照指定的分区号进行分配
(1) If the partition number is specified, it will be allocated according to the specified partition number.

(2)若没有指定分区好,,则使用粘性分区器
(2) If no partition is specified, use sticky partitioner

3)RoundRobinPartitioner轮询分区器

(1)如果在消息中指定了分区则使用指定分区。
(1) If a partition is specified in the message, the specified partition is used.

(2)如果未指定分区,都会将消息轮询每个分区,将数据平均分配到每个分区中。
(2) If no partition is specified, the message polls each partition, distributing the data evenly among each partition.

4)自定义分区器
4) Custom Partitioner

自定义分区策略:可以通过实现 org.apache.kafka.clients.producer.Partitioner 接口,重写 partition 方法来达到自定义分区效果。
Custom partitioning policies: Custom partitioning can be achieved by implementing the org.apache.kafka.clients.producer.Partitioner interface and overriding the partition method.

例如我们想要实现随机分配,只需要以下代码:
For example, if we want to implement random allocation, we only need the following code:

List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);

return ThreadLocalRandom.current().nextInt(partitions.size());

先计算出该主题总的分区数,然后随机地返回一个小于它的正整数。
Compute the total number of partitions for the topic and return a positive integer less than it randomly.

在项目中,如果希望把MySQL中某张表的数据发送到一个分区。可以以表名为key进行发送。
In a project, if you want to send data from a table in MySQL to a partition. It can be sent with the table name key.

1.5.3 Kafka丢不丢数据
1.5.3 Kafka loses data

1)Producer角度

acks=0,生产者发送过来数据就不管了,可靠性差,效率高;
acks=0, the producer sends the data regardless, poor reliability, high efficiency;

acks=1,生产者发送过来数据Leader应答,可靠性中等,效率中等;
acks=1, the producer sends the data Leader response, the reliability is medium, the efficiency is medium;

acks=-1,生产者发送过来数据LeaderISR队列里面所有Follwer应答,可靠性高,效率低;
acks=-1, all Follwer responses in the data Leader and ISR queue sent by the producer have high reliability and low efficiency;

在生产环境中,acks=0很少使用;acks=1,一般用于传输普通日志,允许丢个别数据;acks=-1,一般用于传输和钱相关的数据,对可靠性要求比较高的场景。
In the production environment, acks=0 is rarely used;acks=1, generally used to transmit ordinary logs, allowing individual data to be lost;acks=-1, generally used to transmit data related to money, and for scenarios with high reliability requirements.

2)Broker角度
2) Broker angle

副本数大于等于2。
The number of copies is greater than or equal to 2.

min.insync.replicas大于等于2。

1.5.4 Kafka的ISR副本同步队列
1.5.4 ISR replica synchronization queue for Kafka

ISR(In-Sync Replicas),副本同步队列。如果Follower长时间未向Leader发送通信请求或同步数据,则该Follower将被踢出ISR。该时间阈值由replica.lag.time.max.ms参数设定,默认30s
ISR (In-Sync Replicas), replica synchronization queue. If a Follower does not send a communication request or synchronization data to the Leader for a long time, the Follower will be kicked out of the ISR. This time threshold is set by the replica.lag.time.max.ms parameter and defaults to 30s.

任意一个维度超过阈值都会把Follower剔除出ISR,存入OSR(Outof-Sync Replicas)列表,新加入的Follower也会先存放在OSR中。
Any dimension exceeding the threshold will remove the Followers from the ISR and store them in the OSR (Outof-Sync Replicas) list. The newly added Followers will also be stored in the OSR first.

Kafka分区中的所有副本统称为AR = ISR + OSR
All copies in the Kafka partition are collectively referred to as AR = ISR + OSR

1.5.5 Kafka数据重复
1.5.5 Duplicate Kafka data

去重 = 幂等性 + 事务
Deduplication = idempotent + transaction

1)幂等性原理
1) The idempotent principle

2)幂等性配置参数
2) Idempotent configuration parameters

参数名称
name of parameter

描述

enable.idempotence

是否开启幂等性,默认true,表示开启幂等性。
Whether idempotent is enabled, default true, means idempotent is enabled.

max.in.flight.requests.per.connection

1.0.X版本前,需设置为1,1.0.X之后,小于等于5
Before version 1.0.X, it needs to be set to 1. After version 1.0.X, it should be less than or equal to 5.

retries

失败重试次数,需要大于0
Failed retry times, greater than 0

acks

需要设置为all
Need to be set to all

3Kafka的事务一共有如下5个API
3) Kafka transactions have five APIs as follows

// 1初始化事务
// 1 Initialize transactions

void initTransactions();

// 2开启事务
// 2 Open transaction

void beginTransaction() throws ProducerFencedException;

// 3在事务内提交已经消费的偏移量(主要用于消费者)
// 3 Commit consumed offsets within transactions (primarily for consumers)

void sendOffsetsToTransaction(Map<TopicPartition, OffsetAndMetadata> offsets,

String consumerGroupId) throws ProducerFencedException;

// 4提交事务
// 4 Submission of affairs

void commitTransaction() throws ProducerFencedException;

// 5放弃事务(类似于回滚事务的操作)
// 5 Discard transaction (similar to rollback transaction)

void abortTransaction() throws ProducerFencedException;

4)总结
4) Summary

(1)生产者角度
1) Producer's perspective

acks设置为-1 (acks=-1)。
acks is set to-1 (acks=-1).

幂等性(enable.idempotence = true) + 事务

(2)broker服务端角度
(2) Broker-side perspective

分区副本大于等于2 --replication-factor 2)。
Partition replication factor greater than or equal to 2.

ISR里应答的最小副本数量大于等于2 (min.insync.replicas = 2)。
The minimum number of copies of a response in an ISR is greater than or equal to 2 (min.insync.replicas = 2).

(3)消费者
(3) Consumers

事务 + 手动提交offset enable.auto.commit = false)。
Transaction + manual commit offset (enable.auto.commit = false).

消费者输出的目的地必须支持事务(MySQLKafka)。
The destination of consumer output must support transactions (MySQL, Kafka).

1.5.6 Kafka如何保证数据有序or怎么解决乱序
1.5.6 Kafka How to keep data in order or how to solve disorder

1)Kafka 最多只保证单分区内的消息是有序的,所以如果要保证业务全局严格有序,就要设置 Topic 为单分区。
1) Kafka only ensures that messages in a single partition are ordered at most, so if you want to ensure strict order in the global business, you must set Topic to a single partition.

2)如何保证单分区内数据有序?
2) How to ensure that the data in a single partition is in order?

注:幂等机制保证数据有序的原理如下:
Note: The idempotent mechanism ensures that data is ordered as follows:

1.5.7 Kafka分区Leader选举规则
1.5.7 Kafka Divisional Leader Election Rules

ISR中存活为前提,按照AR中排在前面的优先。例如AR[1,0,2]ISR [102],那么Leader就会按照102的顺序轮询。
Survival in ISR is a prerequisite, according to the priority ranked first in AR. For example AR[1,0,2], ISR [1,0,2], then the Leader polls in the order of 1, 0, 2.

1.5.8 Kafka中AR的顺序
1.5.8 Order of AR in Kafka

如果Kafka服务器只有4个节点,那么设置Kafka的分区数大于服务器台数,在Kafka底层如何分配存储副本呢?
If Kafka server has only 4 nodes, then set the number of partitions of Kafka to be greater than the number of servers. How to allocate storage replicas at the bottom of Kafka?

1)创建16分区,3个副本
1) Create 16 partitions, 3 copies

(1)创建一个新的Topic,名称为second。
(1) Create a new Topic named second.

[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --create --partitions 16 --replication-factor 3 --topic second

(2)查看分区和副本情况。
(2) Check the partition and copy situation.

[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --describe --topic second

Topic: second4Partition: 0Leader: 0Replicas: 0,1,2Isr: 0,1,2

Topic: second4Partition: 1Leader: 1Replicas: 1,2,3Isr: 1,2,3

Topic: second4Partition: 2Leader: 2Replicas: 2,3,0Isr: 2,3,0

Topic: second4Partition: 3Leader: 3Replicas: 3,0,1Isr: 3,0,1

Topic: second4Partition: 4Leader: 0Replicas: 0,2,3Isr: 0,2,3

Topic: second4Partition: 5Leader: 1Replicas: 1,3,0Isr: 1,3,0

Topic: second4Partition: 6Leader: 2Replicas: 2,0,1Isr: 2,0,1

Topic: second4Partition: 7Leader: 3Replicas: 3,1,2Isr: 3,1,2

Topic: second4Partition: 8Leader: 0Replicas: 0,3,1Isr: 0,3,1

Topic: second4Partition: 9Leader: 1Replicas: 1,0,2Isr: 1,0,2

Topic: second4Partition: 10Leader: 2Replicas: 2,1,3Isr: 2,1,3

Topic: second4Partition: 11Leader: 3Replicas: 3,2,0Isr: 3,2,0

Topic: second4Partition: 12Leader: 0Replicas: 0,1,2Isr: 0,1,2

Topic: second4Partition: 13Leader: 1Replicas: 1,2,3Isr: 1,2,3

Topic: second4Partition: 14Leader: 2Replicas: 2,3,0Isr: 2,3,0

Topic: second4Partition: 15Leader: 3Replicas: 3,0,1Isr: 3,0,1

1.5.9 Kafka日志保存时间
1.5.9 Kafka log retention time

默认保存7天;生产环境建议3天。
7 days by default; 3 days recommended for production environment.

1.5.10 Kafka过期数据清理
1.5.10 Kafka obsolete data cleanup

日志清理的策略只有delete和compact两种
There are only two log cleaning strategies: delete and compact.

1delete日志删除:将过期数据删除
1) delete log: delete expired data

log.cleanup.policy = delete ,所有数据启用删除策略
log.cleanup.policy = delete, all data enable delete policy

1基于时间:默认打开以segment中所有记录中的最大时间戳作为该文件时间戳。
(1) Based on time: open by default. Take the largest timestamp of all records in the segment as the timestamp of the file.

(2)基于大小:默认关闭。超过设置的所有日志总大小,删除最早的segment
(2) Based on size: off by default. Exceeds the set total log size and deletes the oldest segment.

log.retention.bytes,默认等于-1,表示无穷大。
log.retention.bytes, which defaults to-1 for infinity.

思考:如果一个segment中有一部分数据过期,一部分没有过期,怎么处理?
Thinking: If part of the data in a segment is expired and part is not expired, what should be done?

2compact日志压缩
2) Compact log compression

1.5.11 Kafka为什么能高效读写数据
1.5.11 Why Kafka reads and writes data efficiently

1Kafka本身是分布式集群,可以采用分区技术,并行度高
1) Kafka itself is a distributed cluster, which can adopt partition technology with high parallelism.

2)读数据采用稀疏索引,可以快速定位要消费的数据
2) Read data using sparse index, you can quickly locate the data to be consumed

3)顺序写磁盘
3) Sequential write disk

Kafka的producer生产数据,要写入到log文件中,写的过程是一直追加到文件末端,为顺序写。官网有数据表明,同样的磁盘,顺序写能到600M/s,而随机写只有100K/s。这与磁盘的机械机构有关,顺序写之所以快,是因为其省去了大量磁头寻址的时间。
Kafka producer production data, to be written to the log file, the writing process is appended to the end of the file, for sequential writing. Official website data shows that the same disk, sequential write can reach 600M/s, while random write only 100K/s. This has to do with the mechanics of the disk. Sequential writing is faster because it saves a lot of head addressing time.

4)页缓存 + 零拷贝技术
4) Page cache + zero-copy technology

1.5.12 自动创建主题
1.5.12 Automatic Theme Creation

如果Broker端配置参数auto.create.topics.enable设置为true(默认值是true),那么当生产者向一个未创建的主题发送消息时,会自动创建一个分区数为num.partitions(默认值为1)、副本因子为default.replication.factor(默认值为1)的主题。除此之外,当一个消费者开始从未知主题中读取消息时,或者当任意一个客户端向未知主题发送元数据请求时,都会自动创建一个相应主题。这种创建主题的方式是非预期的,增加了主题管理和维护的难度。生产环境建议将该参数设置为false。
If the Broker side configuration parameter auto.create.topics.enable is set to true (the default is true), then when a producer sends a message to an uncreated topic, a topic with num.partitions (the default is 1) and a replica factor of default.replication.factor (the default is 1) is automatically created. In addition, when a consumer starts reading messages from an unknown topic, or when any client sends metadata requests to an unknown topic, a corresponding topic is automatically created. This unexpected way of creating themes increases the difficulty of theme management and maintenance. Production environments recommend setting this parameter to false.

(1)向一个没有提前创建five主题发送数据
(1) Sending data to a topic that has not been created five times in advance

[atguigu@hadoop102 kafka]$ bin/kafka-console-producer.sh --bootstrap-server hadoop102:9092 --topic five

>hello world

(2)查看five主题的详情
(2) See details of five themes

[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --describe --topic five

1.5.13 副本设定
1.5.13 Number of copies set

一般我们设置成2个或3个,很多企业设置为2个
Usually we set it to 2 or 3, and many companies set it to 2.

副本的优势:提高可靠性;副本劣势:增加了网络IO传输。
Duplication advantage: improved reliability; duplication disadvantage: increased network IO transmission.

1.5.14 Kakfa分区数
1.5.14 Number of Kakfa partitions

(1)创建一个只有1个分区的Topic
(1) Create a Topic with only one partition.

(2)测试这个TopicProducer吞吐量和Consumer吞吐量。
(2) Test the Producer throughput and Consumer throughput of this Topic.

(3)假设他们的值分别是TpTc,单位可以是MB/s
(3) Assuming that their values are Tp and Tc, the unit can be MB/s.

(4)然后假设总的目标吞吐量是Tt,那么分区数 = Tt / minTpTc)。
(4) Then assuming that the total target throughput is Tt, then the number of partitions = Tt / min (Tp, Tc).

例如:Producer吞吐量 = 20m/sConsumer吞吐量 = 50m/s,期望吞吐量100m/s
For example: Producer throughput = 20m/s;Consumer throughput = 50m/s, expected throughput 100m/s;

分区数 = 100 / 20 = 5分区
Number of partitions = 100 / 20 = 5 partitions

分区数一般设置为:3-10
The number of partitions is generally set to 3-10

分区数不是越多越好,也不是越少越好,需要搭建完集群,进行压测,再灵活调整分区个数。
The number of partitions is not the more the better, nor the less the better. It is necessary to build a cluster, carry out pressure testing, and then flexibly adjust the number of partitions.

1.5.15 Kafka增加分区
1.5.15 Kafka adds partitions

1)可以通过命令行的方式增加分区,但是分区数只能增加,不能减少。
1) Partitions can be increased by command line, but the number of partitions can only be increased, not decreased.

2)为什么分区数只能增加,不能减少?
2) Why can the number of partitions only increase and not decrease?

(1)按照Kafka现有的代码逻辑而言,此功能完全可以实现,不过也会使得代码的复杂度急剧增大。
(1) According to Kafka's existing code logic, this function can be implemented completely, but it will also make the complexity of the code increase sharply.

(2)实现此功能需要考虑的因素很多,比如删除掉的分区中的消息该作何处理?
(2) There are many factors to consider to implement this function, such as how to deal with messages in deleted partitions.

如果随着分区一起消失则消息的可靠性得不到保障;
If the partition disappears with it, the reliability of the message is not guaranteed;

如果需要保留则又需要考虑如何保留,直接存储到现有分区的尾部,消息的时间戳就不会递增,如此对于Spark、Flink这类需要消息时间戳(事件时间)的组件将会受到影响;
If it needs to be retained, it needs to consider how to retain it, directly store it at the end of the existing partition, and the timestamp of the message will not increment. Therefore, components such as Spark and Flink that need message timestamp (event time) will be affected.

如果分散插入到现有的分区中,那么在消息量很大的时候,内部的数据复制会占用很大的资源,而且在复制期间,此主题的可用性又如何得到保障?
If you insert them scattered into existing partitions, then internal data replication can be resource-intensive when message volumes are high, and how can the availability of this topic be guaranteed during replication?

同时,顺序性问题、事务性问题、以及分区和副本的状态机切换问题都是不得不面对的。
At the same time, sequential problems, transactional problems, and state machine switching between partitions and replicas have to be faced.

(3)反观这个功能的收益点却是很低,如果真的需要实现此类的功能,完全可以重新创建一个分区数较小的主题,然后将现有主题中的消息按照既定的逻辑复制过去即可。
(3) On the contrary, the revenue point of this function is very low. If you really need to implement such a function, you can completely recreate a theme with a smaller number of partitions, and then copy the messages in the existing theme according to the established logic.

1.5.16 Kafka多少个Topic
1.5.16 How many topics in Kafka

ODS层:2个
ODS layer: 2

DWD层:20
DWD layers: 20

1.5.17 Kafka消费者是拉取数据还是推送数据
1.5.17 Kafka Does the consumer pull or push data

拉取数据。
Pull data.

1.5.18 Kafka消费端分区分配策略
1.5.18 Kafka Consumer Partition Allocation Policy

粘性分区:
Viscous partition:

该分区分配算法是最复杂的一种,可以通过 partition.assignment.strategy 参数去设置,从 0.11 版本开始引入,目的就是在执行新分配时,尽量在上一次分配结果上少做调整,其主要实现了以下2个目标:
This partition allocation algorithm is the most complex one, which can be set by partition.assignment.strategy parameter. It has been introduced since version 0.11. The purpose is to make as few adjustments as possible on the previous allocation result when executing new allocation. It mainly achieves the following two goals:

(1)Topic Partition 的分配要尽量均衡。
(1) The distribution of Topic Partition should be balanced as much as possible.

(2)当 Rebalance 发生时,尽量与上一次分配结果保持一致。
(2) When Rebalance occurs, try to keep consistent with the previous allocation result.

注意:当两个目标发生冲突的时候,优先保证第一个目标,这样可以使分配更加均匀,其中第一个目标是3种分配策略都尽量去尝试完成的,而第二个目标才是该算法的精髓所在。
Note: When two goals conflict, priority is given to the first goal, which can make the distribution more uniform, where the first goal is the three allocation strategies try to complete, and the second goal is the essence of the algorithm.

1.5.19 消费者再平衡的条件
1.5.19 Conditions for consumer rebalancing

1)Rebalance 的触发条件有三种
1) There are three trigger conditions for Rebalance.

(1)当Consumer Group 组成员数量发生变化(主动加入、主动离组或者故障下线等)。
(1) When the number of members of the Consumer Group changes (actively joining, actively leaving the group or failing to go offline, etc.).

(2)当订阅主题的数量或者分区发生变化。
(2) When the number of subscription topics or partitions changes.

2)消费者故障下线的情况
2) Consumer failure offline situation

参数名称
name of parameter

描述

session.timeout.ms

Kafka消费者和coordinator之间连接超时时间,默认45s。超过该值,该消费者被移除,消费者组执行再平衡。
Connection timeout between Kafka consumer and coordinator, default 45s. Beyond this value, the consumer is removed and the consumer group performs rebalancing.

max.poll.interval.ms

消费者处理消息的最大时长,默认是5分钟。超过该值,该消费者被移除,消费者组执行再平衡。
The maximum time a consumer can process a message is 5 minutes. Beyond this value, the consumer is removed and the consumer group performs rebalancing.

3)主动加入消费者组
3) Actively join consumer groups

在现有集中增加消费者,也会触发Kafka再平衡。注意,如果下游是Flink,Flink会自己维护offset,不会触发Kafka再平衡。
Adding consumers to existing concentrations also triggers Kafka rebalancing. Note that if Flink is downstream, Flink maintains its own offset and does not trigger Kafka rebalancing.

1.5.20 指定Offset消费
1.5.20 Specify Offset Consumption

可以在任意offset处消费数据。
Data can be consumed at any offset.

kafkaConsumer.seek(topic, 1000);

1.5.21 指定时间消费
1.5.21 Specified time consumption

可以通过时间来消费数据。
Data can be consumed over time.

HashMap<TopicPartition, Long> timestampToSearch = new HashMap<>();

timestampToSearch.put(topicPartition, System.currentTimeMillis() - 1 * 24 * 3600 * 1000);

kafkaConsumer.offsetsForTimes(timestampToSearch);

1.5.22 Kafka监控

公司自己开发的监控器
The company developed its own monitor.

开源的监控器:KafkaManager、KafkaMonitorKafkaEagle
Open source monitors: KafkaManager, KafkaMonitor, KafkaEagle.

1.5.23 Kafka数据积压
1.5.23 Kafka data backlog

1)发现数据积压
1) Discovery of data backlog

通过Kafka的监控器Eagle,可以看到消费lag,就是积压情况:
Through Kafka's monitor Eagle, you can see the consumption lag, which is the backlog:

2)解决
2) Resolved

(1)消费者消费能力不足
(1) Insufficient consumer spending power

可以考虑增加Topic的分区数,并且同时提升消费组的消费者数量,消费者数 = 分区数。(两者缺一不可)
① You can consider increasing the number of partitions of Topic, and at the same time increasing the number of consumers in the consumer group, the number of consumers = the number of partitions. (Both are missing.)

增加分区数;
Increase the number of partitions;

[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --alter --topic first --partitions 3

提高每批次拉取的数量,提高单个消费者的消费能力。
② Increase the number of pulls per batch and improve the consumption ability of individual consumers.

参数名称
name of parameter

描述

fetch.max.bytes

默认Default:5242880050 m)。消费者获取服务器端一批消息最大的字节数。如果服务器端一批次的数据大于该值(50m)仍然可以拉取回来这批数据,因此,这不是一个绝对最大值。一批次的大小受message.max.bytes broker configor max.message.bytes topic config)影响。
Default: 52428800 (50 m). The consumer gets the maximum number of bytes in a batch of server-side messages. If a batch of data on the server side is larger than this value (50m), it can still be pulled back, so this is not an absolute maximum. The size of a batch is affected by message.max.bytes (broker config) or max.message.bytes (topic config).

max.poll.records

一次poll拉取数据返回消息的最大条数,默认是500条
The maximum number of messages returned by pulling data at one time. The default is 500.

(2)消费者处理能力不行
(2) Consumer processing capacity is not good

①消费者,调整fetch.max.bytes大小,默认是50m。
① Consumers, adjust the size of fetch.max.bytes, the default is 50m.

②消费者,调整max.poll.records大小,默认是500条。
② Consumers, adjust the size of max.poll.records, the default is 500.

如果下游是Spark、Flink等计算引擎,消费到数据之后还要进行计算分析处理,当处理能力跟不上消费能力时,会导致背压的出现,从而使消费的速率下降。
If the downstream is computing engines such as Spark and Flink, calculation and analysis processing shall be carried out after data consumption. When the processing capacity cannot keep up with the consumption capacity, backpressure will occur, thus reducing the consumption rate.

需要对计算性能进行调优(看Spark、Flink优化)。
Compute performance needs to be tuned (see Spark, Flink optimization).

(3)消息积压后如何处理
(3) How to deal with the backlog of messages

某时刻,突然开始积压消息且持续上涨。这种情况下需要你在短时间内找到消息积压的原因,迅速解决问题。
At some point, the backlog suddenly starts and continues to rise. This situation requires you to find the cause of the backlog in a short time and quickly solve the problem.

导致消息积压突然增加,只有两种:发送变快了或者消费变慢了。
There are only two ways message backlogs can suddenly increase: sending faster or consuming slower.

假如赶上大促或者抢购时,短时间内不太可能优化消费端的代码来提升消费性能,此时唯一的办法是通过扩容消费端的实例数来提升总体的消费能力。如果短时间内没有足够的服务器资源进行扩容,只能降级一些不重要的业务,减少发送方发送的数据量,最低限度让系统还能正常运转,保证重要业务服务正常。
If you catch up with big promotion or rush purchase, it is unlikely to optimize the code of consumer end to improve consumption performance in a short time. At this time, the only way is to improve the overall consumption ability by expanding the number of instances of consumer end. If there is not enough server resources for expansion in a short time, it can only downgrade some unimportant services, reduce the amount of data sent by the sender, and at least make the system work normally to ensure that important services are normal.

假如通过内部监控到消费变慢了,需要你检查消费实例,分析一下是什么原因导致消费变慢?
If consumption slows down through internal monitoring, you need to check consumption examples and analyze what causes consumption to slow down?

优先查看日志是否有大量的消费错误。
① Prioritize checking logs to see if there are a lot of consumption errors.

此时如果没有错误的话,可以通过打印堆栈信息,看一下你的消费线程卡在哪里「触发死锁或者卡在某些等待资源」。
② At this time, if there is no error, you can print the stack information to see where your consumption thread is stuck "triggering deadlock or stuck in some waiting resources".

1.5.24 如何提升吞吐量
1.5.24 How to improve throughput

如何提升吞吐量?
How to improve throughput?

1)提升生产吞吐量
1) Increase production throughput

(1)buffer.memory:发送消息的缓冲区大小,默认值是32m,可以增加到64m
buffer.memory: The buffer size of the message sent, the default value is 32m, can be increased to 64m.

(2)batch.size:默认是16k。如果batch设置太小,会导致频繁网络请求,吞吐量下降;如果batch太大,会导致一条消息需要等待很久才能被发送出去,增加网络延时。
Batch size: The default is 16k. If the batch setting is too small, it will cause frequent network requests and throughput reduction; if the batch is too large, it will cause a message to wait for a long time to be sent, increasing network latency.

(3)linger.ms,这个值默认是0,意思就是消息必须立即被发送。一般设置一个5-100毫秒。如果linger.ms设置的太小,会导致频繁网络请求,吞吐量下降;如果linger.ms太长,会导致一条消息需要等待很久才能被发送出去,增加网络延时。
linger.ms, which defaults to 0, meaning the message must be sent immediately. Generally set a 5-100 ms. If linger.ms is set too small, it will cause frequent network requests and throughput degradation; if linger.ms is too long, it will cause a message to wait for a long time to be sent, increasing network latency.

4)compression.type:默认是none,不压缩,但是也可以使用lz4压缩,效率还是不错的,压缩之后可以减小数据量,提升吞吐量,但是会加大producer端的CPU开销。
(4) compression.type: default is none, no compression, but you can also use lz4 compression, efficiency is good, compression can reduce data volume, improve throughput, but will increase the CPU overhead of the producer side.

2)增加分区
2) Additional zoning

3)消费者提高吞吐量
3) Consumers increase throughput

(1)调整fetch.max.bytes大小,默认是50m。
(1) Adjust the size of fetch.max.bytes, the default is 50m.

(2)调整max.poll.records大小,默认是500条。
(2) Adjust the size of max.poll.records, the default is 500.

1.5.25 Kafka中数据量计算
1.5.25 Calculation of data volume in Kafka

每天总数据量100g,每天产生1亿日志,10000万/24/60/60=1150条/秒钟
Total data volume per day 100g, generating 100 million logs per day, 100 million/24/60/60=1150 logs per second

平均每秒钟:1150
Average per second: 1,150

低谷钟:50
Low point per second: 50

高峰每秒钟:1150*(2-20= 2300条 - 23000条
Peak per second: 1150 *(2-20 times)= 2300 - 23000

每条日志大小0.5k - 2k(取1k
Size of each log: 0.5k - 2k (take 1k)

每秒多少数据量:2.0M - 20MB
How much data per second: 2.0M - 20MB

1.5.26 Kafka如何压测?
1.5.26 How does Kafka pressure test?

用Kafka官方自带的脚本,对Kafka进行压测。
Use Kafka's official script to test Kafka.

生产者压测:kafka-producer-perf-test.sh

消费者压测:kafka-consumer-perf-test.sh

1)Kafka Producer压力测试
1) Kafka Producer Pressure Test

(1)创建一个test Topic,设置为3个分区3个副本
(1) Create a test Topic, set to 3 partitions and 3 copies

[atguigu@hadoop102 kafka]$ bin/kafka-topics.sh --bootstrap-server hadoop102:9092 --create --replication-factor 3 --partitions 3 --topic test

2)在/opt/module/kafka/bin目录下面有这两个文件。我们来测试一下
(2) These two files are located in the/opt/module/kafka/bin directory. Let's test it.

[atguigu@hadoop105 kafka]$ bin/kafka-producer-perf-test.sh --topic test --record-size 1024 --num-records 1000000 --throughput 10000 --producer-props bootstrap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092 batch.size=16384 linger.ms=0

参数说明:
Parameter Description:

record-size是一条信息有多大,单位是字节,本次测试设置为1k。
Record-size is how big a piece of information is, in bytes, and this test is set to 1k.

num-records是总共发送多少条信息,本次测试设置为100万条。
num-records is the total number of messages sent, and this test is set to 1 million.

throughput 是每秒多少条信息,设成-1,表示不限流,尽可能快的生产数据,可测出生产者最大吞吐量。本次实验设置为每秒钟1万条。
Throughput is the number of messages per second, set to-1, indicating unlimited flow, as fast as possible production data, can measure the maximum throughput of producers. The experiment was set at 10,000 per second.

producer-props 后面可以配置生产者相关参数,batch.size配置为16k
Producer-related parameters can be configured after producer-props, batch.size is configured to 16k.

输出结果:
Outputs:

ap.servers=hadoop102:9092,hadoop103:9092,hadoop104:9092 batch.size=16384 linger.ms=0

37021 records sent, 7401.2 records/sec (7.23 MB/sec), 1136.0 ms avg latency, 1453.0 ms max latency.

。。。 。。。

33570 records sent, 6714.0 records/sec (6.56 MB/sec), 4549.0 ms avg latency, 5049.0 ms max latency.

1000000 records sent, 9180.713158 records/sec (8.97 MB/sec), 1894.78 ms avg latency, 5049.00 ms max latency, 1335 ms 50th, 4128 ms 95th, 4719 ms 99th, 5030 ms 99.9th.

3)调整batch.size大小
(3) Adjust batch.size

4)调整linger.ms时间
(4) Adjust linger.ms time

5)调整压缩方式
(5) Adjust the compression method

6)调整缓存大小
(6) Adjust cache size

2)Kafka Consumer压力测试
2) Kafka Consumer Pressure Test

1)修改/opt/module/kafka/config/consumer.properties文件中的一次拉取条数为500
(1) Modify the number of items pulled at one time in the/opt/module/kafka/config/consumer.properties file to 500

max.poll.records=500

2)消费100万条日志进行压测
(2) Consumption of 1 million logs for pressure testing

[atguigu@hadoop105 kafka]$ bin/kafka-consumer-perf-test.sh --bootstrap-server hadoop102:9092,hadoop103:9092,hadoop104:9092 --topic test --messages 1000000 --consumer.config config/consumer.properties

参数说明:
Parameter Description:

--bootstrap-server指定Kafka集群地址
--bootstrap-server Specifies the Kafka cluster address

--topic 指定topic的名称
--topic Specifies the name of the topic

--messages 总共要消费的消息个数。本次实验100万条。
--messages Total number of messages to consume. 1 million experiments.

输出结果:
Outputs:

start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec

2022-01-20 09:58:26:171, 2022-01-20 09:58:33:321, 977.0166, 136.6457, 1000465, 139925.1748, 415, 6735, 145.0656, 148547.1418

3)一次拉取条数为2000
(3) The number of strips pulled at one time is 2000

4)调整fetch.max.bytes大小为100m
Adjust fetch.max.bytes size to 100m

1.5.27 磁盘选择
1.5.27 Disk Selection

kafka底层主要是顺序写,固态硬盘和机械硬盘的顺序写速度差不多。
The bottom layer of kafka is mainly sequential writing, and the sequential writing speed of solid state disk and mechanical hard disk is similar.

建议选择普通的机械硬盘。
It is recommended to choose ordinary mechanical hard disk.

每天总数据量:1亿条 * 1k 100g
Total data volume per day: 100 million * 1k ≈ 100g

100g * 副本2 * 保存时间3天 / 0.7 ≈ 1T
100g * copy 2 * storage time 3 days/ 0.7 ≈ 1T

建议三台服务器硬盘总大小,大于等于1T。
It is recommended that the total hard disk size of three servers be greater than or equal to 1T.

1.5.28 内存选择
1.5.28 Memory Selection

Kafka内存组成:堆内存 + 页缓存
Kafka memory composition: heap memory + page cache

1Kafka堆内存建议每个节点:10g ~ 15g
1) Kafka heap memory recommended each node: 10g ~ 15g

kafka-server-start.sh中修改

if [ "x$KAFKA_HEAP_OPTS" = "x" ]; then

export KAFKA_HEAP_OPTS="-Xmx10G -Xms10G"

fi

(1)查看Kafka进程号
(1) Check the Kafka process number

[atguigu@hadoop102 kafka]$ jps

2321 Kafka

5255 Jps

1931 QuorumPeerMain

2)根据Kafka进程号,查看Kafka的GC情况
(2) According to Kafka process number, check Kafka GC situation

[atguigu@hadoop102 kafka]$ jstat -gc 2321 1s 10

S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT

0.0 7168.0 0.0 7168.0 103424.0 60416.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 60416.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 60416.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 60416.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 60416.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 61440.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 61440.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 61440.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 61440.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

0.0 7168.0 0.0 7168.0 103424.0 61440.0 1986560.0 148433.5 52092.0 46656.1 6780.0 6202.2 13 0.531 0 0.000 0.531

参数说明:
Parameter Description:

YGC:年轻代垃圾回收次数;
YGC: Young generation garbage collection times;

3)根据Kafka进程号,查看Kafka的堆内存
(3) According to Kafka process number, check Kafka heap memory

[atguigu@hadoop102 kafka]$ jmap -heap 2321

… …

Heap Usage:

G1 Heap:

regions = 2048

capacity = 2147483648 (2048.0MB)

used = 246367744 (234.95458984375MB)

free = 1901115904 (1813.04541015625MB)

11.472392082214355% used

2)页缓存:
2) Page cache:

页缓存是Linux系统服务器的内存。我们只需要保证1个segment(1g)中25%的数据在内存中就好。
The page cache is the memory of the Linux system server. We only need to ensure that 25% of the data in 1 segment (1g) is in memory.

每个节点页缓存大小 =(分区数 * 1g * 25%)/ 节点数。例如10个分区,页缓存大小=(10 * 1g * 25%)/ 3 1g
Page cache size per node =(number of partitions * 1g * 25%)/number of nodes. For example, 10 partitions, page cache size =(10 * 1g * 25%)/ 3 ≈ 1g

建议服务器内存大于等于11G。
Recommended server memory is greater than or equal to 11G.

1.5.29 CPU选择
1.5.29 CPU Selection

1)默认配置
1) Default configuration

num.io.threads = 8 负责写磁盘的线程数。
num.io.threads = 8 Number of threads responsible for writing to disk.

num.replica.fetchers = 1 副本拉取线程数。
num.replica.fetchers = 1 Number of replica pull threads.

num.network.threads = 3 数据传输线程数。
num. network.threads = 3 Number of data transfer threads.

2)建议配置
2) Recommended configuration

此外还有后台的一些其他线程,比如清理数据线程,Controller负责感知和管控整个集群的线程等等,这样算,每个Broker都会有上百个线程存在。根据经验,4核CPU处理几十个线程在高峰期会打满,8核勉强够用,而且再考虑到集群上还要运行其他的服务,所以部署Kafka的服务器一般建议在16核以上可以应对一两百个线程的工作,如果条件允许,给到24核甚至32核就更好。
In addition, there are some other threads in the background, such as cleaning up the data thread, Controller is responsible for sensing and controlling the threads of the entire cluster, etc., so that each Broker will have hundreds of threads. According to experience, 4-core CPU processing dozens of threads will be full at peak times, 8 cores are barely enough, and considering that other services must be run on the cluster, so it is generally recommended that Kafka servers be deployed at more than 16 cores to handle one or two hundred threads of work. If conditions permit, it is better to give 24 cores or even 32 cores.

num.io.threads = 16 负责写磁盘的线程数。
num.io.threads = 16 Number of threads responsible for writing to disk.

num.replica.fetchers = 2 副本拉取线程数。
num.replica.fetchers = 2 Number of replica pull threads.

num.network.threads = 6 数据传输线程数。
num. network.threads = 6 Number of data transfer threads.

服务器建议购买 32核CPU
Server recommends buying 32-core CPU

1.5.30 网络选择
1.5.30 Network Selection

网络带宽 = 峰值吞吐量 ≈ 20MB/s 选择千兆网卡即可。
Network bandwidth = peak throughput ≈ 20MB/s, select Gigabit NIC.

100Mbps单位是bit;10M/s单位是byte ; 1byte = 8bit,100Mbps/8 = 12.5M/s
100 Mbps/s, 10 M/s, 1 byte = 8 bit, 100 Mbps/8 = 12.5M/s.

一般百兆的网卡(100Mbps=12.5m/s)、千兆的网卡(1000Mbps=125m/s)、万兆的网卡(1250m/s)。
General 100 Mbps network card (100Mbps=12.5m/s), Gigabit network card (1000Mbps=125m/s), 10 Gigabit network card (1250m/s).

一般百兆的网卡(100Mbps)、千兆的网卡(1000Mbps)、万兆的网卡(10000Mbps)。100Mbps单位是bit;10M/s单位是byte ; 1byte = 8bit,100Mbps/8 = 12.5M/s。
General 100 Mbps network card (100Mbps), Gigabit network card (1000Mbps), 10 Gigabit network card (10000Mbps). 100Mbps unit is bit;10M/s unit is byte ; 1 byte = 8 bit, 100Mbps/8 = 12.5 M/s.

通常选用千兆或者是万兆网卡。
Usually choose gigabit or 10 gigabit network card.

1.5.31 Kafka挂掉

在生产环境中,如果某个Kafka节点挂掉。
In a production environment, if a Kafka node fails.

正常处理办法:
Normal treatment method:

(1)先看日志,尝试重新启动一下,如果能启动正常,那直接解决。
(1) Look at the log first, try to restart it, if it can start normally, then solve it directly.

(2)如果重启不行,检查内存、CPU、网络带宽。调优=》调优不行增加资源
(2) If the restart does not work, check the memory, CPU, and network bandwidth. Tuning => Tuning does not increase resources

(3)如果将Kafka整个节点误删除,如果副本数大于等于2,可以按照服役新节点的方式重新服役一个新节点,并执行负载均衡。
(3) If the entire Kafka node is deleted by mistake, if the number of copies is greater than or equal to 2, a new node can be re-served in the same way as the new node, and Load Balancer can be executed.

1.5.32 Kafka的机器数量
1.5.32 Number of machines in Kafka

1.5.33 服役新节点退役旧节点
1.5.33 New nodes in service Retired old nodes

可以通过bin/kafka-reassign-partitions.sh脚本服役和退役节点。
The service and retirement nodes can be accessed via the bin/kafka-reassign-partitions.sh script.

1.5.34 Kafka单条日志传输大小
1.5.34 Kafka Single Log Transfer Size

Kafka对于消息体的大小默认为单条最大值是1M但是在我们应用场景中,常常会出现一条消息大于1M,如果不对Kafka进行配置。则会出现生产者无法将消息推送到Kafka或消费者无法去消费Kafka里面的数据,这时我们就要对Kafka进行以下配置:server.properties
The default size of Kafka for message body is 1M, but in our application scenario, there will often be a message larger than 1M, if Kafka is not configured. Then there will be producers unable to push messages to Kafka or consumers unable to consume data in Kafka, then we need to configure Kafka as follows: server.properties.

参数名称
name of parameter

描述

message.max.bytes

默认1m,Broker端接收每个批次消息最大值。
Default 1m, maximum number of messages received by Broker.

max.request.size

默认1m,生产者发往Broker每个请求消息最大值。针对Topic级别设置消息体的大小。
Default 1m, maximum value of each request message sent by producer to Broker. Sets the size of the message body for the Topic level.

replica.fetch.max.bytes

默认1m,副本同步数据,每个批次消息最大值。
Default 1m, copy sync data, maximum per batch message.

fetch.max.bytes

默认Default:5242880050 m)。消费者获取服务器端一批消息最大的字节数。如果服务器端一批次的数据大于该值(50m)仍然可以拉取回来这批数据,因此,这不是一个绝对最大值。一批次的大小受message.max.bytes broker configor max.message.bytes topic config)影响。
Default: 52428800 (50 m). The consumer gets the maximum number of bytes in a batch of server-side messages. If a batch of data on the server side is larger than this value (50m), it can still be pulled back, so this is not an absolute maximum. The size of a batch is affected by message.max.bytes (broker config) or max.message.bytes (topic config).

1.5.35 Kafka参数优化
1.5.35 Kafka parameter optimization

重点调优参数:
Key tuning parameters:

(1)buffer.memory 32m

(2)batch.size:16k

(3)linger.ms默认0 调整 5-100ms

(4)compression.type采用压缩 snappy

5)消费者端调整fetch.max.bytes大小,默认是50m。
(5) The consumer adjusts the fetch.max.bytes size, which defaults to 50m.

6)消费者端调整max.poll.records大小,默认是500条。
(6) The consumer adjusts the size of max.poll.records, the default is 500.

(7)单条日志大小:message.max.bytes、max.request.size、replica.fetch.max.bytes适当调整2-10m

(8)Kafka堆内存建议每个节点:10g ~ 15g
(8) Kafka heap memory recommendations per node: 10g ~ 15g

kafka-server-start.sh中修改

if [ "x$KAFKA_HEAP_OPTS" = "x" ]; then

export KAFKA_HEAP_OPTS="-Xmx10G -Xms10G"

fi

9)增加CPU核数
(9) Increase CPU core count

num.io.threads = 8 负责写磁盘的线程数
num.io.threads = 8 Number of threads responsible for writing to disk

num.replica.fetchers = 1 副本拉取线程数
num.replica.fetchers = 1 Number of replica pull threads

num.network.threads = 3 数据传输线程数
num. network.threads = 3 Number of data transfer threads

(10)日志保存时间log.retention.hours 3
(10) Log retention time log.retention.hours 3 days

(11)副本数,调整为2
(11) Number of copies, adjusted to 2

1.6 Hive

1.6.1 Hive的架构
1.6.1 Hive's architecture

1.6.2 HQL转换为MR流程
1.6.2 HQL to MR Process

(1)解析器(SQLParser):将SQL字符串转换成抽象语法树(AST)
SQLParser: converts SQL strings into abstract syntax trees (AST)

(2)语义分析器(Semantic Analyzer):将AST进一步抽象为QueryBlock(可以理解为一个子查询划分成一个QueryBlock)
(2) Semantic Analyzer: AST is further abstracted into QueryBlock (can be understood as a subquery divided into a QueryBlock)

(2)逻辑计划生成器(Logical Plan Gen):由QueryBlock生成逻辑计划
(2) Logical Plan Gen: Generate logical plans from QueryBlock

(3)逻辑优化器(Logical Optimizer):对逻辑计划进行优化
(3) Logical Optimizer: Optimize the logical plan

(4)物理计划生成器(Physical Plan Gen):根据优化后的逻辑计划生成物理计划
(4) Physical Plan Gen: Generate physical plan according to optimized logical plan.

(5)物理优化器(Physical Optimizer):对物理计划进行优化
(5) Physical Optimizer: Optimize the physical plan

(6)执行器(Execution):执行该计划,得到查询结果并返回给客户端
(6) Execution: execute the plan, get the query result and return it to the client.

1.6.3 Hive和数据库比较
1.6.3 Hive vs. Database Comparison

Hive 和数据库除了拥有类似的查询语言,再无类似之处。
Hive and databases have nothing in common except a similar query language.

1数据存储位置
1) Data storage location

Hive 存储在 HDFS 。数据库将数据保存在块设备或者本地文件系统中。
Hive is stored in HDFS. Databases store data in block devices or local file systems.

2数据更新
2) Data update

Hive中不建议对数据的改写。而数据库中的数据通常是需要经常进行修改的
Hive does not recommend rewriting data. The data in the database usually needs to be modified frequently.

3执行延迟
3) Implementation delay

Hive 执行延迟较高。数据库的执行延迟较低。当然,这个是有条件的,即数据规模较小,当数据规模大到超过数据库的处理能力的时候,Hive的并行计算显然能体现出优势。
Hive has higher execution latency. Database execution latency is low. Of course, this is conditional, that is, the data size is small, when the data size is large enough to exceed the processing capacity of the database, Hive's parallel computing can obviously show its advantages.

4数据规模
4) Data size

Hive支持很大规模的数据计算;数据库可以支持的数据规模较小。
Hive supports very large-scale data computations; databases can support smaller data scales.

1.6.4 内部表和外部表
1.6.4 Internal and external tables

元数据、原始数据
Metadata, raw data

1)删除数据时
1) When deleting data

内部表:元数据、原始数据,全删除
Internal tables: metadata, raw data, delete all

外部表:元数据 只删除
External tables: metadata deleted only

2)在公司生产环境下,什么时候创建内部表,什么时候创建外部表?
2) In a company production environment, when are internal tables created and when are external tables created?

在公司中绝大多数场景都是外部表。
Most scenarios in a company are external tables.

自己使用的临时表,才会创建内部表;
Internal tables are created only when temporary tables are used by oneself;

1.6.5 系统函数
1.6.5 System functions

1)数值函数
1) numerical function

(1)round:四舍五入;(2)ceil:向上取整;(3)floor:向下取整
round round

2)字符串函数
2) String functions

(1)substring:截取字符串;(2)replace:替换;(3)regexp_replace:正则替换
(1) substring: intercept string;(2) replace: replace;(3) regexp_replace: regular replacement

(4)regexp:正则匹配;(5)repeat:重复字符串;(6)split:字符串切割
(4) regexp: regular matching;(5) repeat: repeated string;(6) split: string cutting

(7)nvl:替换null值;(8)concat:拼接字符串;
(7) nvl: replace null value;(8) concat: concatenate string;

(9)concat_ws:以指定分隔符拼接字符串或者字符串数组;
(9) concat_ws: concatenates strings or arrays of strings with specified delimiters;

(10get_json_object:解析JSON字符串
(10) get_json_object: Parse JSON string

3)日期函数
3) Date function

(1)unix_timestamp:返回当前或指定时间的时间戳
unix_timestamp: Returns the timestamp of the current or specified time

(2)from_unixtime:转化UNIX时间戳(从 1970-01-01 00:00:00 UTC 到指定时间的秒数)到当前时区的时间格式
(2) from_unixtime: converts the UNIX timestamp (seconds from 1970-01-01 00:00:00 UTC to the specified time) to the time format of the current time zone

(3)current_date:当前日期
(3) current_date: current date

(4)current_timestamp:当前的日期加时间,并且精确的毫秒
(4) current_timestamp: the current date plus time, and accurate milliseconds

(5)month:获取日期中的月;(6)day:获取日期中的日
(5) month: month of the acquisition date;(6) day: day of the acquisition date

(7)datediff:两个日期相差的天数(结束日期减去开始日期的天数)
datediff: the number of days between two dates (the end date minus the start date)

(8)date_add:日期加天数;(9)date_sub:日期减天数
(8) date_add: date plus days;(9) date_sub: date minus days

(10)date_format:将标准日期解析成指定格式字符串
(10) date_format: Parses a standard date into a string in a specified format

4)流程控制函数
4) Process control function

(1)case when:条件判断函数
(1) case when: conditional judgment function

(2)if:条件判断,类似于Java中三元运算符
(2) if: conditional judgment, similar to the ternary operator in Java

5)集合函数
5) Set function

(1)array:声明array集合
(1) array: declared array collection

(2)map:创建map集合
(2) Map: Create a map collection

(3)named_struct:声明struct的属性和值
(3) named_struct: Declares the attributes and values of struct

(4)size:集合中元素的个数
size: the number of elements in the collection

(5)map_keys:返回map中的key
(5) map_keys: returns the key in the map

(6)map_values:返回map中的value

(7)array_contains:判断array中是否包含某个元素
array_contains: determines whether an array contains an element

(8)sort_array:将array中的元素排序
sort_array: sort elements in an array

6)聚合函数
6) Aggregate function

(1)collect_list:收集并形成list集合,结果不去重
(1) collect_list: collect and form a list collection, the result is not repeated

(2)collect_set:收集并形成set集合,结果去重
(2) collect_set: collect and form a set, and remove the duplicate results

1.6.6 自定义UDF、UDTF函数
1.6.6 Custom UDF, UDTF Functions

1)在项目中是否自定义过UDF、UDTF函数以及用他们处理了什么问题及自定义步骤?
1) Have UDF and UDTF functions been customized in the project, and what problems have been solved with them, and custom steps?

(1)目前项目中逻辑不是特别复杂就没有用自定义UDF和UDTF
(1) At present, if the logic in the project is not particularly complex, custom UDF and UDTF are not used.

(2)自定义UDF:继承G..UDF重写核心方法evaluate
(2) Custom UDF: Inheriting G.. UDF, override core method evaluate

(3)自定义UDTF:继承自GenericUDTF,重写3个方法:initialize(自定义输出的列名和类型),process(将结果返回forward(result)),close
(3) Custom UDTF: inherit from GenericUDTF, rewrite 3 methods: initialize (customize the column name and type of output), process (return the result forward (result)), close

2)企业中一般什么场景下使用UDF/UDTF
2) Under what circumstances do UDF/UDTF usually be used in enterprises?

(1)因为自定义函数,可以将自定函数内部任意计算过程打印输出,方便调试。
(1) Because of the custom function, you can print out any calculation process inside the custom function for debugging.

(2)引入第三方jar包时,也需要。
(2) When introducing third-party jar packages, it is also required.

3)广告数仓中解析IP/UA使用hi ve的自定义函数
3) Analyze IP/UA in advertising warehouse using hive custom function

解析IP:可以调用接口/ 使用离线的IP2Region数据库
Resolving IP: interfaces can be invoked/offline IP2 Region databases used

解析UA:正则/ Hutool解析UA的工具类
UA Resolution: Regular/ Hutool Tools for UA Resolution

1.6.7 窗口函数
1.6.7 Window functions

一般在场景题中出现手写:分组TopN、行转列、列转行。
Handwriting generally appears in scene questions: grouping TopN, row to column, column to column.

按照功能,常用窗口可划分为如下几类:聚合函数、跨行取值函数、排名函数。
According to the function, common windows can be divided into the following categories: aggregation function, cross-line value function, ranking function.

1)聚合函数
1) aggregation function

max:最大值。
Max: Maximum value.

min:最小值。
min: Minimum value.

sum:求和。
sum: sum.

avg:平均值。
AVG: Average.

count:计数。
Count: Counting.

2)跨行取值函数
2) Cross-line value function

(1)lead和lag

注:lag和lead函数不支持自定义窗口。
Note: The lag and lead functions do not support custom windows.

(2)first_value和last_value

3)排名函数
3) Ranking function

注:rank 、dense_rank、row_number不支持自定义窗口。
Note: rank, dense_rank, row_number do not support custom windows.

1.6.8 Hive优化
1.6.8 Hive optimization

1.6.8.1 分组聚合
1.6.8.1 Group Aggregation

一个分组聚合的查询语句,默认是通过一个MapReduce Job完成的。Map端负责读取数据,并按照分组字段分区,通过Shuffle,将数据发往Reduce端,各组数据在Reduce端完成最终的聚合运算。
A grouped aggregate query statement, by default, is completed through a MapReduce Job. The Map terminal is responsible for reading data, partitioning the data according to grouping fields, sending the data to the Reduce terminal through Shuffle, and completing the final aggregation operation of each group of data at the Reduce terminal.

分组聚合的优化主要围绕着减少Shuffle数据量进行,具体做法是map-side聚合。所谓map-side聚合,就是在map端维护一个Hash Table,利用其完成部分的聚合,然后将部分聚合的结果,按照分组字段分区,发送至Reduce端,完成最终的聚合。
The optimization of grouping aggregation mainly revolves around reducing the amount of Shuffle data, and the specific method is map-side aggregation. Map-side aggregation is to maintain a Hash Table on the map side, use it to complete partial aggregation, and then send the results of partial aggregation to the Reduce side according to the grouping fields to complete the final aggregation.

相关参数如下:
The relevant parameters are as follows:

--启用map-side聚合,默认是true
--Enable map-side aggregation, default is true

set hive.map.aggr=true;

--用于检测源表数据是否适合进行map-side聚合。检测的方法是:先对若干条数据进行map-side聚合,若聚合后的条数和聚合前的条数比值小于该值,则认为该表适合进行map-side聚合;否则,认为该表数据不适合进行map-side聚合,后续数据便不再进行map-side聚合。
--Used to detect whether the source table data is suitable for map-side aggregation. The detection method is as follows: map-side aggregation is carried out on several pieces of data at first; if the ratio of the number of pieces after aggregation to the number of pieces before aggregation is less than the value, the table is considered suitable for map-side aggregation; otherwise, the table data is considered not suitable for map-side aggregation, and subsequent data will not be map-side aggregated.

set hive.map.aggr.hash.min.reduction=0.5;

--用于检测源表是否适合map-side聚合的条数。
--Number of entries used to detect whether the source table fits into map-side aggregation.

set hive.groupby.mapaggr.checkinterval=100000;

--map-side聚合所用的hash table,占用map task堆内存的最大比例,若超出该值,则会对hash table进行一次flush。
--map-side The hash table used for aggregation, occupying the maximum proportion of map task heap memory. If this value is exceeded, the hash table will be flushed once.

set hive.map.aggr.hash.force.flush.memory.threshold=0.9;

1.6.8.2 Map Join

Hive中默认最稳定的Join算法是Common Join。其通过一个MapReduce Job完成一个Join操作。Map端负责读取Join操作所需表的数据,并按照关联字段进行分区,通过Shuffle,将其发送到Reduce端,相同key的数据在Reduce端完成最终的Join操作。
The most stable Join algorithm in Hive is Common Join by default. It completes a Join operation with a MapReduce Job. The Map end is responsible for reading the data of the table required for the Join operation, partitioning it according to the associated fields, and sending it to the Reduce end through Shuffle. The data with the same key completes the final Join operation at the Reduce end.

优化Join的最为常用的手段就是Map Join,其可通过两个只有Map阶段的Job完成一个join操作。第一个Job会读取小表数据,将其制作为Hash Table,并上传至Hadoop分布式缓存(本质上是上传至HDFS)。第二个Job会先从分布式缓存中读取小表数据,并缓存在Map Task的内存中,然后扫描大表数据,这样在map端即可完成关联操作。
The most common way to optimize Join is Map Join, which can complete a join operation through two jobs with only Map phases. The first Job reads the small table data, makes it into a Hash Table, and uploads it to Hadoop distributed cache (essentially HDFS). The second Job reads the small table data from the distributed cache and caches it in the memory of the Map Task, then scans the large table data, so that the association operation can be completed on the map side.

注:由于Map Join需要缓存整个小标的数据,故只适用于大表Join小表的场景。
Note: Map Join needs to cache the data of the whole small table, so it is only applicable to the scene of large table Join small table.

相关参数如下:
The relevant parameters are as follows:

--启动Map Join自动转换
--Start Map Join automatic conversion

set hive.auto.convert.join=true;

--开启无条件转Map Join
--Open Unconditional Map Join

set hive.auto.convert.join.noconditionaltask=true;

--无条件转Map Join小表阈值,默认值10M,推荐设置为Map Task总内存的三分之一到二分之一
--Unconditional Map Join table threshold, default value 10M, recommended to set to 1/3 to 1/2 of the total memory of Map Task

set hive.auto.convert.join.noconditionaltask.size=10000000;

1.6.8.3 SMB Map Join

上节提到,Map Join只适用于大表Join小表的场景。若想提高大表Join大表的计算效率,可使用Sort Merge Bucket Map Join。
As mentioned in the previous section, Map Join is only applicable to scenarios where large tables Join small tables. To improve the computational efficiency of large table Join large table, use Sort Merge Bucket Map Join.

需要注意的是SMB Map Join有如下要求:
SMB Map Join has the following requirements:

(1)参与Join的表均为分桶表,且分桶字段为Join的关联字段。
(1) All tables participating in Join are bucket tables, and bucket fields are associated fields of Join.

(2)两表分桶数呈倍数关系。
(2) The number of barrels in the two tables is multiple.

(3)数据在分桶内是按关联字段有序的。
(3) The data is ordered in buckets according to associated fields.

SMB Join的核心原理如下:只要保证了上述三点要求的前两点,就能保证参与Join的两张表的分桶之间具有明确的关联关系,因此就可以在两表的分桶间进行Join操作了。
The core principle of SMB Join is as follows: As long as the first two points of the above three requirements are guaranteed, there can be a clear association between the buckets of the two tables participating in Join, so you can perform Join operation between the buckets of the two tables.

若能保证第三点,也就是参与Join的数据是有序的,这样就能使用数据库中常用的Join算法之一——Sort Merge Join了,Merge Join原理如下:
If you can ensure that the third point, that is, the data participating in the Join is ordered, you can use one of the Join algorithms commonly used in the database-Sort Merge Join. The principle of Merge Join is as follows:

在满足了上述三点要求之后,就能使用SMB Map Join了。
After meeting the above three requirements, you can use SMB Map Join.

由于SMB Map Join无需构建Hash Table也无需缓存小表数据,故其对内存要求很低。适用于大表Join大表的场景。
SMB Map Join has low memory requirements because it does not need to build Hash tables or cache small table data. For scenarios where large tables Join large tables.

1.6.8.4 Reduce并行度
1.6.8.4 Reduce parallelism

Reduce端的并行度,也就是Reduce个数,可由用户自己指定,也可由Hive自行根据该MR Job输入的文件大小进行估算。
The parallelism of the Reduce side, that is, the number of Reduces, can be specified by the user himself, or Hive can estimate it according to the file size input by the MR Job.

Reduce端的并行度的相关参数如下:
The relevant parameters of parallelism on the Reduce side are as follows:

--指定Reduce端并行度,默认值为-1,表示用户未指定
--Specifies the reduce-side parallelism, the default value is-1, indicating that the user does not specify

set mapreduce.job.reduces;

--Reduce端并行度最大值
--Reduce maximum parallelism

set hive.exec.reducers.max;

--单个Reduce Task计算的数据量,用于估算Reduce并行度
--The amount of data for a single Reduce Task computation, used to estimate Reduce parallelism

set hive.exec.reducers.bytes.per.reducer;

Reduce端并行度的确定逻辑如下:
The logic for determining reduce-side parallelism is as follows:

若指定参数mapreduce.job.reduces的值为一个非负整数,则Reduce并行度为指定值。否则,Hive自行估算Reduce并行度,估算逻辑如下:
If the value of the specified parameter mapreduce.job.reduces is a non-negative integer, the Reduce parallelism is the specified value. Otherwise, Hive estimates the Reduce parallelism by itself. The estimation logic is as follows:

假设Job输入的文件大小为totalInputBytes
Assuming the file size of Job input is totalInputBytes

参数hive.exec.reducers.bytes.per.reducer的值为bytesPerReducer。

参数hive.exec.reducers.max的值为maxReducers。

则Reduce端的并行度为:
Then the parallelism of the Reduce side is:

min(ceil(totalInputBytesbytesPerReducer),maxReducers)

根据上述描述,可以看出,Hive自行估算Reduce并行度时,是以整个MR Job输入的文件大小作为依据的。因此,在某些情况下其估计的并行度很可能并不准确,此时就需要用户根据实际情况来指定Reduce并行度了。
From the above description, it can be seen that Hive estimates the Reduce parallelism by itself based on the file size of the entire MR Job input. Therefore, in some cases, the estimated parallelism may not be accurate, and the user needs to specify the Reduce parallelism according to the actual situation.

需要说明的是:若使用Tez或者是Spark引擎,Hive可根据计算统计信息(Statistics)估算Reduce并行度,其估算的结果相对更加准确。
It should be noted that if Tez or Spark engine is used, Hive can estimate the Reduce parallelism based on computational statistics, and the estimated result is relatively accurate.

1.6.8.5 小文件合并
1.6.8.5 Small file merge

若Hive的Reduce并行度设置不合理,或者估算不合理,就可能导致计算结果出现大量的小文件。该问题可由小文件合并任务解决。其原理是根据计算任务输出文件的平均大小进行判断,若符合条件,则单独启动一个额外的任务进行合并。
If Hive's Reduce parallelism setting is unreasonable, or the estimation is unreasonable, it may lead to a large number of small files in the calculation result. This problem can be solved by the small file merge task. The principle is to judge according to the average size of the output file of the calculation task. If the condition is met, an additional task will be started separately for merging.

相关参数为:
The relevant parameters are:

--开启合并map only任务输出的小文件
--Open merge map only task output small file

set hive.merge.mapfiles=true;

--开启合并map reduce任务输出的小文件
--Open small files for merge map reduce task output

set hive.merge.mapredfiles=true;

--合并后的文件大小
--Combined file size

set hive.merge.size.per.task=256000000;

--触发小文件合并任务的阈值,若某计算任务输出的文件平均大小低于该值,则触发合并
--Threshold value for triggering small file merge task. If the average file size output by a calculation task is lower than this value, merge will be triggered.

set hive.merge.smallfiles.avgsize=16000000;

1.6.8.6 谓词下推
1.6.8.6 Predicate Push Down

谓词下推(predicate pushdown)是指,尽量将过滤操作前移,以减少后续计算步骤的数据量。开启谓词下推优化后,无需调整SQL语句,Hive就会自动将过滤操作尽可能的前移动。
Predicate pushdown refers to moving the filtering operation forward as much as possible to reduce the amount of data in subsequent calculation steps. When predicate pushdown optimization is enabled, Hive automatically moves filtering operations as far forward as possible without adjusting SQL statements.

相关参数为:
The relevant parameters are:

--是否启动谓词下推(predicate pushdown)优化
--Whether to start predicate pushdown optimization

set hive.optimize.ppd = true;

1.6.8.7 并行执行
1.6.8.7 Parallel Execution

Hive会将一个SQL语句转化成一个或者多个Stage,每个Stage对应一个MR Job。默认情况下,Hive同时只会执行一个Stage。但是SQL语句可能会包含多个Stage,但这多个Stage可能并非完全互相依赖,也就是说有些Stage是可以并行执行的。此处提到的并行执行就是指这些Stage的并行执行。相关参数如下:
Hive transforms an SQL statement into one or more stages, one MR Job per Stage. By default, Hive performs only one Stage at a time. However, a SQL statement may contain multiple Stages, but these Stages may not be completely interdependent, meaning that some Stages can be executed in parallel. Parallel execution is referred to here as parallel execution of these stages. The relevant parameters are as follows:

--启用并行执行优化,默认是关闭的
--Enable parallel execution optimization, default is off

set hive.exec.parallel=true;  

--同一个sql允许最大并行度,默认为8
--Maximum parallelism allowed for the same sql, default is 8

set hive.exec.parallel.thread.number=8;

1.6.8.8 CBO优化
1.6.8.8 CBO Optimization

CBO是指Cost based Optimizer,即基于计算成本的优化。
CBO stands for Cost Based Optimizer, i.e. optimization based on computational costs.

在Hive中,计算成本模型考虑到了:数据的行数、CPU、本地IO、HDFS IO、网络IO等方面。Hive会计算同一SQL语句的不同执行计划的计算成本,并选出成本最低的执行计划。目前CBO在Hive的MR引擎下主要用于Join的优化,例如多表Join的Join顺序。
In Hive, the computational cost model takes into account: rows of data, CPU, local IO, HDFS IO, network IO, etc. Hive calculates the computational cost of different execution plans for the same SQL statement and selects the execution plan with the lowest cost. At present, CBO is mainly used for Join optimization under Hive's MR engine, such as the Join order of multi-table Join.

相关参数为:
The relevant parameters are:

--是否启用cbo优化
--whether to enable cbo optimization

set hive.cbo.enable=true;

1.6.8.9 列式存储
1.6.8.9 Column Storage

采用ORC列式存储加快查询速度。
ORC column storage is adopted to speed up query.

id name age

1 zs 18

2 lishi 19

行:1 zs 18 2 lishi 19

列:1 2 zs lishi 18 19

select name from user

1.6.8.10 压缩
1.6.8.10 Compression

压缩减少磁盘IO:因为Hive底层计算引擎默认是MR,可以在Map输出端采用Snappy压缩。
Compression reduces disk IO: Because Hive's underlying compute engine defaults to MR, Snappy compression can be used on the Map output.

Map(Snappy ) Reduce

1.6.8.11 分区和分桶
1.6.8.11 Partitions and buckets

(1)创建分区表 防止后续全表扫描
(1) Create partitioned tables to prevent subsequent full table scans

(2)创建分桶表 对未知的复杂的数据进行提前采样
(2) Create bucket tables to sample unknown complex data in advance

1.6.8.12 更换引擎
1.6.8.12 Engine Replacement

1MR/Tez/Spark区别:

MR引擎:多Job串联,基于磁盘,落盘的地方比较多。虽然慢,但一定能跑出结果。一般处理,周、月、年指标。
MR engine: multi-job series, disk-based, more places to drop disk. Although slow, it will definitely run out of results. General treatment, weekly, monthly and annual indicators.

Spark引擎:虽然在Shuffle过程中也落盘,但是并不是所有算子都需要Shuffle,尤其是多算子过程,中间过程不落盘 DAG有向无环图。 兼顾了可靠性和效率。一般处理天指标。
Spark engine: Although it also falls in Shuffle process, not all operators need Shuffle, especially multi-operator process, and the intermediate process does not fall in DAG directed acyclic graph. Both reliability and efficiency are considered. General treatment day indicators.

2)Tez引擎的优点
2) Advantages of Tez Engine

(1)使用DAG描述任务,可以减少MR中不必要的中间节点,从而减少磁盘IO和网络IO。
(1) Using DAG to describe tasks reduces unnecessary intermediate nodes in MR, thereby reducing disk IO and network IO.

(2)可更好的利用集群资源,例如Container重用、根据集群资源计算初始任务的并行度等。
(2) Cluster resources can be better utilized, such as Container reuse, parallelism of initial tasks calculated according to cluster resources, etc.

(3)可在任务运行时,根据具体数据量,动态的调整后续任务的并行度。
(3) The parallelism of subsequent tasks can be dynamically adjusted according to the specific data amount when the task is running.

1.6.8.13 几十张表join 如何优化
1.6.8.13 Dozens of tables join How to optimize

(1)减少join的表数量:不影响业务前提,可以考虑将一些表进行预处理和合并,从而减少join操作。
(1) Reduce the number of join tables: Without affecting the business premise, you can consider preprocessing and merging some tables to reduce the join operation.

(2)使用Map Join:将小表加载到内存中,从而避免了Reduce操作,提高了性能。通过设置hive.auto.convert.join为true来启用自动Map Join。
(2) Use Map Join: Load small tables into memory, thus avoiding the Reduce operation and improving performance. Enable automatic Map Join by setting hive.auto.convert.join to true.

(3)使用Bucketed Map Join:通过设置hive.optimize.bucketmapjoin为true来启用Bucketed Map Join。

(4)使用Sort Merge Join:这种方式在Map阶段完成排序,从而减少了Reduce阶段的计算量。通过设置hive.auto.convert.sortmerge.join为true来启用。
(4) Sort Merge Join: This method completes the sorting in the Map phase, thus reducing the amount of computation in the Reduce phase. Enable by setting hive.auto.convert.sortmerge.join to true.

(5)控制Reduce任务数量:通过合理设置hive.exec.reducers.bytes.per.reducer和mapreduce.job.reduces参数来控制Reduce任务的数量。
(5) Control the number of Reduce tasks: Control the number of Reduce tasks by setting hive.exec.reducers.bytes.per.reducer and mapreduce.job.reduces.

(6)过滤不需要的数据:join操作之前,尽量过滤掉不需要的数据,从而提高性能。
(6) Filter unwanted data: Before joining operations, try to filter out unwanted data to improve performance.

(7)选择合适的join顺序:将小表放在前面可以减少中间结果的数据量,提高性能。
(7) Choose the right join order: Putting small tables in front can reduce the amount of data in the middle result and improve performance.

(8)使用分区:可以考虑使用分区技术。只需要读取与查询条件匹配的分区数据,从而减少数据量和计算量。
(8) Use zoning: Consider using zoning techniques. Only partition data matching the query criteria needs to be read, reducing the amount of data and computation.

(9)使用压缩:通过对数据进行压缩,可以减少磁盘和网络IO,提高性能。注意选择合适的压缩格式和压缩级别。
(9) Use compression: By compressing data, you can reduce disk and network IO and improve performance. Pay attention to choosing the appropriate compression format and compression level.

(10)调整Hive配置参数:根据集群的硬件资源和实际需求,合理调整Hive的配置参数,如内存、CPU、IO等,以提高性能。
(10) Adjust Hive configuration parameters: According to the hardware resources and actual needs of the cluster, reasonably adjust Hive configuration parameters, such as memory, CPU, IO, etc., to improve performance.

1.6.9 Hive解决数据倾斜方法
1.6.9 Hive Solutions for Data Skew

数据倾斜问题,通常是指参与计算的数据分布不均,即某个key或者某些key的数据量远超其他key,导致在shuffle阶段,大量相同key的数据被发往同一个Reduce,进而导致该Reduce所需的时间远超其他Reduce,成为整个任务的瓶颈。以下为生产环境中数据倾斜的现象:
The data skew problem usually refers to the uneven distribution of data involved in computation, that is, the data volume of a certain key or some keys far exceeds that of other keys, resulting in a large number of data of the same key being sent to the same Reduce in the shuffle stage, which in turn leads to the time required for the Reduce far exceeding that of other Reductions, becoming the bottleneck of the whole task. The following are examples of data skewing in production environments:

Hive中的数据倾斜常出现在分组聚合和join操作的场景中,下面分别介绍在上述两种场景下的优化思路。
Data skew in Hive often occurs in the scenarios of grouping aggregation and join operations. The optimization ideas in the above two scenarios are described below.

1)分组聚合导致的数据倾斜
1) Data skew caused by grouping aggregation

前文提到过,Hive中的分组聚合是由一个MapReduce Job完成的。Map端负责读取数据,并按照分组字段分区,通过Shuffle,将数据发往Reduce端,各组数据在Reduce端完成最终的聚合运算。若group by分组字段的值分布不均,就可能导致大量相同的key进入同一Reduce,从而导致数据倾斜。
As mentioned earlier, packet aggregation in Hive is done by a MapReduce Job. The Map terminal is responsible for reading data, partitioning the data according to grouping fields, sending the data to the Reduce terminal through Shuffle, and completing the final aggregation operation of each group of data at the Reduce terminal. If the values of the group by field are unevenly distributed, it may cause a large number of the same keys to enter the same Reduce, resulting in data skew.

由分组聚合导致的数据倾斜问题,有如下解决思路:
The data skew problem caused by grouping aggregation can be solved as follows:

(1)判断倾斜的值是否为null
(1) Determine whether the value of inclination is null

若倾斜的值为null,可考虑最终结果是否需要这部分数据,若不需要,只要提前将null过滤掉,就能解决问题。若需要保留这部分数据,考虑以下思路。
If the value of tilt is null, consider whether the final result needs this part of data. If not, as long as null is filtered out in advance, the problem can be solved. If you need to retain this data, consider the following ideas.

2Map-Side聚合

开启Map-Side聚合后,数据会现在Map端完成部分聚合工作。这样一来即便原始数据是倾斜的,经过Map端的初步聚合后,发往Reduce的数据也就不再倾斜了。最佳状态下,Map端聚合能完全屏蔽数据倾斜问题。
When Map-Side aggregation is enabled, the data will now be partially aggregated on the Map side. In this way, even if the original data is skewed, after the initial aggregation at the Map end, the data sent to Reduce is no longer skewed. At its best, Map-side aggregation can completely mask the data skew problem.

相关参数如下:
The relevant parameters are as follows:

set hive.map.aggr=true;

set hive.map.aggr.hash.min.reduction=0.5;

set hive.groupby.mapaggr.checkinterval=100000;

set hive.map.aggr.hash.force.flush.memory.threshold=0.9;

3Skew-GroupBy优化

Skew-GroupBy是Hive提供的一个专门用来解决分组聚合导致的数据倾斜问题的方案。其原理是启动两个MR任务,第一个MR按照随机数分区,将数据分散发送到Reduce,并完成部分聚合,第二个MR按照分组字段分区,完成最终聚合。
Skew-GroupBy is a solution provided by Hive to solve the data skew problem caused by grouping aggregation. The principle is to start two MR tasks. The first MR is partitioned according to random numbers, sending data to Reduce and completing partial aggregation. The second MR is partitioned according to grouping fields to complete final aggregation.

相关参数如下:
The relevant parameters are as follows:

--启用分组聚合数据倾斜优化
--Enable group aggregation data skew optimization

set hive.groupby.skewindata=true;

2)Join导致的数据倾斜
2) Data skew caused by Join

若Join操作使用的是Common Join算法,就会通过一个MapReduce Job完成计算。Map端负责读取Join操作所需表的数据,并按照关联字段进行分区,通过Shuffle,将其发送到Reduce端,相同key的数据在Reduce端完成最终的Join操作。
If the Join operation uses the Common Join algorithm, the computation is done via a MapReduce Job. The Map end is responsible for reading the data of the table required for the Join operation, partitioning it according to the associated fields, and sending it to the Reduce end through Shuffle. The data with the same key completes the final Join operation at the Reduce end.

如果关联字段的值分布不均,就可能导致大量相同的key进入同一Reduce,从而导致数据倾斜问题。
If the values of the associated fields are unevenly distributed, it may cause a large number of the same keys to enter the same Reduce, resulting in data skew problems.

Join导致的数据倾斜问题,有如下解决思路:
The data skew problem caused by Join can be solved as follows:

1Map Join

使用Map Join算法,Join操作仅在Map端就能完成,没有Shuffle操作,没有Reduce阶段,自然不会产生Reduce端的数据倾斜。该方案适用于大表Join小表时发生数据倾斜的场景。
Using Map Join algorithm, Join operation can be completed only on Map side, there is no Shuffle operation, there is no Reduce stage, and naturally there will be no data skew on Reduce side. This scenario is suitable for scenarios where data skew occurs when a large table joins a small table.

相关参数如下:
The relevant parameters are as follows:

set hive.auto.convert.join=true;

set hive.auto.convert.join.noconditionaltask=true;

set hive.auto.convert.join.noconditionaltask.size=10000000;

(2)Skew Join

若参与Join的两表均为大表,Map Join就难以应对了。此时可考虑Skew Join,其核心原理是Skew Join的原理是,为倾斜的大key单独启动一个Map Join任务进行计算,其余key进行正常的Common Join。原理图如下:
If the two tables participating in the Join are both large tables, Map Join will be difficult to cope with. At this time, we can consider Skew Join, whose core principle is Skew Join. The principle of Skew Join is to start a Map Join task for inclined large keys separately, and the rest of the keys are used for normal Common Join. The schematic diagram is as follows:

相关参数如下:
The relevant parameters are as follows:

--启用skew join优化
--Enable skew join optimization

set hive.optimize.skewjoin=true;

--触发skew join的阈值,若某个key的行数超过该参数值,则触发
--Threshold for triggering skew join. Trigger if the number of rows in a key exceeds the parameter value.

set hive.skewjoin.key=100000;

3)调整SQL语句
3) Adjust SQL statements

若参与Join的两表均为大表,其中一张表的数据是倾斜的,此时也可通过以下方式对SQL语句进行相应的调整。
If the two tables participating in the Join are both large tables, and the data of one of the tables is skewed, you can also adjust the SQL statement accordingly in the following ways.

假设原始SQL语句如下:A,B两表均为大表,且其中一张表的数据是倾斜的。
Suppose the original SQL statement is as follows: A, B are both large tables, and one of the tables has skewed data.

hive (default)>

select

*

from A

join B

on A.id=B.id;

Join过程如下:
The process of joining is as follows:

图中1001为倾斜的大key,可以看到,其被发往了同一个Reduce进行处理。
1001 in the figure is a large tilted key, which can be seen to be sent to the same Reduce for processing.

调整之后的SQL语句执行计划如下图所示:
The SQL statement execution plan after adjustment is shown in the following figure:

调整SQL语句如下:
Adjust SQL statements as follows:

hive (default)>

select

*

from(

select --打散操作
select --scatter operation

concat(id,'_',cast(rand()*2 as int)) id,

value

from A

)ta

join(

select --扩容操作
select --expansion operation

concat(id,'_',1) id,

value

from B

union all

select

concat(id,'_',2) id,

value

from B

)tb

on ta.id=tb.id;

1.6.10 Hive的数据中含有字段的分隔符怎么处理?
1.6.10 What about separators in Hive data that contain fields?

Hive 默认的字段分隔符为Ascii码的控制符\001(^A),建表的时候用fields terminated by '\001'。注意:如果采用\t或者\001等为分隔符,需要要求前端埋点和JavaEE后台传递过来的数据必须不能出现该分隔符,通过代码规范约束
Hive's default field delimiter is the Ascii code control character\001 (^A), and fields terminated by '\001' is used when creating tables. Note: If\t or\001 is used as the delimiter, it is required that the data passed from the front-end buried point and JavaEE background must not appear the delimiter, which is constrained by the code specification.

一旦传输过来的数据含有分隔符,需要在前一级数据中转义或者替换(ETL)。通常采用Sqoop和DataX在同步数据时预处理。
Once the transmitted data contains delimiters, escape or substitution (ETL) is required in the previous level of data. Sqoop and DataX are usually used to preprocess data when synchronizing.

id name age

1 zs 18

2 li分隔符si 19
2 li separator si 19

1.6.11 MySQL元数据备份
MySQL Metadata Backup

元数据备份(重点,如数据损坏,可能整个集群无法运行,至少要保证每日零点之后备份到其它服务器两个复本)。
Metadata backup (key point, if the data is damaged, the whole cluster may not be able to run, at least two copies should be backed up to other servers after zero point every day).

(1)MySQL备份数据脚本(建议每天定时执行一次备份元数据)
(1) MySQL backup data script (it is recommended to backup metadata regularly once a day)

#/bin/bash

#常量设置

MYSQL_HOST='hadoop102'

MYSQL_USER='root'

MYSQL_PASSWORD='000000'

# 备份目录,需提前创建
#Backup directory, need to be created in advance

BACKUP_DIR='/root/mysql-backup'

# 备份天数,超过这个值,最旧的备份会被删除
#backup days, beyond which the oldest backup will be deleted

FILE_ROLL_COUNT='7'

# 备份MySQL数据库
#Backup MySQL database

[ -d "${BACKUP_DIR}" ] || exit 1

mysqldump \

--all-databases \

--opt \

--single-transaction \

--source-data=2 \

--default-character-set=utf8 \

-h"${MYSQL_HOST}" \

-u"${MYSQL_USER}" \

-p"${MYSQL_PASSWORD}" | gzip > "${BACKUP_DIR}/$(date +%F).gz"

if [ "$(ls "${BACKUP_DIR}" | wc -l )" -gt "${FILE_ROLL_COUNT}" ]

then

ls "${BACKUP_DIR}" | sort |sed -n 1p | xargs -I {} -n1 rm -rf "${BACKUP_DIR}"/{}

fi

2)MySQL恢复数据脚本
(2) MySQL Recovery Data Script

#/bin/bash

#常量设置

MYSQL_HOST='hadoop102'

MYSQL_USER='root'

MYSQL_PASSWORD='000000'

BACKUP_DIR='/root/mysql-backup'

# 恢复指定日期,不指定就恢复最新数据
#Restore specified date, restore latest data without specifying

RESTORE_DATE=''

[ "${RESTORE_DATE}" ] && BACKUP_FILE="${RESTORE_DATE}.gz" || BACKUP_FILE="$(ls ${BACKUP_DIR} | sort -r | sed -n 1p)"

gunzip "${BACKUP_DIR}/${BACKUP_FILE}" --stdout | mysql \

-h"${MYSQL_HOST}" \

-u"${MYSQL_USER}" \

-p"${MYSQL_PASSWORD}"

1.6.12 如何创建二级分区表?
1.6.12 How do I create a secondary partition table?

create table dept_partition2(

deptno int, -- 部门编号
deptno int, --department number

dname string, -- 部门名称
dname string, --Department name

)

partitioned by (day string, hour string)

row format delimited fields terminated by '\t';

1.6.13 UnionUnion all区别

(1)union会将联合的结果集去重
(1) union will deduplicate the result set of the union

(2)union all不会对结果集去重
(2) union all does not duplicate the result set

1.7 Datax

1.7.1 DataX与Sqoop区别

1)DataX与Sqoop都是主要用于离线系统中批量同步数据处理场景。
1) DataX and Sqoop are mainly used for batch synchronous data processing scenarios in offline systems.

2)DataX和Sqoop区别如下:
2) DataX and Sqoop differ as follows:

(1)DataX底层是单进程多线程;Sqoop底层是4个Map;
(1) DataX bottom layer is single process multithread;Sqoop bottom layer is 4 maps;

(2)数据量大的场景优先考虑Sqoop分布式同步;数据量小的场景优先考虑DataX,完全基于内存;DataX数据量大,可以使用多个DataX实例,每个实例负责一部分(手动划分)。
(2) Sqoop distributed synchronization is preferred for scenarios with large data volume; DataX is preferred for scenarios with small data volume, which is completely memory-based;DataX has large data volume, and multiple DataX instances can be used, each of which is responsible for a part (manual division).

(3)Sqoop是为Hadoop而生的,对Hadoop相关组件兼容性比较好;Datax是插件化开发,支持的Source和Sink更多一些。
(3) Sqoop is born for Hadoop and has good compatibility with Hadoop-related components;Datax is plug-in development and supports more Source and Sink.

(4)Sqoop目前官方不在升级维护;DataX目前阿里在升级维护
(4) Sqoop is not officially upgraded and maintained;DataX is currently upgraded and maintained by Ali.

(5)关于运行日志与统计信息,DataX更丰富,Sqoop基于Yarn不容易采集
(5) About running logs and statistics, DataX is richer, Sqoop is not easy to collect based on Yarn

1.7.2 速度控制
1.7.2 Speed control

1)关键优化参数如下:
1) The key optimization parameters are as follows:

参数

说明

job.setting.speed.channel

总并发数
Total Number of Concurrence

job.setting.speed.record

总record限速
Total record speed limit

job.setting.speed.byte

总byte限速
Total byte speed limit

core.transport.channel.speed.record

单个channel的record限速,默认值为10000(10000条/s)
Record speed limit of a single channel, default value is 10000 (10000/s)

core.transport.channel.speed.byte

单个channel的byte限速,默认值1024*1024(1M/s)
Byte speed limit of single channel, default value 1024*1024 (1M/s)

2)生效优先级:
2) Priority:

(1)全局Byte限速 / 单Channel Byte限速
(1) Global Byte Speed Limit/Single Channel Byte Speed Limit

(2)全局Record限速 / Channel Record限速
(2) Global Record Speed Limit/Single Channel Record Speed Limit

两个都设置,取结果小的
Set both, take the smaller result

(3)上面都没设置,总Channel数的设置生效
(3) None of the above is set, and the setting of the total number of channels takes effect.

3)项目配置
3) Project configuration

只设置 总channel数=5,基本可以跑满网卡带宽。
Just set the total number of channels =5, and you can basically run full network card bandwidth.

1.7.3 内存调整
1.7.3 Memory Adjustments

建议将内存设置为4G或者8G,这个也可以根据实际情况来调整。
It is recommended to set the memory to 4G or 8G, which can also be adjusted according to the actual situation.

调整JVM xms xmx参数的两种方式:一种是直接更改datax.py脚本;另一种是在启动的时候,加上对应的参数,如下:
There are two ways to adjust JVM xms xmx parameters: one is to directly change the datax.py script; the other is to add the corresponding parameters at startup, as follows:

python datax/bin/datax.py --jvm="-Xms8G -Xmx8G" /path/to/your/job.json

1.7.4 空值处理
1.7.4 Handling of null values

1)MySQL(null) => Hive (\N) 要求Hive建表语句
1) MySQL (null)=> Hive (\N) requires Hive table creation statements

解决该问题的方案有两个:
There are two solutions to this problem:

(1)修改DataX HDFS Writer的源码,增加自定义null值存储格式的逻辑,可参考https://blog.csdn.net/u010834071/article/details/105506580
(1) Modify the source code of DataX HDFS Writer and add logic to customize null value storage format. Please refer to https://blog.csdn.net/u010834071/article/details/105506580.

(2)在Hive中建表时指定null值存储格式为空字符串(''),例如:
(2) Specify null values when building tables in Hive to be stored in an empty string (''), for example:

DROP TABLE IF EXISTS base_province;

CREATE EXTERNAL TABLE base_province

(

`id` STRING COMMENT '编号',

`name` STRING COMMENT '省份名称',

`region_id` STRING COMMENT '地区ID',

`area_code` STRING COMMENT '地区编码',

`iso_code` STRING COMMENT '旧版ISO-3166-2编码,供可视化使用',
`iso_code` STRING COMMENT 'Old ISO-3166-2 code for visualization',

`iso_3166_2` STRING COMMENT '新版IOS-3166-2编码,供可视化使用'
`iso_3166_2` STRING COMMENT 'New IOS-3166-2 code for visualization'

) COMMENT '省份表'
) COMMENT 'Province table'

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

NULL DEFINED AS ''

LOCATION '/base_province/';

2Hive(\N => MySQL null

"reader": {

"name": "hdfsreader",

"parameter": {

"defaultFS": "hdfs://hadoop102:8020",

"path": "/base_province",

"column": [

"*"

],

"fileType": "text",

"compress": "gzip",

"encoding": "UTF-8",

"nullFormat": "\\N",

"fieldDelimiter": "\t",

}

}

1.7.5 配置文件生成脚本
1.7.5 Profile Generation Scripts

(1)一个表一个配置,如果有几千张表,怎么编写的配置?
(1) A table a configuration, if there are thousands of tables, how to write the configuration?

2)脚本使用说明
(2) Description of script usage

python gen_import_config.py -d database -t table

1.7.6 DataX一天导入多少数据
1.7.6 DataX How much data is imported in a day

1)全量同步的表如下
1) The full synchronization table is as follows

活动表、优惠规则表、优惠卷表、SKU平台属性表、SKU销售属性表
Activity table, Offer rule table, Coupon table, SKU platform attribute table, SKU sales attribute table

SPU商品表(1-2万)、SKU商品表(10-20万)、品牌表、商品一级分类、商品二级分类、商品三级分类
SPU commodity table (100,000 - 200,000), SKU commodity table (100,000 - 200,000), brand table, commodity primary classification, commodity secondary classification, commodity tertiary classification

省份表、地区表
Province table, region table

编码字典表
coding dictionary table

以上全部加一起30万条,约等于300m。
All of the above add up to 300,000, which is about 300m.

加购表(每天增量20万、全量100万 =》1g)
Additional purchase table (daily increment of 200,000, total amount of 1 million => 1g)

所以Datax每天全量同步的数据1-2g左右。
Therefore, Datax has about 1-2g of data synchronized every day.

注意:金融、保险(平安 、民生银行),只有业务数据数据量大一些。
Note: Finance, insurance (Ping An, Minsheng Bank), only the amount of business data is larger.

2)增量同步的表如下
2) The incremental synchronization table is as follows

加购表(20万)、订单表(10万)、订单详情表(15万)、订单状态表、支付表(9万)、退单表(1000)、退款表(1000
Additional purchase table (200,000), order table (100,000), order details table (150,000), order status table, payment table (90,000), return table (1000), refund table (1000)

订单明细优惠卷关联表、优惠卷领用表
Order Details Coupon Association Table, Coupon Collection Table

商品评论表、收藏表
Product review table, collection table

用户表、订单明细活动关联表
User Table, Order Details Activity Association Table

增量数据每天1-2g
Incremental data 1-2g per day

1.7.7 Datax如何实现增量同步
1.7.7 Datax How to achieve incremental synchronization

获取今天新增和变化的数据:通过sql过滤,创建时间是今天或者操作时间等于今天。
Get today's new and changed data: filtered by sql, created today or operated on today.

1.8 Maxwell

1.8.1 Maxwell与Canal、FlinkCDC的对比

1)FlinkCDC、Maxwell、Canal都是主要用于实时系统中实时数据同步处理场景。
1) FlinkCDC, Maxwell and Canal are mainly used for real-time data synchronization processing scenarios in real-time systems.

FlinkCDC

Maxwell

Canal

SQL与数据条数关系
SQL and Data Number Relationship

SQL影响几条出现几条
SQL affects a few items appear a few items

SQL影响几条出现几条
SQL affects a few items appear a few items

只有一整条(后续可能需要炸开)
Only one whole (may need to be blown later)

数据初始化功能(同步全量数据)
Data initialization function (synchronous full data)

有(支持多库多表同时做)
Yes (support multi-database and multi-table simultaneous operation)

有(单表)
Yes (single table)

断点续传功能
breakpoint resume function

有(放在CK)
Yes (on CK)

有(存在MySQL)
Yes (MySQL exists)

有(本地)
Yes (local)

1.8.2 Maxwell好处

支持断点续传。
Support breakpoint resume.

全量初始化同步。
Full initialization synchronization.

自动根据库名和表名把数据发往Kafka的对应主题。
Automatically send data to the corresponding topic of Kafka based on the library name and table name.

1.8.3 Maxwell底层原理
1.8.3 Maxwell's underlying principles

MySQL主从复制。
MySQL master-slave replication.

1.8.4 全量同步速度如何
1.8.4 How about full synchronous speed

同步速度慢,全量同步建议采用Sqoop或者DataX
Slow synchronization speed, full synchronization is recommended to use Sqoop or DataX.

1.8.5 Maxwell数据重复问题
1.8.5 Maxwell data duplication problem

同步历史数据时,bootstrap会扫描所有数据。
When synchronizing historical data, bootstrap scans all data.

同时maxwell会监听binlog变化。
Maxwell also listens for binlog changes.

例如:用bootstrap同步历史数据库时,历史数据库中新插入一条数据,这时bootstrap扫描到,maxwell进程也监控到了,这时就会出现数据重复问题。
For example, when synchronizing the historical database with bootstrap, a new piece of data is inserted into the historical database. At this time, bootstrap scans and maxwell processes monitor it. At this time, the data duplication problem will occur.

1.9 DolphinScheduler调度器

1.3.9版本,支持邮件、企业微信。
1.3.9 Version, support email, enterprise WeChat.

2.0.3版本,支持的报警信息更全一些,配置更容易。
2.0.3 Version, support alarm information more complete, easier configuration.

3.0.0以上版本,支持数据质量监控。
3.0.0 The above versions support data quality monitoring.

1.9.1 每天集群运行多少指标?
1.9.1 How many metrics does the cluster run per day?

每天跑100多个指标,有活动时跑200个左右。
Run more than 100 indicators every day, and run about 200 when there is activity.

1.9.2 任务挂了怎么办?
1.9.2 What if the mission fails?

(1)运行成功或者失败都会发邮件、发钉钉、集成自动打电话。
(1) E-mail, nail and integrated automatic phone call will be sent if the operation succeeds or fails.

(2)最主要的解决方案就是,看日志,解决问题。
(2) The main solution is to look at the log and solve the problem.

(3)报警网站睿象云,http://www.onealert.com/
(3) Alarm website Ruixiangyun, www.onealert.com/

(4)双11和618活动需要24小时值班
(4) Double 11 and 618 activities require 24-hour duty

1.9.3DS挂了怎么办?
1.9.3 What happens when DS dies?

看日志报错原因:直接重启,资源不够增加资源在重启
See the log error reason: direct restart, insufficient resources to increase resources in the restart

1.10 Spark Core & SQL

1.10.1 Spark运行模式
1.10.1 Spark operating mode

(1)Local运行在一台机器上。测试用。
Local: Running on a machine. For testing.

(2)Standalone是Spark自身的一个调度系统。 对集群性能要求非常高时用。国内很少使用。
Standalone: A scheduling system for Spark itself. When cluster performance requirements are very high. It is rarely used domestically.

(3)Yarn:采用Hadoop的资源调度器 国内大量使用。
Yarn: Resource scheduler with Hadoop. Large domestic use.

Yarn-client模式:Driver运行在Client上(不在AM里)
Yarn-client mode: Driver runs on Client (not AM)

Yarn-cluster模式:Driver在AM上

(4)Mesos:国内很少使用
(4) Mesos: rarely used in the country.

(5)K8S:趋势,但是目前不成熟,需要的配置信息太多。
(5) K8S: Trend, but immature at present, too much configuration information is required.

1.10.2 Spark常用端口号
1.10.2 Common port numbers for Spark

(1)4040 spark-shell任务端口
(1) 4040 spark-shell task port

(2)7077 内部通讯端口。类比Hadoop的8020/9000
(2) 7077 internal communication port. Analogy of Hadoop 8020/9000

(3)8080 查看任务执行情况端口。 类比Hadoop的8088
(3) Port 8080 to view task execution status. Analogy of Hadoop 8088

(4)18080 历史服务器。类比Hadoop的19888
(4) 18080 History Server. Analogy of Hadoop 19888

注意:由于Spark只负责计算,所有并没有Hadoop中存储数据的端口9870/50070
Note: Since Spark is only responsible for computing, there is no port 9870/50070 for storing data in Hadoop.

1.10.3 RDD五大属性
1.10.3 RDD Five Attributes

1.10.4 RDD弹性体现在哪里
1.10.4 Where is RDD Resilience

主要表现为存储弹性、计算弹性、任务(Task、Stage)弹性、数据位置弹性,具体如下:
It is mainly manifested as storage elasticity, calculation elasticity, task elasticity and data location elasticity, as follows:

(1)自动进行内存和磁盘切换
(1) Automatic memory and disk switching

(2)基于lineage的高效容错
(2) Efficient fault tolerance based on lineages

(3)Task如果失败会特定次数的重试
(3) Task will retry a certain number of times if it fails

(4)Stage如果失败会自动进行特定次数的重试,而且只会只计算失败的分片
(4) Stage automatically retries a certain number of times if it fails, and only failed fragments will be calculated.

(5)Checkpoint【每次对RDD操作都会产生新的RDD,如果链条比较长,计算比较笨重,就把数据放在硬盘中】和persist 【内存或磁盘中对数据进行复用】(检查点、持久化)
(5) Checkpoint [every RDD operation will generate a new RDD, if the chain is long, the calculation is heavy, put the data in the hard disk] and persist [data multiplexing in memory or disk](checkpoint, persistence)

(6)数据调度弹性:DAG Task 和资源管理无关
(6) Data scheduling flexibility: DAG Task is independent of resource management

(7)数据分片的高度弹性repartion
(7) Highly elastic reparations of data fragmentation

1.10.5 Spark的转换算子(8个)
1.10.5 Conversion operators of Spark (8)

1)单Value

(1)map

(2)mapPartitions

(3)mapPartitionsWithIndex

(4)flatMap

(5)groupBy

(6)filter

(7)distinct

(8)coalesce

(9)repartition

(10)sortBy

2)双vlaue

(1)intersection

(2)union

(3)subtract

(4)zip

3)Key-Value

(1)partitionBy

(2)reduceByKey

(3)groupByKey

(4)sortByKey

(5)mapValues

(6)join

1.10.6 Spark的行动算子(5个)
1.10.6 Spark's Action Operators (5)

(1)reduce

(2)collect

3count

(4)first

(5)take

6)save

7foreach

1.10.7 map和mapPartitions区别

(1)map:每次处理一条数据
map: one piece of data at a time

(2)mapPartitions:每次处理一个分区数据
(2) mapPartitions: process data one partition at a time

1.10.8 Repartition和Coalesce区别

1关系:
1) Relationship:

两者都是用来改变RDD的partition数量的,repartition底层调用的就是coalesce方法:coalescenumPartitions, shuffle = true)。
Both are used to change the number of partitions in the RDD, and the underlying call to repartition is the coalesce method: coalesce (numPartitions, shuffle = true).

2区别:
2) Difference:

repartition一定会发生Shuffle,coalesce根据传入的参数来判断是否发生Shuffle
Shuffle must occur in partition, coalesce determines whether Shuffle occurs based on the parameters passed in.

一般情况下增大rdd的partition数量使用repartition,减少partition数量时使用coalesce
In general, use repartition to increase the number of partitions in rdd, and use coalesce to decrease the number of partitions.

1.10.9 reduceByKey与groupByKey的区别

reduceByKey:具有预聚合操作
reduceByKey: Has pre-aggregation operations.

groupByKey:没有预聚合。
groupByKey: No preaggregation.

在不影响业务逻辑的前提下,优先采用reduceByKey。
On the premise of not affecting business logic, reduceByKey is preferred.

1.10.10 Spark中的血缘
1.10.10 Bloodlines in Spark

宽依赖和窄依赖。有Shuffle的是宽依赖。
Wide dependence and narrow dependence. Shuffle is wide dependence.

1.10.11 Spark任务的划分
1.10.11 Division of Spark Tasks

(1)Application:初始化一个SparkContext即生成一个Application;
(1) Application: Initialize a SparkContext to generate an Application;

(2)Job:一个Action算子就会生成一个Job;
(2) Job: An Action operator generates a Job;

(3)Stage:Stage等于宽依赖的个数加1;
(3) Stage: Stage equals the number of wide dependencies plus 1;

(4)Task:一个Stage阶段中,最后一个RDD的分区个数就是Task的个数。
(4) Task: In a Stage, the number of partitions in the last RDD is the number of tasks.

1.10.12 SparkSQL中RDD、DataFrame、DataSet三者的转换及区别
1.10.12 Conversion and Difference of RDD, DataFrame and DataSet in SparkSQL

DataFrame和DataSet的区别:前者是row类型
Difference between DataFrame and DataSet: The former is row type

RDD和DataFrame及DataSet的区别:前者没有字段和表信息
Difference between RDD and DataFrame and DataSet: The former has no field and table information

1.10.13 Hive on SparkSpark on Hive区别

元数据

执行引擎
execution engine

语法

生态

Hive on Spark

MySQL

rdd

HQL

更加完善
better

Spark on Hive

(Spark SQL )

MySQL

df ds

Spark SQL

有欠缺(权限管理、元数据管理)
Defective (authority management, metadata management)

内置Hive
Built-in Hive

derby

外置Hive
External Hive

MySQL

1.10.14 Spark内核源码(重点)
1.10.14 Spark kernel source code (focus)

1)提交流程(重点)
1) Submission process (focus)

2Shuffle流程(重点)
2) Shuffle process (emphasis)

1)SortShuffle:减少了小文件。
SortShuffle: Reduced small files.

中间落盘应该是本地磁盘
The middle disk should be a local disk

生成的文件数 = Task数量*2
Number of files generated = Number of Tasks *2

在溢写磁盘前,先根据key进行排序,排序过后的数据,会分批写入到磁盘文件中。默认批次为10000条,数据会以每批一万条写入到磁盘文件。写入磁盘文件通过缓冲区溢写的方式,每次溢写都会产生一个磁盘文件,也就是说一个Task过程会产生多个临时文件。最后在每个Task中,将所有的临时文件合并,这就是merge过程,此过程将所有临时文件读取出来,一次写入到最终文件。
Before overwriting the disk, sort by key, and the sorted data will be written to the disk file in batches. The default batch is 10000 pieces, and the data is written to disk files in batches of 10000 pieces. Writing to disk files occurs through buffer overwrites, each of which produces a disk file, meaning that a Task process produces multiple temporary files. Finally, in each Task, all temporary files are merged. This is the merge process, which reads all temporary files and writes them to the final file at once.

4)bypassShuffle:减少了小文件,不排序,效率高。在不需要排序的场景使用。
(4) bypassShuffle: reduces small files, does not sort, high efficiency. Use in scenarios where sorting is not required.

1.10.15 Spark统一内存模型
1.10.15 Spark Unified Memory Model

1)统一内存管理的堆内内存结构如下图
1) The heap memory structure of unified memory management is shown in the following figure

2)统一内存管理的动态占用机制如下图
2) The dynamic occupancy mechanism of unified memory management is shown in the following figure

1.10.16 Spark为什么比MR快?
1.10.16 Why is Spark faster than MR?

1)内存&硬盘
1) Memory & Hard Disk

(1)MR在Map阶段会在溢写阶段将中间结果频繁的写入磁盘,在Reduce阶段再从磁盘拉取数据。频繁的磁盘IO消耗大量时间。
(1) MR frequently writes intermediate results to disk in the overflow phase in the Map phase, and pulls data from the disk in the Reduce phase. Frequent disk I/O consumes a lot of time.

(2)Spark不需要将计算的中间结果写入磁盘。这得益于Spark的RDD,在各个RDD的分区中,各自处理自己的中间结果即可。在迭代计算时,这一优势更为明显。
(2) Spark does not need to write the intermediate results of the computation to disk. This is thanks to Spark's RDD, which allows each RDD partition to process its own intermediate results. This advantage is even more pronounced when it comes to iterative calculations.

2)Spark DAG任务划分减少了不必要的Shuffle
2) Spark DAG task division reduces unnecessary Shuffle

(1)对MR来说,每一个Job的结果都会落地到磁盘。后续依赖于次Job结果的Job,会从磁盘中读取数据再进行计算。
(1) For MR, the result of each job will be landed on disk. Subsequent jobs that depend on the results of the secondary job will read data from the disk and perform calculations.

(2)对于Spark来说,每一个Job的结果都可以保存到内存中,供后续Job使用。配合Spark的缓存机制,大大的减少了不必要的Shuffle。
(2) For Spark, the result of each job can be saved to memory for subsequent jobs. Combined with Spark's caching mechanism, unnecessary Shuffle is greatly reduced.

3)资源申请粒度:进程&线程
3) Resource application granularity: Process & Thread

开启和调度进程的代价一般情况下大于线程的代价。
The cost of starting and scheduling a process is generally greater than the cost of threads.

(1)MR任务以进程的方式运行在Yarn集群中。N个MapTask就要申请N个进程
(1) MR tasks run in the Yarn cluster as a process. N MapTasks require N processes

(2)Spark的任务是以线程的方式运行在进程中。N个MapTask就要申请N个线程。
(2) Spark tasks run in the process as threads. N MapTasks require N threads.

1.10.17 Spark Shuffle和Hadoop Shuffle区别?

(1)Hadoop不用等所有的MapTask都结束后开启ReduceTask;Spark必须等到父Stage都完成,才能去Fetch数据。
(1) Hadoop does not need to wait for all MapTasks to finish to open ReduceTask; Spark must wait until all the parent stages are completed before it can go to Fetch data.

(2)Hadoop的Shuffle是必须排序的,那么不管是Map的输出,还是Reduce的输出,都是分区内有序的,而Spark不要求这一点。
(2) Hadoop's Shuffle must be sorted, so whether it is the output of Map or Reduce, it is ordered within the partition, and Spark does not require this.

1.10.18 Spark提交作业参数(重点)
1.10.18 Spark Submission Job Parameters (Important)

参考答案:
Answer:

https://blog.csdn.net/gamer_gyt/article/details/79135118

1)在提交任务时的几个重要参数
1) Several important parameters when submitting a task

executor-cores —— 每个executor使用的内核数,默认为1,官方建议2-5个,我们企业是4个
executor-cores - the number of cores used by each executor, the default is 1, the official recommendation is 2-5, and our enterprise is 4

num-executors —— 启动executors的数量,默认为2
num-executors — the number of executors started, default is 2

executor-memory —— executor内存大小,默认1G
executor-memory -- executor Memory size, default 1G

driver-cores —— driver使用内核数,默认为1
driver-cores - the number of cores used by driver, which is 1 by default

driver-memory —— driver内存大小,默认512M
driver-memory - the size of the driver's memory, which is 512 MB by default

2)边给一个提交任务的样式
2) Edge gives a style for submitting tasks

spark-submit \

--master local[5] \

--driver-cores 2 \

--driver-memory 8g \

--executor-cores 4 \

--num-executors 10 \

--executor-memory 8g \

--class PackageName.ClassName XXXX.jar \

--name "Spark Job Name" \

InputPath \

OutputPath

1.10.19 Spark任务使用什么进行提交,JavaEE界面还是脚本
1.10.19 What is the submission of a Spark task, whether it is a JavaEE interface or a script

Shell脚本。海豚调度器可以通过页面提交Spark任务。
Shell script. Dolphin Scheduler can submit Spark tasks via a page.

1.10.20 请列举会引起Shuffle过程的Spark算子,并简述功能。
1.10.20 List the Spark operators that give rise to Shuffle processes and describe their functions briefly.

reduceBykey:

groupByKey:

…ByKey

1.10.21 Spark操作数据库时,如何减少Spark运行中的数据库连接数?
1.10.21 How do I reduce the number of database connections Spark has running when I'm working with a database?

使用foreachPartition代替foreach,在foreachPartition获取数据库的连接。
Use foreachPartition instead of foreach to get the database connection inside foreachPartition.

1.10.22 Spark数据倾斜
1.10.22 Spark data tilt

详见Hive on Spark数据倾斜讲解。
See Hive on Spark Data Tilt for more information.

1.10.23 Spark3.0新特性

动态优化:自适应查询执行、动态分区裁剪
Dynamic optimization: adaptive query execution, dynamic partition clipping

根据计算,动态设置reduce数量
Dynamically set reduce quantity based on calculation

根据表格数量,动态挑选合适的HashJoin,BroadCastJon或MergeJoin
Dynamically pick the right HashJoin, BroadCastJon or MergeJoin based on the number of tables

1.12 Flink

1.12.1 Flink基础架构组成?
1.12.1 Flink Infrastructure Composition?

Flink程序在运行时主要有TaskManager,JobManager,Client三种角色。
Flink program has three roles: TaskManager, JobManager and Client.

JobManager是集群的老大,负责接收Flink Job,协调检查点,Failover 故障恢复等,同时管理TaskManager。 包含:DispatcherResourceManagerJobMaster
JobManager is the boss of the cluster, responsible for receiving Flink jobs, coordinating checkpoints, failover recovery, etc., while managing TaskManager. Includes: Dispatcher, ResourceManager, JobMaster.

TaskManager是执行计算的节点,每个TaskManager负责管理其所在节点上的资源信息,如内存、磁盘、网络。内部划分slot隔离内存,不隔离cpu。同一个slot共享组的不同算子的subtask可以共享slot。
TaskManagers are nodes that perform calculations, and each TaskManager is responsible for managing resource information such as memory, disk, and network on its node. Internal partition slot isolation memory, not CPU isolation. Subtasks of different operators of the same slot sharing group can share slots.

Client是Flink程序提交的客户端,将Flink Job提交给JobManager。
Client is the client submitted by Flink program, submitting Flink Job to JobManager.

1.12.2Flink和Spark Streaming的区别

Flink

Spark Streaming

计算模型
calculation model

流计算

微批次

时间语义
time semantic

三种

没有,处理时间
No, processing time

乱序

没有

窗口

多、灵活
Multiple, flexible

少、不灵活(窗口长度必须是 批次的整数倍)
Less, less flexible (window length must be an integer multiple of batch)

checkpoint

异步分界线快照
asynchronous boundary snapshot

状态

有,多

没有(updatestatebykey)

流式sql
streaming sql

没有

1.12.3 Flink提交作业流程及核心概念
1.123 Flink Submission Workflow and Core Concepts

1)Flink提交流程(Yarn-Per-Job)

2)算子链路:Operator Chain
2) Operator Chain

Flink自动做的优化,要求One-to-one,并行度相同。
Flink automatically optimizes, requiring One-to-one parallelism.

代码disableOperatorChaining()禁用算子链。
The code disableOperatorChaining() disables operator chaining.

3)Graph生成与传递
3) Graph generation and transmission

在哪里生成
Where is it generated?

传递给谁
To whom?

做了什么事
did something

逻辑流图StreamGraph
StreamGraph

Client

Client

最初的DAG图
The original DAG graph

作业流图JobGraph
JobGraph

Client

JobManager

算子链路优化
operator link optimization

执行流图ExecutionGraph
ExecutionGraph

JobManager

JobManager

并行度的细化
Refinement of parallelism

物理流图
physical flow diagram

4)Task、Subtask的区别

Subtask:算子的一个并行实例。
Subtask: A parallel instance of the operator.

Task:Subtask运行起来之后,就叫Task。
Task: When a Subtask is running, it is called a Task.

5)并行度和Slot的关系
5) Relationship between parallelism and Slot

Slot是静态的概念,是指TaskMangaer具有的并发执行能力。
Slot is a static concept, referring to the concurrent execution capability of TaskMangaer.

并行度是动态的概念,指程序运行时实际使用的并发能力。
Parallelism is a dynamic concept that refers to the concurrency power actually used by a program at runtime.

设置合适的并行度能提高运算效率,太多太少都不合适。
Setting the appropriate parallelism can improve the computing efficiency, too much or too little is not appropriate.

6)Slot共享组了解吗,如何独享Slot插槽
6) Slot sharing group Know how to own slot

默认共享组时default,同一共享组的task可以共享Slot。
Default shared group default, tasks of the same shared group can share slots.

通过slotSharingGroup()设置共享组。
Set up sharing groups via slotSharingGroup().

1.12.4 Flink的部署模式及区别?
1.124 Flink deployment patterns and differences?

1)Local:本地模式,Flink作业在单个JVM进程中运行,适用于测试阶段
1) Local: Local mode, Flink jobs run in a single JVM process, suitable for the test phase

2)Standalone:Flink作业在一个专门的Flink集群上运行,独立模式不依赖于其他集群管理器(Yarn或者Kubernetes)
Standalone: Flink jobs run on a dedicated Flink cluster, independent of other cluster managers (Yarn or Kubernetes)

3)Yarn:

Per-job:独享资源,代码解析在Client
Per-job: Exclusive resources, code resolution in Client

Application:独享资源,代码解析在JobMaster
Application: Exclusive resources, code analysis in JobMaster

Session:共享资源,一套集群多个job
Session: Shared resources, a cluster of multiple jobs

4)K8s:支持云原生,未来的趋势
4) K8s: Support Cloud Native, Future Trends

5)Mesos:国外使用,仅作了解
5) Mesos: used abroad, for information only

1.12.5 Flink任务的并行度优先级设置?资源一般如何配置?
1.12.5 Parallel Priority Setting for Flink Tasks? How are resources generally allocated?

设置并行度有多种方式,优先级:算子 > 全局Env > 提交命令行 > 配置文件
There are many ways to set parallelism, priority: Operator> Global Env > Submit Command Line> Configuration File

1)并行度根据任务设置:
1) Parallelism is set according to the task:

(1)常规任务:Source,Transform,Sink算子都与Kafka分区保持一致
(1) Regular tasks: Source, Transform, Sink operators are consistent with Kafka partition

(2)计算偏大任务:Source,Sink算子与Kafka分区保持一致,Transform算子可设置成2的n次方,64,128…
(2) Compute large tasks: Source, Sink operator consistent with Kafka partition, Transform operator can be set to 2 n, 64, 128…

2)资源设置:通用经验 1CU = 1CPU + 4G内存
2) Resource settings: General experience 1CU = 1CPU + 4G memory

TaskmanagerSlot数:1拖1(独享资源)、1拖N(节省资源,减少网络传输)
Slot number of Taskmanager: 1 drag 1 (exclusive resources), 1 drag N (save resources, reduce network transmission)

TaskManager的内存数:4~8G
TaskManager memory: 4~8 gigabytes

TaskManager的CPU:Flink默认一个Slot分配一个CPU
CPU of TaskManager: Flink assigns one CPU to one slot by default

JobManager的内存:2~4G

JobManager的CPU:默认是1
CPU of JobManager: default is 1

3)资源是否足够:
3) Sufficient resources:

资源设置,然后压测,看每个并行度处理上限,是否会出现反压
Resource settings, and then pressure test to see if each parallelism processing upper limit will appear back pressure

例如:每个并行度处理5000/s,开始出现反压,比如我们设置三个并行度,我们程序处理上限15000/s
For example: each parallelism processing 5000/s, start to appear back pressure, for example, we set three parallelism, our program processing upper limit of 15000/s

1.12.6 Flink的三种时间语义
1.12.6 Three temporal semantics of Flink

事件时间Event Time:是事件创建的时间。数据本身携带的时间。
Event Time: The time at which the event was created. The data itself carries the time.

进入时间Ingestion Time:是数据进入Flink的时间。
Ingestion Time: is the time at which data enters Flink.

处理时间Processing Time:是每一个执行基于时间操作的算子的本地系统时间,与机器相关,默认的时间属性就是Processing Time。
Processing Time: is the local system time for each operator performing time-based operations, machine-dependent, and the default time attribute is Processing Time.

1.12.7 你对Watermark的认识
1.12.7 What you know about Watermark

水位线是Flink流处理中保证结果正确性的核心机制,它往往会跟窗口一起配合,完成对乱序数据的正确处理。
Watermark is the core mechanism to ensure the correctness of results in Flink stream processing. It often cooperates with windows to complete the correct processing of out-of-order data.

水位线是插入到数据流中的一个标记,可以认为是一个特殊的数据
A watermark is a marker inserted into the data stream and can be thought of as a special piece of data

水位线主要的内容是一个时间戳,用来表示当前事件时间的进展
The main content of a water mark is a time stamp indicating the progress of the current event time

水位线是基于数据的时间戳生成的
Watermarks are generated based on timestamps of data

水位线的时间戳必须单调递增,以确保任务的事件时间时钟一直向前推进
The timestamp of the watermark must be monotonically increasing to ensure that the event time clock of the task advances all the way forward

水位线可以通过设置延迟,来保证正确处理乱序数据
Watermarks can be delayed to ensure that out-of-order data is handled correctly

一个水位线Watermark(t),表示在当前流中事件时间已经达到了时间戳t这代表t之前的所有数据都到齐了,之后流中不会出现时间戳t’ ≤ t的数据
Watermark(t), indicating that the event time in the current stream has reached the timestamp t, which means that all data before t have arrived, and no data with timestamp t '≤ t will appear in the stream after that.

1.12.8 Watermark多并行度下的传递、生成原理
1.12.8 Watermark Transmission and Generation Principle under Multi-parallelism

1)分类:
1) Classification:

间歇性:来一条数据,更新一次Watermark。
Intermittent: one data, update Watermark once.

周期性:固定周期更新Watermark。
Periodicity: Watermark is updated periodically.

官方提供的API是基于周期的,默认200ms,因为间歇性会给系统带来压力。
The official API is cycle-based, with a default of 200ms, because intermittent can stress the system.

2)生成原理:
2) Principle of formation:

Watermark = 当前最大事件时间 - 乱序时间 - 1ms
Watermark = current maximum event time-out-of-order time- 1ms

3)传递:
3) Transmission:

Watermark是一条携带时间戳的特殊数据,从代码指定生成的位置,插入到流里面。
Watermark is a special piece of data with a timestamp inserted into the stream from the location specified by the code.

一对多:广播
One to many: broadcast.

多对一:取最小
Many against one: take the smallest.

多对多:拆分来看,其实就是上面两种的结合。
Many to many: split, in fact, is the combination of the above two.

1.12.9 Flink怎么处理乱序和迟到数据?
1.129 How does Flink handle out-of-order and late data?

在Apache Flink中,迟到时间(lateness)和乱序时间(out-of-orderness)是两个与处理时间和事件时间相关的概念。它们在流处理过程中,尤其是在处理不按事件时间排序的数据时非常重要。
In Apache Flink, lateness and out-of-order are two concepts related to processing time and event time. They are important in stream processing, especially when dealing with data that is not sorted by event time.

(1)迟到时间(lateness):迟到时间可以影响窗口,在窗口计算完成后,仍然可以接收迟到的数据
(1) lateness: lateness can affect the window, after the window calculation is completed, you can still receive late data

迟到时间是指事件到达流处理系统的延迟时间,即事件的实际接收时间与其事件时间的差值。在某些场景下,由于网络延迟、系统故障等原因,事件可能会延迟到达。为了处理这些迟到的事件,Flink提供了一种机制,允许在窗口计算完成后仍然接受迟到的数据。设置迟到时间后,Flink会在窗口关闭之后再等待一段时间,以便接收并处理这些迟到的事件。
Latency is the delay time for an event to arrive at the stream processing system, i.e. the difference between the actual time of reception of the event and its event time. In some scenarios, events may arrive late due to network delays, system failures, etc. To handle these late events, Flink provides a mechanism that allows late data to be accepted even after the window computation is complete. When you set the lateness time, Flink waits a while after the window closes to receive and process these lateness events.

设置迟到时间的方法如下:
How to set the tardiness time is as follows:

在定义窗口时,使用`allowedLateness`方法设置迟到时间。例如,设置迟到时间为10分钟:
When defining windows, use the `allowedLatency` method to set the lateness time. For example, set the delay time to 10 minutes:

```java

DataStream<T> input = ...;

input

.keyBy(<key selector>)

.window(<window assigner>)

.allowedLateness(Time.minutes(10))

.<window function>;

```

(2)乱序时间(out-of-orderness
(2) Out-of-order time

乱序时间是通过影响水印来影响数据的摄入,它表示的是数据的混乱程度。
Out-of-order time affects the data intake by affecting the watermark, which indicates the degree of chaos of the data.

乱序时间是指事件在流中不按照事件时间的顺序到达。在某些场景下,由于网络延迟或数据源的特性,事件可能会乱序到达。
Out-of-order timing means that events arrive in the stream out of the order of event timing. In some scenarios, events may arrive out of order due to network latency or the nature of the data source.

Flink提供了处理乱序事件的方法,即水位线(watermark)。
Flink provides a way to handle out-of-order events, namely watermarks.

水位线是一种表示事件时间进展的机制,它告诉系统当前处理到哪个事件时间。
Watermarks are a mechanism to indicate the temporal progression of events, telling the system which event time is currently being processed.

当水位线到达某个值时,说明所有时间戳小于该值的事件都已经处理完成。
When the watermark reaches a certain value, all events with timestamps less than that value have been processed.

为了处理乱序事件,可以为水位线设置一个固定的延迟。
To handle out-of-order events, you can set a fixed delay for the watermark.

设置乱序时间的方法如下:
The following is how to set the out-of-order time:

在定义数据源时,使用`assignTimestampsAndWatermarks`方法设置水位线策略。例如,设置水位线延迟为5秒:
When defining a data source, use the `assignTimestampsAndWatermarks` method to set the watermark policy. For example, set the watermark delay to 5 seconds:

```java

DataStream<T> input = env.addSource(<source>);

input

.assignTimestampsAndWatermarks(

WatermarkStrategy

.<T>forBoundedOutOfOrderness(Duration.ofSeconds(5))

.withTimestampAssigner(<timestamp assigner>))

.<other operations>;

```

1.12.10 说说Flink中的窗口(分类、生命周期、触发、划分)
1.1210 Talk about windows in Flink (classification, lifecycle, trigger, partition)

1)窗口分类:
1) Window classification:

Keyed Window和Non-keyed Window

基于时间:滚动、滑动、会话。
Based on time: scrolling, swiping, conversation.

基于数量:滚动、滑动。
Based on quantity: scrolling, sliding.

2)Window口的4个相关重要组件:
2) The four important components of the Window:

assigner(分配器):如何将元素分配给窗口
assigner: How to assign elements to windows.

function(计算函数):为窗口定义的计算。其实是一个计算函数,完成窗口内容的计算
function: The calculation defined for the window. In fact, it is a calculation function that completes the calculation of the contents of the window.

triger(触发器):在什么条件下触发窗口的计算
trigger: Under what conditions triggers the calculation of the window.

可以使用自定义触发器,解决事件时间,没有数据到达,窗口不触发计算问题,还可以使用持续性触发器,实现一个窗口多次触发输出结果,详细看连接
You can use custom triggers to solve the event time, no data arrives, the window does not trigger the calculation problem, you can also use persistent triggers to achieve a window multiple trigger output results, see the connection in detail

问题展示:https://www.bilibili.com/video/BV1Gv4y1H7F8/?spm_id_from=333.999.0.0&vd_source=891aa1a363111d4914eb12ace2e039af

问题解决:https://www.bilibili.com/video/BV1mM411N7uP/?spm_id_from=333.999.0.0&vd_source=891aa1a363111d4914eb12ace2e039af

evictor(退出器):定义从窗口中移除数据
evector: Defines removing data from a window.

3)窗口的划分:如,基于事件时间的滚动窗口
3) window division: e.g., scrolling window based on event time

Start = 按照数据的事件时间向下取窗口长度的整数倍。
Start = integer multiple of window length down by event time of data.

end = start + size

比如开了一个10s的滚动窗口,第一条数据是857s,那么它属于[850s,860s)。
For example, if you open a 10s scrolling window, the first data is 857s, then it belongs to [850s,860s).

4)窗口的创建:当属于某个窗口的第一个元素到达,Flink就会创建一个窗口,并且放入单例集合
4) Window creation: When the first element belonging to a window arrives, Flink will create a window and put it into the singleton collection.

5)窗口的销毁:时间进展 >= 窗口最大时间戳 + 窗口允许延迟时间
5) Destruction of window: time progression>= window maximum time stamp + window allowable delay time

(Flink保证只删除基于时间的窗口,而不能删除其他类型的窗口,例如全局窗口)。
(Flink guarantees that only time-based windows are deleted, and other types of windows, such as global windows, cannot be deleted.)

6)窗口为什么左闭右开:属于窗口的最大时间戳 = end - 1ms
6) Why the window is left closed and right open: the maximum timestamp belonging to the window = end - 1ms

7)窗口什么时候触发:如基于事件时间的窗口 watermark >= end - 1ms
7) When the window is triggered: such as window based on event time watermark >= end - 1ms

1.12.11 Flink的keyby怎么实现的分区?分区、分组的区别是什么?
1.1211 Flink keyby how to achieve partition? What is the difference between subdivisions and groupings?

分组和分区在 Flink 中具有不同的含义和作用:
Grouping and partitioning have different meanings and roles in Flink:

分区:分区(Partitioning)是将数据流划分为多个子集,这些子集可以在不同的任务实例上进行处理,以实现数据的并行处理。
Partitioning: Partitioning is the division of a data stream into subsets that can be processed on different task instances for parallel processing of data.

数据具体去往哪个分区,是通过指定的 key 值先进行一次 hash 再进行一次 murmurHash,通过上述计算得到的值再与并行度进行相应的计算得到。
The partition to which the data goes is obtained by hashing the specified key value first and then murmurHashing it once, and calculating the value obtained by the above calculation and the parallelism accordingly.

分组:分组(Grouping)是将具有相同键值的数据元素归类到一起,以便进行后续操作(如聚合、窗口计算等)。key 值相同的数据将进入同一个分组中。
Grouping: Grouping is grouping data elements with the same key value together for subsequent operations (such as aggregation, window calculation, etc.). Data with the same key value will enter the same group.

注意:数据如果具有相同的 key 将一定去往同一个分组和分区,但是同一分区中的数据不一定属于同一组。
Note: Data with the same key must go to the same group and partition, but data in the same partition does not necessarily belong to the same group.

1.12.12 Flink的Interval Join的实现原理?Join不上的怎么办?
1.1212 Flink Interval Join implementation principle? What if I don't join?

底层调用的是keyby + connect ,处理逻辑:
The underlying call is keyby + connect, processing logic:

(1)判断是否迟到(迟到就不处理了,直接return)
(1) Judge whether you are late (if you are late, don't deal with it, return directly)

(2)每条流都存了一个Map类型的状态(key是时间戳,value是List存数据)
(2) Each stream stores a state of Map type (key is timestamp, value is List stored data)

(3)任一条流,来了一条数据,遍历对方的map状态,能匹配上就发往join方法
(3) Any stream, a piece of data comes, traverse the map state of the other party, and send it to the join method if it can match.

(4)使用定时器,超过有效时间范围,会删除对应Map中的数据(不是clear,是remove)
(4) When the timer is used, the data in the corresponding Map will be deleted (not clear, remove) if the valid time range is exceeded.

Interval join不会处理join不上的数据,如果需要没join上的数据,可以用 coGroup+join算子实现,或者直接使用flinksql里的left join或right join语法。
Interval join does not handle data that is not on join. If you need data that is not on join, you can use coGroup+join operator to implement it, or directly use left join or right join syntax in flinksql.

1.12.13 介绍一下Flink的状态编程、状态机制?
1.1213 Introduce Flink's state programming and state mechanism?

(1)算子状态:作用范围是算子,算子的多个并行实例各自维护一个状态
(1) Operator state: The scope of action is an operator, and multiple parallel instances of the operator maintain a state each.

(2)键控状态:每个分组维护一个状态
(2) Keying state: maintain one state per group

(3)状态后端:两件事=》 本地状态存哪里、checkpoint存哪里
(3) status backend: two things ="where the local state is stored, where the checkpoint is stored

1.13版本之前
1.13 prior to version

本地状态Checkpoint
Local Status Checkpoint

内存TaskManager的内存JobManager内存
Memory TaskManager memory JobManager memory

文件TaskManager的内存HDFS
File TaskManager Memory HDFS

RocksDBRocksDBHDFS

1.13版本之后
1.13 After version

本地状态
local state

Hashmap() TaskManager的内存

RocksDB RocksDB

Checkpoint存储 参数指定
Checkpoint storage parameter specification

(4)算子状态中List和UnionList区别?
(4) Difference between List and UnionList in operator state?

当算子并行度调整时
When the parallelism of operators is adjusted

List整合成大列表,轮询分发到不同并行子任务中
List is consolidated into a large list and distributed to different parallel subtasks by polling

Union整个成大列表,分发到不同并行子任务
Union as a whole into a large list, distributed to different parallel subtasks

1.12.14 Flink如何实现端到端一致性?
1.1214 How does Flink achieve end-to-end consistency?

1.12.15 分布式异步快照原理
1.12.15 Principles of distributed asynchronous snapshots

barriers在数据流源处被注入并行数据流中。快照n的barriers被插入的位置(我们称之为Sn)是快照所包含的数据在数据源中最大位置。
Barriers are injected into parallel data streams at the data stream source. The position where the barriers of snapshot n are inserted (we call it Sn) is the largest position in the data source where the snapshot contains data.

例如,在Kafka中,此位置将是分区中最后一条记录的偏移量。 将该位置Sn报告给checkpoint协调器(Flink的JobManager)。
For example, in Kafka, this location would be the offset of the last record in the partition. Report this location Sn to the checkpoint coordinator (Flink's JobManager).

然后barriers向下游流动。当一个中间操作算子从其所有输入流中收到快照n的barriers时,它会为快照n发出barriers进入其所有输出流中。
Barriers then flow downstream. When an intermediate operator receives barriers for snapshot n from all its input streams, it emits barriers for snapshot n into all its output streams.

一旦Sink操作算子(流式DAG的末端)从其所有输入流接收到barriers n,它就向Checkpoint协调器确认快照n完成。
Once the Sink operator (the end of the streaming DAG) receives barriers n from all its input streams, it acknowledges snapshot n completion to the Checkpoint coordinator.

在所有Sink确认快照后,意味快照着已完成。一旦完成快照n,Job将永远不再向数据源请求Sn之前的记录,因为此时这些记录(及其后续记录)将已经通过整个数据流拓扑,也即是已经被处理结束。
After all Sinks confirm the snapshot, the snapshot is meant to be complete. Once snapshot n is complete, Job will never ask the data source for records prior to Sn, because these records (and their successors) will have passed through the entire data flow topology, i.e., have been processed.

1.12.16 Checkpoint的参数怎么设置的?
1.12.16 How are Checkpoint parameters set?

(1)间隔:兼顾性能和延迟,一般任务设置分钟级(1~5min),要求延迟低的设置秒级
(1) Interval: give consideration to performance and delay, general task setting minutes (1~5min), low delay setting seconds

2)Task重启策略(Failover):
(2) Task restart strategy (Failover):

固定延迟重启策略:重试几次、每次间隔多久。
Fixed delay restart strategy: how many retries and how long between retries.

失败率重启策略:重试次数、重试区间、重试间隔。
Failure rate Restart policy: retry times, retry interval, retry interval.

无重启策略:一般在开发测试时使用。
No restart strategy: generally used during development testing.

Fallback重启策略:默认固定延迟重启策略。
Fallback restart policy: Default fixed delay restart policy.

1.12.17 Barrier对齐和不对齐的区别
1.12.17 Difference Between Barrier Alignment and Misalignment

精准一次性语义:默认Barrier对齐,可以设置Barrier不对齐
Precise one-time semantics: Default Barrier alignment, you can set Barrier misalignment

Barrier对齐:先到达的Barrier会等待其它并行度的Barrier,数据会先缓存,等待对齐快照。
Barrier alignment: The Barrier that arrives first will wait for the barrier of the other parallelism, and the data will be cached first, waiting for the snapshot to be aligned.

Barrier不对齐:先到达Barrier会越过所有算子,对状态和缓存区数据进行快照。
Barrier misalignment: Arriving at Barrier first takes a snapshot of state and buffer data over all operators.

至少一次语义:只有barrier对齐。
At least once semantics: only barrier alignment.

Barrier先到达数据,数据不会阻塞,发送下游计算,可能会导致数据重复计算
Barrier arrives first, data does not block, downstream calculations are sent, which may cause data to double calculate

1.12.18 Flink内存模型(重点)
1.12.18 Flink Memory Model (Key)

1.12.19 Flink常见的维表Join方案
1.12.19 Flink Common Dimension Table Join Scheme

(1)预加载:open()方法,查询维表,存储下来 ==》 定时查询
(1) preload: open() method, query dimension table, store ==》timed query

(2)热存储:存在外部系统Redis、HBase等
(2) Thermal storage: there are external systems Redis, HBase, etc.

(3)广播维表
(3) Broadcast Dimension Table

(4)Lookup Join:外部存储,connector创建,SQL用法
(4) LookupJoin: external storage, connector creation, SQL usage

1.12.20 Flink的上下文对象理解
1.12.20 Context Object Understanding for Flink

上下文对象通常用于访问作业执行和数据处理相关的信息,帮助开发人员更好控制和理解作业行为,允许在作业执行期间访问相关信息,以便进行自定义操作和优化
Context objects are often used to access information about job execution and data processing, help developers better control and understand job behavior, and allow access to relevant information during job execution for custom actions and optimizations

RuntimeContext: 在 Flink 任务中,每个并行任务都有一个与之相关联的 RuntimeContext 对象。这个对象提供了任务的上下文信息,例如任务的名称、索引、并行度等。可以使用对象来访问运行时的配置和状态信息并执行一些有状态操作。
RuntimeContext: In Flink tasks, each parallel task has a RuntimeContext object associated with it. This object provides contextual information about the task, such as its name, index, parallelism, and so on. You can use objects to access runtime configuration and state information and perform stateful operations.

FunctionContext: Flink 作业中使用函数(如 MapFunction 或 KeyedProcessFunction),则可以使用 FunctionContext 来访问有关函数的上下文信息。这包括有关当前数据流记录、定时器、状态等的信息。
FunctionContext: FunctionContext can be used to access context information about functions used in Flink jobs, such as MapFunction or KeyedProcessFunction. This includes information about the current data stream record, timers, status, etc.

CheckpointContext:允许访问检查点相关的信息,例如检查点的 ID 和状态。
CheckpointContext: Allows access to checkpoint-related information, such as the checkpoint ID and status.

1.12.21 Flink网络调优-缓冲消胀机制
1.12.21 Flink Network Tuning-Buffered Deinflation Mechanism

配置缓冲数据量的唯一方法是指定缓冲区的数量和大小。然而,因为每次部署的不同很难配置一组完美的参数。 Flink 1.14 新引入的缓冲消胀机制尝试通过自动调整缓冲数据量到一个合理值来解决这个问题。
The only way to configure the amount of buffered data is to specify the number and size of buffers. However, it is difficult to configure a perfect set of parameters because each deployment is different. Flink 1.14's newly introduced buffer deflation mechanism attempts to solve this problem by automatically adjusting the amount of buffered data to a reasonable value.

https://nightlies.apache.org/flink/flink-docs-release-1.17/zh/docs/deployment/memory/network_mem_tuning/

1.12.22 FlinkCDC锁表问题
1.12.22 FlinkCDC lock table problem

(1)FlinkCDC 1.x同步历史数据会锁表
(1) FlinkCDC 1.x synchronization history data lock table

设置参数不加锁,但只能保证至少一次。
Setting parameters is unlocked, but can only be guaranteed at least once.

(2)2.x 实现了无锁算法,同步历史数据的时候不会锁表
(2) 2x implements a lockless algorithm, and the table will not be locked when synchronizing historical data.

2.x在全量同步阶段可以多并行子任务同步,在增量阶段只能单并行子任务同步
2.x can synchronize multiple parallel subtasks in full synchronization phase, and only single parallel subtask in incremental phase.

1.13 HBase

1.13.1 HBase存储结构
1.13.1 HBase Storage Structure

架构角色:
Architecture Roles:

1)Master

实现类为HMaster,负责监控集群中所有的 RegionServer 实例。主要作用如下:
The implementation class is HMaster, which is responsible for monitoring all RegionServer instances in the cluster. The main functions are as follows:

(1)管理元数据表格hbase:meta,接收用户对表格创建修改删除的命令并执行
(1) Manage the metadata table hbase:meta, receive the user's command to create, modify, and delete the form and execute it

(2)监控region是否需要进行负载均衡,故障转移和region的拆分。
(2) Monitor whether the region needs to be load balanced, failover and region splitting.

通过启动多个后台线程监控实现上述功能:
The above is achieved by starting multiple background thread monitoring:

LoadBalancer负载均衡器
(1) LoadBalancer

周期性监控region分布在regionServer上面是否均衡,由参数hbase.balancer.period控制周期时间,默认5分钟。
Periodically monitor whether the distribution of regions on the regionServer is balanced, and the hbase.balancer.period parameter controls the cycle time, which is 5 minutes by default.

CatalogJanitor元数据管理器
(2) CatalogJanitor Metadata Manager

定期检查和清理HBase:meta中的数据。meta表内容在进阶中介绍。
Regularly check and clean up the data in HBase:meta. The contents of the meta table are described in Advanced.

MasterProcWAL Master预写日志处理器
(3) MasterProcWAL Master write-ahead log processor

Master需要执行的任务记录到预写日志WAL中,如果Master宕机,让backupMaster读取日志继续干。
Record the tasks that the master needs to perform to the write-ahead log WAL, and let the backupMaster read the log if the master goes down.

2)Region Server

Region Server实现类为HRegionServer,主要作用如下:
The implementation class of the Region Server is HRegionServer, which functions as follows:

(1)负责数据cell的处理,例如写入数据put,查询数据get等
(1) Responsible for the processing of data cells, such as writing data put, querying data get, etc

(2)拆分合并Region的实际执行者,有Master监控,有regionServer执行。
(2) The actual executor of the split and merged region is monitored by the master and executed by the regionServer.

3)Zookeeper

HBase通过Zookeeper来做Master的高可用、记录RegionServer的部署信息并且存储有meta表的位置信息
HBase uses ZooKeeper to perform high availability of the master, record the deployment information of the RegionServer, and store the location information of meta tables.

HBase对于数据的读写操作时直接访问Zookeeper的,在2.3版本推出Master Registry模式,客户端可以直接访问Master。使用此功能,会加大对Master的压力,减轻对Zookeeper的压力。
HBase allows clients to directly access Zookeeper during data read and write operations, and the Master Registry mode is introduced in version 2.3, allowing clients to directly access the Master. Using this feature will increase the pressure on the Master and reduce the pressure on Zookeeper.

4)HDFS

HDFS为HBase提供最终的底层数据存储服务,同时为HBase提供高容错的支持。
HDFS provides the ultimate underlying data storage service for HBase and provides highly fault-tolerant support for HBase.

1.13.2 HBase的写流程
1.13.2 HBase Write Process

写流程:
Writing process:

写流程顺序正如API编写顺序,首先创建HBase的重量级连接
The order in which the writing process is written, just like the order in which the APIs are written, first creates a heavyweight connection to HBase

(1)读取本地缓存中的Meta表信息;(第一次启动客户端为空)
(1) Read the Meta table information in the local cache; (The first start of the client is empty)

(2)向ZK发起读取Meta表所在位置的请求;
(2) Initiate a request to ZK to read the location of the Meta table;

(3)ZK正常返回Meta表所在位置;
(3) ZK returns the location of the Meta table normally;

(4)向Meta表所在位置的RegionServer发起请求读取Meta表信息;
(4) Initiate a request to the RegionServer where the Meta table is located to read the Meta table information;

(5)读取到Meta表信息并将其缓存在本地;
(5) Read the Meta table information and cache it locally;

(6)向待写入表发起写数据请求;
(6) Initiate a write data request to the table to be written;

(7)先写WAL,再写MemStore,并向客户端返回写入数据成功。
(7) Write WAL first, then write MemStore, and return the successful data writing to the client.

1.13.3 HBase的读流程
1.13.3 Read flow of HBase

创建连接同写流程。
Create a connection and write process.

(1)读取本地缓存中的Meta表信息;(第一次启动客户端为空)
(1) Read the Meta table information in the local cache; (The first start of the client is empty)

(2)向ZK发起读取Meta表所在位置的请求;
(2) Initiate a request to ZK to read the location of the Meta table;

(3)ZK正常返回Meta表所在位置;
(3) ZK returns the location of the Meta table normally;

(4)向Meta表所在位置的RegionServer发起请求读取Meta表信息;
(4) Initiate a request to the RegionServer where the Meta table is located to read the Meta table information;

(5)读取到Meta表信息并将其缓存在本地;
(5) Read the Meta table information and cache it locally;

(6)MemStore、StoreFile、BlockCache

同时构建MemStore与StoreFile的扫描器,
Build a scanner for both MemStore and StoreFile,

MemStore:正常读
MemStore: normal read

StoreFile

根据索引确定待读取文件;
Determine the files to be read based on the index;

再根据BlockCache确定读取文件;
Then determine the read file according to BlockCache;

(7)合并多个位置读取到的数据,给用户返回最大版本的数据,如果最大版本数据为删除标记,则不给不返回任何数据。
(7) Merge the data read from multiple locations, return the data of the maximum version to the user, and if the data of the maximum version is a deletion mark, no data will be returned without returning.

1.13.4 HBase的合并
1.13.4 Mergers of HBases

Compaction分为两种,分别是Minor CompactionMajor Compaction

1.13.5 RowKey设计原则
1.13.5 RowKey Design Principles

(1)rowkey长度原则
(1) Rowkey length principle

(2)rowkey散列原则
(2) Rowkey hashing principle

(3)rowkey唯一原则
(3) Rowkey sole principle

1.13.6 RowKey如何设计
1.13.6 How to design a RowKey

1)使用场景:
1) Usage Scenarios:

大量用户信息保存在HBase中。
A large amount of user information is stored in HBase.

2)热点问题:
2) Hot Issues:

由于用户的id是连续的,批量导入用户数据后,很有可能用户信息都集中在同一个region中。如果用户信息频繁访问,很有可能该region的节点成为热点。
Since user IDs are sequential, it is likely that all user information will be centralized in the same region after importing user data in batches. If user information is frequently accessed, nodes in the region are likely to become hotspots.

3)期望: 通过对Rowkey的设计,使用户数据能够分散到多个region中。
3) Expectation: Through the design of Rowkey, user data can be distributed across multiple regions.

4)步骤:
4) Steps:

(1)预分区
(1) Pre-partition

通过命令
from the command

create 'GMALL:DIM_USER_INFO','INFO',SPLITS=>['20','40','60','80']

把用户信息表(GMALL:DIM_USER_INFO) 分为5个region : [00-20), [20-40), [40-60), [60-80), [80-99]

(2)写入时反转ID
(2) Invert ID when writing

把用户ID左补零10位(根据最大用户数),然后反转顺序。
Fill the user ID with 10 left zeros (according to the maximum number of users) and reverse the order.

比如:用户id为1457,反转处理后变为7541000000; 根据前两位分到region [60-80),
For example: the user id is 1457, and after inversion processing, it becomes 754100000; according to the first two bits, it is divided into region [60-80),

用户id为1459,反转处理后变为9541000000;根据前两位分到 region [80-99]
The user id is 1459, which becomes 954100000 after inversion processing; it is divided into region [80-99] according to the first two bits.

这样连续的用户ID反转后由于Rowkey开头并不连续,会进入不同的region中。
After such consecutive user IDs are reversed, since Rowkey starts are not consecutive, they will enter different regions.

最终达到的效果可以通过Web UI进行观察:
The final result can be observed through the Web UI:

如上图,用户数据会分散到多个分区中。
As shown above, user data is scattered across multiple partitions.

注意:在用户查询时,也同样根据需要把ID进行反转后进行查询。
Note: When the user queries, the ID is also inverted as needed to query.

1.13.7 HBase二级索引原理
1.13.7 HBase secondary indexing principle

1)原理
1) Principle

协处理器:协助处理数据,可以在向原始表中写入数据之后向索引表中写入一条索引数据。
Coprocessor: Assists in processing data by writing an index entry to the index table after writing data to the original table.

2)种类及用法
2) Types and usage

(1)全局 读多写少
(1) Read more than write less globally

单独创建表专门用于存储索引,索引表数据量比原始表小,读取更快速。但是写操作会写两张表的数据,跨Region,需要多个连接。
A separate table is created specifically to store indexes, which are smaller and faster to read than the original table. However, write operations write data from two tables across Regions, requiring multiple joins.

(2)本地 写多读少
(2) More local writing and less local reading

将索引数据与原表放在一起(Region),加在一起比原表数据量大,读取相对变慢,但是由于在一个Region,所以写操作两条数据用的是同一个连接。
The index data and the original table are put together (Region), which adds up to a larger amount of data than the original table, and the reading is relatively slow, but because it is in a Region, the same connection is used to write two pieces of data.

1.14 Clickhouse

1.14.1 Clickhouse的优势

快:提供了丰富的表引擎,每个表引擎 都做了尽可能的优化。
Fast: Rich table engines are provided, each optimized as much as possible.

为什么快?
Why fast?

(1)向量化
(1) vectorization

(2)列式
(2) Formulas

(3)尽可能使用本节点的 内存+cpu,不依赖其他组件,比如Hadoop
(3) Use the memory +CPU of this node as much as possible, and do not rely on other components, such as Hadoop.

4)提供了sql化的语言
(4) provide sql language

(5)支持自定义函数
(5) Support for custom functions

(6)提供了丰富的表引擎,引擎都经过了优化
(6) Rich table engines are provided, and the engines are optimized.

1.14.2 Clickhouse的引擎

(1)Log

(2)SpecialMemoryDistributed

(3)MergeTree: replacingmergetree、summingmergetree

replicatedmergetree

(4)集成引擎: 外部系统映射,如MySQL
Integration engine: external system mapping, such as MySQL

1.14.3 Flink写入Clickhouse怎么保证一致性?
1.14.3 Flink writes Clickhouse How to guarantee consistency?

Clickhouse没有事务,Flink写入是至少一次语义。
Clickhouse has no transactions, Flink writes are at least once semantic.

利用Clickhouse的ReplacingMergeTree引擎会根据主键去重,但只能保证最终一致性。查询时加上final关键字可以保证查询结果的一致性。
Clickhouse's ReplacingMergeTree engine deduplicates based on the primary key, but can only guarantee final consistency. Adding the final keyword to the query ensures consistency of the query results.

1.14.4 Clickhouse存储多少数据?几张表?
1.14.4 How much data does Clickhouse store? How many forms?

10几张宽表,每天平均10来G,存储一年。
More than 10 wide tables, an average of 10 G per day, stored for one year.

需要磁盘 10G * 365* 2副本/0.7 = 约11T
Disk required 10G * 365 days * 2 copies/0.7 = approx. 11T

1.14.5 Clickhouse使用本地表还是分布式表
1.14.5 Clickhouse Using Local or Distributed Tables

1)我们用的本地表,2个副本
1) Local table we use, 2 copies

2)分布式表写入存在的问题:
2) Problems with distributed table writing:

假如现有一个2分片的集群,使用clickhouse插入分布式表。
If you have a 2-sharded cluster, insert distributed tables using clickhouse.

(1)资源消耗问题:在分片2的数据写入临时目录中会产生写放大现象,会大量消耗分片节点1的CPU和磁盘等资源。
(1) Resource consumption problem: Writing data to the temporary directory of fragment 2 will cause write amplification, which will consume a lot of CPU and disk resources of fragment node 1.

(2)数据准确性和一致性问题:在写入分片2的时候,节点1或节点2的不正常都会导致数据问题。(节点1挂了数据丢失、节点2挂了或者节点2表删了节点1会无限制重试,占用资源)。
(2) Data accuracy and consistency issues: When writing to shard 2, irregularities in node 1 or node 2 will cause data problems. (Node 1 is suspended and data is lost, node 2 is suspended, or node 1 is deleted from the table. There will be unlimited retry, occupying resources).

(3)part过多问题:每个节点每秒收到一个Insert Query,N个节点,分发N-1次,一共就是每秒生成Nx(N-1)个part目录。集群shard数越多,分发产生的小文件也会越多(如果写本地表就会相对集中些),最后会导致写入到MergeTree的Part的数会特别多,最后会拖垮整个文件的系统。
(3) Too many parts problem: Each node receives an Insert Query per second, N nodes are distributed N-1 times, and a total of Nx (N-1) part directories are generated per second. The more shards a cluster has, the more small files it will distribute (and the more concentrated it will be if it is written on the surface), which will eventually lead to a particularly large number of Parts written to MergeTree, which will eventually bring down the entire file system.

1.14.6 Clickhouse的物化视图
1.14.6 Materialized Views of Clickhouse

一种查询结果的持久化,记录了查询语句和对应的查询结果。
A persistence of query results that records query statements and corresponding query results.

优点:查询速度快,要是把物化视图这些规则全部写好,它比原数据查询快了很多,总的行数少了,因为都预计算好了。
Advantages: query speed is fast, if the materialized view these rules are all written, it is much faster than the original data query, the total number of rows is less, because all are expected to calculate well.

缺点:它的本质是一个流式数据的使用场景,是累加式的技术,所以要用历史数据做去重、去核这样的分析,在物化视图里面是不太好用的。在某些场景的使用也是有限的。而且如果一张表加了好多物化视图,在写这张表的时候,就会消耗很多机器的资源,比如数据带宽占满、存储一下子增加了很多。
Disadvantages: Its essence is a use scenario of streaming data, which is an accumulative technology, so it is not easy to use historical data for analysis such as de-duplication and de-core in materialized view. The use in some scenarios is also limited. Moreover, if a table is added with many materialized views, when writing this table, it will consume a lot of machine resources, such as full data bandwidth and a lot of storage.

1.14.7 Clickhouse的优化

1)内存优化
1) Memory optimization

max_memory_usage: 单个查询的内存上限,128G内存的服务器==》 设为100G
max_memory_usage: Maximum memory limit for a single query, 128 GB memory server ==> Set to 100 GB

max_bytes_before_external_group_by:设为 一半,50G

max_bytes_before_external_sort:设为一半,50G

2)CPU

max_concurrent_queries: 默认 100/s ===> 300/s

3)存储
3) Storage

SSD更快
SSD is faster

4)物化视图
4) Materialized View

5)写入时攒批,避免写入过快导致 too many parts
5) Save batches when writing, avoid writing too fast and cause too many parts

1.14.8 Clickhouse的新特性Projection

Projection 意指一组列的组合,可以按照与原表不同的排序存储,并且支持聚合函数的查询。ClickHouse Projection 可以看做是一种更加智能的物化视图,它有如下特点:
Projection means a combination of columns that can be stored in a different order than the original table and supports queries for aggregate functions. ClickHouse Projection can be seen as a more intelligent materialized view, which has the following characteristics:

1)part-level存储

相比普通物化视图是一张独立的表,Projection 物化的数据就保存在原表的分区目录中,支持明细数据的普通Projection 和 预聚合Projection。
Compared with ordinary materialized view, which is an independent table, the materialized data of Projection is saved in the partition directory of the original table. It supports ordinary Projection and pre-aggregation Projection of detailed data.

2)无感使用,自动命中
2) No sense use, automatic hit

可以对一张 MergeTree 创建多个 Projection ,当执行 Select 语句的时候,能根据查询范围,自动匹配最优的 Projection 提供查询加速。如果没有命中 Projection , 就直接查询底表。
Multiple projections can be created for a MergeTree, and when executing Select statements, the optimal Projection can be automatically matched according to the query scope to provide query acceleration. If you don't hit the Projection , query the bottom table directly.

3)数据同源、同生共死
3) Data homology, life and death together

因为物化的数据保存在原表的分区,所以数据的更新、合并都是同源的,也就不会出现不一致的情况了。
Because the materialized data is stored in the partition of the original table, the update and merge of the data are homologous, and there will be no inconsistency.

1.14.9 Cilckhouse的索引、底层存储
1.14.9 Indexing, underlying storage for Cilckhouse

1)索引
1) Index

(1)一级索引:稀疏索引(主键索引) 粒度8192
(1) Level 1 index: sparse index (primary index) granularity 8192

2)二级索引:跳数索引 minmax、set、bloom_filter
(2) Secondary index: hop index minmax, set, bloom_filter, etc.

2)底层存储
2) Underlying storage

Clickhouse默认数据目录在/var/lib/clickhouse/data目录中。所有的数据库都会在该目录中创建一个子文件夹。下图展示了Clickhouse对数据文件的组织。
Clickhouse default data directory is in/var/lib/clickhouse/data directory. All databases create a subfolder in this directory. The following figure shows Clickhouse's organization of data files.

202103_1_10_2

目录

分区目录,由分区+LSM生成的
partition directory, generated by partition +LSM

detached

目录

通过DETACH语句卸载后的表分区存放位置
Table partition storage location after unloading through DETACH statement

format_version.txt

文本文件
text file

纯文本,记录存储的格式
Plain text, format in which records are stored

分区目录命名 = 分区ID_最小数据块编号_最大数据块编号_层级构成。数据块编号从1开始自增,新创建的数据块最大和最小编号相同,当发生合并时会将其修改为合并的数据块编号。同时每次合并都会将层级增加1。
Partition Directory Name = Partition ID_Smallest Block Number_Largest Block Number_Hierarchy Composition. Block numbers increment from 1, the maximum and minimum block numbers for newly created blocks are the same, and when a merge occurs, it is modified to the merged block number. Each merge also increases the level by 1.

1.15 Doris

1.15.1 Doris中的三种模型及对比
1.15.1 The three models in Doris and how they compare?

- Aggregate 将数据分为key和value,进行聚合
- Aggregate divides data into key and value for aggregation

- Uniq 数据添加主键
- Uniq Data Add Primary Key

- Duplicate 明细数据
- Duplicate detail data

1.15.2 Doris的分区分桶怎么理解怎么划分字段
1.15.2 How to understand the partition and bucket of Doris and how to divide the fields

Doris支持两层的数据划分。第一层是 Partition,支持 Range和List的划分方式。第二层是 Bucket(Tablet),仅支持Hash的划分方式。也可以仅使用一层分区。使用一层分区时,只支持Bucket划分。
Doris supports two-tier data partitioning. The first layer is Partition, which supports Range and List partitioning. The second layer is Bucket (Tablet), which only supports Hash partition. It is also possible to use only one level of partitioning. When using one-level partitioning, only Bucket partitioning is supported.

1.15.3 生产中节点多少个,FE,BE 那个对于CPU和内存的消耗大
1.15.3 How many nodes are in production, FE, BE, which consumes a lot of CPU and memory

独立部署,5台起步,BE消耗更大
Independent deployment, starting with 5 sets, BE consumption is greater

1.15.4 Doris使用过程中遇到过哪些问题?
1.15.4 What problems have you encountered with Doris?

1)数据量大资源不足:增加机器台数,增加并发
1) Large data volume and insufficient resources: increase the number of machines and increase concurrency

2)Doirs锁表问题
2) Doirs lock table problem

1.15.5 Doris跨库查询,关联MySQL有使用过吗
1.15.5 Doris cross-database query, associated MySQL have you ever used

可以

1.15.6 Dorisroll up物化视图区别
1.15.6 Difference between roll up and materialized view of Doris

Roll up可以理解为物化视图的过度版本,目前Doris物化视图覆盖roll up功能
Roll up can be understood as an excessive version of the materialized view. Currently, Doris materialized view overrides roll up function.

1.15.7 Doris的前缀索引
1.15.7 Prefix index of Doris

Doris 不支持在任意列上创建索引,而前缀索引,即在排序的基础上,实现的一种根据给定前缀列,快速查询数据的索引方式
Doris does not support creating indexes on arbitrary columns, but prefix indexes, that is, indexes implemented on the basis of sorting to quickly query data according to a given prefix column.

例如:将一行数据的前 36 个字节 作为这行数据的前缀索引。当遇到 VARCHAR 类型时,前缀索引会直接截断。
For example, the first 36 bytes of a row of data are used as the prefix index of this row of data. When VARCHAR types are encountered, prefix indexes are truncated directly.

表结构的前缀索:user_id(8Byte) + age(4Bytes) + message(prefix 20 Bytes)

1.16 可视化报表工具
1.16 Visual Reporting Tools

开源:Echarts(百度)、Kibana、Superset(功能一般)
Open source: Echarts (Baidu), Kibana, Superset (general function)

收费:Tableau(功能强大)、QuickBI(阿里云面对实时)、DataV(阿里云面对实时)、Suga(百度实时)
Charges: Tableau (powerful), QuickBI (Aliyun facing real-time), DataV (Aliyun facing real-time), Suga (Baidu real-time)

1.17 JavaSE

1.17.1 并发编程
1.17.1 Concurrent programming

1)什么是多线程&多线程的优点
1) What is Multithreading & Advantages of Multithreading

多线程是指程序中包含多个执行流,即一个程序中可以同时运行多个不同的线程来执行不同的任务。
Multithreading refers to a program that contains multiple execution streams, that is, a program can run multiple different threads at the same time to perform different tasks.

优点:可以提高cpu的利用率。多线程中,一个线程必须等待的时候,cpu可以运行其它的线程而不是等待,这样大大提高了程序的效率。
Advantages: CPU utilization can be improved. In multithreading, when a thread has to wait, the cpu can run other threads instead of waiting, which greatly improves the efficiency of the program.

2)Java 3种常见创建多线程的方式
Java 3 Common Ways to Create Multithreading

(1)继承Thread类,重run()方法
(1) Inherit Thread class, repeat run() method

(2)实现Runnable接口,重写run()方法
(2) Implement Runnable interface, rewrite run() method

(3)通过创建线程池实现
(3) By creating a thread pool

1.17.2 如何创建线程池
1.17.2 How to create a thread pool

Executors提供了线程工厂方法用于创建线程池,返回的线程池都实现了ExecutorServer接口。
Executors provide thread factory methods for creating thread pools, and the returned thread pools implement the ExecutorServer interface.

newSingleThreadExecutor、newFixedThreadExecutor、newCachedThreadPool、newScheduledThreadPool

虽然Java自带的工厂方法很便捷,但都有弊端,《阿里巴巴Java开发手册》中强制线程池不允许使用以上方法创建,而是通过ThreadPoolExecutor的方式,这样处理可以更加明确线程池运行规则,规避资源耗尽的风险,仅作了解。
Although Java's factory method is very convenient, but there are drawbacks,"Alibaba Java Development Manual" forced thread pool is not allowed to use the above method to create, but through the ThreadPoolExecutor way, this processing can be more clear thread pool running rules, avoid the risk of resource depletion, just for understanding.

1.17.3 ThreadPoolExecutor构造函数参数解析
ThreadPoolExecutor constructor parameter parsing

(1)corePoolSize 创建线程池的线程数量
(1) corePoolSize Number of threads creating the thread pool

(2)maximumPoolSize 线程池的最大线程数
(2) maximumPoolSize Maximum number of threads in the thread pool

(3)keepAliveTime 当线程数量大于corePoolSize ,空闲的线程当空闲时间超过keepAliveTime时就会回收;
(3) keepAliveTime When the number of threads is greater than corePoolSize, idle threads will be recycled when the idle time exceeds keepAliveTime;

(4)unit { keepAliveTime} 时间单位

(5)workQueue 保留任务的队列
(5) workQueue holds the queue of tasks

1.17.4 线程的生命周期
1.17.4 Life cycle of threads

创建 运行 阻塞 等待 死亡
Create Run Block Wait for Death

1.17.5 notify和notifyall区别

notify会选择一个等待在该对象监视器上的线程,然后唤醒该线程
notify selects a thread waiting on the object monitor and wakes it up

notifyAll 方法用于唤醒等待在对象监视器上的所有线程
The notifyAll method is used to wake up all threads waiting on the object monitor

1.17.6 集合
1.17.6 Collections

1)List和Set区别
1) Difference between List and Set

前者底层是数组,有序可以重复,有索引,查找快,删除添加慢
The former is an array at the bottom, ordered can be repeated, there are indexes, find fast, delete add slow

后者底层是HashMap,无序不重复,搜索慢,插入删除快
The latter is HashMap at the bottom, disordered and not repeated, slow search, fast insertion and deletion

2)LinkedList和ArrayList异同

ArrayList基于动态数据实现,查询快
ArrayList based on dynamic data implementation, query fast

LinkedList基于链表实现,新增和删除更快,不支持高效访问
LinkedList is based on linked list implementation, adding and deleting faster, does not support efficient access

两者都是线程不安全的
Both are thread-unsafe.

1.17.7 列举线程安全的Map集合
1.17.7 Listing Thread-Safe Map Collections

SynchronizedMap、ConcurrentHashMap

1.17.8 StringBuffer和StringBuilder的区别

StringBuffer中的方法大都采用synchronized关键字进行修饰,是线程安全的,效率低。
Most methods in StringBuffer are decorated with the synchronized keyword, which is thread-safe and inefficient.

StringBuilder是线程不安全的,效率高。 
StringBuilder is thread-unsafe and efficient.

1.17.9 HashMap和HashTable的区别

HashMap是线程不安全的效率高,HashTable是线程安全的,效率低。
HashMap is thread-unsafe and efficient, HashTable is thread-safe and inefficient.

1.17.10 HashMap的底层原理
1.17.10 The underlying principle of HashMap

1)HashMap的实现原理
1) HashMap implementation principle

HashMap实际上是一个数组和链表的结合体,HashMap基于Hash算法实现的;
HashMap is actually a combination of array and linked list. HashMap is based on Hash algorithm.

(1)当我们向HashMap中Put元素时,利用key的hashCode重新计算出当前对象的元素在数组中的下标
(1) When we put an element into a HashMap, we recalculate the subscript of the element of the current object in the array using the hashCode of the key.

(2)写入时,如果出现Hash值相同的key,此时分类,如果key相同,则覆盖原始值;如果key不同,value则放入链表中
(2) When writing, if there is a key with the same Hash value, the classification is made at this time. If the key is the same, the original value is overwritten; if the key is different, the value is placed in the linked list.

(3)读取时,直接找到hash值对应的下标,在进一步判断key是否相同,进而找到对应值
(3) When reading, directly find the index corresponding to the hash value, and further determine whether the key is the same, and then find the corresponding value.

2)HashMap在JDK1.7和JDK1.8中有哪些区别
What are the differences between HashMap in JDK 1.7 and JDK 1.8?

JDK1.7:数组 + 链表
JDK 1.7: Array + List

JDK1.8:数组+红黑树
JDK1.8: Array + Red-Black Tree

3)HashMap的Put方法具体流程
3) HashMap Put method specific process

4)HashMap的扩容
4) HashMap expansion

HashMap中的键值对大于阈值或者初始化时,就调用resize()进行扩容。
When the key-value pairs in the HashMap are greater than the threshold or initialized, resize() is called to expand.

每次扩展的时候都是扩展2倍。
Every time it expands, it expands twice.

1.17.11 项目中使用过的设计模式
1.17.11 Design patterns used in projects

1)单例模式:确保某个类只有一个实例,实时项目中的线程池
(1) Single-instance pattern: Ensure that there is only one instance of a class, thread pool in real-time projects

2)模板方式模式:一个抽象类公开定义执行它的方式/模板,考评平台的考评器,实时数仓dws层关联维表
(2) template mode: an abstract class publicly defines its execution mode/template, the evaluator of the evaluation platform, the real-time warehouse dws layer association dimension table

1.18 MySQL

1.18.1 SQL执行顺序
1.18.1 SQL Execution Order

From、Where、Group By 、Having、Select、Order By、Limit

1.18.2 TRUNCATE 、DROP、DELETE区别

清空表数据 删除表 删除特定部分数据
Empty table data Delete table Delete specific part of data

1.18.3 MyISAM与InnoDB的区别

对比项

MyISAM

InnoDB

外键

不支持

支持

事务

不支持

支持

行表锁

表锁,即使操作一条记录也会锁住整个表,不适合高并发的操作
Table lock, even if the operation of a record will lock the entire table, not suitable for highly concurrent operations

行锁,操作时只锁某一行,不对其它行有影响,适合高并发的操作
Line lock, lock only one line during operation, no impact on other lines, suitable for highly concurrent operations

缓存

只缓存索引,不缓存真实数据
Caching index only, not real data

不仅缓存索引还要缓存真实数据,对内存要求较高,而且内存大小对性能有决定性的影响
Caching not only indexes but also real data requires high memory requirements, and memory size has a decisive impact on performance.

1.18.4 MySQL四种索引
1.18.4 MySQL Four Indexes

1)唯一索引
1) Unique index

主键索引是唯一的,通常以表的ID设置为主键索引,一个表只能有一个主键索引,这是他跟唯一索引的区别。
A primary key index is unique, usually with the ID of the table set as the primary key index, a table can only have one primary key index, which is the difference between it and a unique index.

2)聚簇索引
2) Cluster index

聚簇索引的叶子节点都包含主键值、事务 ID、用于事务 MVCC 的回滚指针以及所有的剩余列
The leaf nodes of the clustered index all contain the primary key value, the transaction ID, the rollback pointer for the transaction MVCC, and all remaining columns.

3)辅助索引(非聚簇索引|二级索引)
3) Secondary indexes (non-clustered indexes|secondary indexes)

辅助索引也叫非聚簇索引,二级索引等,其叶子节点存储的不是行指针而是主键值,得到主键值再要查询具体行数据的话,要去聚簇索引中再查找一次,也叫回表。这样的策略优势是减少了当出现行移动或者数据页分裂时二级索引的维护工作。
Secondary indexes are also called non-clustered indexes, secondary indexes, etc., and their leaf nodes store primary key values instead of row pointers. The advantage of such a strategy is that it reduces the maintenance of secondary indexes when rows move or data pages split.

4)联合索引
4) Federated indexing

两个或两个以上字段联合组成一个索引。使用时需要注意满足最左匹配原则!
Two or more fields are combined to form a single index. When using it, you need to pay attention to satisfying the leftmost matching principle!

1.18.5 MySQL的事务
1.18.5 MySQL transactions

(1)事务的基本要素(ACID)
(1) Basic Elements of Transactions (ACID)

2)事务的并发问题
(2) The concurrency of transactions

脏读:事务A读取了事务B更新的数据,然后B回滚操作,那么A读取到的数据是脏数据
Dirty read: Transaction A reads the data updated by transaction B, and then B rolls back the operation, then the data read by A is dirty data

不可重复读:事务 A 多次读取同一数据,事务 B 在事务A多次读取的过程中,对数据作了更新并提交,导致事务A多次读取同一数据时,结果 不一致
Non-rereadable: Transaction A reads the same data multiple times, and transaction B updates and commits the data during the multiple reads of transaction A, resulting in inconsistent results when transaction A reads the same data multiple times

幻读:系统管理员A将数据库中所有学生的成绩从具体分数改为ABCDE等级,但是系统管理员B就在这个时候插入了一条具体分数的记录,当系统管理员A改结束后发现还有一条记录没有改过来,就好像发生了幻觉一样,这就叫幻读。
Phantom reading: System administrator A changes the grades of all students in the database from specific scores to ABCDE grades, but system administrator B inserts a record of specific scores at this time, and when system administrator A finishes the change, he finds that there is still a record that has not been changed, as if he has hallucinated, which is called hallucination.

小结:不可重复读的和幻读很容易混淆,不可重复读侧重于修改,幻读侧重于新增或删除。解决不可重复读的问题只需锁住满足条件的行,解决幻读需要锁表
Summary: It is easy to confuse non-repeatable reads with phantom reads, with non-repeatable reads focusing on modifications and phantom reads focusing on adding or deleting. To solve the problem of non-repeatable read, you only need to lock the rows that meet the conditions, and to solve the phantom read, you need to lock the table

1.18.6 MySQL事务隔离级别
1.18.6 MySQL Transaction Isolation Level

事务隔离级别
Transaction isolation level

脏读

不可重复读
It cannot be read repeatedly

幻读

读未提交(read-uncommitted)
read-uncommitted

不可重复读(read-committed)
Read-Committed

可重复读(repeatable-read)
Repeatable-read

串行化(serializable)

1.18.7 MyISAM与InnoDB对比

(1)InnoDB的数据文件本身就是索引文件,而MyISAM索引文件和数据文件是分离的:
(1) The data file of InnoDB is an index file itself, while the index file of MyISAM is separated from the data file:

①InnoDB的表在磁盘上存储在以下文件中: .ibd(表结构、索引和数据都存在一起,MySQL5.7表结构放在.frm中)
(1) InnoDB tables are stored on disk in the following files: .ibd (the table structure, index, and data all exist together, and the MySQL 5.7 table structure is placed in .frm)

②MyISAM的表在磁盘上存储在以下文件中: *.sdi(描述表结构,MySQL5.7是.frm)、*.MYD(数据),*.MYI(索引)
MyISAM tables are stored on disk in the following files: *.sdi (describing table structure, MySQL 5.7 is.frm), *.MYD (data), *.MYI (index)

(2)InnoDB中主键索引是聚簇索引,叶子节点中存储完整的数据记录;其他索引是非聚簇索引,存储相应记录主键的值 。
(2) The primary key index in InnoDB is a clustered index, and the leaf node stores complete data records; other indexes are non-clustered indexes, and store the value of the primary key of the corresponding record.

(3)InnoDB要求表必须有主键 ( MyISAM可以没有 )。如果没有显式指定,则MySQL系统会自动选择一个可以非空且唯一标识数据记录的列作为主键。如果不存在这种列,则MySQL自动为InnoDB表生成一个隐含字段作为主键。
(3) InnoDB requires tables to have a primary key (MyISAM can have none). If not explicitly specified, MySQL automatically selects a column that can be non-null and uniquely identifies the data record as the primary key. If no such column exists, MySQL automatically generates an implied field as the primary key for the InnoDB table.

(4)MyISAM中无论是主键索引还是非主键索引都是非聚簇的,叶子节点记录的是数据的地址。
(4) In MyISAM, both primary key indexes and non-primary key indexes are non-clustered, and leaf nodes record the address of data.

(5)MyISAM的回表操作是十分快速的,因为是拿着地址偏移量直接到文件中取数据的,反观InnoDB是通过获取主键之后再去聚簇索引里找记录,虽然说也不慢,但还是比不上直接用地址去访问。
(5) MyISAM table back operation is very fast, because it takes the address offset directly to the file to get data, in contrast, InnoDB is to get the primary key and then go to the cluster index to find records, although it is not slow, but still not as fast as directly using the address to access.

1.18.8 B树和B+树对比
1.18.8 Comparison of B trees and B+ trees

1)B+ 树和 B 树的差异
1) Differences between B+ trees and B trees

1)B+树中非叶子节点的关键字也会同时存在子节点中,并且是在子节点中所有关键字的最大值(或最小)。
(1) A keyword that is not a leaf node in the B+ tree will also exist in the child node, and it is the maximum (or minimum) of all keywords in the child node.

2)B+树中非叶子节点仅用于索引,不保存数据记录,跟记录有关的信息都放在叶子节点中。而B树中,非叶子节点既保存索引,也保存数据记录。
(2) Non-leaf nodes in the B+ tree are only used for indexing, and data records are not saved. Information related to records is placed in leaf nodes. In B-trees, non-leaf nodes hold both indexes and data records.

3)B+树中所有关键字都在叶子节点出现,叶子节点构成一个有序链表,而且叶子节点本身按照关键字的大小从小到大顺序链接。
(3) All keywords in B+ tree appear in leaf nodes, leaf nodes form an ordered linked list, and leaf nodes themselves are linked according to the size of keywords from small to large.

2)B+树为什么IO的次数会更少
2) Why B+ trees have fewer IO times

真实环境中一个页存放的记录数量是非常大的(默认16KB),假设指针与键值忽略不计(或看做10个字节),数据占 1 kb 的空间:
The number of records stored in a page in the real environment is very large (default 16KB), assuming that pointers and key values are ignored (or regarded as 10 bytes), and the data occupies 1 kb of space:

如果B+树只有1层,也就是只有1个用于存放用户记录的节点,最多能存放16条记录。
If the B+ tree has only one layer, that is, only one node for storing user records, it can store up to 16 records.

如果B+树有2层,最多能存放1600×16 = 25600条记录。
If the B+ tree has two layers, it can store up to 1600×16 = 25600 records.

如果B+树有3层,最多能存放1600×1600×16 = 40960000条记录。
If the B+ tree has three layers, it can store up to 1600×1600×16 = 40960000 records.

如果存储千万级别的数据,只需要三层就够了。
If you store tens of millions of levels of data, you only need three layers.

B+树的非叶子节点不存储用户记录,只存储目录记录,相对B树每个节点可以存储更多的记录,树的高度会更矮胖,IO次数也会更少。
The non-leaf nodes of the B+ tree do not store user records, only directory records. Compared to the B tree, each node can store more records, the height of the tree will be shorter, and the number of IO times will be less.

1.19 Redis

1.19.1 Redis缓存穿透、缓存雪崩、缓存击穿
1.19.1 Redis Cache Penetration, Cache Avalanche, Cache Breakdown

(1)缓存穿透是指查询一个一定不存在的数据。由于缓存命不中时会去查询数据库,查不到数据则不写入缓存,这将导致这个不存在的数据每次请求都要到数据库去查询,造成缓存穿透。
(1) Cache penetration refers to querying data that must not exist. Because the cache will query the database when it is not in the right place, if it cannot find the data, it will not be written to the cache, which will cause the non-existent data to be queried to the database every time, resulting in cache penetration.

解决方案:
Solution:

①是将空对象也缓存起来,并给它设置一个很短的过期时间,最长不超过5分钟。
1 is to cache empty objects and set a short expiration time for them, no more than 5 minutes.

②采用布隆过滤器,将所有可能存在的数据哈希到一个足够大的bitmap中,一个一定不存在的数据会被这个bitmap拦截掉,从而避免了对底层存储系统的查询压力。
② Bloom filter is used to hash all possible data into a bitmap large enough, and a certain non-existent data will be intercepted by this bitmap, thus avoiding the query pressure on the underlying storage system.

(2)如果缓存集中在一段时间内失效,发生大量的缓存穿透,所有的查询都落在数据库上,就会造成缓存雪崩。
(2) If the cache fails over a period of time, a large number of cache penetrations occur, and all queries fall on the database, which will cause cache avalanche.

解决方案:尽量让失效的时间点不分布在同一个时间点。
Solution: Try not to distribute the failure time points at the same time.

(3)缓存击穿,是指一个key非常热点,在不停的扛着大并发当这个key在失效的瞬间,持续的大并发就穿破缓存,直接请求数据库,就像在一个屏障上凿开了一个洞
(3) Cache breakdown refers to a key that is very hot and constantly carrying large concurrency. When this key fails at the moment, the continuous large concurrency will break through the cache and directly request the database, just like cutting a hole in a barrier.

解决方案:可以设置key永不过期。
Solution: You can set the key to never expire.

1.19.2 Redis哨兵模式
1.19.2 Redis Sentinel Mode

(1)主从复制中反客为主的自动版,如果主机Down掉,哨兵会从从机中选择一台作为主机,并将它设置为其他从机的主机,而且如果原来的主机再次启动的话也会成为从机。
(1) Automatic version of anti-guest master in master-slave replication. If the master is down, the sentinel will select one of the slaves as the master and set it as the master of other slaves, and if the original master is started again, it will also become a slave.

(2)哨兵模式是一种特殊的模式,首先Redis提供了哨兵的命令,哨兵是一个独立的进程,作为进程,它独立运行。其原理是哨兵通过发送命令,等待Redis服务器响应,从而监控运行的多个Redis实例。
(2) Sentinel mode is a special mode, first Redis provides the sentinel command, sentinel is an independent process, as a process, it runs independently. The principle is that sentries monitor multiple running Redis instances by sending commands and waiting for Redis servers to respond.

(3)当哨兵监测到Redis主机宕机,会自动将Slave切换成Master,然后通过发布订阅模式通知其他服务器,修改配置文件,让他们换主机
(3) When the sentinel detects that the Redis host is down, it will automatically switch Slave to Master, and then notify other servers through publish subscription mode to modify the configuration file and let them change hosts.

(4)当一个哨兵进程对Redis服务器进行监控,可能会出现问题,为此可以使用哨兵进行监控, 各个哨兵之间还会进行监控,这就形成了多哨兵模式。
(4) When a sentinel process monitors Redis servers, problems may occur. For this reason, sentinels can be used to monitor, and each sentinel can also monitor between sentinels. This forms a multi-sentinel pattern.

1.19.3 Redis数据类型
1.19.3 Redis data types

String 字符串
String String

List 可以重复的集合
List A repeatable collection

Set 不可以重复的集合
Set A collection that cannot be repeated

Hash 类似于Map<String,String>

Zser(sorted set) 分数的set

1.19.4 热数据通过什么样的方式导入Redis
1.19.4 How hot data is imported into Redis

提供一种简单实现缓存失效的思路LRU最近少用的淘汰)。
Provide a simple way to implement cache invalidation: LRU (Left-Used Elimination).

即Redis的缓存每命中一次,就给命中的缓存增加一定TTL(过期时间)(根据具体情况来设定, 比如10分钟)。
That is, every time Redis cache hits, it adds a certain TTL (expiration time) to the hit cache (set according to the specific situation, such as 10 minutes).

一段时间后,热数据的TTL都会较大,不会自动失效,而冷数据基本上过了设定的TTL就马上失效了。
After a period of time, the TTL of hot data will be larger and will not automatically fail, while the TTL of cold data basically exceeds the setting will immediately fail.

1.19.5 Redis的存储模式RDB,AOF
1.19.5 Redis storage modes RDB, AOF

Redis 默认开启RDB持久化方式,在指定的时间间隔内,执行指定次数的写操作,则将内存中的数据写入到磁盘中。
Redis enables RDB persistence mode by default, and writes the data in memory to disk after executing the specified number of write operations within the specified time interval.

RDB 持久化适合大规模的数据恢复但它的数据一致性和完整性较差。
RDB persistence is suitable for large-scale data recovery but has poor data consistency and integrity.

Redis 需要手动开启AOF持久化方式,默认是每秒将写操作日志追加到AOF文件中。
Redis needs to manually enable the AOF persistence mode. The default is to append the write operation log to the AOF file every second.

AOF 的数据完整性比RDB高,但记录内容多了,会影响数据恢复的效率。
AOF data integrity is higher than RDB, but the record content is too much, which will affect the efficiency of data recovery.

Redis 针对 AOF文件大的问题,提供重写的瘦身机制。
Redis provides a slimming mechanism for rewriting large AOF files.

若只打算用Redis 做缓存,可以关闭持久化。
If you only plan to use Redis as a cache, you can turn persistence off.

若打算使用Redis 的持久化。建议RDB和AOF都开启。其实RDB更适合做数据的备份,留一后手。AOF出问题了,还有RDB。
If you intend to use Redis persistence. It is recommended that both RDB and AOF be turned on. In fact, RDB is more suitable for data backup, leaving a hand behind. AOF has a problem, and RDB.

1.19.6 Redis存储的是k-v类型,为什么还会有Hash?
1.19.6 Redis stores k-v type, why is there a Hash?

Redis的hash数据结构是一个键值对(key-value)集合,他是一个String类型的field和value的映射表,Redis本身就是一个key-value 类型的数据库,因此Hash数据结构等于在原来的value上又套了一层key-vlaue型数据。所以Redis 的hash数据类型特别适合存储关系型对象。
Redis hash data structure is a key-value pair (key-value) collection, it is a String type of field and value mapping table, Redis itself is a key-value type database, so the Hash data structure is equivalent to the original value and a layer of key-vlue type data. So Redis hash data type is particularly suitable for storing relational objects.

1.20 JVM

关注尚硅谷教育公众号,回复 java
Pay attention to Shangsi Valley Education Public Account and reply java.

2章 离线数仓项目
Chapter 2: The Last Day

2.1 提高自信
2.1 improve self-confidence

云上数据仓库解决方案:https://www.aliyun.com/solution/datavexpo/datawarehouse

2.2 为什么做这个项目
2.2 Why do this project?

随着公司的发展,老板需要详细的了解公司的运营情况。比如,日活、新增、留存、转化率等。所以公司决定招聘大数据人才来做这个项目,目的是为老板做决策提供数据支持。
As the company grows, bosses need to understand its operations in detail. For example, daily activity, new addition, retention, conversion rate, etc. So the company decided to recruit big data talent to do this project, the purpose is to provide data support for the boss to make decisions.

2.3 数仓概念
2.3 concept of warehouse

数据仓库的输入数据源和输出系统分别是什么?
What are the input data sources and output systems of a data warehouse?

(1)输入系统前端埋点产生用户行为数据、JavaEE后台产生的业务数据个别公司有爬虫数据。
(1) Input system: user behavior data generated by front-end buried points, business data generated by JavaEE background, and crawler data of individual companies.

(2)输出系统:报表系统、用户画像系统、推荐系统
(2) Output system: report system, user portrait system, recommendation system.

2.4 项目架构
2.4 project architecture

2.5 框架版本选型
2.5 Framework Version Selection

1Apache:运维麻烦,组件间兼容性需要自己调研。一般大厂使用,技术实力雄厚,有专业的运维人员)。
1) Apache: Operation and maintenance trouble, compatibility between components needs to be investigated by yourself. (General factory use, strong technical strength, professional operation and maintenance personnel).

2CDH6.3.2:国内使用最多的版本CDH和HDP合并后推出,CDP7.0。收费标准,10000美金一个节点每年。(不建议使用)
2) CDH6.3.2: The most widely used version in China. CDH and HDP were combined and launched, CDP7.0. The fee is $10000 per node per year. (Not recommended)

3HDP:开源,可以进行二次开发,但是没有CDH稳定国内使用较少
3) HDP: open source, can be secondary development, but not CDH stability, less domestic use.

4)云服务选择
4) Cloud service selection

1)阿里云的EMRMaxComputeDataWorks

2)腾讯云EMR、流计算Oceanus、数据开发治理平台WeData
(2) Tencent Cloud EMR, Stream Computing Oceanus, Data Development Governance Platform WeData

3)华为云EMR
(3) Huawei Cloud EMR

4)亚马逊云EMR
(4) Amazon Cloud EMR

星环国际、金蝶。。。神策、数梦
Starlink International, Kingdee... Divine strategy, dream counting

Apache框架各组件重要版本发版时间
The release time of major versions of each component of the Apache framework

框架

版本号

发版时间
Release time

框架

版本号

发版时间
Release time

Hadoop

2.7.2

2017-06

Spark

1.6.0

2016-01

3.0.0

2018-03

2.0.0

2016-07

3.1.3

2020-07

2.2.0

2018-05

Zookeeper

3.4.12

2018-05

2.4.0

2018-11

3.4.14

2019-04

3.0.0

2020-06

3.5.8

2020-05

2.4.8

2022-06

3.7.0

2021-03

3.2.0

2021-10

3.8.0

2022-03

3.3.0

2022-06

Flume

1.9.0

2019-01

Flink

1.7.0

2018-11

1.10.0

2022-03

1.8.0

2019-04

1.11.0

2022-10

1.9.0

2019-08

Kafka

1.0.0

2017-11

1.10.0

2020-02

2.0.0

2018-07

1.11.0

2020-07

2.3.0

2019-03

1.12.0

2020-12

2.4.0

2019-12

1.13.0

2021-04

2.7.0

2020-12

1.13.6

2022-02

3.0.0

2021-09

1.14.0

2021-09

Hive

1.2.1

2015-06

1.15.0

2022-05

2.0.0

2016-02

1.16.0

2022-10

2.2.0

2017-07

DolphinScheduler

1.2.0(最早)
1.2.0 (oldest)

2020-01

3.0.0

2018-05

1.3.9

2021-10

2.3.6

2019-08

2.0.0

2021-11

3.1.2

2019-08

3.0.0

2022-08

2.3.7

2020-04

Doris

0.13.0(最早)
0.13.0 (oldest)

2020-10

3.1.3

2022-04

0.14.0

2021-05

HBase

1.2.0

2016-02

0.15.0

2021-11

1.4.0

2017-12

1.1.0

2022-07

1.5.0

2019-10

Hudi

0.10.0

2021-12

1.6.0

2020-07

0.11.0

2022-03

2.0.0

2018-05

0.12.0

2022-08

2.2.0

2019-06

Sqoop

1.4.6

2017-10

2.4.0

2020-12

1.4.7

2020-07

2.5.0

2022-08

Phoenix

4.14.0

(1.4)

2018-06

4.16.1

( 1.3, 1.4, 1.5, 1.6)

2021-05

5.1.2

( 2.1, 2.2, 2.3, 2.4)

2021-07

*注:着重标出的为公司实际生产中的常用版本。
*Note: The highlighted versions are commonly used in the company's actual production.

2.6 服务器选型
2.6 Server Selection

服务器使用物理机还是云主机?
Is the server a physical machine or a cloud host?

1)机器成本考虑:
1) Machine cost consideration:

(1)物理机:以128G内存20核物理CPU,40线程,40THDD和80TSSD硬盘,单台报价4W出头,惠普品牌。一般物理机寿命5年左右
(1) Physical machine: 128G memory, 20-core physical CPU, 40 threads, 40THDD and 80TSSD hard disk, a single unit is quoted at 4W, HP brand. Generally, the life span of a physical machine is about 5 years.

(2)云主机,以阿里云为例,差不多相同配置,每年5W。华为云、腾讯云、天翼云。
(2) Cloud hosts, taking Alibaba Cloud as an example, are almost the same configuration, 5W per year. HUAWEI CLOUD, Tencent Cloud, and e Cloud.

2)运维成本考虑:
2) O&M cost considerations:

(1)物理机:需要有专业的运维人员(1万 * 13个月)、电费(商业用户)、安装空调、场地。
(1) Physical machine: professional operation and maintenance personnel (10,000 * 13 months), electricity fee (commercial users), installation of air conditioning, and site are required.

(2)云主机:很多运维工作都由阿里云已经完成,运维相对较轻松。
(2) Virtual machines: A lot of O&M work has been completed by Alibaba Cloud, and O&M is relatively easy.

3)企业选择
3) Business selection

(1)金融有钱公司选择云产品(上海)。
(1) Wealthy financial companies choose cloud products (Shanghai).

(2)中小公司、为了融资上市,选择云产品,拉到融资后买物理机。
(2) Small and medium-sized companies, in order to raise funds and go public, choose cloud products, and buy physical machines after financing.

(3)有长期打算,资金比较足,选择物理机。
(3) Have a long-term plan, have sufficient funds, and choose a physical machine.

2.7 集群规模
2.7 Cluster size

1)硬盘方面考虑
1) Hard disk considerations

2)CPU方面考虑
2) CPU considerations

20核物理CPU 40线程 * 8 = 320线程 (指标 100-200
20 cores physical CPU 40 threads * 8 = 320 threads (index 100-200)

3)内存方面考虑
3) Memory considerations

内存128g * 8= 1024g (计算任务内存800g,其他安装框架需要内存)
Memory 128g * 8 = 1024g (800g for computing tasks, memory required for other installation frameworks)

128m =512M内存

100g数据 、800g内存
100g data, 800g memory

4)参考案例说明
4) Reference case description

根据数据规模大家集群(在企业,干了三年 通常服务器集群 5-20台之间)
According to the size of the data cluster (in the enterprise, usually between 5-20 server clusters after three years of work)

(1)参考腾讯云EMR官方推荐部署
(1) Refer to Tencent Cloud EMR official recommended deployment

Master节点管理节点,保证集群的调度正常进行;主要部署NameNode、ResourceManager、HMaster 等进程;非 HA 模式下数量为1,HA 模式下数量为2
Master node: manages nodes to ensure normal scheduling of clusters; mainly deploys processes such as NameNode, ResourceManager, and HMaster; the number is 1 in non-HA mode and 2 in HA mode.

Core节点为计算及存储节点,我们在 HDFS 中的数据全部存储于 core 节点中,因此为了保证数据安全,扩容 core 节点后不允许缩容;主要部署 DataNode、NodeManager、RegionServer 等进程。非 HA 模式下数量≥2,HA 模式下数量≥3。
Core node: It is a computing and storage node. All data in HDFS is stored in the core node. Therefore, in order to ensure data security, capacity reduction is not allowed after expanding the core node. DataNode, NodeManager, RegionServer and other processes are mainly deployed.≥2 in non-HA mode and ≥3 in HA mode.

Common 节点为 HA 集群 Master 节点提供数据共享同步以及高可用容错服务;主要部署分布式协调器组件,如 ZooKeeper、JournalNode 等节点。非HA模式数量为0,HA 模式下数量≥3。
Common node: provides data sharing synchronization and high availability fault tolerance services for HA cluster Master nodes; mainly deploys distributed coordinator components, such as ZooKeeper, JournalNode and other nodes. 0 in non-HA mode and ≥3 in HA mode.

(2)数据传输数据比较紧密的放在一起(Kafka、clickhouse
(2) Data transfer data is relatively close together (Kafka, clickhouse)

(3)客户端尽量放在一到两台服务器上,方便外部访问
(3) The client should be placed on one or two servers as much as possible to facilitate external access

(4)有依赖关系的尽量放到同一台服务器(例如:Ds-workerhive/sparkClickHouse必须单独部署
(4) If there are dependencies, try to put them on the same server (for example: Ds-worker and hive/spark, ClickHouse must be deployed separately)

Master

Master

core

core

core

common

common

common

nn

nn

dn

dn

dn

JournalNode

JournalNode

JournalNode

rm

rm

nm

nm

nm

zk

zk

zk

hive

hive

hive

kafka

kafka

kafka

spark

spark

spark

datax

datax

datax

Ds-master

Ds-master

Ds-worker

Ds-worker

Ds-worker

maxwell

superset

mysql

flume

flume

flink

flink

redis

hbase

2.8 人员配置参考
28 Staffing Reference

2.8.1 整体架构
2.8.1 Overall Architecture

大数据开发工程师 =》 大数据组组长 =》 项目经理=》部门经理=》技术总监CTO
Big Data Development Engineer =》 Big Data Team Leader =》 Project Manager =》Department Manager =》Technical Director CTO

=》 高级架构师 =》 资深架构师
=« Senior Architect =« Senior Architect

2.8.2 你的的职级等级及晋升规则
2.8.2 Your rank and promotion rules

小公司:职级分初级,中级,高级。晋升规则不一定,看公司效益和职位空缺。
Small company: rank is divided into junior, intermediate and senior. Promotion rules are not necessarily, depending on company efficiency and job vacancies.

大公司都有明确的职级:
Big companies have clear ranks:

2.8.3 人员配置参考
2.8.3 Staffing Reference

小型公司(1-3人左右):组长1人,剩余组员无明确分工,并且可能兼顾JavaEE和前端。
Small companies (1-3 people): 1 team leader, the rest of the team members have no clear division of labor, and may take care of Java EE and front end.

中小型公司(3~6人左右):组长1人,离线2人左右,实时1人左右(离线一般多于实时),组长兼顾和JavaEE、前端。
Small and medium-sized companies (about 3~6 people): 1 team leader, about 2 offline, about 1 real-time (offline is generally more than real-time), team leader and JavaEE, front end.

中型公司(5~10人左右):组长1人,离线3~5人左右(离线处理、数仓),实时2人左右,组长和技术大牛兼顾和JavaEE、前端。
Medium-sized company (about 5~10 people): 1 team leader, about 3~5 offline (offline processing, warehouse), about 2 real-time people, team leader and technical bull and JavaEE, front end.

中大型公司(10~20人左右):组长1人,离线5~10人(离线处理、数仓),实时5人左右,JavaEE1人左右(负责对接JavaEE业务),前端1人(有或者没有人单独负责前端)。(发展比较良好的中大型公司可能大数据部门已经细化拆分,分成多个大数据组,分别负责不同业务)
Medium and large companies (about 10~20 people): 1 person in the team leader, 5~10 people offline (offline processing, warehouse), about 5 people in real time, about 1 person JavaEE (responsible for connecting JavaEE business), 1 person in the front end (with or without a person responsible for the front end alone). (For medium and large companies with relatively good development, the big data department may have been divided into multiple big data groups, which are responsible for different businesses)

上面只是参考配置,因为公司之间差异很大,例如ofo大数据部门只有5个人左右,因此根据所选公司规模确定一个合理范围,在面试前必须将这个人员配置考虑清楚,回答时要非常确定。
The above is only a reference configuration, because there are great differences between companies, for example, ofo big data department only has about 5 people, so determine a reasonable range according to the selected company size, this staffing must be considered clearly before the interview, and the answer should be very certain.

咱们自己公司:大数据组组长:1个人;离线3-4个人;实时1-3个人。
Our own company: big data group leader: 1 person; offline 3-4 people; real-time 1-3 people.

IOS多少人?安卓多少人?前端多少人?JavaEE多少人?测试多少人?
How many IOS? How many Android? How many people at the front? How many JavaEE? How many people are tested?

(IOS、安卓) 1-2个人 前端3个人; JavaEE一般是大数据的1-1.5倍,测试:有的有,1个左右,有的没有。 产品经理1个、产品助理1-2个,运营1-3个。
(IOS, Android) 1-2 personal front-end 3 people; JavaEE is generally 1-1.5 times that of big data, test: some have, 1 or so, some do not. 1 product manager, 1-2 product assistants, 1-3 operations.

公司划分:
Company Division:

0-50 小公司;50-500 中等;500-1000 大公司;1000以上 大厂 领军的存在。
0-50 Small companies;50-500 medium;500-1000 large companies; more than 1000 large companies lead the existence.

2.9 从0-1搭建项目,你需要做什么?
2.9 Building a project from 0-1, what do you need to do?

1)需要问项目经理的问题
1) Questions to ask the project manager

(1)数据量(增量、全量): 100g
(1) Data volume (increment, total): 100g

(2)预算: 50万
(2) Budget: 500,000

(3)数据存储多久: 1年
(3) How long the data is stored: 1 year

(4)云主机、物理机: 云主机
(4) Cloud host, physical machine: cloud host

(5)日活: 100万
(5) Daily life: 1 million

(6)数据源: 用户行为数据(文件)、业务数据(MySQL)
(6) Data source: user behavior data (file), business data (MySQL)

(7)项目周期: 1个月-3个月
(7) Project cycle: 1 month-3 months

(8)团队多少人: 3-5
(8) How many people in the team: 3-5

(9)首批指标: 1-10个
(9) Initial indicators: 1-10

(10)未来的规划: 离线和实时 是否都要做
(10) Planning for the future: do you want to do both offline and real-time?

2)项目周期(2个月)
2) Project cycle (2 months)

(1)数据调研(2周) + 集群搭建
(1) Data survey (2 weeks)+ cluster building

(2)明确数据域(2天)
(2) Clarify the data domain (2 days)

(3)构建业务矩阵(3天)
(3) Build Business Matrix (3 days)

(4)建模 至下而上 (2周)
(4) Modeling bottom-up (2 weeks)

①ODS层 ②DWD层 ③DIM层

(5)指标体系建设 至上而下 (2周)
(5) Index system construction from top to bottom (2 weeks)

(6)处理bug 1周
(6) Bug handling 1 week

2.10 数仓建模准备
2.10 Warehouse modeling preparation

1)数据仓库建模的意义
1) Significance of data warehouse modeling

如果把数据看作图书馆里的书,我们希望看到它们在书架上分门别类地放置;
If we think of data as books in a library, we want to see them organized on shelves;

减少重复计算。
Reduce double counting.

快速查询所需要的数据。
Quickly query the data you need.

2)ER模型
2) ER model

如果对方问三范式问题。初步判断对方是一个java程序员,就不要和他深入聊,mysql高级、redis、多线程、JVM、SSM等框架了。
If the other person asks a triple normal form question. Initial judgment of the other party is a java programmer, do not chat with him in depth, mysql advanced, redis, multithreading, JVM, SSM and other frameworks.

应该把话题转移到大数据技术。Spark、flink、海量数据如何处理、维度建模。
The topic should be shifted to Big Data technology. Spark, flink, how to process massive data, dimensional modeling.

3)维度建模
3) Dimensional modeling

星型模型:事实表周围一级维度 减少join => 大数据场景不适合频繁的join
Star model: One-level dimensionality reduction around fact tables join => Big data scenarios are not suitable for frequent joins

雪花模型:事实表周围多级维度
Snowflake model: multiple dimensions around fact tables

星座:多个事实表
Constellations: Multiple fact sheets

4)事实表
4) Fact sheet

(1)如何判断一张表是事实表?
1) How do you know if a table is a fact table?

具有度量值的 可以累加的 个数、件数、金额、次数
Number, number, amount, and frequency of cumulative metrics

(2)同步策略
(2) Synchronization strategy

数据量大 =》 通常增量 特殊的,加购 (周期快照事实表)
large amount of data = usually incremental special additional purchase periodic snapshot fact table

(3)分类
(3) Classification

①事务型事实表
① Transaction type fact table

找原子操作。 例如:下单 加购 支付
Looking for atomic manipulation. For example: order plus purchase payment

①选择业务过程
① Select business process

②声明粒度
② Statement granularity

③确定维度
③ Determine dimension

④确定事实
4. Determine the facts

不足:

连续性指标,不好找原子操作。 例如,库存(周期快照事实表)
Continuity index, difficult to find atomic operation. For example, inventory (periodic snapshot fact sheet)

多事实表关联。 例如,统计加购到支付的平均使用时长 (累积型快照事实表)
Multiple fact table associations. For example, count the average usage time from purchase to payment (cumulative snapshot fact table)

②周期快照事实表
② Periodic snapshot fact table

③累积型快照事实表
③ Cumulative snapshot fact table

5)维度表
5) Dimension table

(1)如何判断一张表是维度表?
(1) How do you determine if a table is a dimension table?

没有度量值,都是描述信息。 身高 体重、年龄、性别
There are no metrics, just descriptive information. Height, weight, age, sex

(2)同步策略
(2) Synchronization strategy

数据量小 =》 通常 全量 特殊的 用户表
Small data ="usually full amount special user table

(3)维度整合 减少Join操作
(3) Dimension integration reduces Join operation

①商品表、商品品类表、SPU、商品一级分类、二级分类、三级分类=》商品维度表
① Commodity table, commodity category table, SPU, commodity primary classification, secondary classification, tertiary classification = Commodity dimension table

②省份表、地区表 =》 地区维度表
② Province table, region table =》Region dimension table

③活动信息表、活动规则表 =》 活动维度表
Activity information table, activity rule table = Activity dimension table

(4)拉链表
(4) Zipper table

对用户表做了拉链。
Zipper the user table.

缓慢变化维 场景
Slow-changing dimensional scene

6)建模工具是什么?
6) What are modeling tools?

PowerDesignerEZDML

2.11 数仓建模
2.11 warehouse modeling

1)数据调研
1) Data research

(1)先和Java人员要表,表中最好有字段的描述或者有表和字段的说明文档。(项目经理帮助协调) =》 快速熟悉表中业务。梳理清楚业务线,找到事实表和维度表。
(1) First and Java personnel to the table, the table preferably has a description of the field or a description of the table and field documents. (Project Manager helps coordinate)= Get familiar with the business in the table quickly. Sort through business lines and find fact tables and dimension tables.

(2)和业务人员聊 =》 验证你猜测的是否正确
(2) Chat with business people to verify whether your guess is correct

(3)和产品经理聊
(3) Talk to the product manager

需求:派生指标、衍生指标
Demand: derived indicators, derived indicators

派生指标 = 原子指标(业务过程 + 度量值 + 聚合逻辑) + 统计周期 + 统计粒度 + 业务限定
Derivative indicator = atomic indicator (business process + metric + aggregation logic)+ statistical period + statistical granularity + business limit

需求中的业务过程必须和实际的后台业务能对应上。
The business processes in the requirements must correspond to the actual background business.

2)明确数据域
2) Clarify the data domain

(1)用户域:登录、注册
(1) User domain: login, registration

(2)流量域:启动、页面、动作、故障、曝光
(2) Flow domain: start, page, action, fault, exposure

(3)交易域:加购、下单、支付、物流、取消下单、取消支付
(3) Transaction domain: purchase, order, payment, logistics, cancel order, cancel payment

(4)工具域:领取优惠卷、使用优惠卷下单、使用优惠卷支付
(4) Tool domain: receive coupons, place orders with coupons, pay with coupons

(5)互动域:点赞、评论、收藏
(5) Interactive domain: likes, comments, favorites

3)构建业务矩阵
3) Build Business Matrix

用户、商品、活动、时间、地区、优惠卷
User, Product, Event, Time, Region, Coupon

(1)用户域:
(1) User domain:

登录、注册
Login, Register

(2)流量域: √
(2) Flow field: √

启动、页面、动作、故障、曝光
Start, page, action, fault, exposure

(3)交易域:
(3) Transaction domain:

加购、下单、支付、物流、取消下单、取消支付
Add purchase, order, payment, logistics, cancel order, cancel payment

(4)工具域:
(4) Toolfields:

领取优惠卷、使用优惠卷下单、使用优惠卷支付
Receive coupons, order with coupons, pay with coupons

(5)互动域:
(5) Interactive areas:

点赞、评论、收藏
Like, comment, collect

4)建模 至下而上
4) Modeling from bottom to top

(1)ODS层

①保持数据原貌不做任何修改 起到备份作用
① Keep the original data without any modification to play a backup role

②采用压缩 减少磁盘空间,采用Gzip压缩
② Reduce disk space by compression and Gzip compression

③创建分区表 防止后续全表扫描
③ Create partition tables to prevent subsequent full table scans

(2)DWD层 事实表
(2) DWD layer fact table

①事务型事实表
① Transaction type fact table

找原子操作
atomic operation

a)选择业务过程
a) Select Business Process

选择感兴趣的业务过程。 产品经理提出的指标中需要的。
Select the business process of interest. Required in the metrics proposed by the product manager.

b)声明粒度
b) Declaration granularity

粒度:一行信息代表什么含义。可以是一次下单、一周下单、一个月下单。
Granularity: what a line of information represents. It can be a single order, a week order, a month order.

如果是一个月的下单,就没有办法统计一次下单情况。保持最小粒度。
If it is a month's order, there is no way to count an order. Maintain minimum granularity.

只要你自己不做聚合操作就可以。
As long as you don't aggregate yourself.

c)确定维度
c) Determination of dimensions

确定感兴趣的维度。 产品经理提出的指标中需要的。
Determine the dimension of interest. Required in the metrics proposed by the product manager.

例如:用户、商品、活动、时间、地区、优惠卷
For example: users, products, activities, time, region, coupons

d)确定事实
d) Establishment of facts

确定事实表的度量值。 可以累加的值,例如,个数、件数、次数、金额。
Determine the measure for the fact table. Value that can be accumulated, for example, number, amount.

事务型事实表的不足:
Deficiencies in transactional fact tables:

连续性指标,不好找原子操作。 例如,库存(周期快照事实表)
Continuity index, difficult to find atomic operation. For example, inventory (periodic snapshot fact sheet)

多事实表关联。例如,统计加购到支付的平均使用时长(累积型快照事实表)
Multiple fact table associations. For example, count the average usage time from purchase to payment (cumulative snapshot fact table)

(2)周期快照事实表
(2) Periodic snapshot fact table

①选择业务过程
① Select business process

②声明粒度 =》 1天
② Statement granularity => 1 day

③确定维度
③ Determine dimension

④确定事实
4. Determine the facts

(3)累积型快照事实表
(3) Cumulative snapshot fact table

①选择业务过程
① Select business process

②声明粒度
② Statement granularity

③确定维度
③ Determine dimension

④确定事实 确定多个事实表度量值
4. Determining facts Determining multiple fact table metrics

(3)DIM层 维度表
(3) DIM layer dimension table

①维度整合 减少join
Dimension integration reduces join

a)商品表、商品品类表、spu、商品一级分类、二级分类、三级分类=》商品维度表
a) commodity table, commodity category table, spu, commodity first class classification, second class classification, third class classification = commodity dimension table

b)省份表、地区表 =》 地区维度表
b) Province Table, Region Table = Region Dimension Table

c)活动信息表、活动规则表 =》 活动维度表
c) Activity Information Table, Activity Rule Table = Activity Dimension Table

②拉链表
② Zipper watch

对用户表做了拉链。
Zipper the user table.

缓慢变化维 场景。
Slowly changing dimensional scenes.

5)指标体系建设 至上而下
5) Index system construction from top to bottom

(1)ADS层

需求、日活、新增、留存、转化率、GMV
Demand, Daily Activity, New, Retention, Conversion Rate, GMV

(2)DWS层 聚合层
(2) DWS Layer Aggregate Layer

需求:派生指标、衍生指标
Demand: derived indicators, derived indicators

派生指标 = 原子指标(业务过程 + 度量值 + 聚合逻辑) + 统计周期 + 统计粒度 + 业务限定
Derivative indicator = atomic indicator (business process + metric + aggregation logic)+ statistical period + statistical granularity + business limit

例如,统计,每天各个省份手机品牌交易总额
For example, statistics, the total number of mobile phone brand transactions in various provinces every day

交易总额 (下单 + 金额 + sum ) + 每天 + 省份 + 手机品牌
Total transaction amount (order + amount + sum)+ daily + province + mobile phone brand

找公共的:业务过程 + 统计周期 + 统计粒度 建宽表
Find common: business process + statistical period + statistical granularity broadening table

2.12 数仓每层做了哪些事
2.12 What does each floor do?

1)ODS层做了哪些事?
1) What does the ODS layer do?

(1)保持数据原貌,不做任何修改
(1) Keep the data as it is and make no changes

(2)压缩采用gzip,压缩比是100g数据压缩完30g左右。
(2) compression using gzip, compression ratio is 100g data compression about 30g.

(3)创建分区表
(3) Create partition table

2)DIM/DWD层做了哪些事?
2) What does DIM/DWD do?

建模里面的操作,正常写。
The operations inside the modeling were written normally.

(1)数据清洗的手段
(1) Means of data cleansing

HQLMRSparkSQLKettlePython(项目采用SQL进行清除)
HQL, MR, SparkSQL, Kettle, Python (cleaning with SQL in project)

(2)清洗规则
(2) Cleaning rules

金额必须都是数字,[0-9]、手机号、身份证、匹配网址URL
The amount must be all numbers,[0-9], mobile phone number, ID card, matching URL

解析数据、核心字段不能为空、过期数据删除、重复数据过滤
Parse data, core field cannot be blank, obsolete data deletion, duplicate data filtering

json => 很多字段 =》 一个一个判断 =》 取数,根据规则匹配
json => many fields => one judgment => take the number, match according to the rules

(3)清洗掉多少数据算合理
(3) How much data is reasonable to clean

参考,1万条数据清洗掉1条
Reference, 10,000 data cleanses 1.

(4)脱敏
(4) Desensitization

对手机号、身份证号等敏感数据脱敏
Desensitization of sensitive data such as mobile phone number and ID card number.

①加*

135****0013 互联网公司经常采用
135***0013 Internet companies often use

②加密算法 md5 需要用数据统计分析,还想保证安全
② Encryption algorithm md5 needs statistical analysis of data, but also wants to ensure security

美团 滴滴 md5(12334354809)=》唯一值
Meituan Didi md5 (12334354809)=》Unique value

③加权限 需要正常使用 军工、银行、政府
③ Adding authority requires normal use of military, banking, and government

5)压缩snappy
5) Reduce the snappy

6)orc列式存储
(6) Orc column storage

3)DWS层做了哪些事?
3) What does the DWS layer do?

指标体系建设里面的内容再来一遍。
The contents of the indicator system construction should be repeated.

4ADS层做了哪些事?
4) What does the ADS layer do?

一分钟至少说出30个指标。
Say at least 30 indicators a minute.

日活、月活、周活、留存、留存率、新增(日、周、年)、转化率、流失、回流、七天内连续3天登录(点赞、收藏、评价、购买、加购、下单、活动)、连续3周(月)登录、GMV、复购率、复购率排行、点赞、评论、收藏、领优惠卷人数、使用优惠卷人数、沉默、值不值得买、退款人数、退款率 topn 热门商品
Daily activity, monthly activity, weekly activity, retention, Retention rate, new addition (day, week, year), conversion rate, churn, reflow, login for 3 consecutive days within 7 days (likes, collections, evaluations, purchases, purchases, orders, activities), login for 3 consecutive weeks (months), GMV, repurchase rate, repurchase rate ranking, likes, comments, collections, number of coupon recipients, number of coupon users, silence, value not worth buying, number of refunds, refund rate topn Hot items

产品经理最关心的:留转G复活
Product Manager's Most Concerned: Stay Turn G Revival

2.13 数据量
2.13 data size

数据量的描述都是压缩前的数据量。
The description of the data volume is the data volume before compression.

1)ODS层:

(1)用户行为数据(100g => 1亿条;1g => 100万条)
(1) User behavior data (100g => 100 million items;1g => 1 million items)

曝光(60g or 600万条)、页面(20g)、动作(10g)、故障 + 启动(10g)
Exposure (60g or 6 million), page (20g), action (10g), fault + start (10g)

(2)业务数据(1-2g => 100万-200万条)
(2) Business data (1-2g => 1 million-2 million items)

登录(20万)、注册(100-1000);
Login (200,000), registration (100-1000);

加购(每天增量20万、全量100万)、下单(10万)、支付(9万)、物流(9万)、取消下单(500)、退款(500);
Additional purchase (daily increment of 200,000, full amount of 1 million), order (100,000), payment (90,000), logistics (90,000), cancellation of order (500), refund (500);

领取优惠卷(5万)、使用优惠卷下单(4万)、使用优惠卷支付(3万);
Receive coupons (50,000), order with coupons (40,000), pay with coupons (30,000);

点赞(1000)、评论(1000)、收藏(1000);
Likes (1000), Comments (1000), Collections (1000);

用户(活跃用户100万、新增1000、总用户1千万)、商品SPU(1-2万)、商品SKU(10-20万)、活动(1000)、时间(忽略)、地区(忽略)
Users (1 million active users, 1000 new users, 10 million total users), Product SPU (10,000 - 20,000), Product SKU (100,000 - 200,000), Activity (1000), Time (ignore), Region (ignore)

2)DWD层 + DIM层:
2) DWD + DIM:

和ODS层几乎一致;
and ODS layer are almost identical;

3)DWS层

轻度聚合后,20g-50g
After mild polymerization, 20g-50g.

4)ADS层

10-50m之间,可以忽略不计。
Between 10-50m, negligible.

2.14 项目中遇到哪些问题?(*****
2.14 What problems did you encounter in the project? (*****)

1)Flume零点漂移
1) Flume zero drift

2Flume挂掉及优化
2) Flume hangs and optimizes

3)Datax空值、调优
3) Datax null value and tuning

4)HDFS小文件处理
4) HDFS small file processing

5)Kafka挂掉
5) Kafka hangs

6)Kafka丢失
6) Kafka is lost

7)Kafka数据重复
7) Kafka data is duplicated

8)Kafka消息数据积压
8) Kafka message data backlog

9Kafk乱序
9) Kafk out of order

10Kafka顺序
10) Kafka order

11)Kafka优化(提高吞吐量)
11) Kafka optimization (increase throughput)

12)Kafka底层怎么保证高效读写
12) How to ensure efficient reading and writing at the bottom of Kafka?

13)Kafka单条日志传输大小
13) The size of a single Kafka log transfer

14)Hive优化(Hive on Spark

15)Hive解决数据倾斜方法
15) Hive solves the data skew method

19)疑难指标编写(7天内连续3次活跃、1 7 30指标、路径分析、用户留存率、最近7/30日各品牌复购率、最近30天发布的优惠券的补贴率、 同时在线人数)
19) Preparation of difficult indicators (3 consecutive active in 7 days, 1 7 30 indicators, path analysis, user retention rate, repurchase rate of each brand in the last 7/30 days, subsidy rate of coupons released in the last 30 days, number of people online at the same time)

20)DS任务挂了怎么办?
20) What should I do if the DS task hangs?

21DS故障报警
21) DS fault alarm

2.15 离线---业务
2.15 Offline --- Services

2.15.1 SKU和SPU

SKU:一台银色、128G内存的、支持联通网络的iPhoneX。
SKU: A silver, 128G RAM, China Unicom iPhone X.

SPU:iPhoneX。

Tm_id:品牌Id苹果,包括IPHONE,耳机,MAC等。
Tm_id: Brand Id Apple, including IPHONE, headphones, MAC, etc.

2.15.2 订单表跟订单详情表区别?
2.15.2 What is the difference between the order form and the order details table?

订单表的订单状态会变化,订单详情表不会,因为没有订单状态。
The order status of the order table changes, but the order details table does not, because there is no order status.

订单表记录user_id,订单id订单编号,订单的总金额order_status,支付方式,订单状态等。
The order form records the user_id, the order ID, the order number, the total amount of the order order_status, the payment method, the order status, etc.

订单详情表记录user_id,商品sku_id,具体的商品信息(商品名称sku_name,价格order_price,数量sku_num)
The order details table records user_id, product sku_id, and specific product information (product name sku_name, price order_price, quantity sku_num)

2.15.3 上卷和下钻
2.15.3 Roll up and down drill

上卷:上卷是沿着维度的层次向上聚集汇总数据。
Rollup: Rollup is the aggregation of summarized data along the hierarchy of dimensions.

下探(钻):下探是上卷的逆操作,它是沿着维的层次向下,查看更详细的数据。
Drill down: Drill down is the reverse of the scroll, which is down the hierarchy of dimensions to see more detailed data.

比如这个经典的数据立方体模型:
For example, this classic data cube model:

维度有产品、年度地区等,统计销售额。实际上,维度还可以更细粒度,如时间维可由年、季、月、日构成,地区也可以由国家、省份、市、区县构成等。
Dimensions include product, year, region, etc., and statistics on sales. In fact, the dimensions can be more fine-grained, for example, the time dimension can be composed of years, quarters, months, and days, and regions can also be composed of countries, provinces, cities, districts, and counties.

下钻可以理解为由粗粒度到细粒度来观察数据,比如对产品销售情况分析时,可以沿着时间维从年到月到日更细粒度的观察数据。
Drilling down can be understood as observing data from coarse-grained to fine-grained, for example, when analyzing product sales, you can observe data at a finer level from year to month to day along the time dimension.

增加维度粒度“”。
Increase the dimension granularity of Month.

上卷和下钻是相逆的操作,所以上卷可以理解为删掉维的某些粒度,由细粒度到粗粒度观察数据向上聚合汇总数据。
Scrolling up and drilling down are reversed operations, so scrolling up can be understood as deleting some granularity of dimensions, observing data from fine-grained to coarse-grained, and aggregating and summarizing data upwards.

2.15.4 TOBTOC解释
2.15.4 TOB and TOC Interpretation

TOB(toBusiness):表示面向的用户是企业
TOB (toBusiness): indicates that the user is an enterprise.

TOC(toConsumer):表示面向的用户是个人
TOC (toConsumer): indicates that the user to whom the target user is an individual.

2.15.5 流转G复活指标
2.15.5 Circulation G resurrection indicator

1)活跃
1) Be active

日活:100万 ;月活:是日活的2-3倍 300万
Daily life: 1 million; monthly life: 2-3 times daily life: 3 million

总注册的用户多少?1000万-3000万之间。
How many total registered users? Between 10 and 30 million.

渠道来源:app 公众号 抖音 百度 36 头条 地推
Channel source: app public number chatter Baidu 36 krypton headline push

2)GMV

GMV:每天 10万订单 (50 100元) 500万-1000万
GMV: 100,000 orders per day (50 - 100 yuan) 5 million-10 million

10%-20% 100万-200万(人员:程序员、人事、行政、财务、房租、收电费)
10%-20% 1 million-2 million (personnel: programmers, personnel, administration, finance, rent, electricity)

3)复购率
3) Repurchase rate

某日常商品复购;(手纸、面膜、牙膏)10%-20%
Re-purchase of a daily commodity;(toilet paper, mask, toothpaste) 10%-20%

电脑、显示器、手表 1%
Computers, monitors, watches 1%

4)转化率
4) Conversion rate

商品详情 =》 加购物车 =》下单 =》 支付
Item Details =》Add Cart =》Order =》Pay

1%-5% 50-60% 80%-95%

5)留存率
5) Retention rate

1/2/3-60日、周留存、月留存
1/2/3-60 days, weeks, months

搞活动: 10-20%
Activities: 10-20%

2.15.6 活动的话,数据量会增加多少?怎么解决?
2.15.6 How much more data will be added if the event is active? How?

日活增加50%,GMV增加多少20%。(留转G复活)情人节,促销手纸。
Daily activity increased by 50%, GMV increased by 20%. Valentine's Day, promotional toilet paper.

集群资源都留有预量。11.116.18,数据量过大,提前动态增加服务器。
Cluster resources are reserved. 11.11 6.18, the amount of data is too large, dynamically increase the server in advance.

加多少机器:3-4
How many machines are added: 3-4 units

2.15.7 哪个商品卖的好?
2.15.7 Which product sells well?

面膜、手纸,每天销售5000个下载APP根据自身业务
Facial mask, toilet paper, 5000 sold daily. Download APP according to your own business

2.15.8 数据仓库每天跑多少张表,大概什么时候运行,运行多久
2.15.8 How many tables does the data warehouse run per day, when and how long does it run?

基本一个项目建一个库,表格个数为初始的原始数据表格加上统计结果表格的总数。(一般70-100张表格)。
Basically, a database is built for one project, and the number of tables is the total number of initial original data tables plus statistical result tables. (Usually 70-100 forms).

用户行为5张;业务数据33张表 =ods34 =》dwd=>32张=》dws 22张宽表=>ads=15=103张。
5 user behaviors; 33 business data tables => ods34 => dwd=>32 => dws 22 wide tables =>ads=> 15 => 103.

Datax:00:10 => 10-20分钟左右 第一次全量。
Datax: 00:10 => 10-20 minutes for the first full dose.

用户行为数据,每天0:30开始运行。=》ds =》 5-6个小时运行完指标。
User behavior data, starting at 0:30 every day.=》ds => 5-6 hours to run the indicator.

所有离线数据报表控制在8小时之内
All offline data reports are controlled within 8 hours.

大数据实时处理部分控制在5分钟之内。(分钟级别、秒级别)
Big data real-time processing is controlled within 5 minutes. (minutes, seconds)

如果是实时推荐系统,需要秒级响应。
If it is a real-time recommendation system, it requires a second-level response.

2.15.9 哪张表数据量最大
2.15.9 Which table has the largest amount of data

1)用户行为数据
1) User behavior data

曝光(60g or 6000万条)、页面(20g)
Exposure (60g or 60 million), page (20g)

2)业务数据(1-2g => 100万-200万条)
2) Business data (1-2g => 1 million-2 million items)

登录(20万)、注册(100-1000);
Login (200,000), registration (100-1000);

加购(20万)、下单(10万)
Additional purchase (200,000 yuan), order (100,000 yuan)

用户(活跃用户100万、新增1000、总用户1千万)
Users (1 million active users, 1000 new users, 10 million total users)

商品SKU(10万-20万)
SKU (100,000 - 200,000)

2.15.10 哪张表最费时间,有没有优化
2.15.10 Which table is the most time-consuming and optimized?

最费时间,一般是发生数据倾斜时,会比较费时间。
The most time-consuming, usually occurs when the data tilt, will be more time-consuming.

1)Group By

(1)统计各个省份对应的交易额
(1) Counting the transaction volume corresponding to each province

第一个统计完的指标和最后一个统计完是时间相差20倍
The time difference between the first statistic and the last statistic is 20 times

我们从Yarn上看到的
What we saw on Yarn

一共执行了多长时间 4-5小时
How long did it take? 4-5 hours.

你想:发生了数据倾斜 任务停止掉
You think: there's a data skew, the mission stops.

(2)解决办法:
(2) Solution:

①开启map-side 预聚合
① Open map-side prepolymerization

②skewindata

解决后的效果怎么样 ?
What is the effect after solving?

30-50分钟内执行完了
30-50 It was executed in minutes.

2)Join

统计 事实表 和维度表join => mapjoin
statistical fact table and dimension table join => mapjoin

(1)小表 大表 join mapjoin
(1) Small table big table join mapjoin

解决办法: mapjoin
Solution: Mapjoin

(2)大表 =》 大表 join
(2) Large table ="Large table join"

项目中什么出现 统计 加购到支付的平均使用时长
What appears in the item Statistics Average length of use from purchase to payment

执行时间 4-5小时 yarn
Execution time 4-5 hours yarn

①:skewjoin

②:smbjoin 分桶有序join 使用的前提 (分桶且有序)
②: smbjoin bucket orderly join premise of use (bucket and orderly)

③:左表随机 右表扩容
③: Random expansion of left table and right table

④:通过建模 规避 大表join大表
④: Avoid large tables by modeling join large tables

累积型快照事实表
cumulative snapshot fact table

2.15.11 并发峰值多少?大概哪个时间点?
2.15.11 How many concurrent peaks? About what time?

高峰期晚上7-12点。Kafka里面20m/s 2万/s 并发峰值在1-2万人
Peak 7-12 p.m. Kafka inside 20m/s 20,000/s concurrent peak at 1- 20,000 people

2.15.12 分析过最难的指标
2.15.12 Analyzed the most difficult indicators

路径分析
path analysis

用户留存率
User Retention rate

最近7/30日各品牌复购率
Re-purchase rate of each brand in the last 7/30 days

7天内连续3天登录
Log in for 3 consecutive days within 7 days

每分钟同时在线人数
Number of simultaneous users per minute

自己扩展
expand themselves.

2.15.13 数仓中使用的哪种文件存储格式
2.15.13 Which file storage format is used in Warehouse

常用的包括:textFile,ORC,Parquet,一般企业里使用ORC或者Parquet,因为是列式存储,且压缩比非常高,所以相比于textFile,查询速度快,占用硬盘空间少。
Commonly used include: textFile, ORC, Parquet, ORC or Parquet in general enterprises, because it is column storage, and the compression ratio is very high, so compared to textFile, the query speed is fast, occupying less hard disk space.

2.15.14 数仓当中数据多久删除一次
2.15.14 How often is data deleted in the warehouse

(1)部分公司永久不删
(1) Some companies do not delete permanently

(2)有一年、两年“删除”一次的,这里面说的删除是,先将超时数据压缩下载到单独安装的磁盘上。然后删除集群上数据。 很少有公司不备份数据,直接删除的。
(2) There is a "delete" once a year or two years, which means that the timeout data is compressed and downloaded to a separately installed disk. Then delete the data on the cluster. Very few companies delete data without backing it up.

2.15.15 Mysql业务库中某张表发生变化,数仓中表需要做什么改变
2.15.15 When a table in Mysql business library changes, what changes need to be made to the table in warehouse

修改表结构,将新增字段放置最后!
Modify the table structure and place the newly added fields last!

2.15.16 50多张表关联,如何进行性能调优
2.15.16 50 Multiple table associations, how to tune performance

详细看第一章Hive多表join优化手段
See chapter 1 Hive multi-table join optimization method for details

2.15.17 拉链表的退链如何实现
2.15.17 How to realize the chain withdrawal of zipper table

拉链表用于记录维度表中的历史变化。在拉链表中,当某个维度属性发生变化时,会插入一条新的记录,同时将原记录的有效期设置为截至。退链是指讲一个已经生效的变更恢复到上一个状态。实现思路如下:
Zipper tables are used to record historical changes in dimension tables. In a zippered table, when a dimension attribute changes, a new record is inserted and the validity period of the original record is set to expire. Backout refers to reverting a change that has already taken effect to the previous state. The idea is as follows:

(1)定位要退链的记录:例如找到用户最近一次信息更新
(1) Locating the record to be dropped: for example, finding the latest information update of the user

(2)查询上一条记录:查询这条记录之前的一条记录(主键相同,不同版本记录,而不是单指上一条)
(2) Query the previous record: Query the previous record (same primary key, different version records, not just the previous record)

(3)更新有效期:将当前记录的生效时间或者有效开始时间更新为无效,将上条记录截至日期改为最大值。
(3) Update validity period: update the effective time or effective start time of the current record to invalid, and change the expiration date of the previous record to the maximum value.

2.15.18 离线数仓如何补数
2.15.18 How to make up for off-line positions

补数:指重新处理一段历史时间范围内的数据,以修复数据问题
Complement: Reprocessing data over a historical period of time to fix data problems

利用调度框架,海豚调度器的补数功能进行补数
Using scheduling framework, complement function of Dolphin scheduler to complement

2.15.19 当ADS计算完,如何判断指标是正确的
2.15.19 When ADS is calculated, how to judge whether the indicator is correct

(1)样本数据验证:从计算结果抽取部分样本数据,与业务部门实际数据对比。
(1) Sample data validation: extract some sample data from the calculation results and compare them with the actual data of the business department.

(2)逻辑验证:检查指标的计算sql是否正确
(2) Logical verification: check whether the calculation sql of the indicator is correct.

(3)指标间关系验证:比较不同指标间关系,检查它们是否符合预期。例如:某个指标是另外一个指标的累计值,那么这两个指标一定存在关系
(3) Verification of the relationship between indicators: Compare the relationship between different indicators and check whether they meet expectations. For example, if an indicator is the cumulative value of another indicator, then there must be a relationship between the two indicators.

(4)历史数据对比:将计算结果和过去数据进行对比,观察指标的变化趋势
(4) Historical data comparison: compare the calculation results with past data and observe the change trend of indicators.

(5)异常值检测:检查计算结果中是否存在异常值
(5) Outlier detection: Check whether there is an outlier in the calculation result.

(6)跨数据部门对比:可以将计算结果与其它部门或团队的数据进行对比,进一步验证
(6) Cross-data department comparison: The calculation results can be compared with the data of other departments or teams for further verification.

2.15.20 ADS层指标计算错误,如何解决
2.15.20 ADS layer index calculation error, how to solve

(1)确定错误范围:找出指标计算错误的时间范围,指标及相关维度,缩小排查范围
(1) Determine the error range: find out the time range, indicators and related dimensions of the index calculation error, and narrow the scope of investigation.

(2)检查数据处理逻辑:从ods层开始排查,找出可能导致计算错误的数据清洗,转换和聚合等步骤,确认处理逻辑不出错误
(2) Check the data processing logic: start from the ods layer, find out the steps such as data cleaning, conversion and aggregation that may cause calculation errors, and confirm that the processing logic does not have errors.

(3)审查数据质量:确保每层数据完整性,一致性,准确性和时效性
(3) Review data quality: ensure data integrity, consistency, accuracy and timeliness at each level

(4)重新计算指标:修复数据质量问题和处理逻辑后,重新计算
(4) Recalculate metrics: recalculate after fixing data quality problems and processing logic

2.15.21 产品给新指标,该如何开发
2.15.21 How to develop new indicators for products

(1)确定需求:与产品经理和业务部门沟通
(1) Determine requirements: communicate with product managers and business departments

(2)数据源分析:分析现有数据源确定是否可以满足新指标计算需求,如果不能支撑需要引入新的数据源或者扩展现有数据源
(2) Data source analysis: analyze existing data sources to determine whether they can meet the calculation requirements of new indicators. If they cannot support the introduction of new data sources or the expansion of existing data sources,

(3)数据模型设计:根据新指标,设计模型
(3) Data model design: design model according to new index.

(4)数据处理流程设计:利用sql进行数据的提取、清洗、转化和加载,实现指标
(4) Data processing flow design: use sql to extract, clean, transform and load data to achieve indicators

2.15.22 新出指标,原有建模无法实现,如何操作
2.15.22 New indicators, the original modeling can not be achieved, how to operate

引入新的数据源或者扩展现有数据源
Introducing new data sources or extending existing ones

2.15.23 和哪些部门沟通,以及沟通什么内容
2.15.23 Which departments to communicate with and what to communicate

运营:新指标,数据报告,异常数据
Operations: new metrics, data reports, outliers

后端开发:数据源以及存储,数据格式,业务逻辑
Back-end development: data sources and storage, data formats, business logic

前端开发:数据可视化要求,用户体验
Front-end development: data visualization requirements, user experience

BI部门:报表和分析需求
BI Department: Reporting and Analysis Requirements

2.15.24 你们的需求和指标都是谁给的
2.15.24 Who gave you your needs and indicators?

运营部门,BI部门报表,产品或者产品经理
Operations Department, BI Department Reports, Product or Product Manager

2.15.25 任务跑起来之后,整个集群资源占用比例
2.15.25 After the task runs, the proportion of resources occupied by the whole cluster

数仓脚本是串行执行的,ods=》dim=》dwd=》dws=》ads
Warehouse scripts are executed serially, ods=> dim=> dwd=> dws=> ads

所以占用集群资源比例,仅为当前执行层的脚本转化成底层Sparkjob需要的资源
Therefore, the proportion of cluster resources occupied is only the resources required for the conversion of the script of the current execution layer into the underlying Sparkjob.

CPU与内存比 1:4
CPU to Memory Ratio 14

离线: 128M数据 512M内存
Offline: 128 MB data 512 MB memory

实时:并行度与Kafka分区一致,CPU与Slot比 1:3
Real-time: parallelism consistent with Kafka partition, CPU to Slot ratio 13

20M/s -> 3个分区 -> CPU与Slot比 1:3 -> 3个Slot -> Core数1-> CPU与内存比 1:4 -> TM 1 slot -> TM 4G资源
20M/s -> 3 partitions-> CPU to Slot ratio 13 -> 3 slots-> 1 core-> CPU to memory ratio 14 -> TM 1 slot -> TM 4G resources

JobManager 2G内存 1CPU

平均 一个Flink作业6G内存,2Core
Average Flink job 6 GB memory, 2 Core

2.15.26 业务场景:时间跨度比较大,数据模型的数据怎么更新的,例如:借款,使用一年,再还款,这个数据时间跨度大,在处理的时候怎么处理
2.15.26 Business scenario: The time span is relatively large. How to update the data of the data model, for example: borrow money, use it for one year, and then repay it. The time span of this data is large. How to deal with it when processing it

累积型快照事实表!
Cumulative snapshot fact table!

2.15.27 数据倾斜场景除了group by 和join外,还有哪些场景
2.15.27 Data Skew Scenarios What are the other scenarios besides group by and join?

(1)数据过滤:在数据过滤场景下,如果某些数据过滤条件导致某个Map任务处理的数据量比其他任务多,就可能出现数据倾斜问题。
(1) Data filtering: In a data filtering scenario, if certain data filtering conditions cause a Map task to process more data than other tasks, a data skew problem may occur.

2)使用Snappy压缩,原始文件大小不等,Map阶段数据倾斜。
(2) Snappy compression is used, the original file size varies, and the Map stage data is skewed.

(3)over开窗,partition by导致数据倾斜。
(3) over windowing, partition by causes data to tilt.

2.15.28 你的公司方向是电子商务,自营的还是提货平台?你们会有自己的商品吗?
2.15.28 Is your company oriented towards e-commerce, self-employed or delivery platform? Do you have your own merchandise?

我们的公司是自营的,有自己的商品,我们有自己经营的商品维度表,也有商品从采购到上架、发货、收货等一系列流程的业务数据和日志数据。
Our company is self-operated and has its own commodities. We have our own commodity dimension table, as well as business data and log data of a series of processes from procurement to shelf, delivery and receipt of commodities.

需要下载APP根据自身情况灵活回答!
You need to download the APP to answer flexibly according to your own situation!

2.15.29 ods事实表中订单状态会发生变化,你们是通过什么方式去监测数据变化的
2.15.29 ods fact table order status will change, how do you monitor data changes

订单状态的改变会引起状态码的变更,这对应了不同的业务过程,可以被提取为不同的事务事实表,无论退款发生在哪个结点,支付成功、发货、收货这些业务过程都可以被记录在对应的事务事实表中。
The change of order status will cause the change of status code, which corresponds to different business processes and can be extracted into different transaction fact tables. No matter which node the refund occurs in, the business processes of successful payment, delivery and receipt can be recorded in the corresponding transaction fact tables.

2.15.30 用户域你们构建了哪些事实表?登录事实表有哪些核心字段和指标?用户交易域连接起来有哪些表?
2.15.30 User Domain What fact tables have you constructed? What are the core fields and metrics of the login fact table? What tables are there for linking user transaction domains?

在离线数仓中,我们创建了多种事实表来存储不同类型的用户数据。
In the offline warehouse, we create a variety of fact tables to store different types of user data.

1)登录事实表(Login Fact Table)
1) Login Fact Table

核心字段可能包括:
Core fields may include:

–用户ID(user_id)

–登录日期(login_date)
- Login date (login_date)

–登录时间(login_time)
- login_time

–设备类型(device_type)
- Device type (device_type)

–IP地址(ip_address)

–会话ID(session_id)

–应用版本(app_version)
- App version (app_version)

核心指标可能包括:
Core metrics may include:

–每日活跃用户数(DAU)
– Daily Active Users (DAU)

–每月活跃用户数(MAU)
– Monthly Active Users (MAU)

–按设备类型分类的活跃用户数
– Number of active users by device type

–平均每日登录次数
– Average number of logins per day

2)用户交易事实表(Transaction Fact Table)
2) Transaction Fact Table

核心字段可能包括:
Core fields may include:

–用户ID(user_id)

–交易ID(transaction_id)

–交易日期(transaction_date)
– Transaction date (transaction_date)

–交易时间(transaction_time)
– Trading hours (transaction_time)

–交易类型(transaction_type)
– Transaction Type (transaction_type)

–交易金额(transaction_amount)

–交易状态(transaction_status)

–付款方式(payment_method)
– Payment Methods (payment_method)

核心指标可能包括:
Core metrics may include:

–总交易量
– Total trading volume

–平均交易金额
– Average transaction amount

–按交易类型分类的交易量
– Volume by transaction type

–按付款方式分类的交易量
– Transaction volume by payment method

为了将用户登录与交易连接起来,我们还构建了以下的维度表:
To connect user logins with transactions, we also built the following dimension tables:

(1)用户维度表(User Dimension Table):包含用户ID、用户名、邮箱、注册日期等用户信息。
(1) User Dimension Table: Contains user information such as user ID, user name, email address, registration date, etc.

(2)会话维度表(Session Dimension Table):包含会话ID、登录时间、登出时间、访问页面数等会话信息。
(2) Session Dimension Table: Contains session information such as session ID, login time, logout time, and number of pages visited.

(3)日期维度表(Date Dimension Table):包含日期ID、年、月、日、周等日期信息。
(3) Date Dimension Table: Contains date information such as date ID, year, month, day, week, etc.

(4)设备维度表(Device Dimension Table):包含设备类型、品牌、操作系统等设备信息。
(4) Device Dimension Table: Contains device information such as device type, brand, operating system, etc.

通过在事实表和维度表之间建立关联,可以方便地查询和分析用户登录与交易行为之间的关系。
By establishing associations between fact tables and dimension tables, you can easily query and analyze the relationship between user logins and transaction behavior.

2.15.31 当天订单没有闭环结束的数据量?
2.15.31 Data volume of orders without closed loop closure on the day?

在离线数仓中,当天订单没有闭环结束的数据量通常指以下几种数据:
In offline warehouse, the amount of data that the order has not closed loop on the day usually refers to the following data:

以当天10万订单为标准
Based on 100,000 orders of the day

未支付订单:指当天生成的订单,但客户还未完成支付行为的数据。
Unpaid order: refers to the order generated on the same day, but the customer has not completed the payment behavior data.

300-500

待发货订单:指当天已经收到付款的订单,但还未处理发货行为的数据。
Order to be shipped: refers to the order that has received payment on the same day, but the data of delivery behavior has not been processed.

6

退货订单:指当天客户发起的退货请求,但商家还未处理完退货流程的数据。
Return order: refers to the data of the return request initiated by the customer on the same day, but the merchant has not completed the return process.

200-300

2.15.32 你们维度数据要做ETL吗?除了用户信息脱敏?没有做其他ETL吗
2.15.32 Do you want ETL for dimensional data? Besides user information desensitization? No other ETL?

除了用户信息脱敏,我们还对用户埋点数据中的IP、UA等信息进行解析,同时我们对数据进行去重操作,还会对空值以及异常数据进行处理,对于部分维度数据我们将其合并
In addition to user information desensitization, we also parse IP, UA and other information in the user buried point data. At the same time, we deduplicate the data, process null values and abnormal data, and merge some dimensional data.

2.15.33 怎么做加密,加密数据要用怎么办,我讲的md5,他问我md5怎么做恢复
2.15.33 How to do encryption, how to use encrypted data, I talked about md5, he asked me how to do md5 recovery

在实际应用中,可以使用各种加密算法对数据进行加密,以确保数据安全。
In practical applications, various encryption algorithms can be used to encrypt data to ensure data security.

(1)对称加密:对称加密使用相同的密钥进行加密和解密。常用的对称加密算法有AES、DES、3DES等。使用对称加密时,需要确保密钥的安全性,否则加密数据的安全性可能会受到威胁。
Symmetric encryption: Symmetric encryption uses the same key for encryption and decryption. Common symmetric encryption algorithms are AES, DES, 3DES and so on. When symmetric encryption is used, it is necessary to ensure the security of the key, otherwise the security of the encrypted data may be compromised.

(2)非对称加密:非对称加密使用一对密钥进行加密和解密。公钥用于加密数据,私钥用于解密数据。常用的非对称加密算法有RSA、ECC等。
(2) Asymmetric encryption: Asymmetric encryption uses a pair of keys for encryption and decryption. The public key is used to encrypt data and the private key is used to decrypt data. Common asymmetric encryption algorithms are RSA, ECC and so on.

(3)散列算法:散列算法使用固定长度的输出代表输入数据。常用的散列算法有MD5、SHA-1、SHA-2等。使用散列算法时,无法恢复原始数据,只能通过再次计算散列值来验证数据的完整性。
(3) Hashing algorithm: Hashing algorithm uses fixed-length output to represent input data. Common hash algorithms are MD5, SHA-1, SHA-2, etc. When using hashing algorithms, the original data cannot be recovered, and the integrity of the data can only be verified by calculating the hash value again.

如果要使用加密后的数据,可以使用相应的解密算法对其进行解密。如果使用的是散列算法,直接使用加密后的散列值当做原字段处理。
If you want to use encrypted data, you can decrypt it using the corresponding decryption algorithm. If the hash algorithm is used, the encrypted hash value is directly used as the original field.

对于MD5加密算法,由于它是一种不可逆的散列算法,无法恢复原始数据。它通常用于验证数据的完整性和一致性,而不是对数据进行加密。如果需要加密数据并且需要能够恢复原始数据,则可以使用对称加密或非对称加密算法。
For MD5 encryption algorithm, because it is an irreversible hash algorithm, it cannot recover the original data. It is usually used to verify the integrity and consistency of data rather than encrypting it. If you need to encrypt the data and you need to be able to recover the original data, you can use symmetric encryption or asymmetric encryption algorithms.

2.15.34 真实项目流程
2.15.34 Real Project Flow

1)数仓搭建完成之前:
1) Before the warehouse is built:

(1)需求分析:与产品了解需求,明确数仓实现功能
(1) Requirements analysis: understand the requirements with the product, and clarify the functions of the warehouse.

(2)数据源分析:分析数据源是否充足,是否需要引进新数据源
(2) Data source analysis: analyze whether data sources are sufficient and whether new data sources need to be introduced.

(3)数据模型设计
(3) Data model design

(4)技术选型
(4) Technical selection

(5)数据处理流程设计:Hivesql
(5) Data processing flow design: Hivesql

2)数仓搭建完成之后:
2) After building the warehouse:

(1)开发与维护
(1) Development and maintenance

(2)数据监控与报警
(2) Data monitoring and alarm

(3)数据质量管理
(3) Data quality management

(4)性能优化
(4) Performance optimization

(5)新需求开发
(5) Development of new requirements

(6)报表与可视化
(6) Reporting and visualization

(7)文档编写与知识分享
(7) Documentation and knowledge sharing

2.15.35 指标的口径怎么统一的(离线这边口径变了,实时这边怎么去获取的口径)
2.15.35 How to unify the caliber of indicators (offline caliber changed, how to obtain the caliber in real time)

在离线和实时计算中统一指标口径可以从以下几个方面处理:
Unifying the indicator calibre in offline and real-time calculations can be handled from the following aspects:

(1)定义统一的指标标准:首先需要定义清晰、明确的指标标准,并确保所有团队成员理解并遵循这些标准。这包括指标的计算方法、数据源、时间范围等。
Define uniform metrics: First, define clear, unambiguous metrics and ensure that all team members understand and follow them. This includes the calculation method of the indicator, data sources, time frame, etc.

2)版本控制和文档:使用版本控制工具(如 Git)来管理指标计算代码,并确保所有更改都经过审查和测试。同时,为指标计算方法编写详细的文档,以便团队成员能够理解和维护这些方法。
Version control and documentation: Use version control tools such as Git to manage metric calculation code and ensure that all changes are reviewed and tested. At the same time, document metrics calculation methods in detail so team members can understand and maintain them.

3)定期检查和校验:定期检查和校验离线和实时计算的结果,以确保指标口径的一致性。可以编写自动化测试用例,对比离线和实时计算的结果,以确保它们在不同场景下都能保持一致。
(3) Regular inspection and verification: Regular inspection and verification of offline and real-time calculation results to ensure the consistency of indicator caliber. Automated test cases can be written to compare results from offline and real-time calculations to ensure they are consistent across scenarios.

4)监控和报警:建立监控和报警机制,以便在离线和实时计算的指标口径出现不一致时及时发现并解决问题。
(4) Monitoring and alarm: Establish monitoring and alarm mechanism to find and solve problems in time when the indicators calculated offline and real-time are inconsistent.

2.15.36 表生命周期管理怎么做的?
2.15.36 How is life cycle management done?

表的生命周期管理是数据仓库和数据库管理的一个重要方面,需要关注表的创建、使用、优化和删除等阶段。管理表的生命周期有如下几个阶段:
Table life cycle management is an important aspect of data warehouse and database management, which needs to pay attention to the stages of table creation, use, optimization and deletion. The lifecycle of management tables has several stages:

1)表设计
1) Table design

2)数据导入和更新
2) Data import and update

3)数据存储和备份
3) Data storage and backup

4)数据安全和权限管理
4) Data security and rights management

5)性能优化和监控
5) Performance optimization and monitoring

6)数据归档和删除
6) Data archiving and deletion

2.15.37 如果上游数据链路非常的多,层级也非常的深,再知道处理链路和表的血缘的情况下,下游数据出现波动怎么处理?
2.15.37 If the upstream data link is very many and the hierarchy is very deep, and then know how to deal with the link and the kinship of the table, how to deal with the fluctuation of the downstream data?

备份思想,类似Spark缓存思想
Backup idea, similar to Spark cache idea

2.15.38 十亿条数据要一次查询多行用什么数据库比较好?
2.15.38 Billion pieces of data to query multiple rows at once What database is better?

Hbase Elasticsearch

2.16 埋点
2.16 buried-point

1)埋点选择
1) Buried point selection

免费的埋点:上课演示。前端程序员自己埋点。
Free Buried Point: Class Demo. The front-end programmer buries himself.

收费的埋点:神策、百度统计、友盟统计。
Buried point of charges: God policy, Baidu statistics, Friends League statistics.

2)埋点方式主要有两种:
2) There are two main ways to bury points:

(1)按照页面埋点,有几个页面就创建几个表。
(1) According to the page buried points, there are several pages to create several tables.

(2)按照用户行为:页面数据事件数据、曝光数据、启动数据和错误数据。 咱们项目中采用的这种方式。
(2) According to user behavior: page data, event data, exposure data, startup data, and error data. This is the way we've been working on this project.

3)埋点数据日志格式
3) Buried Point Data Log Format

为了减少网络上数据的传输,日志格式设计时都会有公共信息。
In order to reduce data transmission over the network, log formats are designed with common information.

{

"common": { -- 公共信息
"common": { --public information

"ar": "230000", -- 地区编码
"ar": "230000", --area code

"ba": "iPhone", -- 手机品牌
"ba": "iPhone", --mobile phone brand

"ch": "Appstore", -- 渠道

"md": "iPhone 8", -- 手机型号
"md": "iPhone 8", --Phone model

"mid": "YXfhjAYH6As2z9Iq", -- 设备id

"os": "iOS 13.2.9", -- 操作系统
"os": "iOS 13.2.9", --OS

"uid": "485", -- 会员id

"vc": "v2.1.134" -- app版本号
"vc": "v2.1.134" -- app version

},

"actions": [ --动作(事件)
"actions": [ --actions (events)

{

"action_id": "favor_add", --动作id

"item": "3", --目标id
"item": "3", --target id

"item_type": "sku_id", --目标类型
"Item _ type": "sku _ id",--target type

"ts": 1585744376605 --动作时间戳
"ts": 1585744376605 --Action timestamp

}

]

"displays": [

{

"displayType": "query", -- 曝光类型

"item": "3", -- 曝光对象id
"item": "3", --Exposure object id

"item_type": "sku_id", -- 曝光对象类型
"item_type": "sku_id", -- The type of object to be exposed

"order": 1 --出现顺序
"order": 1 -- the order in which it appears

},

{

"displayType": "promotion",

"item": "6",

"item_type": "sku_id",

"order": 2

},

{

"displayType": "promotion",

"item": "9",

"item_type": "sku_id",

"order": 3

}

],

"page": { --页面信息
"page": { --pageinfo.}

"during_time": 7648, -- 持续时间毫秒
"during_time": 7648, -- Duration in milliseconds

"item": "3", -- 目标id
"item": "3", -- Target ID

"item_type": "sku_id", -- 目标类型
"item_type": "sku_id", -- Target type

"last_page_id": "login", -- 上页类型
"last_page_id": "login", -- Previous type

"page_id": "good_detail", -- 页面ID

"sourceType": "promotion" -- 来源类型

},

"err":{ --错误
"err": { --Error

"error_code": "1234", --错误码

"msg": "***********" --错误信息
"msg": "***********" -- error message

},

"ts": 1585744374423 --跳入时间戳
"ts": 1585744374423 -- jump in timestamp

}

第3章 实时数仓项目
Chapter 3 Real-time Data Warehouse Project

3.1 为什么做这个项目
3.1 Why did you do this project

随着公司不断业务不断发展,产品需求和内部决策对于数据实时性要求越来越迫切,传统离线数仓T+1模式已经不能满足,所以需要实时数仓的能力来赋能。
With the continuous development of the company's business, product requirements and internal decision-making requirements for real-time data are becoming more and more urgent, and the traditional offline data warehouse T+1 mode can no longer meet the requirements, so the ability of real-time data warehouse is needed to empower it.

3.2 项目架构
3.2 Project Structure

3.3 框架版本选型
3.3 Framework version selection

和离线保持一致。
Be consistent with offline.

3.4 服务器选型
3.4 Server Selection

和离线保持一致。
Be consistent with offline.

3.5 集群规模
3.5 Cluster size

1)生产集群规模、Flink集群规模(10台为例)
1) Scale of production clusters and Flink clusters (10 for example)

项目中方便作业提交,Flink作为客户端,部署在所有的Worker节点
Flink is deployed on all worker nodes as a client for job submission

举例:Job数量在20左右,需要10台服务器
For example, if the number of jobs is about 20, you need 10 servers

Clickhouse单独部署,服务器使用128G,64C
Clickhouse is deployed separately, and the server uses 128G and 64C

1

2

3

4

5

6

7

8

9

10

nn

nn

dn

dn

dn

dn

dn

dn

rm

rm

nm

nm

nm

nm

nm

nm

zk

zk

zk

Kafka

Kafka

Kafka

Flume

Flume

Flume

Hive

Hive

MySQL

Spark

Spark

DS

DS

Datax

maxwell

Hbase

Hbase

Hbase

Flink

Flink

Flink

Flink

Flink

Flink

CK

CK

3.6 项目建模
3.6 Project Modeling

1)数据调研
1) Data research

(1)先和Java人员要表,表中最好有字段的描述或者有表和字段的说明文档。(项目经理帮助协调) =》 快速熟悉表中业务。梳理清楚业务线,找到事实表和维度表。
(1) First ask Java personnel for a table, and it is best to have a description of the fields in the table or a description of the table and fields. (Project manager helps coordinate) => Quickly familiarize yourself with the business in the table. Sort out the lines of business and find the fact table and dimension table.

(2)和业务人员聊 =》 验证你猜测的是否正确
(2) Talk to the business person =" Verify that your guess is correct

(3)和产品经理聊
(3) Talk to the product manager

需求:派生指标、衍生指标
Demand: derived indicators, derived indicators

派生指标 = 原子指标(业务过程 + 度量值 + 聚合逻辑) + 统计周期 + 统计粒度 + 业务限定
Derivative indicator = atomic indicator (business process + metric + aggregation logic)+ statistical period + statistical granularity + business limit

需求中的业务过程必须和实际的后台业务能对应上。
The business processes in the requirements must correspond to the actual background business.

2)明确数据域
2) Clarify the data domain

(1)用户域:登录、注册
(1) User domain: login, registration

(2)流量域:启动、页面、动作、故障、曝光
(2) Flow domain: start, page, action, fault, exposure

(3)交易域:加购、下单、支付、物流
(3) Transaction domain: purchase, order, payment, logistics

(4)工具域:领取优惠卷、使用优惠卷下单、使用优惠卷支付
(4) Tool domain: receive coupons, place orders with coupons, pay with coupons

(5)互动域:点赞、评论、收藏
(5) Interactive domain: likes, comments, favorites

3)构建业务矩阵
3) Build Business Matrix

用户、商品、活动、时间、地区、优惠卷
User, Product, Event, Time, Region, Coupon

(1)用户域:
(1) User domain:

登录、注册
Login, Register

(2)流量域: √
(2) Flow field: √

启动、页面、动作、故障、曝光
Start, page, action, fault, exposure

(3)交易域:
(3) Transaction domain:

加购、下单、支付、物流
Purchase, Order, Payment, Logistics

(4)工具域:
(4) Toolfields:

领取优惠卷、使用优惠卷下单、使用优惠卷支付
Receive coupons, order with coupons, pay with coupons

(5)互动域:
(5) Interactive areas:

点赞、评论、收藏
Like, comment, collect

4)建模 至下而上
4) Modeling from bottom to top

(1)ODS层

①存Kafka: topic_log\topic_db ,保持数据原貌不做处理
① Save Kafka: topic_log\topic_db , keep the data as it is and do not process it

(2)DWD层 事实表
(2) DWD layer fact table

①事务型事实表
① Transaction type fact table

找原子操作
atomic operation

a)选择业务过程
a) Select Business Process

选择感兴趣的业务过程。 产品经理提出的指标中需要的。
Select the business process of interest. Required in the metrics proposed by the product manager.

b)声明粒度
b) Declaration granularity

粒度:一行信息代表什么含义。可以是一次下单、一周下单、一个月下单。
Granularity: what a line of information represents. It can be a single order, a week order, a month order.

如果是一个月的下单,就没有办法统计一次下单情况。保持最小粒度。
If it is a month's order, there is no way to count an order. Maintain minimum granularity.

只要你自己不做聚合操作就可以。
As long as you don't aggregate yourself.

c)确定维度
c) Determination of dimensions

确定感兴趣的维度。 产品经理提出的指标中需要的。
Determine the dimension of interest. Required in the metrics proposed by the product manager.

例如:用户、商品、活动、时间、地区、优惠卷
For example: users, products, activities, time, region, coupons

d)确定事实
d) Establishment of facts

确定事实表的度量值。 可以累加的值,例如,个数、件数、次数、金额。
Determine the measure for the fact table. Value that can be accumulated, for example, number, amount.

e)维度退化
e) Dimensional degradation

通过Lookupjoin 将字典表中字段退化到明细表中
Degenerate fields from dictionary tables into schedule tables via Lookupjoin

(3)DIM层 维度表
(3) DIM layer dimension table

①维度数据存储Hbase,同时不做维度整合
① Dimension data is stored in Hbase without dimension integration.

5)指标体系建设 至上而下
5) Index system construction from top to bottom

(1)ADS层

需求、日活、新增、留存、转化率、GMV
Demand, Daily Activity, New, Retention, Conversion Rate, GMV

(2)DWS层 聚合层
(2) DWS Layer Aggregate Layer

需求:派生指标、衍生指标
Demand: derived indicators, derived indicators

派生指标 = 原子指标(业务过程 + 度量值 + 聚合逻辑) + 统计周期 + 统计粒度 + 业务限定
Derivative indicator = atomic indicator (business process + metric + aggregation logic)+ statistical period + statistical granularity + business limit

例如,统计,每天各个省份手机品牌交易总额
For example, statistics, the total number of mobile phone brand transactions in various provinces every day

交易总额 (下单 + 金额 + sum ) + 每天 + 省份 + 手机品牌
Total transaction amount (order + amount + sum)+ daily + province + mobile phone brand

找公共的:业务过程 + 统计周期 + 统计粒度 建宽表
Find common: business process + statistical period + statistical granularity broadening table

3.7 数据量
3.7 data size

3.7.1 数据分层数据量
3.7.1 Data Stratification Data Volume

1)ODS层

(1)用户行为数据(100g => 1亿条;1g => 100万条)
(1) User behavior data (100g => 100 million items;1g => 1 million items)

曝光(60g or 600万条)、页面(20g)、动作(10g)、故障 + 启动(10g)
Exposure (60g or 6 million), page (20g), action (10g), fault + start (10g)

(2)业务数据(1-2g => 100万-200万条)
(2) Business data (1-2g => 1 million-2 million items)

登录(20万)、注册(100-1000);
Login (200,000), registration (100-1000);

加购(每天增量20万、全量100万)、下单(10万)、支付(9万)、物流(9万)、取消下单(500)、退款(500);
Additional purchase (daily increment of 200,000, full amount of 1 million), order (100,000), payment (90,000), logistics (90,000), cancellation of order (500), refund (500);

领取优惠卷(5万)、使用优惠卷下单(4万)、使用优惠卷支付(3万);
Receive coupons (50,000), order with coupons (40,000), pay with coupons (30,000);

点赞(1000)、评论(1000)、收藏(1000);
Likes (1000), Comments (1000), Collections (1000);

用户(活跃用户100万、新增1000、总用户1千万)、商品SPU(1-2万)、商品SKU(10-20万)、活动(1000)、时间(忽略)、地区(忽略)
Users (1 million active users, 1,000 new users, 10 million total users), product SPU (1-20,000), product SKU (10-200,000), activity (1,000), time (ignore), region (ignore)

2)DWD层 + DIM层
2) DWD layer + DIM layer

和ODS层几乎一致;
Almost identical to the ODS layer;

3)DWS层

轻度聚合后,20g-50g
After mild polymerization, 20g-50g.

4)ADS层

10-50m之间,可以忽略不计。
Between 10-50m, negligible.

3.7.2 实时组件存储数据量
3.7.2 The amount of data stored by the real-time component

1Kafka:

ods层和dwd层数据
ODS layer and DWD layer data

ods和dwd数据一致,每天约200G数据
The ods and dwd data are consistent, about 200G data per day

考虑kafka副本2个,保存三天,kafka存储400G数据
Consider that there are two copies of Kafka that are stored for three days, and Kafka stores 400 GB of data

2)Hbase:

存储 dim层数据,与离线一致
Stores DIM layer data, consistent with offline

3)Clickhouse:

存储dws层数据,每天约20~30G数据
Store data at the DWS layer, about 20~30G data per day

考虑dws层数据保存一年,Clickhouse三个副本
Consider that the data at the dws layer is stored for one year and three copies of Clickhouse

数据约15~20T
The data is about 15~20T

3.7.3 实时QPS峰值数据量
3.7.3 Real-time QPS peak data volume

QPS峰值:20000/s或者2M/s
Peak QPS: 20,000 pieces/s or 2 Mbit/s

3.8 项目中遇到哪些问题及如何解决?
3.8 What problems are encountered in the project and how to solve them?

3.8.1 业务数据采集框架选型问题
3.8.1 Selection of business data collection framework

详见第一章(FlinkCDC,Maxwell,Canal)对比
See Chapter 1 (FlinkCDC, Maxwell, Canal) for comparison

3.8.2 项目中哪里用到状态编程,状态是如何存储的,怎么解决大状态问题
3.8.2 Where state programming is used in the project, how state is stored, and how to solve large state problems

1)Dim动态分流使用广播状态,新老访客修复使用键控状态
1) Dim dynamic shunt uses broadcast state, new and old visitors repair uses keyed state

状态中数据少使用HashMap,状态中数据多的使用RocksDB
HashMap is used for less data in state, RocksDB is used for more data in state

2)大状态优化手段
2) Large state optimization means

(1)使用rocksdb

(2)开启增量检查点、本地恢复、设置多目录
(2) Enable incremental checkpoints, local recovery, and set up multiple directories

(3)设置预定义选项为 磁盘+内存 的策略,自动设定 writerbuffer、blockcache等
(3) Set the predefined option to disk + memory policy, automatically set writerbuffer, blockcache, etc.

3.8.3 项目中哪里遇到了反压,造成的危害,定位解决(*重点*
3.8.3 In the project, where the back pressure is encountered, the harm caused by it shall be located and solved (* key points *)

1)项目中反压造成的原因
1) Reasons for back pressure in the project

流量洪峰:不需要解决
Flood peak: no need to solve

频繁GC:比如代码中大量创建临时对象
Frequent GC: For example, temporary objects are created in large numbers in code

大状态:新老访客修复
Big Status: New Old Visitor Repair

关联外部数据库:从Hbase读取维度数据或将数据写入Clickhouse
Associating external databases: reading dimension data from Hbase or writing data to Clickhouse

数据倾斜:keyby之后不同分组数据量不一致
Data tilt: inconsistent amount of data in different groups after keyby

2)反压的危害
2) Harm of back pressure

问题:Checkpoint超时失败导致job挂掉
Problem: Checkpoint timeout failure causes job to fail

内存压力变大导致的OOM导致job挂掉
OOM caused by increased memory pressure causes job to hang

时效性降低
timeliness reduction

3)定位反压
3) Positioning back pressure

(1)利用Web UI定位
(1) Using Web UI positioning

定位到造成反压的节点,排查的时候,先把operator chain禁用,方便定位到具体算子。
Locate the node causing back pressure. When checking, disable the operator chain first to facilitate locating the specific operator.

Flink 现在在UI上通过颜色和数值来展示繁忙和反压的程度。
Flink now displays the degree of busyness and backpressure in the UI by color and numeric values.

上游都是high,找到第一个为ok的节点就是瓶颈节点。
The upstream is high, and the first node found is ok.

(2)利用Metrics定位
(2) Positioning with Metrics

可以根据指标分析反压: buffer.inPoolUsagebuffer.outPoolUsage
Backpressure can be analyzed according to indicators: buffer.inPoolUsage, buffer.outPoolUsage

可以分析数据传输
Data transmission can be analyzed

4)处理反压
4) Dealing with back pressure

反压可能是暂时的,可能是由于负载高峰、CheckPoint 或作业重启引起的数据积压而导致反压。如果反压是暂时的,应该忽略它。
Backpressure may be temporary and may be caused by a data backlog caused by a load spike, CheckPoint, or job restart. If the backpressure is temporary, it should be ignored.

(1)查看是否数据倾斜
(1) Check whether the data is tilted

(2)使用火焰图分析看顶层的哪个函数占据的宽度最大。只要有"平顶"(plateaus),就表示该函数可能存在性能问题。
(2) Use flame diagram analysis to see which function at the top occupies the largest width. Any "plateaus" indicates that the function may have performance problems.

(3)分析GC日志,调整代码
(3) Analyze GC logs and adjust codes

(4)资源不合理造成的:调整资源
(4) Unreasonable resources: adjust resources

(5)与外部系统交互:
5) Interacting with external systems:

写MySQL、Clickhouse:攒批写入
Write MySQL, Clickhouse: Save Batch Write

读HBase:异步IO、旁路缓存
Read HBase: asynchronous IO, bypass cache

3.8.4 数据倾斜问题如何解决(****重点***
3.8.4 How to solve the data skew problem (**** key **)

1)数据倾斜现象:
1) Data tilt phenomenon:

相同Task 的多个 Subtask 中,个别Subtask 接收到的数据量明显大于其他 Subtask 接收到的数据量,通过 Flink Web UI 可以精确地看到每个 Subtask 处理了多少数据,即可判断出 Flink 任务是否存在数据倾斜。通常,数据倾斜也会引起反压。
Among multiple Subtasks of the same Task, the amount of data received by individual Subtasks is significantly larger than that received by other Subtasks. Through Flink Web UI, you can accurately see how much data each Subtask has processed, and you can judge whether there is data skew in Flink tasks. Usually, data skewing also causes backpressure.

2)数据倾斜解决
2) Data tilt resolution

(1)数据源倾斜
(1) Data source tilt

比如消费Kafka,但是Kafka的Topic的分区之间数据不均衡
For example, Kafka is consumed, but the data between the partitions of Kafka's Topic is unbalanced.

读进来之后调用重分区算子:rescale、rebalance、shuffle等
After reading in, call the repartition operator: rescale, rebalance, shuffle, etc.

2)单表分组聚合(纯流式)倾斜
(2) Single table grouping aggregation (pure flow) tilt

API:利用flatmap攒批、预聚合
API: Use flatmap to save batches and pre-aggregate

SQL:开启MiniBatch+LocalGlobal

(3)单表分组开窗聚合倾斜
(3) Single table grouped windowing aggregation tilt

第一阶段聚合:key拼接随机数前缀或后缀,进行keyby、开窗、聚合
The first stage of aggregation: key concatenate random number prefix or suffix to perform keyby, windowing, and aggregation

注意:聚合完不再是WindowedStream,要获取WindowEnd作为窗口标记作为第二阶段分组依据,避免不同窗口的结果聚合到一起)
Note: After the aggregation is completed, it is no longer WindowedStream, you need to get WindowEnd as the window marker as the second stage of grouping by, to avoid the results of different windows being aggregated together)

第二阶段聚合:按照原来的key及windowEnd作keyby、聚合
The second stage of aggregation: keyby and aggregation based on the original key and windowEnd

在我们项目中,用到了Clickhouse,我们可以第一阶段打散聚合后,直接写入Click house,查clickhouse再处理第二阶段
In our project, we use Clickhouse, we can break up the aggregation in the first stage, write it directly to the clickhouse, check the clickhouse, and then process the second stage

3.8.5 数据如何保证一致性问题
3.8.5 How to ensure data consistency

上游:kafka保证offset可重发,kafka默认实现
Upstream: Kafka ensures that the offset can be retransmitted, and Kafka implements it by default

Flink:Checkpoint设置执行模式为Exactly_once
Flink: set Checkpoint to Exactly_once execution mode

下游:使用事务写入Kafka,使用幂等写入Clickhouse且查询使用final查询
Downstream: Use transactions to write to Kafka, idempotent writes to Clickhouse, and uses final queries to queries

3.8.6 FlinkSQL性能比较慢如何优化
3.8.6 How to optimize FlinkSQL performance is slow

(1)设置空闲状态保留时间
(1) Set the idle state retention time

(2)开启MiniBatch

(3)开启LocalGlobal

(4)开启Split Distinct

(5)多维Distinct使用Filter
(5) Multi-dimensional Distinct uses Filter

3.8.7 Kafka分区动态增加,Flink监控不到新分区数据导致数据丢失
3.8.7 Kafka partitions are dynamically increased, and Flink cannot monitor the data of the new partitions, resulting in data loss

设置Flink动态监控kafka分区的参数
Set the parameters for Flink to dynamically monitor a Kafka partition

3.8.9 Kafka某个分区没有数据,导致下游水位线无法抬升,窗口无法关闭计算
3.8.9 If there is no data in a Kafka partition, the downstream watermark cannot be raised and the calculation window cannot be closed

注入水位线时,设置最小等待时间
When injecting into the watermark, set the minimum waiting time

3.8.10 Hbase的rowkey设计不合理导致的数据热点问题
3.8.10 Data hotspots caused by the unreasonable rowkey design of HBase

详见Hbase的rowkey设计原则
For details, see Hbase's rowkey design principles

3.8.11 Redis和HBase的数据不一致问题
3.8.11 Data inconsistencies between Redis and HBase

对Redis和数据库的操作有2种方案:
There are two ways to manipulate Redis and databases:

(1)先操作(删除)Redis,再操作数据库
(1) First operate (delete) Redis, then operate database

并发下可能产生数据一致性问题。
Concurrency can cause data consistency problems.

上面的图表示,Thread-1 是个更新流程,Thread-2 是个查询流程,CPU 执行顺序是:Thread-1 删除缓存成功,此时 Thread-2 获取到 CPU 执行查询缓存没有数据,然后查询数据库把数据库的值写入缓存,因为此时 Thread-1 更新数据库还没有执行,所以缓存里的值是一个旧值(old),最后 CPU 执行 Thread-1 更新数据库成功的代码,那么此时数据库的值是新增(new),这样就产生了数据不一致行的问题。
The diagram above shows that Thread-1 is an update process, Thread-2 is a query process, and the CPU execution order is: Thread-1 deletes cache successfully. At this time, Thread-2 obtains that there is no data in CPU execution query cache, and then queries database to write database value into cache. Because Thread-1 updates database at this time, the value in cache is an old value (old). Finally, CPU executes Thread-1 to update database successfully. Then the value of the database is new, which creates the problem of inconsistent rows of data.

解决上述问题的两种方案:
There are two ways to solve these problems:

①加锁,使线程顺序执行:如果一个服务部署到了多个机器,就变成了分布式锁,或者是分布式队列按顺序去操作数据库或者 Redis,带来的副作用就是:数据库本来是并发的,现在变成串行的了,加锁或者排队执行的方案降低了系统性能,所以这个方案看起来不太可行。
Locking, making threads execute sequentially: If a service is deployed to multiple machines, it becomes a distributed lock, or a distributed queue operates the database or Redis in order, and the side effect is that the database was originally concurrent, but now it becomes serial. The locking or queuing scheme reduces system performance, so this scheme does not seem feasible.

②采用双删:先删除缓存,再更新数据库,当更新数据后休眠一段时间再删除一次缓存。
② Double deletion: delete the cache first, then update the database, and delete the cache once after sleeping for a period of time after updating the data.

(2)先操作数据库,再操作(删除) Redis
(2) First operate the database, then operate (delete) Redis

我们如果更新数据库成功,删除 Redis 失败,那么 Redis 里存放的就是一个旧值,也就是删除缓存失败导致缓存和数据库的数据不一致了
If we update the database successfully and delete Redis fails, then Redis stores an old value, that is, the failure to delete the cache causes the cache and database data to be inconsistent.

上述二种方案,都希望数据操作要么都成功,要么都失败,也就是最好是一个原子操作,我们不希望看到一个失败,一个成功的结果,因为这样就产生了数据不一致的问题。
In both cases, we hope that the data operation will either succeed or fail, that is, it is best to be an atomic operation. We do not want to see a failure and a successful result, because this will create data inconsistency problems.

3.8.12 双流join关联不上如何解决
3.8.12 How to solve the problem that the dual-stream join is not related

(1)使用interval join调整上下限时间,但是依然会有迟到数据关联不上
(1) Use interval join to adjust the upper and lower limit time, but there will still be late data that is not associated.

(2)使用left join,带回撤关联
(2) Use left join to bring back the association

(3)可以使用Cogroup+connect关联两条流
(3) You can associate two streams using Cogroup+Connect.

3.9 生产经验
3.9 production experience

3.9.1 Flink任务提交使用那种模式,为何选用这种模式
3.9.1 Which mode does Flink task submission use and why?

项目中提交使用的per-job模式,因为每个job资源隔离、故障隔离、独立调优
Per-job pattern used in project submission, because each job is resource isolated, fault isolated, and tuned independently

3.9.2 Flink任务提交参数,JobManager和TaskManager分别给多少
3.9.2 Flink Task Submission Parameters: How many JobManager and TaskManager are given respectively?

JobManager:内存默认1G,cpu默认1核
JobManager: Memory default 1G, CPU default 1 core

TaskManager:数据量多的job,例如:Topic_log分流的job可以给8G
TaskManager: jobs with a large amount of data, e.g. Topic_log jobs with offloading can be given to 8G

数据量少的job,例如:Topic_db分流的job可以给4G
Job with less data, e.g. Topic_db job with branching can be given to 4G

实时:并行度与Kafka分区一致,CPU与Slot比 13
Real-time: parallelism consistent with Kafka partition, CPU to Slot ratio 1:3

20M/s -> 3个分区 -> CPU与Slot比 1:3 -> 3个Slot -> Core数1-> CPU与内存比 1:4 -> TM 1 slot -> TM 4G资源
20M/s -> 3 partitions-> CPU to Slot ratio 13 -> 3 slots-> 1 core-> CPU to memory ratio 14 -> TM 1 slot -> TM 4G resources

JobManager 2G内存 1CPU

平均 一个Flink作业6G内存,2Core
Average Flink job 6 GB memory, 2 Core

3.9.3 Flink任务并行度如何设置
3.9.3 How to set Flink task parallelism

全局并行度设置和kafka分区数保持一致为5,Keyby后计算偏大的算子,单独指定。
The global parallelism setting and kafka partition number are kept consistent at 5, and the operators with large calculations after Keyby are specified separately.

3.9.4 项目中Flink作业Checkpoint参数如何设置
3.9.4 How Do I Set the Checkpoint Parameters of a Flink Job in a Project?

Checkpoint间隔:作业多久触发一次Checkpoint,由job状态大小和恢复调整,一般建议3~5分钟,时效性要求高的可以设置s级别。
Checkpoint interval: How often a job triggers a checkpoint, adjusted by the job state size and recovery, generally recommended for 3~5 minutes, and the S level can be set for high timeliness requirements.

Checkpoint超时:限制Checkpoint的执行时间,超过此时间,Checkpoint被丢弃,建议10分钟。
Checkpoint timeout: The execution time of the checkpoint is limited, after which the checkpoint is discarded, and it is recommended that the checkpoint be discarded within 10 minutes.

Checkpoint最小间隔:避免Checkpoint过于频繁,可以设置分钟级别。
Checkpoint Minimum Interval: To avoid checkpoints being too frequent, you can set the minute level.

Checkpoint的执行模式:Exactly_once或At_least_once,选择Exactly_once

Checkpoint的存储后端:一般存储HDFS
Checkpoint's storage backend: HDFS is generally stored.

3.9.5 迟到数据如何解决
3.9.5 How to solve late data

(1)设置乱序时间
(1) Set the disorder time

(2)窗口允许迟到时间
(2) Allowable window delay time

(3)侧输出流
(3) Side output flow

生产中侧输出流,需要Flink单独处理,在写入Clickhouse,通过接口再次计算
Production mid-side output stream, needs to be processed separately by Flink, written in Clickhouse, calculated again through interface

3.9.6 实时数仓延迟多少
3.9.6 How much is the delay in real-time counting

反压,状态大小,资源偏少,机器性能,checkpoint时间都会影响数仓延迟。
Backpressure, state size, low resources, machine performance, checkpoint time will affect the bin delay.

一般影响最大就是窗口大小,一般是5s。
Generally, the largest impact is the window size, usually 5s.

如果启用两阶段提交写入Kafka,下游设置读已提交,那么需要加上CheckPoint间隔时间。
If two-phase commit write Kafka is enabled and downstream settings read committed, then CheckPoint interval time needs to be added.

3.9.7 项目开发多久,维护了多久
3.9.7 How long has the project been developed and maintained?

开发周期半年,维护半年多
Development cycle half a year, maintenance more than half a year

3.9.8 如何处理缓存冷启动问题
3.9.8 How to handle cache cold start issues

初次启动,Redis没有缓存数据,大量读请求访问Habse,类似于缓存雪崩
Initial startup, Redis has no cache data, a large number of read requests access Habse, similar to cache avalanche

从离线统计热门维度数据,最近三天用户购买,活跃的sku,手动插入Redis。
From offline statistics hot dimension data, last three days user purchases, active sku, manual insert Redis.

3.9.9 如何处理动态分流冷启动问题(主流数据先到,丢失数据怎么处理)
3.9.9 How to deal with dynamic shunt cold start problem (mainstream data arrives first, how to deal with lost data)

在Open方法中预加载配置信息到HashMap以防止配置信息后到。
Preload configuration information into HashMap in Open method to prevent configuration information from arriving later.

3.9.10 代码升级,修改代码,如何上线
3.9.10 Code upgrade, code modification, how to go online

Savepoint停止程序,通过Savepoint恢复程序。
Savepoint stops the program and resumes the program via Savepoint.

代码改动较大,savepoint恢复不了怎么办,看历史数据要不要,要从头跑,不要就不适用savepoint恢复直接提交运行。
Code changes are large, savepoint recovery can not do, see historical data do not want to, to run from scratch, do not apply savepoint recovery directly submitted to run.

3.9.11 如果现在做了5个Checkpoint,Flink Job挂掉之后想恢复到第三次Checkpoint保存的状态上,如何操作
3.9.11 If 5 Checkpoints have been made now, how to restore the state saved by the third Checkpoint after Flink Job is suspended?

在Flink中,我们可以通过设置externalized-checkpoint来启用外部化检查点,要从特定的检查点(例如第三个检查点)恢复作业,我们需要手动指定要从哪个检查点(需要指定到chk-xx目录)恢复。
In Flink, we can enable externalized checkpoints by setting externalized-checkpoint, and to recover jobs from a specific checkpoint (e.g., the third checkpoint), we need to manually specify which checkpoint to recover from (we need to specify to the chk-xx directory).

3.9.12 需要使用flink记录一群人,从北京出发到上海,记录出发时间和到达时间,同时要显示每个人用时多久,需要实时显示,如果让你来做,你怎么设计?
3.9.12 Need to use flink to record a group of people, from Beijing to Shanghai, record departure time and arrival time, at the same time to show how long each person takes, need real-time display, if let you do, how do you design?

按照每个人KeyBy,将出发时间存入状态,当到达时使用到达时间减去出发时间。
Save departure time to status per person KeyBy, use arrival time minus departure time when arriving.

3.9.13 flink内部的数据质量和数据的时效怎么把控的
3.9.13 How to control the quality and timeliness of data inside flink

内部的数据质量:内部一致性检查点
Internal data quality: internal consistency checkpoints

数据的时效:结合3.9.5回答
Data timeliness: answer in conjunction with 3.9.5

3.9.14 实时任务问题(延迟)怎么排查
3.9.14 How to troubleshoot real-time task problems (delays)

实时任务出现延迟时,可以从以下几个方面进行排查:
When real-time tasks are delayed, they can be investigated from the following aspects:

(1)监控指标:看是否反压
(1) Monitoring indicators: see if there is back pressure

(2)日志信息:查看任务运行时的日志信息,定位潜在的问题和异常情况。例如,网络波动、硬件故障、不当的配置等等。
(2) Log information: View log information when the task is running to locate potential problems and exceptions. For example, network fluctuations, hardware failures, improper configuration, etc.

3)外部事件:如果延迟出现在大量的外部事件后,则可能需要考虑其他因素(如外部系统故障、网络波动等)。框架混部,资源争抢!
External events: If the delay occurs after a large number of external events, other factors (such as external system failures, network fluctuations, etc.) may need to be considered. Frameworks mixed up, resource scramble!

3.9.15 维度数据查询并发量
3.9.15 Dimension Data Query Concurrent Volume

未做优化之前,有几千QPS,做完Redis的缓存优化,下降到几十
Before optimization, there are thousands of QPS, after Redis cache optimization, down to dozens

3.9.16 Prometheus+Grafana是自己搭的吗,监控哪些指标
3.9.16 Is Prometheus+Grafana built by itself? What indicators are monitored?

是我们自己搭建的,用来监控Flink任务和集群的相关指标
We built it ourselves to monitor Flink tasks and cluster metrics.

1)TaskManager Metrics:这些指标提供有关TaskManager的信息,例如CPU使用率、内存使用率、网络IO等。
TaskManager Metrics: These metrics provide information about TaskManager, such as CPU usage, memory usage, network IO, and more.

2)Task Metrics:这些指标提供有关任务的信息,例如任务的延迟时间、记录丢失数、输入输出速率等。
2) Task Metrics: These metrics provide information about the task, such as the delay time of the task, the number of records lost, the input and output rate, etc.

3)Checkpoint Metrics:这些指标提供有关检查点的信息,例如检查点的持续时间、成功/失败的检查点数量、检查点大小等。
Checkpoint Metrics: These metrics provide information about checkpoints, such as checkpoint duration, number of successful/failed checkpoints, checkpoint size, etc.

4)Operator Metrics:这些指标提供有关Flink操作符的信息,例如操作符的输入/输出记录数、处理时间、缓存大小等。
4) Operator Metrics: These metrics provide information about the Flink operator, such as the number of input/output records for the operator, processing time, cache size, etc.

3.9.17 怎样在不停止任务的情况下改flink参数
3.9.17 How to change the flink parameter without stopping the task

动态分流,其他没有做过!
Dynamic shunt, others have not done!

3.9.18 hbase中有表,里面的1月份到3月份的数据我不要了,我需要删除它(彻底删除),要怎么做
3.9.18 There is a table in hbase. I don't want the data from January to March in it. I need to delete it (delete it completely). How do I do it?

在HBase中彻底删除表中的数据,需要执行以下步骤:
To completely delete the data in the table in HBase, you need to perform the following steps:

(1)禁用表
(1) Prohibited list

(2)创建一个新表
(2) Create a new table

(3)复制需要保留的数据,将需要保留的数据从旧表复制到新表。
(3) Copy the data that needs to be preserved, copying the data that needs to be preserved from the old table to the new table.

(4)删除旧表
(4) Delete the old table

(5)重命名新表
(5) Renaming the new table

在执行这些步骤之前,建议先进行数据备份以防止意外数据丢失。此外,如果旧表中的数据量非常大,复制数据到新表中的过程可能会需要很长时间。
Before performing these steps, it is recommended that you backup your data to prevent accidental data loss. In addition, if the amount of data in the old table is very large, the process of copying data into the new table can take a long time.

3.9.19 如果flink程序的数据倾斜是偶然出现的,可能白天可能晚上突然出现,然后几个月都没有出现,没办法复现,怎么解决?
3.9.19 If the data skew of flink program appears accidentally, it may suddenly appear during the day or at night, and then it does not appear for several months. There is no way to reproduce it. How to solve it?

Flink本身存在反压机制,短时间的数据倾斜问题可以自身消化掉,所以针对于这种偶然性数据倾斜,不做处理。
Flink itself has a backpressure mechanism, and the short-term data tilt problem can be digested by itself, so it is not handled for this accidental data tilt.

3.9.20 维度数据改变之后,如何保证新join的维度数据是正确的数据
3.920 After dimensional data changes, how to ensure that the dimensional data of the new join is correct data

(1)我们采用的是低延迟增量更新,本身就有延迟,没办法保证完全的正确数据。
(1) We are using low-latency incremental updates, which are inherently delayed and cannot guarantee complete correct data.

(2)如果必须要正确结果,只能直接读取MySQL数据,但是需要考虑并发,MySQL机器性能。
(2) If you must get the correct result, you can only read MySQL data directly, but you need to consider concurrency and MySQL machine performance.

3.10 实时---业务
3.10 Real-time--business

3.10.1 数据采集到ODS层
3.10.1 Data Acquisition to ODS Layer

1)前端埋点的行为数据为什么又采集一份?
1) Why collect another piece of behavioral data from the front-end burial point?

时效性

Kafka保存3天,磁盘够:原来1T,现在2T,没压力
Kafka storage for 3 days, disk enough: original 1T, now 2T, no pressure

2)为什么选择Kafka?
2) Why choose Kafka?

实时写、实时读
Real-time writing, real-time reading

=》 消息队列适合,其他数据库受不了
Message queue is suitable, other databases can not stand

3)为什么用Maxwell?历史数据同步怎么保证一致性?
3) Why Maxwell? How does historical data synchronization ensure consistency?

FlinkCDC在20年7月才发布
FlinkCDC was released in July of '20.

Canal与Maxwell区别:

Maxwell支持同步历史数据
Maxwell supports synchronization of historical data

Maxwell支持断点还原(存在元数据库)
Maxwell supports breakpoint restore (metadata exists)

数据格式更轻量
Lighter data format

保证至少一次,不丢
At least once, promise not to lose.

4Kafka保存多久?如果需要以前的数据怎么办?
4) How long does Kafka last? What if you need previous data?

跟离线项目保持一致:3天
Consistent with offline projects: 3 days

我们的项目不需要,如果需要的话可以去数据库或Hive现查,ClickHouse也有历史的宽表数据。
Our project doesn't need it, if you need it, you can go to the database or Hive. ClickHouse also has historical wide table data.

3.10.2 ODS层

1)存储原始数据
1) Storage of raw data

2个topic:埋点的行为数据 ods_base_log、业务数据 ods_base_db
2 topics: behavior data ods_base_log of buried points, business data ods_base_db

2)业务数据的有序性:
2) Order of business data:

maxwell配置,指定生产者分区的key为 table。
maxwell configuration, specifying that the producer partition key is table.

3.10.3 DWD+DIM

1)存储位置,为什么维度表存HBase?
1) Storage location, why dimension tables store HBase?

事实表存Kafka、维度表存HBase
Kafka in fact table and HBase in dimension table

基于热存储加载维表的Join方案:
Join scheme based on hot-storage loaded dimension table:

随机查

长远考虑
long-term consideration

适合实时读写
Suitable for real-time reading and writing

2)埋点行为数据分流
2) Buried point behavior data diversion

(1)修复新老访客(选择性):以前是前端试别新老访客,不够准确
(1) Repair new and old visitors (optional): Previously, it was the front end to try to distinguish new and old visitors, which was not accurate enough.

(2)分流:侧输出流
(2) Split flow: side output flow

分了3个topic: 启动、页面、曝光
There are three topics: launch, page, exposure

(3)用户跳出、独立访客统计
(3) User pop-up, independent visitor statistics

3)业务数据处理
3) Business data processing

(1)动态分流:FlinkSQL读取topic_base_db数据,过滤出每张明细表写回kafka
(1) Dynamic distribution: FlinkSQL reads topic_base_db data, filters out each schedule and writes it back to kafka.

2)订单预处理表设计:双流join,使用leftjoin
(2) Order preprocessing table design: double-stream join, left-join

(3)字典表维度退化
(3) Degeneration of dictionary table dimension

4)维度数据写入Hbase
4) Dimension data written to Hbase

(1)为了避免维度数据发生变化而重启任务,在mysql存一张配置表来动态配置。
(1) In order to avoid restarting tasks due to dimensional data changes, a configuration table is stored in mysql for dynamic configuration.

动态实现:通过广播状态
Dynamic implementation: by broadcasting status

=》 读取一张配置表 ===》 维护这张配置表
=> Read a configuration table ==> Maintain this configuration table

source来源 sink写到哪 操作类型 字段 主键 扩展
source source sink to which operation type field primary key extension

=》实时获取配置表的变化 ==》CDC工具
=》Real-time access to configuration table changes ==》CDC tool

=》 FlinkCDC

=》 使用了sql的方式,去同步这张配置表
=》Use sql to synchronize this configuration table

=》sql的数据格式比较方便
=》SQL data format is more convenient

(2)怎么写HBase:借助phoenix
(2) How to write HBase: with the help of phoenix

没有做维度退化
No dimensional degradation is done

维表数据量小、变化频率慢
The data volume of the dimension table is small and the frequency of change is slow

(3)Hbase的rowkey怎么设计的?有没有数据热点问题?
(3) How is HBase's rowkey designed? Are there any data hotspots?

最大的维表:用户维表
Largest dimension table: User dimension table

=》百万日活,2000万注册用户为例,1条平均1k:2000万*1k=约20G
= 》Millions of daily active users, 20 million registered users as an example, 1 average 1k: 20 million * 1k = about 20G

使用Phoenix创建的盐表,避免数据热点问题
Use the salt table created by Phoenix to avoid data hotspots

https://developer.aliyun.com/article/532313

3.10.4 DWS层

1)为什么选择ClickHouse
1) Why ClickHouse

(1)适合大宽表、数据量多、聚合统计分析 =》 快
(1) Suitable for large and wide tables, large amount of data, aggregate statistical analysis = fast

(2)宽表已经不再需要Join,很合适
(2) Wide tables no longer require Join, which is very suitable

2)关联维度数据
2) Correlation dimension data

(1)维度关联方案:预加载、读取外部数据库、双流Join、LookupJoin
(1) Dimension association scheme: preloading, reading external databases, dual-stream joining, and LookupJoining

(2)项目中读取Hbase中维度数据
(2) Read the dimension data in HBase in the project

(3)优化1:异步IO
(3) Optimization 1: Asynchronous I/O

异步查询实际上是把维表的查询操作托管给单独的线程池完成,这样不会因为某一个查询造成阻塞,单个并行可以连续发送多个请求,提高并发效率。
An asynchronous query actually hosts the query operation of the dimension table to a separate thread pool, so that it will not be blocked by a single query, and a single parallel can send multiple requests in a row, improving concurrency efficiency.

这种方式特别针对涉及网络IO的操作,减少因为请求等待带来的消耗。
This method is especially aimed at operations involving network I/O to reduce the cost of request waiting.

Flink在1.2中引入了Async I/O,在异步模式下,将IO操作异步化,单个并行可以连续发送多个请求,哪个请求先返回就先处理,从而在连续的请求间不需要阻塞式等待,大大提高了流处理效率。
In Flink 1.2, Async I/O is introduced, which asynchronously makes I/O operations asynchronous, so that a single parallel can send multiple requests in a row, and whichever request is returned first is processed first, so that there is no need for blocking waiting between consecutive requests, which greatly improves the efficiency of stream processing.

Async I/O 是阿里巴巴贡献给社区的一个呼声非常高的特性,解决与外部系统交互时网络延迟成为了系统瓶颈的问题。
Async I/O is a highly requested feature that Alibaba has contributed to the community, solving the problem of network latency becoming a bottleneck when interacting with external systems.

(4)优化2:旁路缓存
(4) Optimization 2: Bypass cache

旁路缓存模式是一种非常常见的按需分配缓存的模式。如图,任何请求优先访问缓存,缓存命中,直接获得数据返回请求。如果未命中则,查询数据库,同时把结果写入缓存以备后续请求使用。
The Cache Bypass pattern is a very common pattern in which caches are allocated on demand. As shown in the figure, any request preferentially accesses the cache, the cache hits, and the data return request is directly obtained. If it misses, the database is queried and the results are cached for subsequent requests.

(5)怎么保证缓存一致性
(5) How to ensure cache consistency

方案1:当我们获取到维表更新的数据,也就是拿到维度表操作类型为update时:
Solution 1: When we get the updated data of the dimension table, that is, when the operation type of the dimension table is update:

更新Hbase的同时,删除redis里对应的之前缓存的数据
When updating HBase, delete the previously cached data in Redis

Redis设置了过期时间:24小时
Redis sets an expiration time of 24 hours

方案2:双写
Scenario 2: Dual-write

3)轻度聚合
3) Mild polymerization

(1)DWS层要应对很多实时查询,如果是完全的明细那么查询的压力是非常大的。将更多的实时数据以主题的方式组合起来便于管理,同时也能减少维度查询的次数。
(1) The DWS layer has to deal with a lot of real-time queries, and if it is a complete detail, the query pressure is very large. Combining more real-time data in a thematic way makes it easier to manage and reduces the number of dimension queries.

(2)开一个小窗口,5s的滚动窗口
(2) Open a small window, a 5s scrolling window

(3)同时减轻了写ClickHouse的压力,减少后续聚合的时间
(3) At the same time, it reduces the pressure of writing ClickHouse and reduces the time for subsequent aggregation

(4)几张表? 表名、字段
(4) How many tables? Table name and field

访客、商品、地区、关键词
Visitors, products, regions, keywords

3.10.5 ADS层

1)实现方案
1) Implement the solution

为可视化大屏服务,提供一个数据接口用来查询ClickHouse中的数据。
Provides a data API for querying data in ClickHouse for the large visualization screen service.

2)怎么保证ClickHouse的一致性?
2) How to ensure the consistency of ClickHouse?

ReplacingMergeTree只能保证最终一致性,查询时的sql语法加上去重逻辑。
ReplacingMergeTree can only guarantee eventual consistency, SQL syntax at query time plus deduplication logic.

3)Flink任务如何监控
3) How Flink tasks are monitored

Flink和ClickHouse都使用了Prometheus + Grafana

第4章 数据考评平台项目
Chapter 4: The Data Evaluation Platform Project

4.1项目背景
4.1 Project Background

4.1.1 为什么做数据治理
4.1.1 Why do we do data governance?

随着大数据技术普及,越来越多企业搭建数据仓库,但是由于企业多数据源复杂情况,指标前后期口径不同,开发规范不统一,企业人员流动等原因,导致数据仓库:
With the popularization of big data technology, more and more enterprises are building data warehouses, but due to the complexity of multiple data sources in enterprises, different calibers before and after indicators, inconsistent development specifications, and enterprise personnel flow, data warehouses are as follows:

数据计算不准确甚至错误,导致决策不可靠;各部门数据无法有效整合导致数据孤岛;数据缺乏保护措施,增加数据风险;规范不统一,在传输、存储、计算中增加理解难度,降低使用率; 数据难以创造价值,降低用户体验感等。
Inaccurate or even erroneous data calculations, resulting in unreliable decision-making; The data of various departments cannot be effectively integrated, resulting in data silos; Lack of data protection measures, increasing data risks; The specifications are not uniform, which increases the difficulty of understanding and reduces the utilization rate in transmission, storage, and computing. It is difficult for data to create value, and the user experience is reduced.

基于这种背景很多公司需要做数据治理。
Based on this background, many companies need to do data governance.

4.1.2 数据治理概念
4.1.2 Data Governance Concepts

数据治理是系统化方法,企业中能够提高数据质量,一致性,安全性和完整性的手段。
Data governance is a systematic approach to improving data quality, consistency, security, and integrity in an enterprise.

设计策略、流程、技术、工具。
Design strategies, processes, technologies, tools.

数据治理企业中落地一般依靠数据中台,数据中台是“一站式”的数据处理和治理平台,一般包含:数据接入集成、清洗转换、存储和管理、质量管理、元数据管理与血缘管理、安全、可视化等功能。
The data governance enterprise generally relies on the data middle platform, which is a "one-stop" data processing and governance platform, which generally includes: data access integration, cleaning and conversion, storage and management, quality management, metadata management and lineage management, security, visualization and other functions.

4.1.3 数据治理考评平台做的是什么
4.1.3 What does the data governance evaluation platform do?

数据考评平台是更轻量级的Web平台,从规范、存储、计算、安全、质量角度对数据仓库的每张表进行打分,就像电脑的健康大师每日对数据仓库进行扫描,找出不符合规范的表格,进行调整,进而提升数据仓库的质量。
The data evaluation platform is a lightweight web platform that scores each table in the data warehouse from the perspectives of standardization, storage, computing, security, and quality, just like a computer health master scans the data warehouse every day to find out the tables that do not meet the specifications and make adjustments, so as to improve the quality of the data warehouse.

4.1.4 考评指标
4.1.4 Evaluation indicators

规范:表名是否规范、是否具有表备注、字段备注、是否有负责人
Specification: Whether the table name is standardized, whether it has table comments, field comments, and whether there is a responsible person

存储:存储指定生命周期、相似表、空表
Storage: Stores specified lifecycles, similar tables, and empty tables

计算:计算无产出、无访问、计算报错、简单加工
Calculation: calculation without output, no access, calculation error, simple processing

安全:安全等级、目录文件访问权限
Security: security level, directory file access

质量:指标计算时长超过波动、指标计算结果超过波动
Quality: indicator calculation time exceeds fluctuation, indicator calculation result exceeds fluctuation

4.2 技术架构
4.2 technical framework

4.3 项目实现了哪些功能
4.3 What functions does the project implement?

4.3.1 元数据的加载与处理及各表数据的页面接口
4.3.1 Loading and processing of metadata and page interface of each table data

(1)使用hiveClient对象调用getTable方法获取table表对象,进而获取元数据,写入到Mysql中
(1) Use hiveClient object to call getTable method to obtain table object, and then obtain metadata, and write it to Mysql.

(2)利用FileSystem类创建hdfs客户端对象,读取对应路径的元数据
(2) Create hdfs client object with FileSystem class and read metadata of corresponding path

(3)实现接口,在Web页面,人工补录部分表格信息,持久化到Mysql
(3) Implement the interface, manually supplement some table information on the Web page, and persist it to Mysql.

(4)表查询列表 、单表信息、辅助信息接口实现
(4) Interface implementation of table query list, single table information and auxiliary information

基本的增删改查
Basic additions, deletions, corrections, and searches

4.3.2 数据治理考评链路(**核心**
4.3.2 Data Governance Assessment Link (** Core **)

Mysql中:元数据表,权重表,指标类型表 -》 治理考评明细
Mysql: metadata table, weight table, indicator type table-Governance Evaluation Details

(1)通过Mybatis-plus的代码生成工具,以数据库表为基础生成最基础(bean,控制层,服务层,数据层)的代码。
(1) Generate the most basic (bean, control layer, service layer, data layer) code based on database tables through Mybatis-plus code generation tools.

(2)取得所有待考评表的列表 (List<TableMetaInfo>)
(2) Get a list of all tables to be evaluated (List<TableMetaInfo>)

(3)取得所有待考评的指标项列表(List<GovernanceMetric>)
(3) Obtain a list of all indicators to be evaluated (List<GovernanceMetric>)

(4)两层循环迭代两个列表,取得每个指标项的考评器,把考评器需要的参数传递给考评器,考评打分,得到各个表各个指标项的评分结果(List<GovernanceAssessDetail>)
(4) Two lists are iterated in two layers to obtain the evaluator of each index item, and the parameters required by the evaluator are transferred to the evaluator to evaluate and score, so as to obtain the scoring results (List ) of each index item in each table<GovernanceAssessDetail>.

(5)把考评结果保存到mysql
(5) Save evaluation results to MySQL

4.3.3 数据治理考评结果核算
4.3.3 Accounting of data governance evaluation results

主要的计算方式就是利用sql group by 进行计算。
The main calculation method is to use SQL group by calculation.

计算核算到表时,要考虑考评指标的治理类型,不同治理类型对应不同的权重。要把分数乘以权重计算该治理类型的分数。
When calculating the accounting table, the governance type of the evaluation index should be considered. Different governance types correspond to different weights. The score is multiplied by the weight to calculate the score for that governance type.

(1)计算每张表的考评分
(1) Calculate the evaluation score of each table

(2)计算每个技术负责人的考评分
(2) Calculate the evaluation score of each technical director

(3)计算全局的考评分
(3) Calculate the global evaluation score

(4)考评任务串联起来统一调度
(4) The evaluation tasks are connected in series and dispatched uniformly.

(5)利用springtask定时调度任务计算
(5) Scheduling task calculation with springtask

优点:简单易用配置少。
Advantages: Simple and easy to use.

缺点:不能做分布式调度。
Disadvantages: Distributed scheduling is not possible.

springboot启动程序上增加注解 @EnableScheduling
Add annotations to the SpringBoot launcher @EnableScheduling

方法中@Scheduled(参数)
@Scheduled in Method (Parameter)

4.3.4 可视化治理考评提供数据接口
4.3.4 Visual governance evaluation provides data interfaces

实现接口:各个治理类型问题个数、分组人员排行榜、手动触发评估
Implementation interfaces: the number of problems of each governance type, the ranking of grouped personnel, and the manual triggering of evaluation

4.4 项目中的问题/及优化
4.4 Problems in the project and/or optimization

4.4.1 计算hdfs路径数据量大小、最后修改访问时间
4.4.1 Calculate the size of the HDFS path data and the last modified access time

利用递归实现
Utilize recursive implementation

4.4.2 考评器作用是什么?
4.4.2 What is the purpose of the evaluator?

模板模式设计:一个抽象类公开定义了执行它的方法的方式/模板。它的子类可以按需要重写方法实现,但调用将以抽象类中定义的方式进行。
Template schema design: An abstract class exposes a template/template that defines the way in which it is executed. Its subclasses can be implemented by overriding methods as needed, but the calls will be made in the way defined in the abstract class.

特点:由父抽象类负责控制调用,由不同子类负责核心功能。
Features: The parent abstract class is responsible for controlling the call, and the different subclasses are responsible for the core function.

好处:符合开闭原则,即对修改封闭,对扩展开发。代码责任清洗易于维护。
Benefits: In line with the principle of open and closed, that is, closed to modification and extended development. Code responsibility cleaning is easy to maintain.

4.4.3 稍微难度考评器实现思路
4.4.3 Slightly difficult evaluator implementation ideas

表名是否合规:利用正则匹配判断表名
Whether the table name is compliant: Use regular matching to determine the table name

生命周期是否合理:
Is the life cycle reasonable:

是否相似表:
Similarity table:

表产出数据量监控:
Table Output Data Volume Monitoring:

目录文件数据访问超过权限建议值:
Directory file data access exceeds permissions Recommended values:

是否简单加工:sql解析语法树,定义节点处理器
Simple processing: sql parse syntax tree, define node processor

1)是否有复杂处理逻辑,查询了哪些表,过滤哪些字段
1) Whether there is complex processing logic, which tables are queried, and which fields are filtered

4.4.4 利用多线程优化考评计算
4.4.4 Using multithreading to optimize evaluation calculations

使用线程池+ CompletableFuture异步执行考评指标计算,提升效率
Use thread pool + CompletableFuture to perform evaluation index calculation asynchronously to improve efficiency

4.4.5 实现过哪些指标
4.4.5 What indicators have been achieved

简单指标:是否有业务owner、是否有表备注、是否缺失字段备注、是否空表、是否设置安全级别、长期未被访问、长期未产出表
Simple indicators: whether there is a business owner, whether there is a table comment, whether there is a missing field comment, whether the table is empty, whether the security level is set, whether the table has not been accessed for a long time, and whether the table has not been produced for a long time.

DS相关指标:当日任务报错、表产出时效监控、是否简单加工、sql中包含select*、数据倾斜检查
DS-related indicators: error reporting for tasks of the day, time-effectiveness monitoring for table output, simple processing, select* included in sql, data tilt check

第4章 用户画像项目
Chapter 4: The User Portrait Project

4.1 画像系统主要做了哪些事
4.1 What does the portrait system mainly do?

1)用户信息标签化
1) User information labeling

2)对标签化的数据的应用(分群、洞察分析)
2) Application of tagged data (clustering, insight analysis)

3)标签如何建模的,有哪些标签
3) How labels are modeled and what labels are there

根据用户需求,协调产品经理一起规划了四级标签。前两级是分类,第三级是标签,第四级是标签值。
According to user demand, coordinate product manager to plan four-level label together. The first two levels are classification, the third level is label, and the fourth level is label value.

4.2 项目整体架构
4.2 Overall project structure

4.3 讲一下标签计算的调度过程
4.3 Let's talk about the scheduling process of label calculation.

4.4 整个标签的处理过程
4.4 Batch process of whole label

四个任务:
Four missions:

(1)通过根据每个标签的业务逻辑编写SQL,生产标签单表。
(1) Produce label sheet tables by writing SQL according to the business logic of each label.

(2)把标签单表合并为标签宽表。
(2) Merge label single tables into label wide tables.

(3)把标签宽表导出到Clickhouse中的标签宽表。
(3) Export the label width table to the label width table in Clickhouse.

(4)把Clickhouse中的标签表转储为Bitmap表。
(4) Dump the label table in Clickhouse into a Bitmap table.

四个任务通过编写Spark程序完成。并通过画像平台调度,以后新增标签只需要在平台填写标签定义、SQL及相关参数即可。
The four tasks were accomplished by writing Spark programs. And through the portrait platform scheduling, in the future to add labels only need to fill in the label definition, SQL and related parameters in the platform.

4.5 你们的画像平台有哪些功能 ?
4.5 What are the functions of your portrait platform?

(1)标签定义
(1) Definition of labels

(2)标签任务设定
(2) Label task setting

(3)任务调度
(3) Task scheduling

(4)任务监控
(4) Task monitoring

(5)分群创建维护
(5) Cluster creation and maintenance

(6)人群洞察
(6) Crowd Insight

4.6 是否做过Web应用开发,实现了什么功能 
4.6 Have you done Web application development and what functions have been implemented?

(1)画像平台   分群
(1) Portrait platform group

(2)画像平台 其他功能(可选)
(2) Other functions of portrait platform (optional)

(3)实时数仓   数据接口 
(3) Real-time warehouse data interface

4.7 画像平台的上下游 
4.7 Upstream and downstream of the portrait platform

(1)上游:  数仓系统 
(1) Upstream: warehouse system

(2)下游:  写入到Redis中,由广告、运营系统访问。
(2) Downstream: written to Redis, accessed by advertising and operation systems.

4.8 BitMap原理,为什么可以提高性能
4.8 BitMap Principle and Why It Can Improve Performance

Bitmap是一个二进制集合,用0或1 标识某个值是否存在。
Bitmap is a binary set that identifies whether a value exists with a 0 or 1.

在求两个集合的交集运算时,不需要遍历两个集合,只要对位进行与运算即可。无论是比较次数的降低(从O(N^2) 到O(N) ,还是比较方式的改善(位运算),都给性能带来巨大的提升。
When finding the intersection of two sets, you don't need to traverse the two sets, just perform the sum operation on the bits. Both the reduction in the number of comparisons (from O(N^2) to O(N)) and the improvement in comparison methods (bitwise arithmetic) have resulted in a huge performance improvement.

业务场景:把每个标签的用户id集合放在一个Bitmap中,那多个标签求交集(比如:女性 + 90后)这种分群筛选时,就可以通过两个标签的Bitmap求交集运算即可。
Business scenario: Put the user ID set of each tag in a Bitmap, and when the multiple tags are intersected (for example, female + post-90s), you can use the Bitmap intersection operation of the two tags.

第5章 数据湖项目
Chapter 5: Data Lake Project

5.1 数据湖与数据仓库对比
5.1 Data lakes vs. data warehouses

数据湖(Data Lake)是一个存储企业的各种各样原始数据的大型仓库,其中的数据可供存取、处理、分析及传输。
A data lake is a large warehouse that stores a wide variety of raw data for an enterprise, which can be accessed, processed, analyzed, and transferred.

HudiIcebergData Lake、Paimon

5.2 为什么做这个项目?解决了什么痛点?
5.2 Why did you do this project? What pain points have been solved?

(1)离线数仓痛点
(1) Pain points of offline data warehouses

时效性:T+1模式,时效性差
Timeliness: T+1 mode, poor timeliness

数据更新只能overwrite,耗费资源
Data updates can only be overwritten, which consumes resources

(2)实时数仓痛点
(2) Real-time data warehouse pain points

数据一致性问题:维护麻烦
Data consistency issues: Troublesome maintenance

历史数据修正:没有持久化明细数据,需要重跑,流程繁琐
Historical data correction: There is no persistent detailed data, which needs to be re-run, and the process is cumbersome

(3)传统的数仓发展方向
(3) The development direction of traditional data warehouses

流批一体:一套架构,一套代码,可以跑批也可以跑流
Integration of flow and batch: One set of architecture, one set of code, can run batch or stream

==》节省 资源 人力
==》Save resources and manpower

(4)Hudi数据湖的优势
(4) Advantages of Hudi Data Lake

将离线的时效性降低到了分钟级别(5~10分钟)
Reduced offline timeliness to minutes (5~10 minutes)

本身支持增量处理
Incremental processing is supported

数据更新支持upsert
UPSERT is supported for data updates

随着大数据技术发展趋势,公司对单一的数据湖和数据架构并不满意,想要去融合数据湖和数据仓库,构建在数据湖低成本的数据存储架构之上,又继承数据仓库的数据处理和管理功能。
With the development trend of big data technology, the company is not satisfied with a single data lake and data architecture, and wants to integrate data lake and data warehouse, build on the low-cost data storage architecture of the data lake, and inherit the data processing and management functions of the data warehouse.

5.3 项目架构
5.3 Project Structure

5.4 业务
5.4 Business

业务与实时数仓一致。
The business is the same as that of the real-time data warehouse.

5.5 优化or遇到的问题怎么解决
5.5 How to solve the problems encountered by the optimization or

1)断点续传采集如何处理
1) How to handle resumable data collection

FlinkCDC分为全量和binlog,他们都是基于Flink state的能力,同步过程会将进度存储在state中,如果失败了,下一次会从state中恢复即可。
FlinkCDC is divided into full and binary logs, both of which are based on the capabilities of Flink state, and the synchronization process will store the progress in the state, and if it fails, it will be restored from the state next time.

2)写Hudi表数据倾斜问题
2) Write Hudi table data skew problem

FlinkCDC在全量阶段,读取完一张表后在读取下一张表,如果下游接了多个Sink,则只有一个Sink有数据写入。
In the full phase of FlinkCDC, after reading a table and then reading the next table, if multiple sinks are connected downstream, only one sink writes data.

使用多表混合读取方式解决。
Use the multi-table hybrid reading mode to solve the problem.

大状态:regular join + 无TTL
Large status: regular join + no TTL

rocksdb +增量
rocksdb + incremental

Hudi优化
Hudi optimized

(1)MOR表,离线compaction(不跟写入过程绑定在一起)
(1) MOR table, offline compaction (not bound to the writing process)

(2)相关并发、内存
(2) Related concurrency, memory

Compaction、write并发 =》4

内存:compaction =》1G

(3)大状态:rocksdb +增量
(3) Large state: rocksdb + increment

Hudi二期规划
Hudi Phase II Planning

(1)解决大状态问题=》不使用多留join,使用“部分列”更新方案
(1) solve the problem of large state ="do not use multi-stay join, use" partial column "update scheme

=》hudi<=0.12,官方没有提供,需要自定义payload实现类(大厂实现)
=> hudi<=0.12, not officially provided, custom payload implementation class is required (big factory implementation)

=》0.13.0,官方加入了 部分列更新的payload类
=》0.13.0, the official added the payload class updated in some columns

(2)Dws层换成olap=》clickhouse,存储明细
(2) The DWS layer is replaced with olap=》clickhouse to store details

希望dws是一种明细,支持灵活的自助分析==》未来连实时项目也可以干掉
I hope that DWS is a kind of detail, and supports flexible self-service analysis == "In the future, even real-time projects can be killed

6章 测试&上线流程
Chapter 6 Testing & Go-live Process

6.1 测试相关
6.1 Testing

6.1.1 公司有多少台测试服务器
6.1.1 How many test servers does the company have?

测试服务器一般三台。
There are generally three test servers.

6.1.2 测试服务器配置
6.1.2 Test Server Configuration?

有钱的公司和生产环境电脑配置一样。
Wealthy companies have the same computer configuration as the production environment.

一般公司测试环境的配置是生产的一半。
The configuration of the test environment in a typical company is half that of production.

6.1.3 测试数据哪来的?
6.1.3 Where does the test data come from?

一部分自己写Java程序自己造(更灵活),一部分从生产环境上取一部分(更真实)
Some of them write their own Java programs (more flexible), and some of them take some of them from the production environment (more realistic).

6.1.4 如何保证写的SQL正确(重点)
6.1.4 How to Ensure the Correctness of Written SQL (Important)

先在MySQL的业务库里面把结果计算出来;在给你在ads层计算的结果进行比较;
First calculate the results in MySQL's business library; compare the results calculated in the ads layer for you;

需要造一些特定的测试数据,测试。
You need to create some specific test data, tests.

从生产环境抓取一部分数据,数据有多少你是知道的,运算完毕应该符合你的预期。
Grab some data from the production environment, how much data you know, and the calculation should meet your expectations.

离线数据和实时数据分析的结果比较。(日活1万 实时10100),倾向取离线。
Comparison of offline and real-time data analysis results. (10,000 live days in real time), tend to take offline.

算法异构
heterogeneous algorithm

实时数据质量监控(脚本、调度器、可视化、故障报警)
Real-time data quality monitoring (scripts, schedulers, visualizations, fault alarms)

6.1.5 测试之后如何上线?
6.1.5 How to go online after testing?

大公司:上线的时候,将脚本打包,提交git。先邮件抄送经理和总监运维运维负责上线。
Big companies: when you go live, package scripts and submit them to git. First email copied to manager and director, operations. Operation and maintenance is responsible for on-line.

小公司:跟项目经理说一下,项目经理技术把关,项目经理通过了就可以上线了。风险意识。
Small company: talk to the project manager, the project manager technical check, the project manager can go online after passing. Risk awareness.

所谓的上线就是编写脚本,并在DolphinScheduler中进行作业调度。
The so-called online is to write scripts and schedule jobs in Dolphin Scheduler.

6.1.6 A/B测试了解
6.1.6 A/B Test Understanding

1)什么是 A/B 测试?
1) What is A/B testing?

A / B测试本质上是一种实验,即随机向用户显示变量的两个或多个版本,并使用统计分析来确定哪个变量更适合给定的转化目标。
A / B testing is essentially an experiment in which two or more versions of a variable are randomly presented to the user and statistical analysis is used to determine which variable is better suited for a given transformation goal.

2)为什么要做A/B测试
2) Why do I need to do an A/B test?

举例:字节跳动有一款中视频产品叫西瓜视频,最早它叫做头条视频。为了提升产品的品牌辨识度,团队想给它起个更好的名字。经过一些内部调研和头脑风暴,征集到了西瓜视频、奇妙视频、筷子视频、阳光视频4个名字,于是团队就针对一共5个APP 名称进行了A/B实验。
For example: ByteDance has a medium video product called watermelon video, which was originally called headline video. In order to enhance the brand recognition of the product, the team wanted to give it a better name. After some internal research and brainstorming, four names were collected: watermelon video, wonderful video, chopsticks video and sunshine video, so the team conducted A/B experiments on a total of five APP names.

这个实验中唯一改变的是应用市场里该产品的名称和对应的logo,实验目的是为了验证哪一个应用名称能更好地提升头条视频APP在应用商店的点击率。最后西瓜视频和奇妙视频的点击率位列前二,但差距不显著,结合用户调性等因素的综合考量,最终决定头条视频正式更名为西瓜视频
The only change in this experiment was the name and logo of the product in the app market, and the purpose of the experiment was to verify which app name could better improve the click rate of the "Headline Video" APP in the app store. Finally, the click rate of watermelon video and wonderful video ranked in the top two, but the difference was not significant. Combined with the comprehensive consideration of factors such as user tonality, it was finally decided that the headline video was officially renamed watermelon video.

通过这个案例可以看到,A/B测试可以帮助业务做最终决策。结合案例的直观感受,我们可以这样来定义A/B 测试:在同一时间对目标受众做科学抽样、分组测试以评估效果。
As you can see from this example, A/B testing can help businesses make final decisions. Combined with the intuitive feel of the case, we can define A/B testing as follows: scientific sampling of the target audience at the same time, group testing to evaluate the effect.

以上图图示为例,假设我们有100万用户要进行A/B测试:
For example, suppose we have 1 million users who want to do A/B testing:

先选定目标受众,比如一线城市的用户。A/B测试不可能对所有用户都进行实验,所以要进行科学抽样,选择小部分流量进行实验。
First select the target audience, such as users in first-tier cities. A/B testing cannot be done on all users, so it is necessary to conduct scientific sampling and select a small number of traffic for experiments.

抽样之后需要对样本进行分组,比如A组保持现状,B组的某一个因素有所改变。
After sampling, the sample needs to be grouped, such as group A to maintain the status quo, and group B to change a certain factor.

分组之后在同一时间进行实验,就可以看到改变变量后用户行为的变化。
When you experiment at the same time after grouping, you can see the change in user behavior after changing variables.

再根据对应实验目标的指标,比如点击率的高低,来评估实验的结果。
The results of the experiment are then evaluated according to the indicators corresponding to the experimental objectives, such as the click-through rate.

做A/B测试主要原因有3点:
There are 3 main reasons to do A/B testing:

(1)风险控制:小流量实验可以避免直接上线效果不好造成损失。其次,实验迭代的过程中,决策都是有科学依据的,可以避免系统性的偏差。
(1) Risk control: Small flow experiments can avoid losses caused by poor direct online results. Secondly, in the process of experimental iteration, the decisions are scientifically based, which can avoid systematic bias.

(2)因果推断:我们相信A/B实验中的优化和改变最终能影响到线上数据以及用户的行为。在这个前提下,A/B测试就是最好的因果推断工具。
(2) Causal inference: We believe that the optimizations and changes in A/B experiments can ultimately affect online data and user behavior. Under this premise, A/B testing is the best tool for causal inference.

(3)复利效应:A/B测试是可以持续不断进行的实验,即使一次实验提升的效果不大,但是长期下来复利效应的积累会产生很大的变化和回报。
(3) Compound interest effect: A/B testing is an experiment that can be carried out continuously, even if the effect of an experiment is not large, but the accumulation of compound interest effect will produce great changes and returns in the long run.

3)哪个首页新UI版本更受欢迎
3) Which homepage new UI version is more popular

今日头条UI整体风格偏大龄被诟病已久,不利于年轻和女性用户泛化,历史上几次红头改灰头实验都对大盘数据显著负向。因此团队设计了A/B实验,目标是在可接受的负向范围内,改一版用户评价更好的UI。通过控制变量法,对以下变量分别开展数次A/B实验:
The overall style of ToutiaoUI has been criticized for a long time, which is not conducive to the generalization of young and female users. Therefore, the team designed an A/B experiment with the goal of changing the UI with better user reviews within an acceptable negative range. Several A/B experiments were performed on the following variables using the control variable method:

头部色值饱和度字号字重上下间距左右间距底部 tab icon
Head color value saturation, font size, weight, top and bottom spacing, left and right spacing, bottom tab icon.

结合用户调研(结果显示:年轻用户和女性用户对新 UI 更偏好)
Combined with user research (results show that younger and female users prefer the new UI).

综合来看,效果最好的 UI 版本如下图所示,全量上线。
Overall, the best-performing UI version is shown in the figure below, and it is fully launched.

新 UI 上线后,Stay duration 显著负向从-0.38% 降至 -0.24%,图文类时长显著 +1.66%,搜索渗透显著 +1.47%,高频用户(占 71%)已逐渐适应新 UI。
After the launch of the new UI, the Stay duration decreased significantly from -0.38% to -0.24%, the duration of graphics and texts was significantly +1.66%, and the search penetration was significantly +1.47%, and high-frequency users (71%) have gradually adapted to the new UI.

6.2 项目实际工作流程
6.2 Actual Project Workflow

以下是活跃用户需求的整体开发流程。
Here's the overall development process for active user demand.

产品经理负责收集需求:需求来源与客户反馈、老板的意见。
The product manager is responsible for gathering requirements: the source of the requirements, customer feedback, and the opinion of the boss.

第1步:确定指标的业务口径
Step 1: Determine the business caliber of the indicator

由产品经理主导,找到提出该指标的运营负责人沟通。首先要问清楚指标是怎么定义的,比如活跃用户是指启动过APP的用户。设备id 还是用户id
The product manager leads the communication with the operations leader who proposed the metric. First of all, ask how the metric is defined, for example, an active user is a user who has launched an app. Device ID or User ID.

产品经理先编写需求文档并画原型图。=》需求不要口头说。
The product manager writes the requirements document and draws the prototype diagram. Don't talk about demand.

第2步:需求评审
Step 2: Requirements Review

由产品经理主导设计原型,对于活跃主题,我们最终要展示的是最近n天的活跃用户数变化趋势 ,效果如下图所示。此处大数据开发工程师、后端开发工程师、前端开发工程师一同参与,一起说明整个功能的价值和详细的操作流程,确保大家理解的一致。
Product managers lead the design of prototypes. For active topics, we will eventually show the trend of active users in the last n days, as shown in the following figure. Here, big data development engineers, back-end development engineers and front-end development engineers participate together to explain the value of the whole function and detailed operation process to ensure that everyone understands the same.

工期:

接口:数据格式、字段类型、责任人。
Interface: data format, field type, responsible person.

第3步:大数据开发
Step 3: Big Data Development

大数据开发工程师,通过数据同步的工具如Flume、Datax、Maxwell等将数据同步到ODS层,然后就是一层一层的通过SQL计算到DWD、DWS层,最后形成可为应用直接服务的数据填充到ADS层。
Big data development engineers synchronize data to ODS layer through data synchronization tools such as Flume, Datax, Maxwell, etc., and then calculate it layer by layer to DWD and DWS layer through SQL, and finally form data that can be directly served by application to fill ADS layer.

第4步:后端开发
Step 4: Backend Development

后端工程师负责,为大数据工程师提供业务数据接口。
Back-end engineers are responsible for providing business data interfaces to big data engineers.

同时还负责读取ADS层分析后,写入MySQL中的数据。
Also responsible for reading ADS layer analysis, write MySQL data.

第5步:前端开发
Step 5: Front End Development

前端工程师负责,前端埋点。
Front end engineer responsible, front end buried point.

对分析后的结果数据进行可视化展示。
Visualize the result data after analysis.

第6步:联调
Step 6: Joint debugging

此时大数据开发工程师、前端开发工程师、后端开发工程师都要参与进来。此时会要求大数据开发工程师基于历史的数据执行计算任务,大数据开发工程师承担数据准确性的校验。前后端解决用户操作的相关BUG保证不出现低级的问题完成自测。
At this time, big data development engineers, front-end development engineers, and back-end development engineers should all participate. At this time, the big data development engineer will be required to perform calculation tasks based on historical data, and the big data development engineer will undertake the verification of data accuracy. Before and after the end to solve the user operation related BUG to ensure that there is no low-level problem to complete the self-test.

第7步:测试
Step 7: Testing

测试工程师对整个大数据系统进行测试。测试的手段包括,边界值、等价类等。
Test engineers test the entire big data system. Test methods include boundary values, equivalence classes, etc.

提交测试异常的软件有:禅道(测试人员记录测试问题1.0,输入是什么,结果是什么,跟预期不一样->需要开发人员解释,是一个bug,下一个版本解决1.1->测试人员再测试。测试1.1ok->测试经理关闭bug)
The software that submitted the test exception was: Zen Road (tester records test problem 1.0, what is the input, what is the result, different from expectations-> developer explanation, is a bug, the next version solves 1.1-> tester tests again. Test 1.1ok-> Test Manager Close bug)

1周开发写代码 =2周测试时间
1 week development code writing = 2 weeks testing time

第8步:上线
Step 8: Go online

运维工程师会配合我们的前后端开发工程师更新最新的版本到服务器。此时产品经理要找到该指标的负责人长期跟进指标的准确性。重要的指标还要每过一个周期内部再次验证,从而保证数据的准确性。
The Ops Engineer will work with our backend development engineers to update the latest version to the server. At this point, the product manager needs to find the person responsible for the indicator to follow up on the accuracy of the indicator for a long time. Important indicators must be verified internally every cycle to ensure the accuracy of the data.

6.3 项目当前版本号是多少?多久升级一次版本
6.3 What is the current version number of the project? How often do you update your version?

敏捷开发(少量需求=>代码编写=>测试=>少量需求=>代码编写=>测试),又叫小步快跑。
Agile development (small requirements => code writing => testing => small requirements => code writing => testing…), also known as small steps.

差不多一个月会迭代一次。每月都有节日(元旦、春节、情人节、3.8妇女节、端午节、618、国庆、中秋、1111/6.1/5.1、生日、周末)新产品、新区域。
It iterates about once a month. Every month there are festivals (New Year's Day, Spring Festival, Valentine's Day, March 8 Women's Day, Dragon Boat Festival, June 18, National Day, Mid-Autumn Festival, 1111/6.1/5.1, birthday, weekend) New products, new areas.

产品或我们提出优化需求,然后评估时间。每周我们都会开会下周计划和本周总结。(日报、周报、月报、季度报、年报)需求1周的时间,周三一定完成。周四周五(帮同事写代码、自己学习工作额外的技术)。
We propose optimization requirements for the product or us, and then evaluate the time. Every week we meet to plan for the week and wrap up for the week. (Daily, weekly, monthly, quarterly, annual) demand 1 week time, Wednesday must be completed. Thursdays and Fridays (help coworkers write code, learn extra skills for work).

5.1.2

5是大版本号:必须是重大升级
5 is a major version number: must be a major upgrade

1:一般是核心模块变动
1: Generally core module changes

2:一般版本变化
2: General Version Changes

6.4 项目实现一个需求大概多长时间
6.4 How long does it take to implement a requirement in a project?

(1)刚入职第一个需求大概需要7天左右。业务熟悉后,平均一天一需求。
(1) The first requirement for new employment takes about 7 days. After becoming familiar with the business, an average of one demand a day.

(2)影响时间的因素:对业务熟悉、开会讨论需求、表的权限申请、测试等。新员工培训(公司规章制度、代码规范)
(2) Factors affecting time: familiarity with business, meeting to discuss requirements, authorization application for forms, testing, etc. New employee training (company rules and regulations, code specifications)

6.5 项目开发中每天做什么事
6.5 What do you do every day during project development?

(1)新需求(活动、优化、新产品、新市场) 60%
(1) New requirements (activities, optimization, new products, new markets). 60%

(2)故障分析:数仓任何步骤出现问题,需要查看问题,比如日活,月活下降或快速上升等。20%
(2) Fault analysis: There is a problem in any step of the warehouse, and it is necessary to check the problem, such as daily activity, monthly activity decline or rapid rise, etc. 20%

(3)新技术的预言(比如湖仓一体 数据湖 Doris 实时数据质量监控)10%
(3) Prediction of new technologies (such as Lake Warehouse Integrated Data Lake Doris real-time data quality monitoring) 10%

(4)其临时任务 10%
(4) Temporary assignment 10%

(5)晨会-》10做操-》讨论中午吃什么-》12点出去吃1点-》睡到2点-》3点茶歇水果-》晚上吃啥-》吃加班餐-》开会-》晚上6点吃饭-》7点开始干活-10点-》11
(5) Morning meeting-> 10 exercises-> Discuss what to eat at noon-> 12:00 go out to eat 1:00-> Sleep until 2:00-> 3:00 tea break fruit-> What to eat at night-> Eat overtime meal-> Meeting-> 6:00 pm eat-> 7:00 start work-10:00-> 11:00

7章 数据治理
Chapter 7: Data Governance

7.1 元数据管理
7.1 metadata management

元数据管理目前开源的框架中,Atlas框架使用的较多。再就是采用自研的系统。
Atlas framework is widely used among the open source frameworks for metadata management. Then there is the use of self-developed systems.

1)元数据管理底层实现原理
1) Metadata management underlying implementation principle

解析如下HQL,获取对应的原数据表和目标表直接的依赖关系。
Parse the following HQL to obtain the direct dependency relationship between the corresponding source data table and target table.

insert into table ads_user

select id, name from dws_user

依赖关系能够做到:表级别和字段级别 neo4j
Dependency can be: table-level and field-level neo4j

2)用处:作业执行失败,评估他的影响范围。主要用于表比较多的公司
2) Use: Failure to execute the job, evaluate its scope of influence. Mainly used for companies with more tables

atlas版本问题:
Atlas version issues:

0.84版本:2019-06-21

2.0版本:2019-05-13

框架版本:
Framework Version:

Apache 0.84 2.0 2.1

CDH 2.0

3)尚大自研的元数据管理
3) Metadata management of Shangda self-research

7.2 数据质量监控
7.2 Data Quality Monitoring

7.2.1 监控原则
7.2.1 Monitoring principles

1)单表数据量监控
1) Single meter data volume monitoring

一张表的记录数在一个已知的范围内,或者上下浮动不会超过某个阈值
The number of records in a table is within a known range, or does not fluctuate above or below a certain threshold

SQL结果:var 数据量 = select count(*)from where 时间等过滤条件
SQL result: var data amount = select count (*) from table where time filter condition

报警触发条件设置:如果数据量不在[数值下限, 数值上限], 则触发报警
Alarm trigger condition setting: If the data volume is not in [lower limit, upper limit], the alarm will be triggered.

同比增加:如果((本周的数据量 - 上周的数据量)/上周的数据量*100)不在 [比例下线,比例上限],则触发报警
Year-on-year increase: If ((this week's data volume-last week's data volume)/last week's data volume *100) is not in [scale lower line, scale upper limit], an alarm will be triggered.

环比增加:如果((今天的数据量 - 昨天的数据量)/昨天的数据量*100)不在 [比例下线,比例上限],则触发报警
Ring increase: if ((today's data volume-yesterday's data volume)/yesterday's data volume *100) is not in [scale lower line, scale upper limit], trigger alarm

报警触发条件设置一定要有。如果没有配置的阈值,不做监控
Alarm trigger conditions must be set. If there is no configured threshold, monitoring cannot be done

日活、周活、月活、留存(日周月)、转化率(日、周、月)GMV(日、周、月)
Daily, weekly, monthly, survival (day, week, month), conversion rate (day, week, month) GMV (day, week, month)

复购率(日周月) 30%
Repurchase rate (day, week, month) 30%

2)单表空值检测
2) Single table null value detection

某个字段为空的记录数在一个范围内,或者占总量的百分比在某个阈值范围内
The number of records with a field empty is within a range, or the percentage of the total is within a threshold

目标字段:选择要监控的字段,不能选“无”
Target field: select the field to be monitored. None cannot be selected.

SQL结果:var 异常数据量 = select count(*) from where 目标字段 is null
SQL result: var Exception = select count(*) from table where target field is null

单次检测:如果异常数据量不在[数值下限, 数值上限],则触发报警
Single detection: if (abnormal data volume) is not in [lower limit, upper limit], trigger alarm

3)单表重复值检测
3) Single table duplicate value detection

一个或多个字段是否满足某些规则
Whether one or more fields satisfy certain rules

目标字段:第一步先正常统计条数;select count(*) form 表;
Target field: the first step is to count the number of normal items;select count(*) form table;

第二步,去重统计;select count(*) from group by 某个字段
Select count(*) from table group by field

第一步的值和第二步不的值做减法,看是否在上下线阀值之内
Subtract the value of step 1 from the value of step 2 to see if it is within the upper and lower line thresholds

单次检测:如果异常数据量不在[数值下限, 数值上限], 则触发报警
Single detection: if (abnormal data volume) is not in [lower limit, upper limit], trigger alarm

4)单表值域检测
4) Single table range detection

一个或多个字段没有重复记录
One or more fields have no duplicate records

目标字段:选择要监控的字段,支持多选
Target field: select the field to monitor. Multiple selections are supported.

检测规则:填写“目标字段”要满足的条件。其中$1表示第一个目标字段,$2表示第二个目标字段,以此类推。上图中的“检测规则”经过渲染后变为“delivery_fee = delivery_fee_base+delivery_fee_extra”
Detection rule: fill in the conditions to be satisfied by the Target Field. Where $1 represents the first target field,$2 represents the second target field, and so on. The "detection rule" in the above image is rendered to "delivery_fee = delivery_fee_base+delivery_fee_extra"

阈值配置与“空值检测”相同
The threshold configuration is the same as null detection

5)跨表数据量对比
5) Comparison of data volume across tables

主要针对同步流程,监控两张表的数据量是否一致
Mainly for synchronization process, monitor whether the data volume of two tables is consistent

SQL结果:count本表 - count关联表
SQL results: count- count

阈值配置与“空值检测”相同
The threshold configuration is the same as null detection

7.2.2 数据质量实现
7.2.2 Data Quality Realization

7.2.3 实现数据质量监控,你具体怎么做,详细说?
7.2.3 Data quality monitoring, how do you do it, in detail?

实现数据质量检测的功能,我们需要首先明确数据质量的维度,例如准确性、完整性、唯一性、及时性和一致性。
To realize the function of data quality detection, we need to first clarify the dimensions of data quality, such as accuracy, completeness, uniqueness, timeliness and consistency.

1)确定数据源
1) Determine the data source

确定需要进行数据质量检测的数据源。这可能是数据库表、文件、API等。
Identify data sources that require data quality testing. This could be database tables, files, APIs, etc.

2)定义质量规则
2) Define quality rules

为每个数据质量维度定义具体的规则。例如:
Define specific rules for each data quality dimension. For example:

–准确性:检查数据是否符合预期的范围或分布
Accuracy: Check whether the data fits the expected range or distribution.

–完整性:检查数据是否存在缺失值或空值
Integrity: Check data for missing or null values.

–唯一性:检查数据中是否存在重复项
Uniqueness: Check data for duplicates.

–及时性:检查数据是否在预期的时间范围内更新。
Timeliness: Check if the data is updated within the expected time frame.

–一致性:检查数据是否符合预定义的格式或标准。
Consistency: Check whether data conforms to predefined formats or standards.

3)实现检测功能
3) Realize detection function

使用编程语言(如SQL)编写检测数据质量的函数。这些函数可以包括:
Write functions that check data quality using a programming language such as SQL. These functions may include:

–数据导入:从数据源导入数据。
- Data Import: Import data from a data source.

–数据清理:对数据进行预处理,如去除空格、转换数据类型等。
- Data cleaning: Pre-processing of data, such as removing spaces, converting data types, etc.

–应用质量规则:根据定义的质量规则,实现相应的检测函数。例如,检查缺失值、重复项或数据范围等。
- Apply quality rules: implement corresponding detection functions according to defined quality rules. For example, check for missing values, duplicates, or data ranges.

–输出报告:生成数据质量报告,如将检测结果汇总成表格或可视化图表。
Output reports: Generate data quality reports, such as summarizing test results into tables or visual charts.

4)自动化和监控
4) Automation and monitoring

将数据质量检测功能集成到数据管道或ETL过程中,以实现自动化检测。此外,可以设置监控和警报机制,以便在检测到数据质量问题时及时通知相关人员。
Integrate data quality inspection capabilities into data pipelines or ETL processes for automated inspection. In addition, monitoring and alert mechanisms can be set up to notify relevant personnel when data quality problems are detected.

7.3 权限管理(Ranger)
7.3 Authority Management (Ranger)

7.4 用户认证(Kerberos)
7.4 User authentication (Kerberos)

7.5 数据治理
7.5 data governance

资产健康度量化模型
Quantitative model of asset health.

根据数据资产健康管理的关键因素,明确量化积分规则。根据数据基础信息完整度、数据存储和数据计算健康度、数据质量监控规则合理性等,完整计算数据资产健康分。
According to the key factors of data asset health management, define the quantitative points rules. According to the integrity of data basic information, data storage and data calculation health degree, rationality of data quality monitoring rules, etc., complete calculation of data asset health score.

1)资产健康分基础逻辑
1) Basic logic of asset health

(1)健康分基本设定原则:
(1) Basic principles for setting health scores:

健康分采用百分制,100最高,0分最低;
Health score adopts centenary system, 100 is the highest, 0 is the lowest;

健康度以表为最细粒度,每个表都有一个健康分;
The table is the finest granularity of health degree, and each table has a health score.

个人、业务版块、团队、一级部门、以及集团的健康分以所属表的健康分加权平均;
Health scores of individuals, business segments, teams, first-level departments, and groups are weighted averages of health scores of the tables to which they belong;

数据表权重=表字节数 + 1再开立方根;空表的权重为1;
Weight of data table =(number of bytes in table + 1) and cube root; weight of empty table is 1;

(2)数据表资产健康分:
(2) Data sheet Asset health score:

数据表资产健康分score =(规范合规健康分*10% + 存储健康分*30% + 计算健康分*30% + 数据质量健康分*15% + 数据安全健康分 * 15%
Data table asset health score =(specification compliance health score *10% + storage health score *30% + calculation health score * 30% + data quality health score *15% + data safety health score * 15%);

2数据资产特征列表:
2) List of data asset characteristics:

资产健康类型
Asset Health Type

特征

特征分计算逻辑
Eigenscore computation logic

规范 Specification

规范健康分= 100 * sum(特征分)/count(特征)
Normative health score = 100 * sum(feature score)/count(feature score)

有技术owner
Technical owner

0/1

有业务owner
Business owner

0/1

有分区信息
There is partition information

0/1

有归属部门
There is a department to which it belongs

0/1

表命名合规
Table naming compliance

0/1

数仓分层合规
Warehouse Layered Compliance

0/1

表有备注信息
The table has comments

0/1

字段有备注信息
Field has comment information

有备注字段数 / 总字段数
Number of fields with comments/total number of fields

存储 Storage
Storage Storage

存储健康分= 100 * 完成度
Storage Health Score = 100 * Completion

生命周期合理性
life cycle rationality

永久保留表:不可再生源头表、白名单表、冷备表、最近93天有访问非分区表, 按100%完成度;
Permanent reservation table: non-renewable source table, white list table, cold standby table, non-partition table visited in the last 93 days, according to 100% completion degree;

未管理表:分区表但未配置生命周期,按0%完成度;
Unmanaged table: partition table but no life cycle configured, 0% completion;

无访问表:在93天前创建,但最近 93 天无访问,按0%完成度;
No access table: created 93 days ago, but no access in the last 93 days, according to 0% completion;

新建表:创建时间小于93天,尚未积累访问数据和未配置合理生命周期,默认按80%完成度;
New table: created less than 93 days ago, access data has not been accumulated and reasonable life cycle has not been configured, default is 80% completion;

普通表:以上之外普通表,系统计算建议保留天数与当前生命周期比值作为完成度;
Ordinary table: In addition to the above ordinary tables, the system calculates the ratio of the recommended retention days to the current life cycle as the completion degree;

计算 Calc
Calculate Calc

计算健康分= 100 * sum(特征分)/count(特征)
Calculate health score = 100 * sum(feature score)/count(feature score)

hdfs路径被删除
hdfs path deleted

被删除,0分;否则为1
Deleted, 0; otherwise 1

产出为空
Output is null

连续15天无数据产出,0分;否则为1;
No data output for 15 consecutive days, 0 points; otherwise, 1;

产出表未被读取
Output table not read

最近30天产出数据表无读取,0分;否则为1;
No reading of output data table in the last 30 days, 0 point; otherwise, 1 point;

运行出错
run error

最近3天任务运行有出错,0分;否则为1;
0 for errors in task operation in the last 3 days; otherwise 1;

重复表/相似表
Duplicate/Similar Tables

与其他数据表50%相似,则为0分,否则为1;
0 points for 50% similarity with other data tables, otherwise 1;

责任人不合理 ​
The person responsible is unreasonable.

对应的调度节点责任人已经离职,或调度节点责任人在职但与数据表责任人不一致 ,0分;否则为1;
The corresponding dispatching node responsible person has resigned, or the dispatching node responsible person is on-the-job but inconsistent with the data table responsible person, 0 points; otherwise, 1;

简单加工
simple processing

生产sql只简单 select字段出来,没有 join、group;where 条件只有分区字段 ;0分,否则为1;
Production sql only has simple select field, without join and group;where condition only has partition field;0 points, otherwise 1;

暴力扫描
Violent scanning

表中被查询分区大于 90 天,同时被查询分区的总存储量大于 100G ;0分,否则为1;
The queried partition in the table is greater than 90 days, and the total storage capacity of the queried partition is greater than 100G ;0 points, otherwise 1;

两侧类型不一致
Inconsistent type on both sides

类似这个例子:
Similar to this example:

select ... from table1 t1 join table2 t2 on t1.a_bigint=t2.a_string;

这种情况在 on 条件中两边都被 double了,这个其实不合理;是个大坑, 会导致行为和用户期待不一致
In this case, both sides are doubled in the on condition, which is actually unreasonable; it is a big pit, which will lead to inconsistent behavior and user expectations.

数据倾斜 ​
data skew

长尾运行实例耗费时高于平均值 20%,分数记为0,否则记为1;
Long-tailed running instances cost 20% more than average, score 0, otherwise score 1;

需要列剪裁
Column clipping required

判断 select 语句及后续使用逻辑,是否 select 出来的列都被使 用,用被使用的列数/总 select 列数计算使用率,低于 50% 就需要列剪裁,0分,否则为1;
Judge the select statement and subsequent use logic, whether all the selected columns are used, calculate the utilization rate by using the number of columns used/the total number of selected columns, if it is less than 50%, column pruning is required, 0 points, otherwise 1;

质量 Quality
Quality

质量健康分= 100 * sum(特征分)/count(特征)
Quality health score = 100 * sum(feature score)/count(feature score)

表产出时效监控
Table Output Timing Monitoring

qdc有定义产出时间预警或已经归属于某个生产基线; 0/1;
qdc has a defined output time warning or has been attributed to a production baseline; 0/1;

表内容监控
Table content monitoring

有配置表级规则; 0/1;
Configuration table level rule exists; 0/1;

字段内容监控
Field Content Monitoring

有配置字段级规则; 0/1;
There are configuration field-level rules; 0/1;

表产出SLA
Table Output SLA

X点及时性SLA测量函数:
X-point timeliness SLA measurement function:

realTime:数据实际产出时间
realTime: actual data output time

expectTime: 期望数据产出的时间点
expectTime: The point in time at which data is expected to be produced

n:数据产出周期(针对多次调度)
n: Data output cycle (for multiple scheduling)

表内容SLA
Table Content SLA

1-触发监控规则/总监控规则数
1-Trigger monitoring rules/total monitoring rules

字段内容SLA
Field Content SLA

1-触发监控规则/总监控规则数
1-Trigger monitoring rules/total monitoring rules

安全 Security

安全健康分= 100 * sum(特征分)/count(特征)
Safety and health score = 100 * sum(feature score)/count(feature score)

数据分类
data classification

有明确设置归属的“资产目录” ; 0/1;
There is an Asset Directory with clear attribution settings; 0/1;

资产分级
asset classification

有指定资产等级; 0/1;
There is a specified asset class; 0/1;

字段级安全等级
Field-level security level

有字段设置了安全等级; 0/1;
There are fields where security levels are set; 0/1;

第8章 中台
Chapter 8: The Last Day

https://mp.weixin.qq.com/s/nXI0nSSOneteIClA7dming

8.1 什么是中台?
8.1 What is the middle stage?

1)什么是前台?
1) What is Front Desk?

首先,这里所说的“前台”和“前端”并不是一回事。所谓前台即包括各种和用户直接交互的界面,比如web页面,手机app;也包括服务端各种实时响应用户请求的业务逻辑,比如商品查询、订单系统等等。
First of all, the "foreground" and "front end" are not the same thing. The so-called foreground includes various interfaces that interact directly with users, such as web pages and mobile apps; it also includes various business logics that respond to user requests in real time on the server side, such as commodity queries, order systems, etc.

2)什么是后台?
2) What is Backstage?

后台并不直接面向用户,而是面向运营人员的配置管理系统,比如商品管理、物流管理、结算管理。后台为前台提供了一些简单的配置。
The background is not directly user-oriented, but a configuration management system for operators, such as commodity management, logistics management, and settlement management. The background provides some simple configuration for the foreground.

3)为什么要做中台
3) Why do you want to be in the middle?

传统项目痛点:重复造轮子。
Traditional project pain point: repeat wheel building.

8.2 各家中台
8.2 Gejia Zhongtai

1SuperCell公司

2)阿里巴巴提出了“大中台,小前台”的战略
2) Alibaba put forward the strategy of "large and medium-sized platform, small foreground"

3)华为提出了“平台炮火支撑精兵作战”的战略
3) Huawei put forward the strategy of "platform artillery support elite combat"

8.3 中台具体划分
8.3 Specific division of middle platform

1)业务中台 & 技术中台
1) Business & Technology Center

业务中台 技术中台
Figure Business Middle Station Figure Technical Middle Station

2)数据中台 & 算法中台
2) Data Center & Algorithm Center

图 数据中台 图 算法中台
Table in graph data Table in graph algorithm

8.4 中台使用场景
8.4 Central Station Usage Scenarios

1)从0到1的阶段,没有必要搭建中台。
1) From 0 to 1, there is no need to build a middle stage.

从0到1的创业型公司,首要目的是生存下去,以最快的速度打造出产品,证明自身的市场价值。
Start-up companies from 0 to 1, the primary goal is to survive, to build products as quickly as possible, to prove their market value.

这个时候,让项目野蛮生长才是最好的选择。如果不慌不忙地先去搭建中台,恐怕中台还没搭建好,公司早就饿死了。
At this time, letting the project grow savagely was the best choice. If you don't hurry to build the middle platform first, I'm afraid the company will starve to death before the middle platform is built.

2)从1到N的阶段,适合搭建中台。
2) The stage from 1 to N is suitable for building the middle stage.

当企业有了一定规模,产品得到了市场的认可,这时候公司的首要目的不再是活下去,而是活的更好。
When the enterprise has a certain scale and the product has been recognized by the market, the primary purpose of the company is no longer to survive, but to live better.

这个时候,趁着项目复杂度还不是特别高,可以考虑把各项目的通用部分下沉,组建中台,以方便后续新项目的尝试和旧项目的迭代。
At this time, while the complexity of the project is not particularly high, you can consider sinking the common parts of each project and forming a middle platform to facilitate subsequent attempts of new projects and iterations of old projects.

3)从N到N+1的阶段,搭建中台势在必行。
3) From N to N+1, it is imperative to build a middle platform.

当企业已经有了很大的规模,各种产品、服务、部门错综复杂,这时候做架构调整会比较痛苦。
When the enterprise has a large scale and various products, services and departments are complicated, it will be painful to make structural adjustments at this time.

但是长痛不如短痛,为了项目的长期发展,还是需要尽早调整架构,实现平台化,以免日后越来越难以维护。
However, long-term pain is better than short-term pain. For the long-term development of the project, it is still necessary to adjust the architecture as soon as possible to realize platformization, so as not to become more and more difficult to maintain in the future.

8.5 中台的痛点
8.5 Pain point in the middle stage

牵一发动全身,中台细小的改动,都需要严格测试。周期比较长。
A pull starts the whole body, and small changes in the middle stage need to be strictly tested. The cycle is longer.

大厂一般有总的中台,也有部门级别的中台,保证效率。
Large factories generally have a general middle station, but also a department-level middle station to ensure efficiency.

9章 算法题(LeetCode)
Chapter 9: LeetCode

9.1 时间复杂度、空间复杂度理解
9.1 Time complexity, spatial complexity understanding

在计算机算法理论中,用时间复杂度和空间复杂度来分别从这两方面衡量算法的性能。
In computer algorithm theory, the performance of algorithms is measured from these two aspects by time complexity and spatial complexity.

1)时间复杂度(Time Complexity)
1) Time Complexity

算法的时间复杂度,是指执行算法所需要的计算工作量。
The time complexity of an algorithm refers to the amount of computational work required to execute the algorithm.

一般来说,计算机算法是问题规模n 的函数fn,算法的时间复杂度也因此记做:Tn= Οfn))。
In general, a computer algorithm is a function of f(n) of the problem size n, and the time complexity of the algorithm is therefore denoted as: T(n) = Ο(f(n)).

问题的规模n 越大,算法执行的时间的增长率与fn的增长率正相关,称作渐进时间复杂度(Asymptotic Time Complexity)。
The larger the size of the problem n, the growth rate of the algorithm execution time is positively correlated with the growth rate of f(n), which is called the Asymptotic Time Complexity.

2)空间复杂度
2) Spatial complexity

算法的空间复杂度,是指算法需要消耗的内存空间。有时候递归调用,还需要考虑调用栈所占用的空间。
The spatial complexity of the algorithm refers to the memory space consumed by the algorithm. Sometimes recursive calls are made, and the space occupied by the call stack needs to be considered.

其计算和表示方法与时间复杂度类似,一般都用复杂度的渐近性来表示。同时间复杂度相比,空间复杂度的分析要简单得多。
Its calculation and expression methods are similar to time complexity, which is generally expressed by the asymptotic property of complexity. Spatial complexity analysis is much simpler than temporal complexity analysis.

所以,我们一般对程序复杂度的分析,重点都会放在时间复杂度上。
Therefore, our analysis of program complexity generally focuses on time complexity.

9.2 常见算法求解思想
9.2 Common algorithm solution ideas

1)暴力求解
1) Violent solution

不推荐。
Not recommended.

2)动态规划
2) Dynamic planning

动态规划(Dynamic Programming,DP)是运筹学的一个分支,是求解决策过程最优化的过程。
Dynamic programming (DP) is a branch of operations research, which is the process of solving the optimization of decision-making process.

动态规划过程是:把原问题划分成多个“阶段”,依次来做“决策”,得到当前的局部解;每次决策,会依赖于当前“状态”,而且会随即引起状态的转移。
The dynamic programming process is to divide the original problem into multiple "stages" and make "decisions" in turn to obtain the current local solution; each decision depends on the current "state" and will immediately cause the state to transition.

这样,一个决策序列就是在变化的状态中,“动态”产生出来的,这种多阶段的、最优化决策,解决问题的过程就称为动态规划(Dynamic Programming,DP)
In this way, a decision sequence is "dynamically" generated in a changing state, and this multi-stage, optimal decision-making, problem-solving process is called dynamic programming (DP).

3)分支
3) Branches

对于复杂的最优化问题,往往需要遍历搜索解空间树。最直观的策略,就是依次搜索当前节点的所有分支,进而搜索整个问题的解。为了加快搜索进程,我们可以加入一些限制条件计算优先值,得到优先搜索的分支,从而更快地找到最优解:这种策略被称为“分支限界法”。
For complex optimization problems, it is often necessary to traverse the search solution space tree. The most intuitive strategy is to search all branches of the current node in turn, and then search for the solution of the whole problem. In order to speed up the search process, we can add some constraints to calculate the priority value, get the branch of priority search, and thus find the optimal solution faster: this strategy is called "branch and bound method".

分支限界法常以广度优先(BFS)、或以最小耗费(最大效益)优先的方式,搜索问题的解空间树。
Branch-and-bound methods often search the solution space tree of the problem in a breadth-first (BFS) or least-cost (maximum-benefit)-first manner.

9.3 基本算法
9.3 basic algorithm

9.3.1 冒泡排序
Bubble Sort

冒泡排序是一种简单的排序算法。
Bubble sort is a simple sort algorithm.

的基本原理是:重复地扫描要排序的数列,一次比较两个元素,如果它们的大小顺序错误,就把它们交换过来。这样,一次扫描结束,我们可以确保最大(小)的值被移动到序列末尾。这个算法的名字由来,是因为越小的元素会经由交换,慢慢“浮”到数列的顶端。
The basic principle is to repeatedly scan the sequence to be sorted, comparing two elements at a time, and swapping them if they are in the wrong order of size. In this way, at the end of a scan, we can ensure that the largest (smallest) value is moved to the end of the sequence. The algorithm gets its name because smaller elements slowly float to the top of the sequence by swapping.

冒泡排序的时间复杂度为O(n2)。
The time complexity of bubble sort is O (n2).

public void bubbleSort(int nums[]) {

int n = nums.length;

for(int i = 0; i < n - 1; i++) {

for(int j = 0; j < n - i - 1; j++) {

if(nums[j + 1] < nums[j])

swap(nums, j, j + 1);

}

}

}

9.3.2 快速排序
9.3.2 Quick Sorting

快速排序的基本思想:通过一趟排序,将待排记录分隔成独立的两部分,其中一部分记录的关键字均比另一部分的关键字小,则可分别对这两部分记录继续进行排序,以达到整个序列有序。
The basic idea of quick sorting: by sorting, the records to be sorted are separated into two independent parts, one of which has a smaller keyword than the other, and the two parts of the records can be sorted separately to achieve the order of the whole sequence.

快排应用了分治思想,一般会用递归来实现。
The fast arrangement applies the idea of partition, which is generally implemented by recursion.

快速排序的时间复杂度可以做到O(nlogn),在很多框架和数据结构设计中都有广泛的应用。
The time complexity of quick sort can be O (nlogn), and it is widely used in many frameworks and data structure designs.

public void qSort(int[] nums, int start, int end){

if (start >= end) return;

int mid = partition(nums, start, end);

qSort(nums, start, mid - 1);

qSort(nums, mid + 1, end);

}

// 定义分区方法,把数组按一个基准划分两部分,左侧元素一定小于基准,右侧大于基准
//Define the partition method, divide the array into two parts according to a benchmark, the left element must be smaller than the benchmark, and the right is larger than the benchmark

private static int partition( int[] nums, int start, int end ){

// 以当前数组起始元素为pivot
//start with the current array element pivot

int pivot = nums[start];

int left = start;

int right = end;

while ( left < right ){

while ( left < right && nums[right] >= pivot )

right --;

nums[left] = nums[right];

while ( left < right && nums[left] <= pivot )

left ++;

nums[right] = nums[left];

}

nums[left] = pivot;

return left;

}

9.3.3 归并排序
9.3.3 Merge Sort

归并排序是建立在归并操作上的一种有效的排序算法。该算法是采用分治法(Divide and Conquer)的一个非常典型的应用。
Merge sort is an efficient sort algorithm based on merge operation. This algorithm is a very typical application of Divide and Conquer.

将已有序的子序列合并,得到完全有序的序列;即先使每个子序列有序,再使子序列段间有序。若将两个有序表合并成一个有序表,称为2-路归并。
The ordered subsequence is merged to obtain a completely ordered sequence, that is, each subsequence is ordered first, and then the subsequence segments are ordered. If two ordered tables are merged into one ordered table, it is called a 2-way merge.

归并排序的时间复杂度是O(nlogn)。代价是需要额外的内存空间。
The time complexity of merge sort is O(nlogn). The cost is extra memory space.

public void mergeSort(int[] nums, int start, int end){

if (start >= end ) return;

int mid = (start + end) / 2;

mergeSort(nums, start, mid);

mergeSort(nums, mid + 1, end);

merge(nums, start, mid, mid + 1, end);

}

private static void merge(int[] nums, int lstart, int lend, int rstart, int rend){

int[] result = new int[rend - lstart + 1];

int left = lstart;

int right = rstart;

int i = 0;

while (left <= lend && right <= rend){

if (nums[left] <= nums[right])

result[i++] = nums[left++];

else

result[i++] = nums[right++];

}

while (left <= lend)

result[i++] = nums[left++];

while (right <= rend)

result[i++] = nums[right++];

System.arraycopy(result, 0, nums, lstart, result.length);

}

9.3.4 遍历二叉树
9.3.4 Traversing Binary Trees

题目:求下面二叉树的各种遍历(前序、中序、后序、层次)
Title: Find the following binary tree of all kinds of traversal (preorder, middle order, postorder, hierarchy)

中序遍历:即左-根-右遍历,对于给定的二叉树根,寻找其左子树;对于其左子树的根,再去寻找其左子树;递归遍历,直到寻找最左边的节点i,其必然为叶子,然后遍历i的父节点,再遍历i的兄弟节点。随着递归的逐渐出栈,最终完成遍历
In-order traversal: i.e. left-root-right traversal, for a given binary tree root, find its left subtree; for the root of its left subtree, then find its left subtree; recursively traverse until finding the leftmost node i, which must be a leaf, then traverse the parent node of i, and then traverse the sibling node of i. As the recursion gradually pops out of the stack, the traversal is finally completed

先序遍历:即根-左-右遍历
Precedent traversal: root-left-right traversal

后序遍历:即左-右-根遍历
Post-order traversal: left-right-root traversal

层序遍历:按照从上到下、从左到右的顺序,逐层遍历所有节点。
Sequence traversal: traverse all nodes layer by layer from top to bottom and from left to right.

9.3.5 二分查找
9.3.5 Binary search

给定一个n个元素有序的(升序)整型数组nums和一个目标值target,写一个函数搜索nums中的target,如果目标值存在返回下标,否则返回-1
Given an ordered (ascending) array of n integers numbers and a target value target, write a function to search for target in numbers, returning subscripts if the target value exists, or-1 otherwise.

二分查找也称折半查找(Binary Search),它是一种效率较高的查找方法,前提是数据结构必须先排好序,可以在对数时间复杂度内完成查找。
Binary search, also known as Binary Search, is a highly efficient search method, provided that the data structure must be sorted first and the search can be completed in logarithmic time complexity.

二分查找事实上采用的就是一种分治策略,它充分利用了元素间的次序关系,可在最坏的情况下用O(log n)完成搜索任务。
In fact, binary search is a divide-and-conquer strategy, which makes full use of the order relationship between elements and can complete the search task in O (log n) in the worst case.

/**

* @param a 要查找的有序int数组
* @param a Ordered int array to find

* @param key 要查找的数值元素
* @param key The numeric element to find

* @return 返回找到的元素下标;如果没有找到,返回-1
* @return Returns the index of the element found; if not found, returns-1

*/

public int binarySearch(int[] a, int key){

int low = 0;

int high = a.length - 1;

if ( key < a[low] || key > a[high] )

return -1;

while ( low <= high){

int mid = ( low + high ) / 2;

if( a[mid] < key)

low = mid + 1;

else if( a[mid] > key )

high = mid - 1;

else

return mid;

}

return -1;

}

9.4 小青蛙跳台阶
9.4 Frog jumping steps

题目:一只青蛙一次可以跳上1级台阶,也可以跳上2级台阶。求该青蛙上一个n级台阶总共有多少种跳法?
A frog can jump up one step at a time or two steps at a time. How many jumps does the frog have on an n step?

9.5 最长回文子串
9.5 longest palindromic substring

题目:给你一个字符串s,找到s中最长的回文子串。
Given a string s, find the longest palindrome substring in s.

实例:

输入:s = “babad”
Input: s ="bad"

输出:“bab”
Output: "bab"

解释:“aba”也是符合题意答案
Explanation: "aba" is also the answer to the question

9.6 数字字符转化成IP
9.6 Digital characters converted to IP

题目:现在有一个只包含数字的字符串,将该字符串转化成IP地址的形式,返回所有可能的情况。
Title: Now there is a string containing only numbers, convert that string to the form of an IP address, and return all possible cases.

例如:

给出的字符串为“25525511135
The given string is "25525511135"

返回["255.255.11.135", "255.255.111.35"](顺序没有关系)
Return to ["255.255.11.135", "255.255.111.35"](order doesn't matter)

9.7 最大公约数
9.7 greatest common denominator

9.8 链表反转
9.8 chain inversion

9.9 数组寻找峰值
9.9 Array search for peaks

10章 场景题
Chapter 10: The Last Day

10.1 手写Flink的UV
10.1 Handwritten Flink UV

10.2 Flink的分组TopN

10.3 Spark的分组TopN

1)方法1:
Method 1:

(1)按照key对数据进行聚合(groupByKey)
(1) Aggregate data by key (groupByKey)

(2)将value转换为数组,利用scala的sortBy或者sortWith进行排序(mapValues)数据量太大,会OOM。
(2) Convert value to array, sort by sortBy or sortWith of scala (mapValues) The amount of data is too large, and OOM will occur.

2)方法2:
Method 2:

(1)取出所有的key
1) Remove all keys

(2)对key进行迭代,每次取出一个key利用spark的排序算子进行排序
(2) Iterative over keys, taking out one key at a time and sorting using spark's sorting operator

方法3:
Method 3:

(1)自定义分区器,按照key进行分区,使不同的key进到不同的分区
(1) Custom partition, partition according to key, so that different keys enter different partitions

(2)对每个分区运用spark的排序算子进行排序
(2) Sort each partition using the sort operator of spark

10.4 如何快速从40亿条数据中快速判断,数据123是否存在
10.4 How to quickly determine whether data 123 exists from 4 billion data

10.5 给你100G数据,1G内存,如何排序?
10.5 Give you 100 gigabytes of data, 1 gigabyte of memory, how to sort?

10.6 公平调度器容器集中在同一个服务器上?
10.6 Fair scheduler containers centralized on the same server?

10.7 匹马赛跑,1个赛道,每次5匹进行比赛,无法对每次比赛计时,但知道每次比赛结果的先后顺序,最少赛多少次可以找出前三名?
10.7 A horse race, 1 track, 5 horses at a time, unable to time each race, but know the order of each race results, at least how many races can find the top three?

10.8 给定一个点、一条线、一个三角形、一个有向无环图,请用java面向对象的思想进行建模
10.8 Given a point, a line, a triangle, and a directed acyclic graph, model it using java object-oriented thinking.

10.9 现场出了一道sql题,让说出sql的优化,优化后效率提升了多少
10.9 A sql question was given on the spot. Let's say how much sql optimization has improved efficiency after optimization.

select 2d from t_order where 2d in (SELECT 2d from t_order_f)

对于这条 SQL 语句,可以使用内连接(INNER JOIN)来代替子查询(IN)。这通常可以提高查询性能,因为内连接在大多数数据库系统中的性能优化更为成熟。以下是优化后的 SQL 语句:
For this SQL statement, you can use INNER JOIN instead of IN. This usually improves query performance because inner joins are more mature for performance optimization in most database systems. Here is the optimized SQL statement:

SELECT t1.2d

FROM t_order t1

INNER JOIN t_order_f t2 ON t1.2d = t2.2d

优化后,如何判断效率提升了多少?
After optimization, how to judge how much efficiency has been improved?

查看执行时间:执行优化前后的 SQL 语句,比较它们的执行时间。执行时间的减少表示性能得到了提升。
View execution time: execute SQL statements before and after optimization and compare their execution time. A reduction in execution time indicates improved performance.

11HQL场景题
Chapter 11: The Last Day

尚大自研刷题网站的网址:http://forum.atguigu.cn/interview.html
Website of Shangda Self-research Brush Topic Website: forum.atguigu.cn/interview.html

HQL刷题模块,刷分到1000分以上。
HQL brush module, brush points to more than 1000 points.

12章 面试说明
Chapter 12: The Last Day

12.1 面试过程最关键的是什么?
12.1 What is the most important part of the interview process?

(1)大大方方的聊,放松
(1) Talk openly and relax

(2)体现优势,避免劣势
(2) Reflect advantages and avoid disadvantages

12.2 面试时该怎么说?
12.2 What should I say during the interview?

1)语言表达清楚
1) The language is clear.

(1)思维逻辑清晰,表达流畅
(1) Clear thinking logic and smooth expression

(2)一二三层次表达
(2) Level 1, 2, 3

2)所述内容不犯错
2) The content is not wrong

(1)不说前东家或者自己的坏话
(1) Do not speak ill of your former employer or yourself

(2)往自己擅长的方面说
(2) Say what you are good at

(3)实质,对考官来说,内容听过,就是自我肯定没听过,那就是个学习的过程。
(3) In essence, for the examiner, the content has been heard, which is self-affirmation; if it has not been heard, it is a learning process.

12.3 面试技巧
12.3 interviewing skills/techniques

12.3.1 六个常见问题
12.3.1 Six Frequently Asked Questions

1你的优点是什么?
1) What are your strengths?

大胆的说出自己各个方面的优势和特长
Boldly state your strengths and strengths in all aspects

2)你的缺点是什么?
2) What are your weaknesses?

要谈自己真实问题;用“缺点衬托自己的优点
Don't talk about your real problems; set your strengths against your weaknesses.

3你的离职原因是什么?
3) What is your reason for leaving?

不说前东家坏话,哪怕被伤过
Don't speak ill of your former employer, even if you've been hurt

合情合理合法
reasonable and legal

不要说超过1个以上的原因
Don't say more than one reason.

4您对薪资的期望是多少?
4) What are your salary expectations?

不深谈薪资
Not talking about salary until the end

只说区间,不说具体数字
Just say intervals, not numbers.

底线是不低于当前薪资
The bottom line is no less than the current salary

非要具体数字,区间取中间值,或者当前薪资的+20%
No specific number required, middle of range, or +20% of current salary

5您还有什么想问的问题?
5) Do you have any other questions?

这是体现个人眼界和层次的问题
This is a question of personal vision and hierarchy.

问题本身不在于面试官想得到什么样的答案,而在于你跟别的应聘者的对比
The question itself is not what answer the interviewer wants, but how you compare yourself to other candidates

标准答案:
Standard answer:

公司希望我入职后的3-6月内,给公司解决什么样的问题
What kind of problems does the company expect me to solve within 3-6 months after I join the company

公司或者对这个部门)未来的战略规划是什么样子的?
What is the future strategic plan for the company (or for the department)?

你现在对我的了解,您觉得我需要多长时间融入公司?
Given what you know about me now, how long do you think it will take me to blend in?

6最快多长时间能入职?
6) How soon can you start?

一周左右,如果公司需要,可以适当提前
A week or so, if the company needs it, it can be advanced appropriately.

12.3.2 两个注意事项
12.3.2 Two considerations

(1)职业化的语言
1) Professional language

(2)职业化的形象
2) Professional image

12.3.3 自我介绍
12.3.3 Introduction

1个人基本信息
1) Basic personal information

2工作履历
2) Work experience

时间、公司名称、任职岗位、主要工作内容、工作业绩、离职原因
Time, company name, position, main job content, job performance, reason for resignation.