第1 章 绪论 ··························································································································1
1.1 数据挖掘基本概念 ··································································································1
1.1.1 数据挖掘的概念 ··························································································1
1.1.2 大数据环境下的数据挖掘 ···········································································2
1.1.3 数据挖掘的特性 ··························································································3
1.1.4 数据挖掘的过程 ··························································································3
1.2 数据挖掘起源及发展历史 ······················································································4
1.3 数据挖掘常用工具 ··································································································7
1.3.1 商用工具 ······································································································7
1.3.2 开源工具 ······································································································8
1.4 数据挖掘应用场景 ································································································ 10
习题 ································································································································ 12
参考文献 ························································································································ 13
第2 章 数据预处理与相似性 ····························································································· 14
2.1 数据类型 ··············································································································· 14
2.1.1 属性与度量 ································································································ 14
2.1.2 数据集的类型 ···························································································· 15
2.2 数据预处理 ··········································································································· 16
2.2.1 数据清理 ···································································································· 16
2.2.2 数据集成 ···································································································· 18
2.2.3 数据规范化 ································································································ 19
2.2.4 数据约简 ···································································································· 20
2.2.5 数据离散化 ································································································ 22
2.3 数据的相似性 ······································································································· 23
2.3.1 数值属性的相似性度量 ············································································· 23
2.3.2 标称属性的相似性度量 ············································································· 26
2.3.3 组合异种属性的相似性度量 ····································································· 27
2.3.4 文档相似性度量 ························································································ 28
2.3.5 离散序列相似性度量 ················································································· 30
习题 ································································································································ 31
参考文献 ························································································································ 32
第3 章 分类 ························································································································ 33
3.1 分类的基本概念、分类过程及分类器性能的评估 ············································· 33
3.1.1 分类的基本概念 ························································································ 33
3.1.2 分类的过程 ································································································ 33
3.1.3 分类器性能的评估方法 ············································································· 34
3.2 决策树 ··················································································································· 35
3.2.1 决策树概述 ································································································ 35
3.2.2 决策树的用途和特性 ················································································· 35
3.2.3 决策树工作原理 ························································································ 36
3.2.4 决策树构建步骤 ························································································ 37
3.2.5 决策树算法原理 ························································································ 38
3.3 贝叶斯分类 ··········································································································· 47
3.3.1 贝叶斯定理 ································································································ 47
3.3.2 朴素贝叶斯分类原理与流程 ····································································· 48
3.3.3 贝叶斯分析 ································································································ 51
3.3.4 贝叶斯决策 ································································································ 52
3.4 支持向量机 ··········································································································· 52
3.4.1 支持向量机主要思想 ················································································· 53
3.4.2 支持向量机基础理论 ················································································· 53
3.4.3 支持向量机原理 ························································································ 58
3.5 实战:决策树算法在Weka 中的实现 ·································································· 62
3.5.1 Weka 探索者图形用户界面 ······································································· 62
3.5.2 决策树算法在Weka 中的具体实现 ·························································· 62
3.5.3 使用中的具体实例····················································································· 65
习题 ································································································································ 66
参考文献 ························································································································ 67
第4 章 回归 ························································································································ 69
4.1 回归的基本概念 ···································································································· 69
4.1.1 回归分析的定义 ························································································ 69
4.1.2 回归分析步骤 ···························································································· 70
4.1.3 回归分析要注意的问题 ············································································· 70
4.2 一元回归分析 ······································································································· 71
4.2.1 一元回归分析的模型设定 ········································································· 71
4.2.2 一元线性回归模型的参数估计 ································································· 73
4.2.3 基本假设下OLS 估计的统计性质 ···························································· 74
4.2.4 误差方差估计 ···························································································· 75
4.2.5 回归系数检验(t 检验) ··········································································· 76
4.2.6 拟合优度和模型检验(F 检验) ······························································ 77
4.3 多元线性回归分析 ································································································ 78
4.3.1 多元线性回归模型····················································································· 78
4.3.2 多元线性回归模型的假定 ········································································· 79
4.3.3 多元线性回归模型的参数估计 ································································· 80
4.3.4 显著性检验 ································································································ 82
4.3.5 回归变量的选择与逐步回归 ····································································· 84
4.4 逻辑回归分析 ······································································································· 86
4.4.1 逻辑回归模型 ···························································································· 86
4.4.2 logit 变换 ···································································································· 87
4.4.3 Logistic 分布 ······························································································ 88
4.4.4 列联表的Logistic 回归模型 ······································································ 88
4.5 其他回归分析 ······································································································· 89
4.5.1 多项式回归(Polynomial Regression) ···················································· 89
4.5.2 逐步回归(Stepwise Regression) ···························································· 90
4.5.3 岭回归(Ridge Regression) ····································································· 90
4.5.4 套索回归(Lasso Regression) ································································· 91
4.5.5 弹性网络(ElasticNet) ············································································ 92
4.6 实战:用回归分析方法给自己的房子定价 ························································· 92
4.6.1 为Weka 构建数据集 ·················································································· 92
4.6.2 将数据载入Weka ······················································································ 93
4.6.3 用Weka 创建一个回归模型 ······································································ 94
4.6.4 结果分析 ···································································································· 95
习题 ································································································································ 96
参考文献 ························································································································ 97
第5 章 聚类 ························································································································ 98
5.1 聚类的基本概念 ···································································································· 98
5.2 划分方法 ············································································································· 100
5.2.1 k 均值算法 ······························································································· 101
5.2.2 k 中心点算法 ··························································································· 103
5.3 层次方法 ············································································································· 106
5.3.1 层次方法的分类 ······················································································ 106
5.3.2 BIRCH 算法 ····························································································· 109
5.4 基于密度的方法 ·································································································· 112
5.5 实战:聚类分析 ·································································································· 115
5.5.1 背景与聚类目的 ······················································································ 115
5.5.2 聚类过程 ·································································································· 116
5.5.3 聚类结果分析 ·························································································· 120
习题 ······························································································································ 122
参考文献 ······················································································································ 123
第6 章 关联规则 ·············································································································· 124
6.1 基本概念 ············································································································· 124
6.1.1 购物篮分析:啤酒与尿布的经典案例 ··················································· 124
6.1.2 关联规则的概念 ······················································································ 124
6.1.3 频繁项集的产生 ······················································································ 128
6.2 Apriori 算法:通过限制候选项集产生发现频繁项集······································· 128
6.2.1 Apriori 算法的频繁项集产生 ·································································· 128
6.2.2 Apriori 算法描述 ······················································································ 131
6.3 FP-growth 算法 ··································································································· 134
6.3.1 构造FP 树 ································································································ 134
6.3.2 挖掘FP 树 ································································································ 136
6.3.3 FP-Tree 算法 ···························································································· 138
6.4 其他关联规则算法 ······························································································ 139
6.4.1 约束性关联规则算法 ··············································································· 139
6.4.2 增量式关联规则算法 ··············································································· 140
6.4.3 多层关联规则算法··················································································· 141
6.5 实战:个人信用关联规则挖掘 ·········································································· 143
6.5.1 背景与挖掘目标 ······················································································ 143
6.5.2 分析方法与过程 ······················································································ 144
6.5.3 总结 ·········································································································· 148
习题 ······························································································································ 148
参考文献 ······················································································································ 149
第7 章 常用大数据挖掘算法优化改进 ··········································································· 151
7.1 分类算法 ············································································································· 151
7.1.1 分类算法的并行化··················································································· 151
7.1.2 并行化的决策树算法优化 ······································································· 154
7.1.3 一种新的朴素贝叶斯改进方法 ······························································· 158
7.1.4 支持向量机并行优化改进 ······································································· 160
7.2 聚类算法 ············································································································· 161
7.2.1 聚类分析研究的主要内容及算法应用 ··················································· 162
7.2.2 并行聚类相关技术及算法体系结构和模型 ············································ 163
7.2.3 k-means 聚类算法的一种改进方法 ························································· 164
7.2.4 基于Spark 的k-means 算法并行化设计与实现 ····································· 166
7.2.5 基于Spark 的k-means 改进算法的并行化 ············································· 168
7.2.6 基于MapReduce 的聚类算法并行化 ······················································ 170
7.2.7 谱聚类算法并行化方法 ··········································································· 171
7.3 关联规则 ············································································································· 173
7.3.1 Apriori 算法的一种改进方法 ·································································· 173
7.3.2 Apriori 算法基于Spark 的分布式实现 ··················································· 176
7.3.3 并行FP-growth 关联规则算法研究 ························································ 177
7.3.4 基于Spark 的FP-growth 算法的并行化实现 ········································· 179
习题 ······························································································································ 183
参考文献 ······················································································································ 183
第8 章 推荐系统 ·············································································································· 186
8.1 推荐系统概念 ····································································································· 186
8.1.1 基本概念 ·································································································· 186
8.1.2 发展历史 ·································································································· 187
8.1.3 推荐系统评测指标··················································································· 188
8.2 基于内容的推荐 ·································································································· 192
8.2.1 物品表示 ·································································································· 193
8.2.2 物品相似度 ······························································································ 196
8.2.3 用户对物品的评分··················································································· 197
8.2.4 基于向量空间模型的推荐 ······································································· 198
8.3 协同过滤 ············································································································· 201
8.3.1 协同过滤基本概念··················································································· 201
8.3.2 基于用户的协同过滤 ··············································································· 205
8.3.3 基于物品的协同过滤 ··············································································· 207
8.3.4 隐语义模型和矩阵因子分解模型 ··························································· 209
8.4 其他推荐技术 ····································································································· 217
8.5 实战:基于协同过滤算法推荐电影 ·································································· 220
8.5.1 数据准备与导入 ······················································································ 221
8.5.2 建立矩阵因子分解模型 ··········································································· 223
8.5.3 推荐预测及验证 ······················································································ 225
习题 ······························································································································ 227
参考文献 ······················································································································ 228
第9 章 互联网数据挖掘 ·································································································· 232
9.1 链接分析与网页排序 ·························································································· 232
9.1.1 PageRank ·································································································· 232
9.1.2 PageRank 的快速计算 ············································································· 238
9.1.3 面向主题的PageRank ············································································· 239
9.1.4 时间序列分析 ·························································································· 239
9.2 互联网信息抽取 ·································································································· 241
9.2.1 概述 ·········································································································· 241
9.2.2 典型应用模型构建··················································································· 242
9.2.3 挖掘、存储与网络技术分析 ··································································· 243
9.2.4 数据采集管理 ·························································································· 243
9.2.5 信息抽取方法与知识发现 ······································································· 244
9.2.6 行业案例研究 ·························································································· 247
9.3 日志挖掘与查询分析 ·························································································· 248
9.3.1 概述 ·········································································································· 248
9.3.2 挖掘分析常用方法与工具比较 ······························································· 249
9.3.3 海量数据挖掘过程展现与分析 ······························································· 250
9.3.4 行业应用举例 ·························································································· 251
习题 ······························································································································ 252
参考文献 ······················································································································ 253
附录A Weka ···················································································································· 255
A.1 Weka 简介 ·········································································································· 255
A.1.1 概述 ········································································································· 255
A.1.2 Weka 数据格式 ······················································································· 256
A.2 Explorer 界面 ······································································································ 259
A.2.1 数据准备 ································································································· 260
A.2.2 数据载入 ································································································· 260
A.2.3 训练与模型评估 ······················································································ 261
A.2.4 属性选择或过滤 ······················································································ 264
A.2.5 可视化 ····································································································· 271
A.3 Knowledge Flow 界面 ························································································ 273
A.3.1 界面组件分析 ························································································· 273
A.3.2 组件的配置与连接 ·················································································· 273
A.3.3 知识流界面实例 ······················································································ 274
A.4 Experimenter 界面 ······························································································ 276
A.4.1 实验者界面实例 ······················································································ 276
A.4.2 简单设置 ································································································· 278
A.4.3 高级设置 ································································································· 280
A.4.4 实验结果分析 ························································································· 281
习题 ······························································································································ 283
参考文献 ······················································································································ 284
附录B Spark MLlib ·········································································································· 285
B.1 Spark 简介 ·········································································································· 285
B.1.1 Spark 生态系统 ······················································································· 285
B.1.2 Spark 集群架构 ······················································································· 287
B.1.3 Spark 作业调度 ······················································································· 287
B.2 Spark RDD ·········································································································· 288
B.2.1 RDD 设计思想 ························································································ 289
B.2.2 RDD 编程接口 ························································································ 290
B.2.3 RDD 操作 ································································································ 292
B.3 Spark MLlib 概述 ······························································································· 294
B.4 Spark MLlib 数据类型 ························································································ 295
B.4.1 本地向量 ································································································· 295
B.4.2 标注点 ····································································································· 296
B.4.3 本地矩阵 ································································································· 297
B.5 Spark MLlib 算法库 ···························································································· 298
B.5.1 机器学习管道 ·························································································· 298
B.5.2 特征提取与转换 ······················································································ 303
B.5.3 分类与回归 ······························································································ 309
B.5.4 聚类 ········································································································· 312
B.5.5 协同过滤 ································································································· 314
B.5.6 模型选择与调优 ······················································································ 316
习题 ······························································································································ 318
参考文献 ······················································································································ 319
附录C 人工智能和大数据实验环境 ··············································································· 320