### 基本信息

- 原书名：Data Mining：Concepts and Techniques，Third Edition
- 原出版社： Morgan Kaufmann

- 作者：
**(美)Jiawei Han****(加)Micheline Kamber****Jian Pei** - 丛书名：
**经典原版书库** - 出版社：机械工业出版社
- ISBN：
**9787111374312** - 上架时间：2012-3-27
- 出版日期：2012 年3月
- 开本：16开
- 页码：703
- 版次：3-1
- 所属分类：计算机 > 数据库 > 数据库存储与管理

教材

### 【插图】

### 编辑推荐

最完整、最全面地讲述了数据挖掘领域的重要知识和技术创新

是从事数据挖掘与知识发现领域研究开发人员的必读之书

### 内容简介

计算机书籍

当代商业和科学领域大量激增的数据量要求我们采用更加复杂和精细的工具来进行数据分析、处理和挖掘。尽管近年来数据挖掘技术取得的长足进展使得我们广泛收集数据越来越容易，但技术的发展依然难以匹配爆炸性的数据增长以及随之而来的大量数据处理需求，因此我们比以往更加迫切地需要新技术和自动化工具来帮助我们将这些数据转换为有用的信息和知识。

本书前版曾被KDnuggets的读者评选为最受欢迎的数据挖掘专著，是一本可读性极佳的教材。它从数据库角度全面系统地介绍数据挖掘的概念、方法和技术以及技术研究进展，并重点关注近年来该领域重要和最新的课题——数据仓库和数据立方体技术，流数据挖掘，社会化网络挖掘，空间、多媒体和其他复杂数据挖掘。每章都针对关键专题有单独的指导，提供最佳算法，并对怎样将技术运用到实际工作中给出了经过实践检验的实用型规则。如果你希望自己能熟练掌握和运用当今最有力的数据挖掘技术，那这本书正是你需要阅读和学习的宝贵资源。《数据挖:掘概念与技术(英文版.第3版)》是数据挖掘和知识发现领域内的所有教师、研究人员、开发人员和用户都必读的一本书。

《数据挖:掘概念与技术(英文版.第3版)》特点：

引入了多算法和实现示例，全部以易于理解的伪代码编写，适用于实际的大规模数据挖掘项目。

讨论了一些高级主题，例如挖掘面向对象的关系型数据库、空间数据库、多媒体数据库、时间序列数据库、文本数据库、万维网以及其他领域的应用等等。

全面而实用地给出用于从海量数据中获取尽可能多信息的概念和技术。

### 作译者

Micheline Kamber 拥有加拿大康考迪亚大学计算机科学硕士学位，她是NSERC Scholar，现在加拿大麦吉尔大学、西蒙-弗雷泽大学及瑞士从事研究工作。

Jian Pei（裴健） 目前是加拿大西蒙-弗雷泽大学计算机学院副教授。2002年，他在Jia wei Han教授的指导下获得西蒙-弗雷泽大学博士学位。

### 目录

Foreword to Second Edition

Preface

Acknowledgments

About the Authors

Chapter 1 Introduction

1.1 Why Data Mining?

1.2 What Is Data Mining!

1.3 What Kinds of Data Can Be Mined?

1.4 What Kinds of Patterns Can Be Mined?

1.5 Which Technologies Are Used?

1.6 Which Kinds of Applications Are Targeted?

1.7 Major Issues in Data Mining

1.8 Summary

1.9 Exercises

1.10 Bibliographic Notes

Chapter 2 Getting to Know Your Data

2.1 Data Objects and Attribute Types

2.2 Basic Statistical Descriptions of Data

2.3 Data Visualization

### 前言

This book explores the concepts and techniques of knowledge discovery and data min- ing. As a multidisciphnary field, data mining draws on work from areas including statistics,machine learning, pattern recognition, database technology, information retrieval,network science, knowledge-based systems, artificial intelligence, high-performance computing, and data visualization. We focus on issues relating to the feasibility, use-fulness, effectiveness, and scalability of techniques for the discovery of patterns hidden in large data sets. As a result, this book is not intended as an introduction to statis-tics, machine learning, database systems, or other such areas, although we do provide some background knowledge to facilitate the reader's comprehension of their respective

roles in data mining. Rather, the book is a comprehensive introduction to data mining.It is useful for computing science students, application developers, and business professionals, as well as researchers involved in any of the disciplines previously listed.

Data mining emerged during the late 1980s, made great strides during the 1990s, and continues to flourish into the new millennium. This book presents an overall picture of the field, introducing interesting data mining techniques and systems and discussing applications and research directions. An important motivation for writing this book was the need to build an organized framework for the study of data mining--a challenging task, owing to the extensive multidisciplinary nature of this fast-developing field. We hope that this book will encourage people with different backgrounds and experiences to exchange their views regarding data mining so as to contribute toward the further promotion and shaping of this exciting and dynamic field.

Organization of the Book

Since the publication of the frst two editions of this book, great progress has been made in the field of data mining. Many new data mining methodologies, systems, and applications have been developed, especially for handling new kinds of data, includ-ing information networks, graphs, complex structures, and data streams, as well as text,Web, multimedia, time-series, and spatiotemporal data. Such fast development and rich,new technical contents make it difficult to cover the full spectrum of the field in a single book. Instead of continuously expanding the coverage of this book, we have decided to cover the core material in sufficient scope and depth, and leave the handling of complex data types to a separate forthcoming book.

The third edition substantially revises the first two editions of the book, with numer-ous enhancements and a reorganization of the technical contents. The core technical material, which handles mining on general data types, is expanded and substantially enhanced. Several individual chapters for topics from the second edition (e.g., data pre-processing, frequent pattern mining, classification, and clustering) are now augmented and each split into two chapters for this new edition. For these topics, one chapter encap-sulates the basic concepts and techniques while the other presents advanced concepts and methods.

Chapters from the second edition on mining complex data types (e.g., stream data,sequence data, graph-structured data, social network data, and multirelational data,as well as text, Web, multimedia, and spatiotemporal data) are now reserved for a new book that will be dedicated to advanced topics in data mining. Still, to support readers in learning such advanced topics, we have placed an electronic version of the relevant chapters from the second edition onto the book's web site as companion material for the third edition.

The chapters of the third edition are described briefly as follows, with emphasis on the new material.

Chapter I provides an introduction to the multidisciplinary field of data mining. It discusses the evolutionary path of information technology, which has led to the needfor data mining, and the importance of its applications. It examines the data types to be mined, including relational, transactional, and data warehouse data, as well as complex data types such as time-series, sequences, data streams, spatiotemporal data, multimedia data, text data, graphs, social networks, and Web data. The chapter presents a general dassification of data mining tasks, based on the kinds of knowledge to be mined, the kinds of technologies used, and the kinds of applications that are targeted. Finally, major challenges in the field are discussed.

Chapter 2 introduces the general data features. It first discusses data objects and attribute types and then introduces typical measures for basic statistical data descrip-tions. It overviews data visualization techniques for various kinds of data. In addition to methods of numeric data visualization, methods for visualizing text, tags, graphs,and multidimensional data are introduced. Chapter 2 also introduces ways to measure similarity and dissimilarity for various kinds of data.

Chapter 5 introduces techniques for data preprocessing. It first introduces the con- cept of data quality and then discusses methods for data cleaning, data integration, data reduction, data transformation, and data discretization.

Chapters 4 and 5 provide a solid introduction to data warehouses, OLAP (online ana-lytical processing), and data cube technology. Chapter 4 introduces the basic concepts,modeling, design architectures, and general implementations of data warehouses and OLAP, as well as the relationship between data warehousing and other data generali- zation methods. Chapter 5 takes an in-depth look at data cube technology, presenting a detailed study of methods of data cube computation, including Star-Cubing and high-dimensional OLAP methods. Further explorations of data cube and OLAP technologies are discussed, such as sampling cubes, ranking cubes, prediction cubes, multifeature cubes for complex analysis queries, and discovery-driven cube exploration.

Chapters 6 and 7 present methods for mining frequent patterns, associations, and correlations in large data sets. Chapter 6 introduces fundamental concepts, such as market basket analysis, with many techniques for frequent itemset mining presented in an organized way. These range from the basic Apriori algorithm and its vari-ations to more advanced methods that improve efficiency, including the frequent pattern growth approach, frequent pattern mining with vertical data format, and min-ing closed and max frequent itemsets. The chapter also discusses pattern evaluation methods and introduces measures for mining correlated patterns. Chapter 7 is on advanced pattern mining methods. It discusses methods for pattern mining in multi-level and multidimensional space, mining rare and negative patterns, mining colossal patterns and high-dimensional data, constraint-based pattern mining, and mining com-pressed or approximate patterns. It also introduces methods for pattern exploration and application, including semantic annotation of frequent patterns.

Chapters 8 and 9 describe methods for data classification. Due to the importance and diversity of classification methods, the contents are partitioned into two chapters.Chapter 8 introduces basic concepts and methods for classification, including decision tree induction, Bayes classification, and rule-based classification. It also discusses model evaluation and selection methods and methods for improving classification accuracy,including ensemble methods and how to handle imbalanced data. Chapter 9 discusses advanced methods for classification, including Bayesian belief networks, the neural network technique of backpropagation, support vector machines, classification using frequent patterns, k-nearest-neighbor classifiers, case-based reasoning, genetic algo-rithms, rough set theory, and fuzzy set approaches. Additional topics include multiclass classification, semi-supervised classification, active learning, and transfer learning.

Cluster analysis forms the topic of Chapters 10 and 11. Chapter 10 introduces the basic concepts and methods for data clustering, including an overview of basic cluster analysis methods, partitioning methods, hierarchical methods, density-based methods,and grid-based methods. It also introduces methods for the evaluation of clustering.Chapter 11 discusses advanced methods for clustering, including probabilistic model-based clustering, clustering high-dimensional data, clustering graph and network data,and clustering with constraints.

Chapter 12 is dedicated to outlier detection. It introduces the basic concepts of out-liers and outlier analysis and discusses various outlier detection methods from the view of degree of supervision (i.e., supervised, semi-supervised, and unsupervised meth-ods), as well as from the view of approaches (i.e., statistical methods, proximity-based methods, clustering-based methods, and classification-based methods). It also discusses methods for mining contextual and collective outliers, and for outlier detection in high-dimensional data.

Finally, in Chapter 13, we discuss trends, applications, and research frontiers in data mining. We briefly cover mining complex data types, including mining sequence data (e.g., time series, symbolic sequences, and biological sequences), mining graphs and networks, and mining spatial, multimedia, text, and Web data. In-depth treatment of data mining methods for such data is left to a book on advanced topics in data mining,the writing of which is in progress. The chapter then moves ahead to cover other data mining methodologies, including statistical data mining, foundations of data mining,visual and audio data mining, as well as data mining applications. It discusses data mining for financial data analysis, for industries like retail and telecommunication, for use in science and engineering, and for intrusion detection and prevention. It also dis-cusses the relationship between data mining and recommender systems. Because data mining is present in many aspects of daily life, we discuss issues regarding data mining and society, including ubiquitous and invisible data mining, as well as privacy, security,and the social impacts of data mining. We conclude our study by looking at data mining trends.

Throughout the text, italic font is used to emphasize terms that are defined, while bold font is used to highlight or summarize main ideas. Sans serif font is used for reserved words. Bold italic font is used to represent multidimensional quantities.This book has several strong features that set it apart from other texts on data mining.It presents a very broad yet in-depth coverage of the principles of data mining. Thechapters are written to be as self-contained as possible, so they may be read in order of interest by the reader. Advanced chapters offer a larger-scale view and may be considered optional for interested readers. All of the major methods of data mining are presented.The book presents important topics in data mining regarding multidimensional OLAP analysis, which is often overlooked or minimally treated in other data mining books.The book also maintains web sites with a number of online resources to aid instructors,students, and professionals in the field. These are described further in the following.To the Instructor

This book is designed to give a broad, yet detailed overview of the data mining field. It can be used to teach an introductory course on data mining at an advanced undergrad-uate level or at the first-year graduate level. Sample course syllabi are provided on the book's web sites ( www. cs. uiuc. edu/-.,hanj/bk3 and www. booksite, rnkp. corn/datarnining3e)in addition to extensive teaching resources such as lecture slides, instructors' manuals,and reading lists (see p. xiv).

### 媒体评论

—— Gregory Piatetsky-Shapiro, KDnuggets的总裁

Jiawei、Micheline和Jian的教材全景式地讨论了数据挖掘的所有相关方法，从聚类和分类的经典主题，到数据库方法（关联规则、数据立方体），到更新和更高级的主题（SVD/PCA、小波、支持向量机），等等。总的说来，这是一本既讲述经典数据挖掘方法又涵盖大量当代数据挖掘技术的优秀著作，既是教学相长的优秀教材，又对专业人员具有很高的参考价值。

—— 摘自卡内基-梅隆大学Christos Faloutsos教授为本书所作序言