Hadoop权威指南(英文影印版)(第二版 修订版)
基本信息
- 原书名: Hadoop: The Definitive Guide
- 原出版社: O'Reilly Media
编辑推荐
揭示了apache hadoop如何为你释放数据的力量。
hadoop架构是 mapreduce算法的一种开源应用,是google开创其帝国的重要基石。
程序员可从中探索如何分析海量数据集,管理员可以了解如何建立与运行 hadoop集群。
内容简介回到顶部↑
书籍
计算机书籍
揭示了apache hadoop如何为你释放数据的力量。这本内容全面的书籍展示了如何使用hadoop架构搭建和维护可靠、可伸缩的分布式系统。hadoop架构是mapreduce算法的一种开源应用,是google开创其帝国的重要基石。程序员可从中探索如何分析海量数据集,管理员可以了解如何建立与运行hadoop集群。
《hadoop权威指南(英文影印版)(第二版 修订版)》涵盖了hadoop最近的更新,包括诸如hive、sqoop和avro之类的新特性。它也提供了案例学习来展示hadoop如何解决特殊问题。期待尽情享受你的数据?这就是你要的书。
·使用hadoop分布式文件系统(hdfs)来存储海量数据集,通过mapreduce对这些数据集运行分布式计算
·熟悉hadoop的数据和i/o构件,用于压缩、数据集成、序列化和持久处理
·洞悉编写mapreduce实际应用程序时的常见陷阱和高级特性
·设计、构建和管理专用的hadoop集群或在云上运行hadoop
·使用pig这种高级的查询语言来处理大规模数据
·使用hive、hadoop的数据仓库系统来分析数据集
·利用hbase这个hadoop数据库来处理结构化和半结构化数据
·学习zookeeper,这是一个用于构建分布式系统的协作原语工具箱
计算机书籍
揭示了apache hadoop如何为你释放数据的力量。这本内容全面的书籍展示了如何使用hadoop架构搭建和维护可靠、可伸缩的分布式系统。hadoop架构是mapreduce算法的一种开源应用,是google开创其帝国的重要基石。程序员可从中探索如何分析海量数据集,管理员可以了解如何建立与运行hadoop集群。
《hadoop权威指南(英文影印版)(第二版 修订版)》涵盖了hadoop最近的更新,包括诸如hive、sqoop和avro之类的新特性。它也提供了案例学习来展示hadoop如何解决特殊问题。期待尽情享受你的数据?这就是你要的书。
·使用hadoop分布式文件系统(hdfs)来存储海量数据集,通过mapreduce对这些数据集运行分布式计算
·熟悉hadoop的数据和i/o构件,用于压缩、数据集成、序列化和持久处理
·洞悉编写mapreduce实际应用程序时的常见陷阱和高级特性
·设计、构建和管理专用的hadoop集群或在云上运行hadoop
·使用pig这种高级的查询语言来处理大规模数据
·使用hive、hadoop的数据仓库系统来分析数据集
·利用hbase这个hadoop数据库来处理结构化和半结构化数据
·学习zookeeper,这是一个用于构建分布式系统的协作原语工具箱
作译者回到顶部↑
本书提供作译者介绍
Tom White从2007年起就是Apache Hadoop的理事。他是Apache软件基金会的成员和Cloudera的工程师。Tom为oreilly.com,java.net和IBM的developerWorks撰文,并为业内会议演讲。
.. << 查看详细
.. << 查看详细
目录回到顶部↑
《hadoop权威指南(英文影印版)(第二版 修订版)》
foreword
preface
1. meet hadoop
data!
data storage and analysis
comparison with other systems
rdbms
grid computing
volunteer computing
a brief history of hadoop
apache hadoop and the hadoop ecosystem
2. mapreduce
a weather dataset
data format
analyzing the data with unix tools
analyzing the data with hadoop
map and reduce
java mapreduce
scaling out
foreword
preface
1. meet hadoop
data!
data storage and analysis
comparison with other systems
rdbms
grid computing
volunteer computing
a brief history of hadoop
apache hadoop and the hadoop ecosystem
2. mapreduce
a weather dataset
data format
analyzing the data with unix tools
analyzing the data with hadoop
map and reduce
java mapreduce
scaling out
前言回到顶部↑
Martin Gardner, the mathematics and science writer, once said in an interview: Beyond calculus, I am lost. That was the secret of my column's success. It took me so long to understand what I was writing about that I knew how to write in a way most readers would understand.*
In many ways, this is how I feel about Hadoop. Its inner workings are complex, restingas they do on a mixture of distributed systems theory, practical engineering, and com-mon sense. And to the uninitiated, Hadoop can appear alien.
But it doesn't need to be like this. Stripped to its core, the tools that Hadoop provides for building distributed systems--for data storage, data analysis, and coordination--are simple. If there's a common theme, it is about raising the level of abstraction--to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don't have the time; the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book.
The Apache Hadoop community has come a long way. Over the course of three years, the Hadoop project has blossomed and spun off half a dozen subprojects. In this time, the software has made great leaps in performance, reliability, scalability, and manageability. To gain even wider adoption, however, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with more systems; and writing new, improved APIs. I'm looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too.
Administrative Notes
During discussion of a particular Java class in the text, I often omit its package name,to reduce clutter. If you need to know which package a class is in, you can easily look it up in Hadoop's Java API documentation for the relevant subproject, linked to from the Apache Hadoop home page at http://hadoop, apache, org/. Or if you're using an IDE, it can help using its auto-complete mechanism.
Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example: import org. apache, hadoop, io,*).
The sample programs in this book are available for download from the website that accompanies this book: http://www, hadoopbook, com/. You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book, and links to updates, additional resources, and my blog.
What's in This Book?
The rest of this book is organized as follows. Chapter 1 emphasizes the need for Hadoop and sketches the history of the project. Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,serialization, and file-based data structures.
The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the MapReduce programming model, and the various data formats that MapReduce can work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining data.
Chapters 9 and 10 are for Hado0p administrators, and describe how to set up and maintain a Hadoop cluster running HDFS and MapReduce.
Later. chapters are dedicated to projects that build on Hadoop or are related to it. Chapters 11 and 12 present Pig and Hive, which are analytics platforms built on HDFS and MapReduce, whereas Chapters 13, 14, and 15 cover HBase, ZooKeeper, and Sqoop, respectively.
Finally, Chapter 16 is a collection of case studies contributed by members of the Apache Hadoop community.
What's New in the Second Edition?
The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), a new section covering Avro (in Chapter 4), an introduction to the new security features in Hadoop (in Chapter 9), and a new case study on analyzing massive network graphs using Hadoop (in Chapter 16).
This edition continues to describe the 0.20 release series of Apache Hadoop, since this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced in.
Conventions Used in This Book
The following typographical conventions are used in this book:
In many ways, this is how I feel about Hadoop. Its inner workings are complex, restingas they do on a mixture of distributed systems theory, practical engineering, and com-mon sense. And to the uninitiated, Hadoop can appear alien.
But it doesn't need to be like this. Stripped to its core, the tools that Hadoop provides for building distributed systems--for data storage, data analysis, and coordination--are simple. If there's a common theme, it is about raising the level of abstraction--to create building blocks for programmers who just happen to have lots of data to store, or lots of data to analyze, or lots of machines to coordinate, and who don't have the time; the skill, or the inclination to become distributed systems experts to build the infrastructure to handle it.
With such a simple and generally applicable feature set, it seemed obvious to me when I started using it that Hadoop deserved to be widely used. However, at the time (in early 2006), setting up, configuring, and writing programs to use Hadoop was an art. Things have certainly improved since then: there is more documentation, there are more examples, and there are thriving mailing lists to go to when you have questions. And yet the biggest hurdle for newcomers is understanding what this technology is capable of, where it excels, and how to use it. That is why I wrote this book.
The Apache Hadoop community has come a long way. Over the course of three years, the Hadoop project has blossomed and spun off half a dozen subprojects. In this time, the software has made great leaps in performance, reliability, scalability, and manageability. To gain even wider adoption, however, I believe we need to make Hadoop even easier to use. This will involve writing more tools; integrating with more systems; and writing new, improved APIs. I'm looking forward to being a part of this, and I hope this book will encourage and enable others to do so, too.
Administrative Notes
During discussion of a particular Java class in the text, I often omit its package name,to reduce clutter. If you need to know which package a class is in, you can easily look it up in Hadoop's Java API documentation for the relevant subproject, linked to from the Apache Hadoop home page at http://hadoop, apache, org/. Or if you're using an IDE, it can help using its auto-complete mechanism.
Similarly, although it deviates from usual style guidelines, program listings that import multiple classes from the same package may use the asterisk wildcard character to save space (for example: import org. apache, hadoop, io,*).
The sample programs in this book are available for download from the website that accompanies this book: http://www, hadoopbook, com/. You will also find instructions there for obtaining the datasets that are used in examples throughout the book, as well as further notes for running the programs in the book, and links to updates, additional resources, and my blog.
What's in This Book?
The rest of this book is organized as follows. Chapter 1 emphasizes the need for Hadoop and sketches the history of the project. Chapter 2 provides an introduction to MapReduce. Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth.Chapter 4 covers the fundamentals of I/O in Hadoop: data integrity, compression,serialization, and file-based data structures.
The next four chapters cover MapReduce in depth. Chapter 5 goes through the practical steps needed to develop a MapReduce application. Chapter 6 looks at how MapReduce is implemented in Hadoop, from the point of view of a user. Chapter 7 is about the MapReduce programming model, and the various data formats that MapReduce can work with. Chapter 8 is on advanced MapReduce topics, including sorting and joining data.
Chapters 9 and 10 are for Hado0p administrators, and describe how to set up and maintain a Hadoop cluster running HDFS and MapReduce.
Later. chapters are dedicated to projects that build on Hadoop or are related to it. Chapters 11 and 12 present Pig and Hive, which are analytics platforms built on HDFS and MapReduce, whereas Chapters 13, 14, and 15 cover HBase, ZooKeeper, and Sqoop, respectively.
Finally, Chapter 16 is a collection of case studies contributed by members of the Apache Hadoop community.
What's New in the Second Edition?
The second edition has two new chapters on Hive and Sqoop (Chapters 12 and 15), a new section covering Avro (in Chapter 4), an introduction to the new security features in Hadoop (in Chapter 9), and a new case study on analyzing massive network graphs using Hadoop (in Chapter 16).
This edition continues to describe the 0.20 release series of Apache Hadoop, since this was the latest stable release at the time of writing. New features from later releases are occasionally mentioned in the text, however, with reference to the version that they were introduced in.
Conventions Used in This Book
The following typographical conventions are used in this book:
序言回到顶部↑
Hadoop got its start in Nutch. A few of us were attempting to build an open source web search engine and having trouble managing computations running on even a handful of computers. Once Google published its GFS and MapReduce papers, the route became clear. They'd devised systems to solve precisely the problems we were having with Nutch. So we started, two of us, half-time, to try to re-create these systems as a part of Nutch.
We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web's massive scale, we'd need to run it on thousands of machines and, moreover, that the job was bigger than two half-time developers could handle.
Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.
In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he'd written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose.
From the beginning, Tom's contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use.
Initially, Tom specialized in making Hadoop run well on Amazon's EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee.
Tom is now a respected senior member of the Hadoop developer community. Though he's an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand.
Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master--not only of the technology, but also of common sense and plain talk.
--Doug Cutting
Shed in the Yard, California
We managed to get Nutch limping along on 20 machines, but it soon became clear that to handle the Web's massive scale, we'd need to run it on thousands of machines and, moreover, that the job was bigger than two half-time developers could handle.
Around that time, Yahoo! got interested, and quickly put together a team that I joined. We split off the distributed computing part of Nutch, naming it Hadoop. With the help of Yahoo!, Hadoop soon grew into a technology that could truly scale to the Web.
In 2006, Tom White started contributing to Hadoop. I already knew Tom through an excellent article he'd written about Nutch, so I knew he could present complex ideas in clear prose. I soon learned that he could also develop software that was as pleasant to read as his prose.
From the beginning, Tom's contributions to Hadoop showed his concern for users and for the project. Unlike most open source contributors, Tom is not primarily interested in tweaking the system to better meet his own needs, but rather in making it easier for anyone to use.
Initially, Tom specialized in making Hadoop run well on Amazon's EC2 and S3 services. Then he moved on to tackle a wide variety of problems, including improving the MapReduce APIs, enhancing the website, and devising an object serialization framework. In all cases, Tom presented his ideas precisely. In short order, Tom earned the role of Hadoop committer and soon thereafter became a member of the Hadoop Project Management Committee.
Tom is now a respected senior member of the Hadoop developer community. Though he's an expert in many technical corners of the project, his specialty is making Hadoop easier to use and understand.
Given this, I was very pleased when I learned that Tom intended to write a book about Hadoop. Who could be better qualified? Now you have the opportunity to learn about Hadoop from a master--not only of the technology, but also of common sense and plain talk.
--Doug Cutting
Shed in the Yard, California
媒体评论回到顶部↑
“祝贺你有此良机向大师学习Hadoop,在享用技术本身的同时,体验大师的睿智和朴素的文风。”
——Doug Cuning Cloudera公司
——Doug Cuning Cloudera公司







点击看大图


加载中...

