信息检索导论(三位信息检索领域的顶级专家联手打造)(英文版)
基本信息
- 作者: (美)Christopher D. Manning Prabhakar Raghavan Hinrich Schutze [作译者介绍]
- 丛书名: 图灵计算机科学
- 出版社:人民邮电出版社
- ISBN:9787115218247
- 上架时间:2010-1-8
- 出版日期:2010 年1月
- 开本:16开
- 页码:482
- 版次:1-1
- 所属分类:
计算机 > 信息系统 > 综合
编辑推荐
三位信息检索领域的顶级专家联手打造
旨在从计算机科学的视角提供一种现代化的信息检索方法
内容新颖,选材独特,对信息检索的基础知识和发展方向进行了生动的描述
书中所有重要的思想都是用示例进行解释,图文并茂
既具备深厚的理论基础,又代表了尖端科技水准
推荐阅读
内容简介回到顶部↑
本书是信息检索的教材,旨在从计算机科学的视角提供一种现代的信息检索方法。书中从基本概念讲解网络搜索以及文本分类和文本聚类等,对收集、索引和搜索文档系统的设计和实现的方方面面、评估系统的方法、机器学习方法在文本收集中的应用等给出了最新的讲解。
书中所有重要的思想都是用示例进行解释,图文并茂。本书非常适合作为计算机科学及相关专业的高年级本科生和研究生的“信息检索”课程的入门教材,当然也同样适合研究人员和专业人士阅读。
书中所有重要的思想都是用示例进行解释,图文并茂。本书非常适合作为计算机科学及相关专业的高年级本科生和研究生的“信息检索”课程的入门教材,当然也同样适合研究人员和专业人士阅读。
作译者回到顶部↑
本书提供作译者介绍
Christopher D.Manning,斯坦福大学语言学博士,现任斯坦福大学计算机科学和语言学副教授,主要研究方向是统计自然语言处理、信息提取与表示、文本理解和文本挖掘等。
Prabhakar Raghavan,加州大学伯克利分校博士,现任Yahoo!实验室主任,斯坦福大学计算机科学系顾问教授,是ACM和IEEE会士。主要研究兴趣是文本及Web数据挖掘、算法设计等。此前,他曾任Verity公司CTO,并在IBM研究院担任过管理工作。
Hinrich Schiitze,斯坦福大学博士,现任斯图加特大学自然语言处理研究所理论计算语言学主.. << 查看详细
Prabhakar Raghavan,加州大学伯克利分校博士,现任Yahoo!实验室主任,斯坦福大学计算机科学系顾问教授,是ACM和IEEE会士。主要研究兴趣是文本及Web数据挖掘、算法设计等。此前,他曾任Verity公司CTO,并在IBM研究院担任过管理工作。
Hinrich Schiitze,斯坦福大学博士,现任斯图加特大学自然语言处理研究所理论计算语言学主.. << 查看详细
目录回到顶部↑
able of notation page xi
preface xv
1 boolean retrieval 1
1.1 an example information retrieval problem 3
1.2 a first take at building an inverted index 6
1.3 processing boolean queries 9
1.4 the extended boolean model versus ranked retrieval 13
1.5 references and further reading 16
2 the term vocabulary and postings lists 18
2.1 document delineation and character sequence decoding 18
2.2 determining the vocabulary of terms 21
2.3 faster postings list intersection via skip pointers 33
2.4 positional postings and phrase queries 36
2.5 references and further reading 43
3 dictionaries and tolerant retrieval 45
3.1 search structures for dictionaries 45
3.2 wildcard queries 48
3.3 spelling correction 52
3.4 phonetic correction 58
前言回到顶部↑
As recently as the 1990s, studies showed that most people preferred getting information from other people rather than from information retrieval (IR)systems. Of course, in that time period, most people also used human travel
agents to book their travel. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels at which most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows 2004) found that "92% of Internet users say the Internet is a good place to go for getting everyday information." To the surprise of many,the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people's preferred means of information access. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates.
Information retrieval did not begin with the Web. In response to various challenges of providing information access, the field of IR evolved to give principled approaches to searching various forms of content. The field began with scientific publications and library records but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors. Much of the scientific research on IR has occurred in these contexts, and much of the continued practice Of IR deals with providing access to unstructured information in various corporate and governmental domains, and this work forms much of the foundation of our book.
Nevertheless, in recent years, a principal driver of innovation has been the World Wide Web, unleashing publication at the scale of tens of millions of content creators. This explosion of published information would be moot if the information could not be found, annotated, and analyzed so that each user can quickly find information that is both relevant and comprehensive for their needs. By the late 1990s, many people felt that continuing to index the whole Web would rapidly become impossible, due to the Web's exponential growth in size. But major scientific innovations, superb engineering, the rapidly declining price of computer hardware, and the rise of a commercial underpinning for web search have all conspired to power today's major search engines, which are able to provide high-quality results within subsecond response times for hundreds of millions of searches a day over billions of web pages.
Book organization and course development
This book is the result of a series of courses we have taught at Stanford University and at the University of Stuttgart, in a range of durations including a single quarter, one semester, and two quarters. These courses were aimed at early stage graduate students in computer science, but we have also had enrollment from upper-class computer science undergraduates, as well as students from law, medical informatics, statistics, linguistics, and various engineering disciplines. The key design principle for this book, therefore, was to cover what we believe to be important in a one-term graduate course on IR. An additional principle is to build each chapter around material that we believe can be covered in a single lecture of 75 to 90 minutes.
The first eight chapters of the book are devoted to the basics of information retrieval and in particular the heart of search engines; we consider this material to be core to any course on information retrieval. Chapter 1 introduces inverted indexes and shows how simple Boolean queries can be processed using such indexes. Chapter 2 builds on this introduction by detailing the manner in which documents are preprocessed before indexing and by discussing how inverted indexes are augmented in various ways for functionality and speed. Chapter 3 discusses search structures for dictionaries and how to process queries that have spelling errors and other imprecise matches to the vocabulary in the document collection being searched. Chapter 4 describes a number of algorithms for constructing the inverted index from a text collection with particular attention to highly scalable and distributed algorithms that can be applied to very large collections. Chapter 5 covers techniques for compressing dictionaries and inverted indexes. These techniques are critical for achieving subsecond response times to user queries in large search engines. The indexes and queries considered in Chapters 1 through 5 only deal with Boolean retrieval, in which a document either matches a query or does not. A desire to measure the extent to which a document matches a query, or the score of a document for a query, motivates the development of term weighting and the computation of scores in Chapters 6 and 7, leading to the idea of a list of documents that are rank-ordered for a query. Chapter 8 focuses on the evaluation of an information retrieval system based on the relevance of the documents it retrieves, allowing us to compare the relative
performances of different systems on benchmark document collections and queries.
Chapters 9 through 21 build on the foundation of the first eight chapters to cover a variety of more advanced topics. Chapter 9 discusses methods by which retrieval can be enhanced through the use of techniques like relevance feedback and query expansion, which aim at increasing the likelihood of retrieving relevant documents. Chapter 10 considers IR from documents that are structured with markup languages like XML and HTML. We treat structured retrieval by reducing it to the vector space scoring methods developed in Chapter 6. Chapters 11 and 12 invoke probability theory to compute scores for documents on queries. Chapter 11 develops traditional probabilistic IR,which provides a framework for computing the probability of relevance of a document, given a set of query terms. This probability may then be used as a score in ranking. Chapter 12 illustrates an alternative, wherein, for each document in a collection, we build a language model from which one can estimate a probability that the language model generates a given query. This probability is another quantity with which we can rank-order documents.
Chapters 13 through 18 give a treatment of various forms of machine learning and numerical methods in information retrieval. Chapters 13 through 15 treat the problem of classifying documents into a set of known categories,given a set of documents along with the classes they belong to. Chapter 13 motivates statistical classification as one of the key technologies needed for a successful search engine; introduces Naive 'Bayes, a conceptually simple and efficient text classification method; and outlines the standard methodology for evaluating text classifiers. Chapter 14 employs the vector space model from Chapter 6 and introduces two classification methods, Rocchio and k nearest neighbor (kNN), that operate on document vectors. It also presents the bias-variance tradeoff as an important characterization of learning problems that provides criteria for selecting an appropriate method for a text classification problem. Chapter 15 introduces support vector machines,which many researchers currently view as the most effective text classification method. We also develop connections in this chapter between the problem of classification and seemingly disparate topics such as the induction of scoring functions from a set of training examples.
Chapters 16,17, and 18 consider the problem of inducing clusters of related documents from a collection. In Chapter 16, we first give an overview of a number of important applications of clustering in IR. We then describe two flat clustering algorithms: the K-means algorithm, an efficient and widely used document clustering method, and the expectation-maximization algorithm, which is computationally more expensive, but also more flexible.
Chapter 17 motivates the need for hierarchically structured clusterings (instead of flat clusterings) in many applications in IR and introduces a number of clustering algorithms that produce a hierarchy of clusters. The chapter also addresses the difficult problem of automatically computing labels for clusters. Chapter 18 develops methods from linear algebra that constitute an extension of clustering and also offer intriguing prospects for algebraic methods in IR, which have been pursued in the approach of latent semantic indexing.
Chapters 19 through 21 treat the problem of web search. We give in Chapter 19 a summary of the basic challenges in web search, together with a set of techniques that are pervasive in web information retrieval. Next, Chapter 20 describes the architecture and requirements of a basic web crawler.Finally, Chapter 21 considers the power of link analysis in web search, using in the process several methods from linear algebra and advanced probability theory.
This book is not comprehensive in covering all topics related to IR. We have put aside a number of topics, which we deemed outside the scope of what we wished to cover in an introduction to IR class. Nevertheless, for people interested in these topics, we provide the following pointers to mainly textbook coverage:
Cross-language IR Grossman and Frieder 2004, ch. 4, and Oard and Dorr 1996.
Image and multimedia IR Grossman and Frieder 2004, ch. 4; BaezaYates and Ribeiro-Neto 1999, ch. 6; Baeza-Yates and Ribeiro-Neto 1999,
ch. 11; Baeza-Yates and Ribeiro-Neto 1999, ch. 12; del Bimbo 1999; Lew 2001; and Smeulders et al. 2000.
Speech retrieval Coden et al. 2002.
Music retrieval Downie 2006 and http://www, ismir, net/.
User interfaces for IR Baeza-Yates and Ribeiro-Neto 1999, ch. 10.
agents to book their travel. However, during the last decade, relentless optimization of information retrieval effectiveness has driven web search engines to new quality levels at which most people are satisfied most of the time, and web search has become a standard and often preferred source of information finding. For example, the 2004 Pew Internet Survey (Fallows 2004) found that "92% of Internet users say the Internet is a good place to go for getting everyday information." To the surprise of many,the field of information retrieval has moved from being a primarily academic discipline to being the basis underlying most people's preferred means of information access. This book presents the scientific underpinnings of this field, at a level accessible to graduate students as well as advanced undergraduates.
Information retrieval did not begin with the Web. In response to various challenges of providing information access, the field of IR evolved to give principled approaches to searching various forms of content. The field began with scientific publications and library records but soon spread to other forms of content, particularly those of information professionals, such as journalists, lawyers, and doctors. Much of the scientific research on IR has occurred in these contexts, and much of the continued practice Of IR deals with providing access to unstructured information in various corporate and governmental domains, and this work forms much of the foundation of our book.
Nevertheless, in recent years, a principal driver of innovation has been the World Wide Web, unleashing publication at the scale of tens of millions of content creators. This explosion of published information would be moot if the information could not be found, annotated, and analyzed so that each user can quickly find information that is both relevant and comprehensive for their needs. By the late 1990s, many people felt that continuing to index the whole Web would rapidly become impossible, due to the Web's exponential growth in size. But major scientific innovations, superb engineering, the rapidly declining price of computer hardware, and the rise of a commercial underpinning for web search have all conspired to power today's major search engines, which are able to provide high-quality results within subsecond response times for hundreds of millions of searches a day over billions of web pages.
Book organization and course development
This book is the result of a series of courses we have taught at Stanford University and at the University of Stuttgart, in a range of durations including a single quarter, one semester, and two quarters. These courses were aimed at early stage graduate students in computer science, but we have also had enrollment from upper-class computer science undergraduates, as well as students from law, medical informatics, statistics, linguistics, and various engineering disciplines. The key design principle for this book, therefore, was to cover what we believe to be important in a one-term graduate course on IR. An additional principle is to build each chapter around material that we believe can be covered in a single lecture of 75 to 90 minutes.
The first eight chapters of the book are devoted to the basics of information retrieval and in particular the heart of search engines; we consider this material to be core to any course on information retrieval. Chapter 1 introduces inverted indexes and shows how simple Boolean queries can be processed using such indexes. Chapter 2 builds on this introduction by detailing the manner in which documents are preprocessed before indexing and by discussing how inverted indexes are augmented in various ways for functionality and speed. Chapter 3 discusses search structures for dictionaries and how to process queries that have spelling errors and other imprecise matches to the vocabulary in the document collection being searched. Chapter 4 describes a number of algorithms for constructing the inverted index from a text collection with particular attention to highly scalable and distributed algorithms that can be applied to very large collections. Chapter 5 covers techniques for compressing dictionaries and inverted indexes. These techniques are critical for achieving subsecond response times to user queries in large search engines. The indexes and queries considered in Chapters 1 through 5 only deal with Boolean retrieval, in which a document either matches a query or does not. A desire to measure the extent to which a document matches a query, or the score of a document for a query, motivates the development of term weighting and the computation of scores in Chapters 6 and 7, leading to the idea of a list of documents that are rank-ordered for a query. Chapter 8 focuses on the evaluation of an information retrieval system based on the relevance of the documents it retrieves, allowing us to compare the relative
performances of different systems on benchmark document collections and queries.
Chapters 9 through 21 build on the foundation of the first eight chapters to cover a variety of more advanced topics. Chapter 9 discusses methods by which retrieval can be enhanced through the use of techniques like relevance feedback and query expansion, which aim at increasing the likelihood of retrieving relevant documents. Chapter 10 considers IR from documents that are structured with markup languages like XML and HTML. We treat structured retrieval by reducing it to the vector space scoring methods developed in Chapter 6. Chapters 11 and 12 invoke probability theory to compute scores for documents on queries. Chapter 11 develops traditional probabilistic IR,which provides a framework for computing the probability of relevance of a document, given a set of query terms. This probability may then be used as a score in ranking. Chapter 12 illustrates an alternative, wherein, for each document in a collection, we build a language model from which one can estimate a probability that the language model generates a given query. This probability is another quantity with which we can rank-order documents.
Chapters 13 through 18 give a treatment of various forms of machine learning and numerical methods in information retrieval. Chapters 13 through 15 treat the problem of classifying documents into a set of known categories,given a set of documents along with the classes they belong to. Chapter 13 motivates statistical classification as one of the key technologies needed for a successful search engine; introduces Naive 'Bayes, a conceptually simple and efficient text classification method; and outlines the standard methodology for evaluating text classifiers. Chapter 14 employs the vector space model from Chapter 6 and introduces two classification methods, Rocchio and k nearest neighbor (kNN), that operate on document vectors. It also presents the bias-variance tradeoff as an important characterization of learning problems that provides criteria for selecting an appropriate method for a text classification problem. Chapter 15 introduces support vector machines,which many researchers currently view as the most effective text classification method. We also develop connections in this chapter between the problem of classification and seemingly disparate topics such as the induction of scoring functions from a set of training examples.
Chapters 16,17, and 18 consider the problem of inducing clusters of related documents from a collection. In Chapter 16, we first give an overview of a number of important applications of clustering in IR. We then describe two flat clustering algorithms: the K-means algorithm, an efficient and widely used document clustering method, and the expectation-maximization algorithm, which is computationally more expensive, but also more flexible.
Chapter 17 motivates the need for hierarchically structured clusterings (instead of flat clusterings) in many applications in IR and introduces a number of clustering algorithms that produce a hierarchy of clusters. The chapter also addresses the difficult problem of automatically computing labels for clusters. Chapter 18 develops methods from linear algebra that constitute an extension of clustering and also offer intriguing prospects for algebraic methods in IR, which have been pursued in the approach of latent semantic indexing.
Chapters 19 through 21 treat the problem of web search. We give in Chapter 19 a summary of the basic challenges in web search, together with a set of techniques that are pervasive in web information retrieval. Next, Chapter 20 describes the architecture and requirements of a basic web crawler.Finally, Chapter 21 considers the power of link analysis in web search, using in the process several methods from linear algebra and advanced probability theory.
This book is not comprehensive in covering all topics related to IR. We have put aside a number of topics, which we deemed outside the scope of what we wished to cover in an introduction to IR class. Nevertheless, for people interested in these topics, we provide the following pointers to mainly textbook coverage:
Cross-language IR Grossman and Frieder 2004, ch. 4, and Oard and Dorr 1996.
Image and multimedia IR Grossman and Frieder 2004, ch. 4; BaezaYates and Ribeiro-Neto 1999, ch. 6; Baeza-Yates and Ribeiro-Neto 1999,
ch. 11; Baeza-Yates and Ribeiro-Neto 1999, ch. 12; del Bimbo 1999; Lew 2001; and Smeulders et al. 2000.
Speech retrieval Coden et al. 2002.
Music retrieval Downie 2006 and http://www, ismir, net/.
User interfaces for IR Baeza-Yates and Ribeiro-Neto 1999, ch. 10.
媒体评论回到顶部↑
“如何排定SVM、XML、DNS和LSI的顺序?什么是信息检索中的垃圾信息、隐藏页和门页?MapReduce和其他一些并行运算方法是如何实现由兆字节(MB)到百万兆字节(PB)的飞跃的?这些问题在本书中您都能找到答案,本书首次将构建Web搜索引擎的复杂过程以一种清晰的全景方式展现给读者。”
——Peter Norvig,Google公司研究主管
“本书将信息检索这个举足轻重而又发展迅猛的领域进行了全面、新颖、准确的介绍,我们非常需要这样一本教科书。”
——Raymond J.Mooney,得克萨斯大学奥斯汀分校教授
“此书内容新颖,选材独特,对信息检索的基础知识和发展方向进行了生动的描述。”
——Jon Kieinberg,康奈尔大学教授
——Peter Norvig,Google公司研究主管
“本书将信息检索这个举足轻重而又发展迅猛的领域进行了全面、新颖、准确的介绍,我们非常需要这样一本教科书。”
——Raymond J.Mooney,得克萨斯大学奥斯汀分校教授
“此书内容新颖,选材独特,对信息检索的基础知识和发展方向进行了生动的描述。”
——Jon Kieinberg,康奈尔大学教授


点击看大图






加载中...
