基本信息
- 原书名:The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
- 原出版社: Cambridge University Press
编辑推荐
本书是一本非常好的文本挖掘的导论,由该领域的领军人物编写。
书中很好地结合了文本挖掘的理论和实践,既适合文本挖掘领域的研究人员阅读又适合实践者参考。
内容简介
作译者
James Sanger 风险投资家,商业数据解决方案、因特网应用和IT安全产品领域公认的行业专家。他于1982年与人合伙创立了ABS Ventures公司。此前,他是DB Capital纽约公司的常务董事。他本科毕业于宾夕法尼亚大学,研究生就读于牛津大学和利物浦大学。他是IEEE和美国人工智能协会(AAAI)会员。...
目录
I.1 Defining Text Mining
I.2 General Architecture of Text Mining Systems
II. Core Text Mining Operations
II.1 Core Text Mining Operations
II.2 Using Background Knowledge for Text Mining
II.3 Text Mining Query Languages
III. Text Mining Preprocessing Techniques
III.1 Task-Oriented Approaches
III.2 Further Reading
IV. Categorization
IV.1 Applications of Text Categorization
IV.2 Definition of the Problem
IV.3 Document Representation
IV.4 Knowledge Engineering Approach to TC
IV.5 Machine Learning Approach to TC
IV.6 Using Unlabeled Data to Improve Classification
IV.7 Evaluation of Text Classifiers
IV.8 Citations and Notes
V. Clustering
前言
Text mining is a new and exciting research area that tries to solve the informationoverload problem by using techniques from data mining, machine learning, naturallanguage processing (NLP), information retrieval (IR), and knowledge management.Text mining involves the preprocessing of document collections (text categorization,information extraction, term extraction), the storage of the intermediate representations,the techniques to analyze these intermediate representations (such as distributionanalysis, clustering, trend analysis, and association rules), and visualization ofthe results.
This book presents a general theory of text mining along with the main techniquesbehind it.We offer a generalized architecture for text mining and outline thealgorithms and data structures typically used by text mining systems.
The book is aimed at the advanced undergraduate students, graduate students,academic researchers, and professional practitioners interested in complete coverageof the text mining field. We have included all the topics critical to peoplewho plan to develop text mining systems or to use them. In particular, we havecovered preprocessing techniques such as text categorization, text clustering, andinformation extraction and analysis techniques such as association rules and linkanalysis.
The book tries to blend together theory and practice; we have attempted toprovide many real-life scenarios that show how the different techniques are used inpractice.When writing the book we tried to make it as self-contained as possible andhave compiled a comprehensive bibliography for each topic so that the reader canexpand his or her knowledge accordingly.
BOOK OVERVIEW
The book starts with a gentle introduction to text mining that presents the basicdefinitions and prepares the reader for the next chapters. In the second chapter wedescribe the core text mining operations in detail while providing examples for eachoperation. The third chapter serves as an introduction to text mining preprocessingtechniques. We provide a taxonomy of the operations and set the ground forChapters IV through VII. Chapter IV offers a comprehensive description of thetext categorization problem and outlines the major algorithms for performing textcategorization.
Chapter V introduces another important text preprocessing task called text clustering,and we again provide a concrete definition of the problem and outline themajor algorithms for performing text clustering. Chapter VI addresses what is probablythe most important text preprocessing technique for text mining – namely, informationextraction. We describe the general problem of information extraction andsupply the relevant definitions. Several examples of the output of information extractionin several domains are also presented.
In Chapter VII, we discuss several state-of-the-art probabilistic models for informationextraction, and Chapter VIII describes several preprocessing applicationsthat either use the probabilistic models of Chapter VII or are based on hybridapproaches incorporating several models. The presentation layer of a typical textmining system is considered in Chapter IX. We focus mainly on aspects relatedto browsing large document collections and on issues related to query refinement.
Chapter X surveys the common visualization techniques used either to visualize thedocument collection or the results obtained from the text mining operations. ChapterXI introduces the fascinating area of link analysis. We present link analysis asan analytical step based on the foundation of the text preprocessing techniques discussedin the previous chapters, most specifically information extraction. The chapterbegins with basic definitions from graph theory and moves to common techniquesfor analyzing large networks of entities. ..
Finally, in Chapter XII, three real-world applications of text mining are considered.We begin by describing an application for articles posted in BioWorld magazine.
This application identifies major biological entities such as genes and proteins andenables visualization of relationships between those entities. We then proceed tothe GeneWays application, which is based on analysis of PubMed articles. The nextapplication is based on analysis of U.S. patents and enables monitoring trends andvisualizing relationships between inventors, assignees, and technology terms.
The appendix explains the DIAL language, which is a dedicated informationextraction language. We outline the structure of the language and describe its exactsyntax. We also offer several code examples that show how DIAL can be used toextract a variety of entities and relationships. A detailed bibliography concludes thebook.
ACKNOWLEDGMENTS
This book would not have been possible without the help of many individuals. Inaddition to acknowledgments made throughout the book, we feel it important totake the time to offer special thanks to an important few. Among these we wouldlike to mention especially Benjamin Rosenfeld, who devoted many hours to revisingthe categorization and clustering chapters. The people at ClearForest Corporationalso provided help in obtaining screen shots of applications using ClearForesttechnologies – most notably in Chapter XII. In particular, we would like to mentionthe assistance we received from RafiVesserman,YonatanAumann, Jonathan Schler,Yair Liberzon, Felix Harmatz, and Yizhar Regev. Their support meant a great dealto us in the completion of this project.
Adding to this list, we would also like to thank Ian Bonner and Kathy Bentaiebof Inxight Software for the screen shots used in Chapter X. Also, we would like toextend our appreciation to Andrey Rzhetsky for his personal screen shots of theGeneWays application.
A book written on a subject such as text mining is inevitably a culmination ofmany years of work. As such, our gratitude is extended to both Haym Hirsh andOren Etzioni, early collaborators in the field.
In addition, we would like to thank Lauren Cowles of Cambridge UniversityPress for reading our drafts and patiently making numerous comments on how toimprove the structure of the book and its readability. Appreciation is also owed toJessica Farris for help in keeping two very busy coauthors on track.
Finally it brings us great pleasure to thank those dearest to us – our children Yael,Hadar, Yair, Neta and Frithjof – for leaving us undisturbed in our rooms while wewere writing. We hope that, now that the book is finished, we will have more timeto devote to you and to enjoy your growth.We are also greatly indebted to our dearwives Hedva and Lauren for bearing with our long hours on the computer, doingresearch, and writing the endless drafts.Without your help, confidence, and supportwe would never have completed this book. Thank you for everything. We love you! ...
媒体评论
——L. Venkata Subramaniam,IBM印度研究实验室.
“一本由该领域最重要专家编写的文本挖掘导论。这本书写得非常好,完美地结合了文本挖掘的理论和实践,既适合研究人员又适合实践者……极力推荐那些没有任何计算语言学背景而想钻研文本挖掘领域的人阅读本书。”
——Rada Mihalcea,北得克萨斯大学...