- Home
- Mehmed Kantardzic
Data Mining
Data Mining Read online
Table of Contents
Cover
Series page
Title page
Copyright page
DEDICATION
PREFACE TO THE SECOND EDITION
PREFACE TO THE FIRST EDITION
1 DATA-MINING CONCEPTS
1.1 INTRODUCTION
1.2 DATA-MINING ROOTS
1.3 DATA-MINING PROCESS
1.4 LARGE DATA SETS
1.5 DATA WAREHOUSES FOR DATA MINING
1.6 BUSINESS ASPECTS OF DATA MINING: WHY A DATA-MINING PROJECT FAILS
1.7 ORGANIZATION OF THIS BOOK
2 PREPARING THE DATA
2.1 REPRESENTATION OF RAW DATA
2.2 CHARACTERISTICS OF RAW DATA
2.3 TRANSFORMATION OF RAW DATA
2.4 MISSING DATA
2.5 TIME-DEPENDENT DATA
2.6 OUTLIER ANALYSIS
3 DATA REDUCTION
3.1 DIMENSIONS OF LARGE DATA SETS
3.2 FEATURE REDUCTION
3.3 RELIEF ALGORITHM
3.4 ENTROPY MEASURE FOR RANKING FEATURES
3.5 PCA
3.6 VALUE REDUCTION
3.7 FEATURE DISCRETIZATION: CHIMERGE TECHNIQUE
3.8 CASE REDUCTION
4 LEARNING FROM DATA
4.1 LEARNING MACHINE
4.2 SLT
4.3 TYPES OF LEARNING METHODS
4.4 COMMON LEARNING TASKS
4.5 SVMs
4.6 KNN: NEAREST NEIGHBOR CLASSIFIER
4.7 MODEL SELECTION VERSUS GENERALIZATION
4.8 MODEL ESTIMATION
4.9 90% ACCURACY: NOW WHAT?
5 STATISTICAL METHODS
5.1 STATISTICAL INFERENCE
5.2 ASSESSING DIFFERENCES IN DATA SETS
5.3 BAYESIAN INFERENCE
5.4 PREDICTIVE REGRESSION
5.5 ANOVA
5.6 LOGISTIC REGRESSION
5.7 LOG-LINEAR MODELS
5.8 LDA
6 DECISION TREES AND DECISION RULES
6.1 DECISION TREES
6.2 C4.5 ALGORITHM: GENERATING A DECISION TREE
6.3 UNKNOWN ATTRIBUTE VALUES
6.4 PRUNING DECISION TREES
6.5 C4.5 ALGORITHM: GENERATING DECISION RULES
6.6 CART ALGORITHM & GINI INDEX
6.7 LIMITATIONS OF DECISION TREES AND DECISION RULES
7 ARTIFICIAL NEURAL NETWORKS
7.1 MODEL OF AN ARTIFICIAL NEURON
7.2 ARCHITECTURES OF ANNS
7.3 LEARNING PROCESS
7.4 LEARNING TASKS USING ANNS
7.5 MULTILAYER PERCEPTRONS (MLPs)
7.6 COMPETITIVE NETWORKS AND COMPETITIVE LEARNING
7.7 SOMs
8 ENSEMBLE LEARNING
8.1 ENSEMBLE-LEARNING METHODOLOGIES
8.2 COMBINATION SCHEMES FOR MULTIPLE LEARNERS
8.3 BAGGING AND BOOSTING
8.4 ADABOOST
9 CLUSTER ANALYSIS
9.1 CLUSTERING CONCEPTS
9.2 SIMILARITY MEASURES
9.3 AGGLOMERATIVE HIERARCHICAL CLUSTERING
9.4 PARTITIONAL CLUSTERING
9.5 INCREMENTAL CLUSTERING
9.6 DBSCAN ALGORITHM
9.7 BIRCH ALGORITHM
9.8 CLUSTERING VALIDATION
10 ASSOCIATION RULES
10.1 MARKET-BASKET ANALYSIS
10.2 ALGORITHM APRIORI
10.3 FROM FREQUENT ITEMSETS TO ASSOCIATION RULES
10.4 IMPROVING THE EFFICIENCY OF THE APRIORI ALGORITHM
10.5 FP GROWTH METHOD
10.6 ASSOCIATIVE-CLASSIFICATION METHOD
10.7 MULTIDIMENSIONAL ASSOCIATION–RULES MINING
11 WEB MINING AND TEXT MINING
11.1 WEB MINING
11.2 WEB CONTENT, STRUCTURE, AND USAGE MINING
11.3 HITS AND LOGSOM ALGORITHMS
11.4 MINING PATH–TRAVERSAL PATTERNS
11.5 PAGERANK ALGORITHM
11.6 TEXT MINING
11.7 LATENT SEMANTIC ANALYSIS (LSA)
12 ADVANCES IN DATA MINING
12.1 GRAPH MINING
12.2 TEMPORAL DATA MINING
12.3 SPATIAL DATA MINING (SDM)
12.4 DISTRIBUTED DATA MINING (DDM)
12.5 CORRELATION DOES NOT IMPLY CAUSALITY
12.6 PRIVACY, SECURITY, AND LEGAL ASPECTS OF DATA MINING
13 GENETIC ALGORITHMS
13.1 FUNDAMENTALS OF GAs
13.2 OPTIMIZATION USING GAs
13.3 A SIMPLE ILLUSTRATION OF A GA
13.4 SCHEMATA
13.5 TSP
13.6 MACHINE LEARNING USING GAs
13.7 GAS FOR CLUSTERING
14 FUZZY SETS AND FUZZY LOGIC
14.1 FUZZY SETS
14.2 FUZZY-SET OPERATIONS
14.3 EXTENSION PRINCIPLE AND FUZZY RELATIONS
14.4 FUZZY LOGIC AND FUZZY INFERENCE SYSTEMS
14.5 MULTIFACTORIAL EVALUATION
14.6 EXTRACTING FUZZY MODELS FROM DATA
14.7 DATA MINING AND FUZZY SETS
15 VISUALIZATION METHODS
15.1 PERCEPTION AND VISUALIZATION
15.2 SCIENTIFIC VISUALIZATION AND INFORMATION VISUALIZATION
15.3 PARALLEL COORDINATES
15.4 RADIAL VISUALIZATION
15.5 VISUALIZATION USING SELF-ORGANIZING MAPS (SOMs)
15.6 VISUALIZATION SYSTEMS FOR DATA MINING
APPENDIX A
A.1 DATA-MINING JOURNALS
A.2 DATA-MINING CONFERENCES
A.3 DATA-MINING FORUMS/BLOGS
A.4 DATA SETS
A.5 COMERCIALLY AND PUBLICLY AVAILABLE TOOLS
A.6 WEB SITE LINKS
APPENDIX B: DATA-MINING APPLICATIONS
B.1 DATA MINING FOR FINANCIAL DATA ANALYSIS
B.2 DATA MINING FOR THE TELECOMUNICATIONS INDUSTRY
B.3 DATA MINING FOR THE RETAIL INDUSTRY
B.4 DATA MINING IN HEALTH CARE AND BIOMEDICAL RESEARCH
B.5 DATA MINING IN SCIENCE AND ENGINEERING
B.6 PITFALLS OF DATA MINING
BIBLIOGRAPHY
Index
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
IEEE Press Editorial Board
Lajos Hanzo, Editor in Chief
R. Abhari M. El-Hawary O. P. Malik
J. Anderson B-M. Haemmerli S. Nahavandi
G. W. Arnold M. Lanzerotti T. Samad
F. Canavero D. Jacobson G. Zobrist
Kenneth Moore, Director of IEEE Book and Information Services (BIS)
Technical Reviewers
Mariofanna Milanova, Professor
Computer Science Department
University of Arkansas at Little Rock
Little Rock, Arkansas, USA
Jozef Zurada, Ph.D.
Professor of Computer Information Systems
College of Business
University of Louisville
Louisville, Kentucky, USA
Witold Pedrycz
Department of ECE
University of Alberta
Edmonton, Alberta, Canada
Copyright © 2011 by Institute of Electrical and Electronics Engineers. All rights reserved.
Published by John Wiley & Sons, Inc., Hoboken, New Jersey.
Published simultaneously in Canada.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyr
ight.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com.
Library of Congress Cataloging-in-Publication Data:
Kantardzic, Mehmed.
Data mining : concepts, models, methods, and algorithms / Mehmed Kantardzic. – 2nd ed.
p. cm.
ISBN 978-0-470-89045-5 (cloth)
1. Data mining. I. Title.
QA76.9.D343K36 2011
006.3'12–dc22
2011002190
oBook ISBN: 978-1-118-02914-5
ePDF ISBN: 978-1-118-02912-1
ePub ISBN: 978-1-118-02913-8
To Belma and Nermin
PREFACE TO THE SECOND EDITION
In the seven years that have passed since the publication of the first edition of this book, the field of data mining has made a good progress both in developing new methodologies and in extending the spectrum of new applications. These changes in data mining motivated me to update my data-mining book with a second edition. Although the core of material in this edition remains the same, the new version of the book attempts to summarize recent developments in our fast-changing field, presenting the state-of-the-art in data mining, both in academic research and in deployment in commercial applications. The most notable changes from the first edition are the addition of
new topics such as ensemble learning, graph mining, temporal, spatial, distributed, and privacy preserving data mining;
new algorithms such as Classification and Regression Trees (CART), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Balanced and Iterative Reducing and Clustering Using Hierarchies (BIRCH), PageRank, AdaBoost, support vector machines (SVM), Kohonen self-organizing maps (SOM), and latent semantic indexing (LSI);
more details on practical aspects and business understanding of a data-mining process, discussing important problems of validation, deployment, data understanding, causality, security, and privacy; and
some quantitative measures and methods for comparison of data-mining models such as ROC curve, lift chart, ROI chart, McNemar’s test, and K-fold cross validation paired t-test.
Keeping in mind the educational aspect of the book, many new exercises have been added. The bibliography and appendices have been updated to include work that has appeared in the last few years, as well as to reflect the change in emphasis when a new topic gained importance.
I would like to thank all my colleagues all over the world who used the first edition of the book for their classes and who sent me support, encouragement, and suggestions to put together this revised version. My sincere thanks are due to all my colleagues and students in the Data Mining Lab and Computer Science Department for their reviews of this edition, and numerous helpful suggestions. Special thanks go to graduate students Brent Wenerstrom, Chamila Walgampaya, and Wael Emara for patience in proofreading this new edition and for useful discussions about the content of new chapters, numerous corrections, and additions. To Dr. Joung Woo Ryu, who helped me enormously in the preparation of the final version of the text and all additional figures and tables, I would like to express my deepest gratitude.
I believe this book can serve as a valuable guide to the field for undergraduate, graduate students, researchers, and practitioners. I hope that the wide range of topics covered will allow readers to appreciate the extent of the impact of data mining on modern business, science, even the entire society.
MEHMED KANTARDZIC
Louisville
July 2011
PREFACE TO THE FIRST EDITION
The modern technologies of computers, networks, and sensors have made data collection and organization an almost effortless task. However, the captured data need to be converted into information and knowledge from recorded data to become useful. Traditionally, the task of extracting useful information from recorded data has been performed by analysts; however, the increasing volume of data in modern businesses and sciences calls for computer-based methods for this task. As data sets have grown in size and complexity, so there has been an inevitable shift away from direct hands-on data analysis toward indirect, automatic data analysis in which the analyst works via more complex and sophisticated tools. The entire process of applying computer-based methodology, including new techniques for knowledge discovery from data, is often called data mining.
The importance of data mining arises from the fact that the modern world is a data-driven world. We are surrounded by data, numerical and otherwise, which must be analyzed and processed to convert it into information that informs, instructs, answers, or otherwise aids understanding and decision making. In the age of the Internet, intranets, data warehouses, and data marts, the fundamental paradigms of classical data analysis are ripe for changes. Very large collections of data—millions or even hundred of millions of individual records—are now being stored into centralized data warehouses, allowing analysts to make use of powerful data mining methods to examine data more comprehensively. The quantity of such data is huge and growing, the number of sources is effectively unlimited, and the range of areas covered is vast: industrial, commercial, financial, and scientific activities are all generating such data.
The new discipline of data mining has developed especially to extract valuable information from such huge data sets. In recent years there has been an explosive growth of methods for discovering new knowledge from raw data. This is not surprising given the proliferation of low-cost computers (for implementing such methods in software), low-cost sensors, communications, and database technology (for collecting and storing data), and highly computer-literate application experts who can pose “interesting” and “useful” application problems.
Data-mining technology is currently a hot favorite in the hands of decision makers as it can provide valuable hidden business and scientific “intelligence” from large amount of historical data. It should be remembered, however, that fundamentally, data mining is not a new technology. The concept of extracting information and knowledge discovery from recorded data is a well-established concept in scientific and medical studies. What is new is the convergence of several disciplines and corresponding technologies that have created a unique opportunity for data mining in scientific and corporate world.
The origin of this book was a wish to have a single introductory source to which we could direct students, rather than having to direct them to multiple sources. However, it soon became apparent that a wide interest existed, and potential readers other than our students would appreciate a compilation of some of the most important methods, tools, and algorithms in data mining. Such readers include people from a wide variety of backgrounds and positions, who find themselves confronted by the need to make sense of large amount of raw d
ata. This book can be used by a wide range of readers, from students wishing to learn about basic processes and techniques in data mining to analysts and programmers who will be engaged directly in interdisciplinary teams for selected data mining applications. This book reviews state-of-the-art techniques for analyzing enormous quantities of raw data in a high-dimensional data spaces to extract new information useful in decision-making processes. Most of the definitions, classifications, and explanations of the techniques covered in this book are not new, and they are presented in references at the end of the book. One of the author’s main goals was to concentrate on a systematic and balanced approach to all phases of a data mining process, and present them with sufficient illustrative examples. We expect that carefully prepared examples should give the reader additional arguments and guidelines in the selection and structuring of techniques and tools for his or her own data mining applications. A better understanding of the implementational details for most of the introduced techniques will help challenge the reader to build his or her own tools or to improve applied methods and techniques.
Teaching in data mining has to have emphasis on the concepts and properties of the applied methods, rather than on the mechanical details of how to apply different data mining tools. Despite all of their attractive “bells and whistles,” computer-based tools alone will never provide the entire solution. There will always be the need for the practitioner to make important decisions regarding how the whole process will be designed, and how and which tools will be employed. Obtaining a deeper understanding of the methods and models, how they behave, and why they behave the way they do is a prerequisite for efficient and successful application of data mining technology. The premise of this book is that there are just a handful of important principles and issues in the field of data mining. Any researcher or practitioner in this field needs to be aware of these issues in order to successfully apply a particular methodology, to understand a method’s limitations, or to develop new techniques. This book is an attempt to present and discuss such issues and principles and then describe representative and popular methods originating from statistics, machine learning, computer graphics, data bases, information retrieval, neural networks, fuzzy logic, and evolutionary computation.
In this book, we describe how best to prepare environments for performing data mining and discuss approaches that have proven to be critical in revealing important patterns, trends, and models in large data sets. It is our expectation that once a reader has completed this text, he or she will be able to initiate and perform basic activities in all phases of a data mining process successfully and effectively. Although it is easy to focus on the technologies, as you read through the book keep in mind that technology alone does not provide the entire solution. One of our goals in writing this book was to minimize the hype associated with data mining. Rather than making false promises that overstep the bounds of what can reasonably be expected from data mining, we have tried to take a more objective approach. We describe with enough information the processes and algorithms that are necessary to produce reliable and useful results in data mining applications. We do not advocate the use of any particular product or technique over another; the designer of data mining process has to have enough background for selection of appropriate methodologies and software tools.