商品参数
大数据分析基础:概念、技术、方法和商务(英文版) |
|
曾用价 |
219.00 |
出版社 |
科学出版社 |
版次 |
1 |
出版时间 |
2018年09月 |
开本 |
16 |
作者 |
Hui,Xue,Zeng,Jiajia |
装帧 |
平装 |
页数 |
614 |
字数 |
0 |
ISBN编码 |
9787030581488 |
内容介绍
无
目录
Contents
Part One Basics and Concepts
Chapter 1 Introduction 3
1.1 What Is Big Data Analytics? 3
1.1.1 Big Data Analytics Requires Data-Driven Business Culture 4
1.1.2 Big Data Analytics Requires High-Performance Analyses 4
1.2 Why Big Data Analytics? 4
1.2.1 History and Evolution of Big Data Analytics 5
1.2.2 The Drivers of Big Data Analytics 6
1.2.3 Why Is Big Data Analytics Important? 6
1.2.4 The Challenges of Big Data Analytics 8
1.2.5 How Big Data Analytics Is Used Today? 10
1.3 Big Data Analytics Applications 10
1.3.1 Industries Where Big Data Analytics Are Successful 11
1.3.2 Four Powerful Big Data Analytics Application Examples 13
1.4 The Big Data Analytics Market 14
1.5 Big Data Analytics Future Trends 15
1.5.1 Predictive Analytics Will Dominate 15
1.5.2 Refocusing on the Human Decision-Making 15
1.5.3 Market Segmentation in Data Analysis Platforms 16
1.5.4 Open Source Software Tools 16
1.5.5 Plug-in AI Technologies 16
1.6 The Contents of Big Data Analytics 17
1.7 References 19
1.8 Review Questions and Exercises 20
Chapter 2 Data and Big Data 21
2.1 Data as a Basic Entity in the DIKW Framework 21
2.1.1 DIKW Framework 21
2.1.2 Data Object, Data Attribute and Data Set 23
2.1.3 Data Attribute Types 25
2.2 Big Data 28
2.2.1 Big Data Definition 29
2.2.2 Big Data Types 33
2.3 Quality of Data and Big Data 37
2.3.1 Definition of Data Quality 37
2.3.2 Data Measurement and Data Collection 38
2.3.3 Errors in Measurement and Collection 39
2.3.4 Data Accuracy 40
2.4 Basic Measurement of Dataset 41
2.5 Summary 42
2.6 References 44
2.7 Review Questions 45
Chapter 3 Big Data Analytics Process 47
3.1 The Process of Data Mining and Knowledge Discovery 47
3.1.1 CRISP-DM Framework 47
3.1.2 KDD Process 49
3.2 Process of Big Data Analytics 51
3.2.1 Acquisition 51
3.2.2 Understanding 51
3.2.3 Preprocess 52
3.2.4 Analysis 52
3.2.5 Reporting 52
3.2.6 Action 52
3.3 Data Preprocess 53
3.3.1 Data Cleaning 54
3.3.2 Data Integration 54
3.3.3 Data Reduction 54
3.3.4 Data Transformation 55
3.4 Big Data Analysis 56
3.4.1 Analysis 56
3.4.2 Types of Big Data Analysis 57
3.4.3 Descriptive Analysis 60
3.4.4 Explorative Analysis 61
3.4.5 Predictive Data Analysis 62
3.5 Summary 66
3.6 References 68
3.7 Questions and Exercises 68
Part Two Technologies and Tools
Chapter 4 Supporting Infrastructure 73
4.1 Cloud Computing 73
4.1.1 Essential Characteristics of Cloud Computing 75
4.1.2 Services Provided by Cloud Computing 75
4.2 Distributed Computing 77
4.2.1 Characteristics of Distributed Systems 78
4.2.2 Distributed Systems Composition 78
4.2.3 Distributed State 81
4.2.4 The CAP Theorem 83
4.3 Big Data Systems 86
4.3.1 Requirements for a Big Data System 86
4.3.2 The Problems with Fully Incremental Architectures 87
4.3.3 Lambda Architecture 90
4.4 Summary 96
4.5 References 96
4.6 Questions and Exercises 97
Chapter 5 Hadoop and MapReduce 98
5.1 Computer Cluster 98
5.1.1 Concept of Computer Cluster 99
5.1.2 Attributes of Clusters 100
5.2 Apache Hadoop in a Nutshell 101
5.2.1 History and Overview of Hadoop 101
5.2.2 What Is Hadoop? 102
5.2.3 Components of Hadoop 103
5.2.4 The Hadoop Ecosystem 110
5.2.5 Hadoop Limitations 111
5.3 How Do Hadoop and MapReduce Work? 113
5.3.1 Big Example (WordCount) 113
5.3.2 Scalling WordCount in MapReduce 117
5.3.3 The Driver Method 120
5.4 MapReduce Data Flow 121
5.5 Other Hadoop Usages 123
5.5.1 Chaining Jobs 123
5.5.2 Listing and Killing Jobs 124
5.5.3 Pipes 124
5.5.4 Hadoop Streaming 126
5.5.5 Example of Hadoop Streaming Using Python 127
5.6 Summary 129
5.7 References 129
5.8 Review Questions and Excesses 130
5.9 Practical Tasks (lab tasks) 130
Chapter 6 Apache Spark 132
6.1 Spark in a Nutshell 132
6.1.1 Spark’s Stack 132
6.1.2 Spark’s Usage 134
6.1.3 Spark’s Advantages 135
6.1.4 Fast Application Support 135
6.2 Spark High-level Architecture 136
6.2.1 How Does a Spark Application Work? 137
6.2.2 Application Programming Interface (API) 138
6.3 Programming with RDDs 139
6.3.1 Steps for Program with RDDs 140
6.3.2 Spark Shell 140
6.3.3 RDD Creation 141
6.3.4 RDD Operations 142
6.3.5 Actions 144
6.3.6 Checking the Output 145
6.4 Spark Application Development and Deployment 145
6.4.1 Spark Jobs 146
6.4.2 Shared Variables 146
6.4.3 General Steps for Create a Spark Application 148
6.5 Summary 150
6.6 References 150
6.7 Questions and Exercises 150
6.8 Practical Tasks (lab tasks) 151
Chapter 7 NoSQL and MongoDB 152
7.1 NoSQL in a Nutshell 152
7.1.1 What Is NoSQL? 152
7.1.2 Why NoSQL? 153
7.1.3 The CAP Principle 154
7.1.4 ACID Rules 154
7.1.5 BASE Rules 155
7.1.6 Benefits of NoSQL 156
7.1.7 Types of NoSQL Databases 158
7.2 NoSQL and Hadoop Integration in Big Data Analytics 161
7.2.1 OLTP vs OLAP 161
7.2.2 Operational vs Analytical View of NoSQL 162
7.2.3 NoSQL Integration with Hadoop 162
7.3 MongoDB 163
7.3.1 MongoDB Basics 164
7.3.2 MongoDB Architecture 165
7.3.3 MongoDB Data Modelling 167
7.3.4 MongoDB Data Representation 173
7.3.5 MongoDB CRUD Operations 175
7.4 Big Data Analysis with MongoDB 179
7.4.1 Aggregation 179
7.4.2 MongoDB with MapReduce 181
7.4.3 MongoDB with Hadoop 184
7.5 Summary 186
7.6 References 188
7.7 Questions and Exercises 189
7.8 Practical Tasks (lab tasks) 190
Part Three Methods and Algorithms
Chapter 8 Data Preparation 195
8.1 What is Big Data Preparation? 195
8.2 Data Cleaning 196
8.2.1 Fill in Missing Values 196
8.2.2 Identify Outliers and Smooth Out Noisy Data 197
8.2.3 Correct Inconsistent Data 199
8.3 Data Integration 201
8.3.1 Entity Identification Problem 201
8.3.2 Redundancy Identification 202
8.3.3 Data Deduplication 207
8.4 Data Reduction 208
8.4.1 Overview of Data Reduction Strategies 208
8.4.2 Reducing the Number of Data Records 209
8.4.3 Reducing the Number of Attributes 215
8.4.4 Reducing the Number of Attribute Values 223
8.5 Data Transformation 228
8.5.1 Data Transformation Strategies Overview 228
8.5.2 Normalisation 229
8.5.3 Generalisation 232
8.6 Data Discretisation and Binarisation 234
8.6.1 Binarisation 235
8.6.2 Discretisation 236
8.7 Summary 242
8.8 References 243
8.9 Questions and Exercises 244
Chapter 9 Descriptive Data Analysis 248
9.1 Descriptive Data Analysis 248
9.2 Univariate Descriptive Analyses 250
9.2.1 Simple Data Summary 251
9.2.2 Location Measures 252
9.2.3 Percentiles 255
9.2.4 Dispersion Measures 256
9.2.5 Distribution or Shape Measures 257
9.3 Multivariate Descriptive Analyses 261
9.3.1 Contingency Table for Categorical Data 261
9.3.2 Multivariate Statistics on Categorical and Continuous Variables 262
9.3.3 Multivariate Summary on Quantitative Variables 262
9.3.4 Covariance and Correlation Matrices 263
9.4 Descriptive Analysis between Data Objects 264
9.4.1 Definitions of Similarity, Dissimilarity and Proximity 265
9.4.2 Proximity between Data Objects with Single Attribute 267
9.4.3 Proximity between Data Objects with Multiple Attributes 268
9.4.4 Proximity Analyses Issues and Selections 279
9.5 Association Analysis 282
9.5.1 Problem Definition 284
9.5.2 Frequent Itemset Generation 286
9.5.3 Association Rules Generation 301
9.5.4 Alternative Association Analysis 304
9.5.5 Evaluation of Association Patterns 316
9.5.6 Applications of Association Analysis 316
9.6 Summary 317
9.7 References 319
9.8 Questions and Exercises 320
Chapter 10 Explorative Data Analysis 326
10.1 Explorative Analysis Approach 326
10.1.1 Motivations for EDA 328
10.1.2 Definition of Exploratory Data Analysis 330
10.2 Univariate Graphical EDA 333
10.2.1 Stem and Leaf Plot 334
10.2.2 Histograms 334
10.2.3 Box Plot 338
10.2.4 Pie Chart 340
10.2.5 Bar Chart 341
10.2.6 Percentile Plots 342
10.2.7 Scatter Plots 344
10.2.8 Quantile-Normal Plots 345
10.3 Multivariate Graphical EDA 349
10.3.1 Generic Approaches for Multivariate 349
10.3.2 Extending 2-Dimensional and 3-Dimensional Plots 351
10.4 Data Visualisation 353
10.4.1 Pixel-Oriented Visualisation Techniques 353
10.4.2 Geometric Projection Visualisation Techniques 354
10.4.3 Icon-Based Visualisation Techniques 356
10.4.4 Visualising Spatio-Temporal Data 358
10.4.5 Animation 361
10.4.6 Do’s and Don’ts of Visualising Data 361
10.5 Multidimensional Data Analysis (OLAP) 362
10.5.1 Data Cube: A Multidimensional Data Model 363
10.5.2 Typical OLAP Operations 367
10.5.3 General Procedure Using Data Cubes and OLAP 371
10.6 Data Clustering 371
10.6.1 What Is Clustering? 372
10.6.2 Basic Clustering Techniques 379
10.6.3 Partitioning Clustering Methods 381
10.6.4 Hierarchical Clustering Methods 391
10.6.5 Density-Based Methods 411
10.6.6 Clustering with Mixed Methods 418
10.6.7 Clustering Evaluation 423
10.7 Summary 432
10.8 References 436
10.9 Questions and Exercises 437
Chapter 11 Predictive Data Analysis 443
11.1 Introduction to Predictive Data Analysis 443
11.1.1 What Is Predictive Data Analysis? 443
11.1.2 Predictive Data Analysis History and Its Applications 444
11.1.3 The Predictive Analytics Process 446
11.1.4 Tools and Software 450
11.2 Process of Building Predictive Models 452
11.3 Predictive Models 457
11.3.1 Predictive Model Types 457
11.3.2 Regression Models 459
11.3.3 Rule Based Models 467
11.3.4 Machine Learning Techniques 477
11.4 Predictive Models Evaluation 496
11.4.1 Confusion Matrix 496
11.4.2 Gain and Lift Charts 498
11.4.3 K-S Chart 501
11.4.4 Area Under the ROC Curve (AUC – ROC) 503
11.4.5 Gini Coefficient 505
11.4.6 Cross Validation 505
11.4.7 Root Mean Squared Error (RMSE) 506
11.5 Classification Problem 507
11.5.1 Basic Concepts 507
11.5.2 Decision Tree Induction 508
11.5.3 Overfitting and Tree Pruning 521
11.5.4 Evaluating the Performance of a Classifier 528
11.5.5 Comparing the Performance of Two Classifiers 530
11.6 Recent Applications of Predictive Data Analytics 537
11.6.1 Customer Relationship Management (CRM) 537
11.6.2 Risk Management and Fraud Detection 539
11.6.3 Clinical Decision Support Systems (CDSS) 540
11.6.4 Future and High-Level Economy Prediction 541
11.7 Summary 541
11.8 References 545
11.9 Questions and Exercises 547
Part Four Social, Ethical and Organisational Issues
Chapter 12 Ethics, Governance and Security of Big Data 559
12.1 12 V’s of Big Data 559
12.2 Ethics of Big Data 561
12.2.1 Relevancy of Ethics in a Big Data World 561
12.2.2 Big Data Analytics Ethical Awareness Framework 563
12.2.3 Big Data Ethics in Practice 564
12.3 Governance of Big Data 566
12.3.1 The Definition 567
12.3.2 Big Data Governance Framework 567
12.4 Big Data Privacy and Security 570
12.4.1 Big Data Privacy: The Great Fear 571
12.4.2 Data Collection: Understanding Privacy’s First Frontier 572
12.4.3 Big Data Security: The Foundation of Privacy 575
12.5 Case Studies 577
12.5.1 Google Street View Wifi: Inadvertent Over-Collection of Data 577
12.5.2 IPhone Location Database 578
12.5.3 A Chinese Case 578
12.6 Summary 579
12.7 References 580
12.8 Questions and Exercises 582
Chapter 13 Building Data-Driven Business Organisations 583
13.1 What Is a Data-Driven Organisation? 583
13.1.1 Definition of Data-Driven Organisation 584
13.1.2 Prerequisites of Data-Driven Organisations 584
13.1.3 Activities a Data-Driven Organisation Ought to Do 585
13.2 Organisational Big Data Analytics Maturity Models 586
13.2.1 SAS’ Eight Levels of Analytics Maturity Model 586
13.2.2 TDWI Data Governance Maturity Model 587
13.2.3 Analytics Business Maturity Model 588
13.2.4 DataFlux Data Governance Maturity Model 589
13.2.5 Gartner Enterprise Information Management Maturity Model 589
13.2.6 IBM Big Data Analytics Maturity Model 590
13.3 How to Build a Data-Driven Organisation? 591
13.3.1 Understand the Business 591
13.3.2 Aligning Big Data Initiatives to Business Goals and Strategy 593
13.3.3 Decision Making Based On Data Evidence 595
13.3.4 Build the Big Data Team 599
13.3.5 Adopt Best Practices with Big Data 601
13.3.6 Top 10 Priorities for Big Data Management (Russom 2013) 603
13.4 Big Data Analytics Innovation Examples 605
13.4.1 DeepGlint 606
13.4.2 Essentia Analytics 606
13.4.3 Catapult 607
13.4.4 Next Big Sound 607
13.4.5 Mark43 608
13.4.6 Netflix 608
13.4.7 Poshly 609
13.4.8 Ayasdi 609
13.4.9 Frost Data Capital 609
13.4.10 Splunk 610
13.4.11 Sumall 610
13.5 Summary 611
13.6 References 612
13.7 Questions and Exercises 613
在线试读
Chapter 1 Introduction
After Grid computing, Cloud computing, Internet of things, Big Data Analytics (BDA) has become another hot topic for academic research and business discussions (Manyika and Chui 2011). This is because of unprecedented data generation from a wide range of sources due to advanced sensor input technology, digitalised smart equipment and widely used mobile devices. Advanced computation platforms and tools make it possible to process huge quantities of data in a short time. Business competition demands fast understanding of both the market and users’ requirements. After all, Big Data Analytics reflects both scientific and technological advancement in many fields, such as business management, data science and computing.
1.1 What is Big Data Analytics?
Definition 1.1 Big Data Analytics. Big Data Analytics is a field of study on how to derive value out of Big Data to help business organisations achieve their goals.
Figure 1.1 is a term cloud, which also called word cloud, of Big Data Analytics which shows the related terms and relevance of it. It provides a rough idea what Big Data Analytics is about. However, the above definition exposes two important aspects of Big Data Analytics: deriving value out of Big Data and how that value can help achieve business goals. These two aspects are studied by the two major scientific fields: Management Science and Computer Science respectively. Therefore, it should not be surprised to see Big Data Analytics is taught in both Computer Science and Management Science. Generally speaking, Computer Science mainly focuses on how to derive value out of Big Data. It is sometimes called Big Data Analysis or Data Mining, which considers the process of collecting, organising and analysing large sets of data, called Big Data, to discover patterns and other useful informations. In a similar way, Big Data Analytics from Management Science is mainly focused on how to response to market changes, when facing the challenge of Big Data, and how to obtain competitive advantages to increase business profit. It is a sub-field of Business Intelligence (BI) (Dhiraj et al. 2013).
1.1.1 Big Data Analytics Requires Data-Driven Business Culture
Businesses drive the study and development of Big Data Analytics, therefore the ultimate goal of Big Data Analytics is to better serve the business’ needs. When facing unprecedented amounts of data, business needs new vision, new strategy, new organisational data-driven decision making culture, and even new teams, its members have data handling skills and expertise. Most importantly business needs better understanding of the data and therefore to make correct business decisions. All of these are to achieve more effective business decisions, marketing, new revenue opportunities, better customer service, improved operational efficiency, competitive advantages over rival organisations and business benefits.
1.1.2 Big Data Analytics Requires High-Performance Analyses
Analysing large volume of data requires use of new processes, methods and algorithms. It also requires advanced technology that is able to analyse Big Data and get answers from sufficiently quickly in situations where traditional technology, methods and algorithms cannot. Big Data Analytics is typically performed using specialised software tools and applications for predictive analytics, data mining, text mining, data forecasting and optimisation. Collectively these processes are separate but are combined into highly integrated functions of high-performance analyses. Using Big Data tools and software enables an organisation to process extremely large volumes of data that a business has collected to determine what data is relevant, and can be analysed to drive better business decisions in the future.
The overall purpose of Big Data Analytics is to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information from Big Data to support business decisions.
1.2 Why Big Data Analytics?
Big Data Analytics has not come into existence overnight. It has a long history starting when data was first created. It is a natural advancement of data usage.
1.2.1 History and Evolution of Big Data Analytics
The concept of Data Analytics has been around for years. Now, most business organisations understand that if they capture all the data that streams into their businesses, they can apply analytics and get significant value from it. However, even in the 1950s, decades before the term Big Data was coined, businesses were using basic analytics, essentially numbers in a spreadsheet that were manually examined, to uncover insights and trends.
From a historical perspective a timeline narrative with some important milestones can help with understanding Big Data Analytics development. Figure 1.2 shows this particular timeline of recent technology developments in business related advancement.
From the Figure 1.2, it can be seen that neither Big Data nor Data Analytics are new concepts. On one hand, there are companies that have dealt with billions of transactions for many years. Enterprise resource planning (ERP) systems were widely used in the 1980s to manage business core process through collection, storage, management and interpretation of data from many business activities (Harreld 2016, Turbar et al. 2008). Customer relationship management (CRM) gained in popularity in the 1990s to manage and analyse customer interactions and data throughout the customer lifecycle, with the goal of improving business relationships with customers, assisting in customer retention and driving sales growth (Robert 1991, Reichheld 2001). After the dot-com storm around the millennium, electronic commerce became the dominating driving force of business transactions. Go-on-line became a must for every business organisation. After massive use of the Internet and social networks, business organisations, facing an unprecedented amount of data, increasingly need to find actionable insights into their data in order to boost sales, increase efficiency, and improve operations, customer service and risk management. On the other hand, academic research, particularly in data science, has been trying to find advanced data processing models, methods and algorithms for a long time. Research results have been used by software companies such as IBM, Oracle, SAS, and many companies have been developing Big Data crunching software for over two decades.
The timeline in Figure 1.2 can be viewed as three stages based on the business relations with information technologies (IT):
Dependent. This reflects the earlier days when data systems were still fairly new and users didn’t quite know what they wanted. IT were assumed “Build it and they shall come.” Businesses were heavily dependent on the IT systems and technologies.
Independent. After several decades of using IT systems, users understood what an analytical platform was, and worked together with IT systems to define the business needs and approach for deriving insights for their business goals.
Interdependent. This is when Big Data becomes part of business daily routine. The so called Big Data Era means that various companies have to live together and are tied by various data strands. Social collaboration beyond individual company walls becomes inevitable.
1.2.2 The Drivers of Big Data Analytics
Big Data Analytics has many deep roots. Three major drivers fuel the world today with an abundant mentality of Big Data Analytics are listed below:
1.Data exploration. Data from legacy systems, sensor networks, web generated and newly created user data that have arrived in unprecedented ways, speed, format and quantity. This makes it extremely complex and difficult with the traditional data management and analytics technology and practices.
2.Computing advancement. Big Data Analytics is the natural result of four major computational advancements:
1) Cheaper and more capable hardware. It is stated as Moore’s Law that the cost of the technology (transistors in integrated circuit) is reduced and the capability of technologies (transistors in integrated circuit) are increased in every two years (Moore 2006).
2) Mobile computing. Smart phones and mobile tablets are widely used and the usage covers more and more areas. 3) Social networking such as Facebook, Twitter, and WeChat has created virtual communities beyond the physical reach. 4) Cloud computing enables users to rent or lease hardware or software far beyond what a conventional PC can accommodate.
3.Convergence of software advancement. Traditional data management, analytics software, open-source technology are merging to create new alternatives for business to address Big Data Analytics in different ways.
Thanks to the above three drivers, the global economy now generates unprecedented quantities of data at a scale and speed never seen before, with no sign of this acceleration stopping.
1.2.3 Why Is Big Data Analytics Important?
Big Data Analytics helps organisations harness their data and use it to identify new