Data Mining
Data mining - methods and algorithms to explore and analyze large volumes of data
Goal: to find patterns in data that are
- valid: with some certainty
- e.g. everybody speaks English in Blois - not true
- novel: non obvious for a human
- everybody speaks French in Blois - obvious
- useful: can do something with extracted knowledge
- understandable for humans
What is DM
What is NOT Data Mining:
- look up a phone number in a dictionary
- compute the number of customers who bought iPad in August
- can use SQL for that
What is Data Mining:
- What is the profile of the customers who bought iPad?
- Which customers will buy the new iPhone?
- Which customers will buy which products?
Origins
DM is a discipline with roots from
- Artificial Intelligence
- Statistics
- Machine Learning
- Pattern Recognition
- Cognitive Science
- Database Systems
Main Focuses
DM is mostly used
- Customer Relationship Management (CRM)
- churn scoring - predict if a customer leaves to a competitor
- direct marketing - show ads only to whose who are interested
- credit scoring
- sales forecasting
- etc
- website/search optimization
- supply chain optimization
- many others
Types of Data Mining
Rule Mining
Sequence Mining:
Graph Mining
- Social Network Mining
Others
- Cluster Analysis
- Web Mining
- Text Mining - part of Natural Language Processing and Information Retrieval
- Stream Mining
- Tree Mining
- Preference Mining
Data Mining Process
CRISP-DM (CRoss Industry Standard Process for Data Mining)
Business Understanding
- Define the success criteria
- How to integrate the output with existing technologies?
Data Understanding
- Collect the data from Data Sources
- Summarizing Data: First Look at the Data
- Exploratory Data Analysis
- Univariate Analysis - to analyze how variable values behave in isolation
- Bivariate Analysis - to analyze how two variables interact
Data Preparation
- Need to prepare data so it can be processed by Models
- Data Cleaning
- Data Transformation
- Data Reduction
Data Modeling
Evaluation
Links
- http://en.wikipedia.org/wiki/Data_mining
- nice DM&ML slides [http://www.evernote.com/shard/s344/sh/284d7df3-ef98-41d3-9de5-9cbc4ad4b800/77713ac8ce6e2d4b52e2b5c63e7fe2f5]
- Data Mining syllabus in Boston College [http://www.evernote.com/shard/s344/sh/da3d2ca3-390f-4a0b-b443-b1773c7c24d4/9ad3c26bd0ef9e637d8bdce2011db309]
- Data Mining map by Saed Sayad [http://www.saedsayad.com/data_mining_map.htm]