Data Mining Process
CRISP-DM [http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining]
- CRISP-DM (CRoss Industry Standard Process for Data Mining)
- there are 6 steps
CRISP-DM: four levels of abstraction
- Phases
- Example: Data Preparation
- Generic Tasks
- A stable, general and complete set of tasks
- Example: Data Cleaning
- Specialized Task
- A specific task that belongs to a generic task
- Example: Missing Value Handling
- Process Instance
- How a specific task is carried out?
- Example: The mean value for numeric attributes and the most frequent for categorical attributes
Business Understanding
Main Objectives
- Define the success criteria
- Forms of output?
- How to integrate the output with existing technologies?
Data Understanding
Main Objectives
- Collect the data
- What are the data sources?
- a lot of links at Data Sources
- Summarizing Data: First Look at the Data
- Exploratory Data Analysis
- building simple data Plots (Histograms, etc)
- to help to understand the Distribution of data
- Univariate Analysis - to analyze how variable values behave in isolation
- Bivariate Analysis - to analyze how two variables interact
Data Preparation
Need to prepare data so it can be processed by Models
- Data Cleaning - Handling Noise, Anomaly Detection, Duplicate Detection, etc
- Data Transformation - Data Normalization, Data Discretization
- Data Reduction
Modeling
Prediction Tasks
- models to predict unknown or future values
- Classification Models: predict a categorical value
- Regression Models: predict a continuous value
Description Tasks
- Goal: find patterns / clusters that describe a data set
- Cluster Analysis: find clusters in data
- Extraction of local patterns: find local properties in a data set
Evaluation
Main Questions
- How to evaluate a method? - Error Analysis
- How to compare different models that solve same problem? - Cross-Validation and
Objective Measures:
- Error rate of a classifier - Error Metrics
- Conference of associative rules
Subjective Measures: