Data - values of qualitative or quantitative variables belonging to a set of items
- set of items - subjects
- variables - measurements
- hard to use
- complex format
Want to have Pre-Processed Data
- ready for analysis
- each variable forms a column
- each observation forms a raw
- each file stores about one kind of observation
Types of Data Analysis
- Goal: to describe a set of data
- commonly applied to census data
- Goal: Find relationships you didn't know about
- and ideas for the following studies
- Exploratory analyses alone should not be used for generalizing/predicting
- use small data sample to say something about the bigger population
- use data on some object to predict values for another object
- finds out what happens to one variable if another one changes
- understand the exact changes in other variables
Structure of Data Analysis
- Define the question (business/scientific)
- Start with some general question
- "Can I automatically detect messages that are SPAM"?
- Make it concrete
- "Can I use quantitative characteristics of emails to classify them?"
- Obtain the data
- What data you can access?
- A lot of data can be got from Data Sources
- you also may buy or generate data
- Clean the data - so you can analyze it
- Is the data you found good enough?
- Most often - not, so you'll have to change the data
- may have to use ETLs for that and load the data into a Data Warehouse
- Exploratory Data Analysis
- Statistical prediction/modeling
- To answer the question you asked
- Should be informed by the result of the previous phase
- Methods may depend on the questions
- Typically Data Mining and Machine Learning algorithms are used for this
- Report all measures of uncertainty: number of mistakes you did on the test set, etc
- Interpret results
- What does it mean - in plain natural language
- Challenge results
- What are potential failings?
- Challenge all the steps
- was it right? could you have made it more specific/general?
- Data Sources
- was it right data? did you get the right samples? the right population?
- correctly identified the variables?
- Did we pick the right model? Could the results be better with another model?
- Synthesize/write up results
- In plain language - using the data to answer the question
- should read like a story
- Create reproducible code
- so you can share your analysis with other people