Computing for Data Analysis (coursera)

coursera notes programming

Курс Computing for Data Analysis об основах языка программирования R

svn директория [http://code.google.com/p/stolzen/source/browse/trunk/courses/coursera/Computing+for+Data+Analysis/]
mindmap [http://stolzen.googlecode.com/svn/trunk/courses/coursera/Computing%20for%20Data%20Analysis/summary.xmind]
конспекты
- неделя 1 [http://stolzen.googlecode.com/svn/trunk/courses/coursera/Computing%20for%20Data%20Analysis/week1/week1%20summary.pdf]
- неделя 2 [http://stolzen.googlecode.com/svn/trunk/courses/coursera/Computing%20for%20Data%20Analysis/week2/week2_summary.pdf]
- неделя 3 [http://stolzen.googlecode.com/svn/trunk/courses/coursera/Computing%20for%20Data%20Analysis/week3/week3_notes.pdf]
- неделя 4 [http://stolzen.googlecode.com/svn/trunk/courses/coursera/Computing%20for%20Data%20Analysis/week4/week4_notes.pdf]

Week1

Syntax

x <- assigment # comment x <- 1:20 - sequence

Data types

atomic classes

logic
characters
numeric
integer
complex
numbers
doubles by default
if integer is needed - use L
Inf = 1/ 0
-Inf
NaN - not a number
vector
everything of the same class
creating a vector
- vector() creates a new vector
- x <- vector(“numeric”, length = 10)
  - creates a vector filled with NA
- c() concatenate - other function for creating vector
- c(0.5, 0.6) numeric
- c(TRUE, FALSE); c(T, F) logical
- c(“a”, “b”) character
- 9:29 integer
- c(1+0i, 2+4i) complex
  matrix
vector with dimension attributes
- nrow
- ncol
creating
- m <- matrix(nrow = 2, ncol = 3)
- m <- matrix(1:6, nrow = 2, ncol = 3)
  - from upper left down
  - 1 3 5 2 4 6
- column-binding
  - x <- 1:3; y <- 10:12
  - cbind(x, y)
    - 1 10 2 11 3 12
  - rbind(x, y)
    - 1 2 3 10 11 12
      list
may contain different types
x <- list(1, “a”, T, 1+4i)
attributes
names (dimnames)
dimensions
class
length
other user defined data
attributes() - general function
factor
categorical data vector
- ordered
- unordered
integer vector with each integer having a label
self-describing
x <- factor(c(“yes”, “yes”, “no”, “yes”, “no”)
table(x) => frequency
unclass(x)
- 2 2 1 2 2 1 #(2=yes, 1=no)
gl - generate factors
- f = gl(3, 10) # 1..1 x10, 2..2 x10, 3..3 x10
  data frame
for storing tabular data
special type of list
special attribute: row.names
read.table() read.csv() return data frame
can be converted to matrix
- but it will be coerced
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
nrow(x)
ncol(x)
convertion
coercion
- finds “last common denom”
- y <- c(1.7, “a”): 1.7 -> “1.7”, “a” (character)
- c(T, 2) -> 1, 2 (TRUE becomes 1)
- c(“a”, TRUE) -> “a”, “TRUE” (T becomes “TRUE”)
explicit
- as.* functions
- as.numeric()
- as.logical()
- as.character()
- as.complex()
- if no convertion possible, returns NA
  missing values
NaN
NA
is.na()
- is.na(NA) => T
- is.na(NaN) => T
is.nan()
- is.nan(NA) => F
- is.nan(NaN) = T
is.na(c(1, NA, 2)) => F T F
naming
if named, will be printed with names
vectors
- x <- 1:3 names(x) => NULL names(x) <- c(“foo”, “bar”, “n”) names(x) => given names
lists
- x <- list(a=1, b=2, c=3)
matrices
- m <- matrix(1:4, nrow=2, ncol=2) dimnames(m) <- list(c(“a”, “b”), c(“c”, “d”))
- m => c d a 1 3 b 2 4

Read/Write

tabular

read

read.table
- arguments:
  - file*
  - header
    - logical - variable names in the first row?
  - sep
    - separator (“,” etc)
  - colClasses
    - types of classes in each column
    - is not specified, R figures out the types
  - nrows
    - number of rows to read
    - default is all the file
  - comment
    - char, after which nothing is read
    - default is #
  - skip
    - number of lines to skip
- data <- read.table(“foo.txt”)
read.csv
- same as read.table, but default separator is “,”

for large files

RTFM

- make a rough calculation of RAM needed

- 1.5 mln rows * 120 cols of numeric data

1.5 mln * 120 * 8 bytes = 1.34 GB

consider overhead

1.43 * 2 ~ 2.7 GB

- comment.char = “” - faster

- colClasses set explicitly - much faster

initial <- read.table(“db.txt”, nrows=100) classes <- sapply(initial, class) #automatically figured out classes head(initial) #just shows the fist 6 lines tabAll <- read.table(“db.txt”, colClasses=classes)

write
- write.table
- write.csv
  R objects
metadata, as R source
- editable - good for VCS - but not very space efficient
read
- source
- dget
write
- dump
- dput
- for one variable y <- data.frame(a=1, b=”a”) dput(y) => to the console dput(y, file=”y.R”) new.y <- dget(“y.R”)
- for several variables x <- smth, y <- smth dump(c(“x”, “y”), file=”data.R”) rm(x, y) # removes x y source(“data.R”)
Subtopic 5
character
readLines
writeLines
serialize/unserialize

File connections
file
- plain file
gzfile
- gzip
bzfile
- bzip2
url
- to a web page
con <- file(“foo.txt”, “r”) data <- read.csv(con) close(con)
1. the same as read.csv(“foo.txt”)
con <- gzfile(“w.gz”) x <- readLines(con, 10) #first 10 lines
con <- url(“http://…”, “r”) x <- readLines(con)

Accessing subsets

[

returns the same type
starts from 1
[[
for lists and data frames
$
from lists and data frames, but using name
vectors
x <- c(“a”, “b”, “c”, “c”, “d”, “a”) x[1] => a x[1:4] => a b c d x[x > “a”] => b c c d
1. the same: u <- x > “a” => F T T T T F x[u] => b c c d
2. only trues
  matrix
m[row, col]
x <- matrix(1:6, 2, 3) x[1, 2] => 3 x[2, 1] => 2
m[1, ] => first row
m[, 2] => secod col
return VECTOR| | - x[1, 2, drop=F] => matrix of one element | - x[1, ,drop=F] => a matrix of one row
lists
x <- list(foo=1:4, bar=0.6)
x[1] => sublist $foo [1] 1 2 3 4
x1 => (element) 1 2 3 4
x$bar <=> x“bar” => 0.6
x[“bar”] => sublist $bar [1] 0.6
x[c(1, 3)] returns several cols
name <- “bar” x[name] => smth x$name => NULL| (“name” doesn’t exist) |- xc(1, 3) => 1st list, 3th el | - the same: x1 3
partial matching
- x <- list(longname=1:5) x$lo => returns $longname x“lo” => NULL x“lo”, exact=F => $longname
  removing NAs
bad <- is.na(x) x[| bad] #note the inversion! |- good <- complete.cases(x, y) |x[good]; y[good]
r <- matrix good <- complete.cases(r) r[good, ]
vectorized operations
vector
- no need for looping - x <- 1:4; y <- 6:9 - x + y
- x > 2
- x >= 2
- y == 8
- x * y
- x / y
matrix
- x <- matrix(1:4, 2, 2) y <- matrix(rep(10, 4)), 2, 2)
- x * y => element-wise
- x / y => element-wise
- x %*% y - true matrix multiplication

Week2

Control Structures

conditions

if (condition) {
smth

} else if (cond2) {

smth

} else {

smth

}

blocks return result
y = if (x > 3) { 10 } else { 0 }
loops
while
- while (condition) { #do smth }
repeat
- infinite loop
- repeat {
  smth
  
  if (smth) { break; } }
for
- for (i in 1:10) { print(i) }
- for (i in seq_lenght(nrow(x))) { print(x[i]) }
next
- like continue
break

Functions

high-level objects

may take function as parameters, or return functions
declaration
f = function(arg.x, arg.y, def.val = NULL) { 4 # no need for return }
f = function(arg.x, …) {
… - any other arguments

another.function(10, 20, …) }

scoping
lexical scoping
environments
- env - name-value pairs
- env can have a parent and multiple children
- env of a function - its closure
- global env
  - seach() - list of packages
- local function environment
  - consider make.power function
    - make.power = function(n) { pow = function(x) { x ^ n } }
  - cube = make.power(3)
  - ls(environment(cube)) #lists variables
  - get(“n”, environment(cube)) #returns variable n
look up
- local env
- if not in the function env, look up in the parental
- until the top one reached
- global
  args(f) returns list of arguments

Loop function

lapply

lapply(X, FUN, …)
X -list, returns a list
if X not a list, it will be coerced
x = list(a = 1:5, b=rnorm(10)) lapply(x, mean)
sapply
same as lapply, but simplifies the result
if each el contains one element, result coerced to vector
if each el of the same len, result coerced to matrix
otherwise a list is returned
apply
evaluates the function over the margin of array
apply(X, MARGIN, FUN, …)
MARGIN: 1 for row, 2 for col
- rowSum = apply(x, 1, sum)
- rowMean = apply(x, 1, mean)
- colSum = apply(x, 2, sum)
- colMean = apply(x, 2, mean)
apply(x, 1, quantile, probs=c(0.25, 0.75))
- apply for each row: qualtile(row, probs=c(0.25, 0.75)
  tapply
appliers for a group
tapply(X, INDEX, FUN, …)
example
- x = c(rnorm(10), runif(10), rnorm(10, 1))
- f = gl(3, 10) # 1x10, 2x10, 3x10
- tapply(x, f, mean) - returns 3 means
  split
takes a vector and a factor
splits it into groups determined by factors
apply(X, F, …)
example
- x = c(rnorm(10), runif(10), rnorm(10, 1))
- f = gl(3, 10) # 1x10, 2x10, 3x10
- will return 3 groups
  mapply
multivariable apply
mapply(FUN, …)
mapply(rep, 1:4, 4:1)
- rep(1, 4)
- rep(2, 3)
- rep(3, 2)
- rep(4, 1)

Debugging

traceback

prints out stacktrace
debug
you can step through
browser
suspends execution for debugging
trace
allows to insert debug code into existent functions
recover
gets the control back

Week3

Simulations

generating random numbers

common prefixes

d - density
r - random number
p - cumulative distribution
q - quantile
normal
rnorm
dnorm
unified
runif
dunif
set.seed(n)
ensures reproducibility
examples
linear model
- y = B_0 + B_1 * X + E
- E ~ N(0, 4), X ~ N(0, 1), B_0=0.5, B_1=2
- set.seed(20)
- x = rnorm(100)
- e = rnorm(100, 0, 2)
- y = 0.5 + 2 * x + e
- plot(x, y)
  sampling
sample(1:10, 4) # 4 random numbers from 1:10
sample(1:10) # permutation
sample(letters, 5)
sample(1:10, replace=TRUE) # el can be used multiple times

Plotting

base

plot(x, y)
params
- global
  - par - global parameters (?par)
  - example (margins)
    - x = rnorm(100)
    - y = rnorm(100)
    - par(mar=c(2, 2, 2, 2))
    - plot(x, y)
- plot(x, y, pch=20) # solid circles
adding
- title(“name”) adds a title
- legend(“topleft”, legend=”Data”, pch=20)
- adding a line
  - fit = lm(y ~ x)
  - abline(fit)
  - abline(fit, lwd=3, color=”blue”) # thick
example(points) - built-in demos
example
- plot(x, y, xlab=”weight”, ylab=”height”, main=”Scatterplot”, pch=20)
- several plots
  - par(mfraw=c(2, 1))
  - plot(x, y, pch=20)
  - plot(x, z, pch=19)
annotations
- expression
  - plot(0, 0, main=expression(theta==0), ylab=expression(hat(gamma)==0), xlab=expression(sum(x[i]*y[i], i==1,n))
- substitute
  - xlab=substitute(bar(X)==k, list(k=mean(x))
  - replaces x in the exp onto mean
    lattice
generally tale a formula
y ~ x f * g - y ~ x - variables
- f * g - conditional variables
returns an object
example
- x = rnorm(100)
- y = x + rnorm(100, sd=0.5)
- f = gl(2, 50, labels=c(“g1”, “g2”))
- xyplot(x ~ y | f) # x vs y, split conditioned on f |
  Sorting
  
  sort(read)
returns a sorted array
order(read)
sorts and returns a vector with ordered indexes
- read[order(read)] == sort(read)
order(read, prog) # 2 variables
order(prog, -read) # reverse order
order(prog, na.last=F) #na goes first
order(prog, la.last=NA) # na omitted

Week4

Dates and time

Date class - only dates

Time

POSIXct
- based on integer
POSIXlt
- based on list
x = Sys.time()
p = as.POSIXlt(x)
p$sec
Convertion
strptime
- dates = c(“January 10, 2012 10:40”, “December 9, 2011 10:50”)
- x = strptime(dates, “%B %d, %Y %H:%M”)
  Operations
many operations are allowed: < > + - etc
leap years, seconds etc are considered
- y = as.Date(“2012-03-01”)
- x = as.Date(“2012-02-28”)
- y - x # returns 2
and time zones
- x = as.POSIXct(“2012-10-25 01:00:00”)
- y = as.POSIXct(“2012-10-25 06:00:00”, tz=”GMT”)
- x - y = 1 hour

Reg exps

grep

grep
- returns indexes of matches
grepl
- returns T/F vector
  regexpr
index where match begins, only first
gregexpr - all matches
replacing
sub
- sub(pattern, replacement)
- replaces found pattern on given string
- only first occurence
gsub
- replaces everything
  regmatches function
for extracting results returned by regexpr
r = regexpr(pattern, h)
regexpr(h, r)
regexec
regexec(pattern, d)
returns list with matches
if groups are used, also returned
example
r = regexec(“<dd>[Ff]ound on (.*?)</dd>”, homicides)
m = regmatches(homicides, r)
dates = sapply(m, function(x) x[2])
dates = as.Date(dates, “%B %d, %Y”)
hist(dates, “year”, freq=TRUE)

Classes and methods

class - description

object - instance of a class

method - operates on a certain class

generic function - for all objets

don’t do anything
only dispatches to appropriate function
more information
?Classes, ?Methods, ?setClass, ?setMethod
S3 classes
“old-style”
mean
- UseMethod(“mean”)
- methods(“mean”) lists methods for mean
  S4 classes
“new-style”
show
- StandartGeneric(“show”)
- ShowMethod(“show”)
  dispatching
gets the class
searches for appropriate method
if method exists, it’s called
otherwise default is called
if no default, exception is thrown
example (S4)
setClass - creates a new class
slots - data stored there
- accessed via @
setClass(“polygon”, representation(x=”numeric”, y=”numeric”))
setMethod(“plot”, “polygon”, function(x, y, …) { plot(x@x, x@y, type=n, …) xp = c(x@x, x@x[1]) yp = x(x@y, x@y[1]) lines(xp, yp) } })
p = new(“polygon”, x=c(1, 2, 3, 4), y=c(1, 2, 3, 1))
plot(p)

Misc

Errors

if (condition) { stop(“error message”) }

Proxy

Sys.setenv(http_proxy=”http://username:password@proxyaddress:8080”)

Sets operations

setdiff(i, j) - elements present in i, not present in j

✏️ Edit on GitHub

Computing for Data Analysis (coursera)

Week1

Syntax

Data types

atomic classes

numbers

vector

matrix

list

attributes

factor

data frame

convertion

missing values

naming

Read/Write

tabular

R objects

character

serialize/unserialize

File connections

Accessing subsets

[

[[

$

vectors

matrix

lists

removing NAs

vectorized operations

Week2

Control Structures

conditions

smth

smth

smth

blocks return result

loops

smth

Functions

high-level objects

declaration

… - any other arguments

scoping

args(f) returns list of arguments

Loop function

lapply

sapply

apply

tapply

split

mapply

Debugging

traceback

debug

browser

trace

recover

Week3

Simulations

generating random numbers

common prefixes

normal

unified

set.seed(n)

examples

sampling

Plotting

base

lattice

Sorting

sort(read)

order(read)

Week4

Dates and time

Date class - only dates

Time

Convertion

Operations

Reg exps