Курс Computing for Data Analysis об основах языка программирования R

## Week1

### Syntax

 x <- assigment
# comment
x <- 1:20 - sequence


### Data types

• logic
• characters
• numeric
• integer
• complex

#### numbers

• doubles by default
• if integer is needed - use L
• Inf = 1/ 0
• -Inf
• NaN - not a number

#### vector

• everything of the same class
• creating a vector
• vector() creates a new vector
• x <- vector("numeric", length = 10)
• creates a vector filled with NA
• c() concatenate - other function for creating vector
• c(0.5, 0.6) numeric
• c(TRUE, FALSE); c(T, F) logical
• c("a", "b") character
• 9:29 integer
• c(1+0i, 2+4i) complex

#### matrix

• vector with dimension attributes
• nrow
• ncol
• creating
• m <- matrix(nrow = 2, ncol = 3)
• m <- matrix(1:6, nrow = 2, ncol = 3)
• from upper left down
• 1 3 5

2 4 6

• column-binding
• x <- 1:3; y <- 10:12
• cbind(x, y)
• 1 10

2 11 3 12

• rbind(x, y)
• 1 2 3

10 11 12

#### list

• may contain different types
• x <- list(1, "a", T, 1+4i)

#### attributes

• names (dimnames)
• dimensions
• class
• length
• other user defined data
• attributes() - general function

#### factor

• categorical data vector
• ordered
• unordered
• integer vector with each integer having a label
• self-describing
• x <- factor(c("yes", "yes", "no", "yes", "no")
• table(x) => frequency
• unclass(x)
• 2 2 1 2 2 1 #(2=yes, 1=no)
• gl - generate factors
• f = gl(3, 10) # 1..1 x10, 2..2 x10, 3..3 x10

#### data frame

• for storing tabular data
• special type of list
• special attribute: row.names
• can be converted to matrix
• but it will be coerced
• x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
• nrow(x)
• ncol(x)

#### convertion

• coercion
• finds "last common denom"
• y <- c(1.7, "a"): 1.7 -> "1.7", "a" (character)
• c(T, 2) -> 1, 2 (TRUE becomes 1)
• c("a", TRUE) -> "a", "TRUE" (T becomes "TRUE")
• explicit
• as.* functions
• as.numeric()
• as.logical()
• as.character()
• as.complex()
• if no convertion possible, returns NA

#### missing values

• NaN
• NA
• is.na()
• is.na(NA) => T
• is.na(NaN) => T
• is.nan()
• is.nan(NA) => F
• is.nan(NaN) = T
• is.na(c(1, NA, 2)) => F T F

#### naming

• if named, will be printed with names
• vectors
• x <- 1:3

names(x) => NULL names(x) <- c("foo", "bar", "n") names(x) => given names

• lists
• x <- list(a=1, b=2, c=3)
• matrices
• m <- matrix(1:4, nrow=2, ncol=2)

dimnames(m) <- list(c("a", "b"), c("c", "d"))

• m =>
       c  d


a 1 3 b 2 4

#### tabular

• arguments:
• file*
• logical - variable names in the first row?
• sep
• separator ("," etc)
• colClasses
• types of classes in each column
• is not specified, R figures out the types
• nrows
• number of rows to read
• default is all the file
• comment
• char, after which nothing is read
• default is #
• skip
• number of lines to skip
• same as read.table, but default separator is ","
• for large files
• RTFM!
• make a rough calculation of RAM needed
• 1.5 mln rows * 120 cols of numeric data
• 1.5 mln * 120 * 8 bytes = 1.34 GB
• consider overhead! 1.43 * 2 ~ 2.7 GB
• comment.char = "" - faster
• colClasses set explicitly - much faster

classes <- sapply(initial, class) #automatically figured out classes head(initial) #just shows the fist 6 lines tabAll <- read.table("db.txt", colClasses=classes)

• write
• write.table
• write.csv

#### R objects

• editable!
• good for VCS
• but not very space efficient
• source
• dget
• write
• dump
• dput
1. for one variable

y <- data.frame(a=1, b="a") dput(y) => to the console dput(y, file="y.R") new.y <- dget("y.R")

1. for several variables

x <- smth, y <- smth dump(c("x", "y"), file="data.R") rm(x, y) # removes x y source("data.R")

• writeLines

#### File connections

• file
• plain file
• gzfile
• gzip
• bzfile
• bzip2
• url
• to a web page
• con <- file("foo.txt", "r")

• con <- gzfile("w.gz")

x <- readLines(con, 10) #first 10 lines

### Accessing subsets

#### [

• returns the same type
• starts from 1

#### [[

• for lists and data frames

#### $• from lists and data frames, but using name #### vectors • x <- c("a", "b", "c", "c", "d", "a") x[1] => a x[1:4] => a b c d x[x > "a"] => b c c d 1. the same: u <- x > "a" => F T T T T F x[u] => b c c d 1. only trues #### matrix • m[row, col] • x <- matrix(1:6, 2, 3) x[1, 2] => 3 x[2, 1] => 2 • m[1, ] => first row • m[, 2] => secod col • return VECTOR! • x[1, 2, drop=F] => matrix of one element • x[1, ,drop=F] => a matrix of one row #### lists • x <- list(foo=1:4, bar=0.6) • x[1] => sublist$foo [1] 1 2 3 4

• x1 => (element) 1 2 3 4
• x$bar <=> x"bar" => 0.6 • x["bar"] => sublist$bar [1] 0.6

• x[c(1, 3)] returns several cols
• name <- "bar"

x[name] => smth x$name => NULL! ("name" doesn't exist) • xc(1, 3) => 1st list, 3th el • the same: x13 • partial matching • x <- list(longname=1:5) x$lo => returns $longname x"lo" => NULL x"lo", exact=F =>$longname

#### removing NAs

• good <- complete.cases(x, y)

x[good]; y[good]

• r <- matrix

good <- complete.cases(r) r[good, ]

#### vectorized operations

• vector
• no need for looping!
• x <- 1:4; y <- 6:9
• x + y
• x > 2
• x >= 2
• y == 8
• x * y
• x / y
• matrix
• x <- matrix(1:4, 2, 2)

y <- matrix(rep(10, 4)), 2, 2)

• x * y => element-wise
• x / y => element-wise
• x %*% y - true matrix multiplication

## Week2

### Control Structures

#### conditions

• if (condition) {
 # smth


} else if (cond2) {

 # smth


} else {

 # smth


}

#### blocks return result

• y = if (x > 3) {
 10


} else {

 0


}

#### loops

• while
• while (condition) {
 #do smth


}

• repeat
• infinite loop
• repeat {
 # smth
if (smth) {
break;
}


}

• for
• for (i in 1:10) { print(i) }
• for (i in seq_lenght(nrow(x))) {
 print(x[i])


}

• next
• like continue
• break

### Functions

#### high-level objects

• may take function as parameters, or return functions

#### declaration

• f = function(arg.x, arg.y, def.val = NULL) {
 4 # no need for return


}

• f = function(arg.x, ...) {
 # ... - any other arguments
another.function(10, 20, ...)


}

#### scoping

• lexical scoping
• environments
• env - name-value pairs
• env can have a parent and multiple children
• env of a function - its closure
• global env
• seach() - list of packages
• local function environment
• consider make.power function
• make.power = function(n) {
 pow = function(x) {
x ^ n
}


}

• cube = make.power(3)
• ls(environment(cube)) #lists variables
• get("n", environment(cube)) #returns variable n
• look up
• local env
• if not in the function env, look up in the parental
• until the top one reached
• global

### Loop function

#### lapply

• lapply(X, FUN, ...)
• X -list, returns a list
• if X not a list, it will be coerced
• x = list(a = 1:5, b=rnorm(10))

lapply(x, mean)

#### sapply

• same as lapply, but simplifies the result
• if each el contains one element, result coerced to vector
• if each el of the same len, result coerced to matrix
• otherwise a list is returned

#### apply

• evaluates the function over the margin of array
• apply(X, MARGIN, FUN, ...)
• MARGIN: 1 for row, 2 for col
• rowSum = apply(x, 1, sum)
• rowMean = apply(x, 1, mean)
• colSum = apply(x, 2, sum)
• colMean = apply(x, 2, mean)
• apply(x, 1, quantile, probs=c(0.25, 0.75))
• apply for each row: qualtile(row, probs=c(0.25, 0.75)

#### tapply

• appliers for a group
• tapply(X, INDEX, FUN, ...)
• example
• x = c(rnorm(10), runif(10), rnorm(10, 1))
• f = gl(3, 10) # 1x10, 2x10, 3x10
• tapply(x, f, mean) - returns 3 means

#### split

• takes a vector and a factor
• splits it into groups determined by factors
• apply(X, F, ...)
• example
• x = c(rnorm(10), runif(10), rnorm(10, 1))
• f = gl(3, 10) # 1x10, 2x10, 3x10
• will return 3 groups

#### mapply

• multivariable apply
• mapply(FUN, ...)
• mapply(rep, 1:4, 4:1)
• rep(1, 4)
• rep(2, 3)
• rep(3, 2)
• rep(4, 1)

### Debugging

#### traceback

• prints out stacktrace

#### debug

• you can step through

#### browser

• suspends execution for debugging

#### trace

• allows to insert debug code into existent functions

#### recover

• gets the control back

## Week3

### Simulations

#### common prefixes

• d - density
• r - random number
• p - cumulative distribution
• q - quantile

• rnorm
• dnorm

• runif
• dunif

#### set.seed(n)

• ensures reproducibility

#### examples

• linear model
• y = B_0 + B_1 * X + E
• E ~ N(0, 4), X ~ N(0, 1),

B_0=0.5, B_1=2

• set.seed(20)
• x = rnorm(100)
• e = rnorm(100, 0, 2)
• y = 0.5 + 2 * x + e
• plot(x, y)

#### sampling

• sample(1:10, 4) # 4 random numbers from 1:10
• sample(1:10) # permutation
• sample(letters, 5)
• sample(1:10, replace=TRUE) # el can be used multiple times

### Plotting

#### base

• plot(x, y)
• params
• global
• par - global parameters (?par)
• example (margins)
• x = rnorm(100)
• y = rnorm(100)
• par(mar=c(2, 2, 2, 2))
• plot(x, y)
• plot(x, y, pch=20) # solid circles
• legend("topleft", legend="Data", pch=20)
• fit = lm(y ~ x)
• abline(fit)
• abline(fit, lwd=3, color="blue") # thick
• example(points) - built-in demos
• example
• plot(x, y, xlab="weight", ylab="height",

main="Scatterplot", pch=20)

• several plots
• par(mfraw=c(2, 1))
• plot(x, y, pch=20)
• plot(x, z, pch=19)
• annotations
• expression
• plot(0, 0, main=expression(theta==0),

ylab=expression(hat(gamma)==0), xlab=expression(sum(x[i]*y[i], i==1,n))

• substitute
• xlab=substitute(bar(X)==k, list(k=mean(x))
• replaces x in the exp onto mean

#### lattice

• generally tale a formula
• y ~ x | f * g
• y ~ x - variables
• f * g - conditional variables
• returns an object
• example
• x = rnorm(100)
• y = x + rnorm(100, sd=0.5)
• f = gl(2, 50, labels=c("g1", "g2"))
• xyplot(x ~ y | f) # x vs y, split conditioned on f

### Sorting

• returns a sorted array

• sorts and returns a vector with ordered indexes
• order(read, prog) # 2 variables
• order(prog, -read) # reverse order
• order(prog, na.last=F) #na goes first
• order(prog, la.last=NA) # na omitted

## Week4

### Dates and time

#### Time

• POSIXct
• based on integer
• POSIXlt
• based on list
• x = Sys.time()
• p = as.POSIXlt(x)
• p\$sec

#### Convertion

• strptime
• dates = c("January 10, 2012 10:40", "December 9, 2011 10:50")
• x = strptime(dates, "%B %d, %Y %H:%M")

#### Operations

• many operations are allowed: < > + - etc
• leap years, seconds etc are considered
• y = as.Date("2012-03-01")
• x = as.Date("2012-02-28")
• y - x # returns 2
• and time zones
• x = as.POSIXct("2012-10-25 01:00:00")
• y = as.POSIXct("2012-10-25 06:00:00", tz="GMT")
• x - y = 1 hour

### Reg exps

#### grep

• grep
• returns indexes of matches
• grepl
• returns T/F vector

#### regexpr

• index where match begins, only first
• gregexpr - all matches

#### replacing

• sub
• sub(pattern, replacement)
• replaces found pattern on given string
• only first occurence
• gsub
• replaces everything

#### regmatches function

• for extracting results returned by regexpr
• r = regexpr(pattern, h)
• regexpr(h, r)

#### regexec

• regexec(pattern, d)
• returns list with matches
• if groups are used, also returned

#### example

• r = regexec("
[Ff]ound on (.*?)
", homicides)
• m = regmatches(homicides, r)
• dates = sapply(m, function(x) x[2])
• dates = as.Date(dates, "%B %d, %Y")
• hist(dates, "year", freq=TRUE)

### Classes and methods

#### generic function - for all objets

• don't do anything
• only dispatches to appropriate function

• ?Classes, ?Methods, ?setClass, ?setMethod

#### S3 classes

• "old-style"
• mean
• UseMethod("mean")
• methods("mean") lists methods for mean

#### S4 classes

• "new-style"
• show
• StandartGeneric("show")
• ShowMethod("show")

#### dispatching

• gets the class
• searches for appropriate method
• if method exists, it's called
• otherwise default is called
• if no default, exception is thrown

#### example (S4)

• setClass - creates a new class
• slots - data stored there
• accessed via @
• setClass("polygon", representation(x="numeric", y="numeric"))
• setMethod("plot", "polygon",

function(x, y, ...) { plot(x@x, x@y, type=n, ...) xp = c(x@x, x@x[1]) yp = x(x@y, x@y[1]) lines(xp, yp) } })

• p = new("polygon", x=c(1, 2, 3, 4), y=c(1, 2, 3, 1))
• plot(p)

## Misc

### Errors

 if (condition) {
stop("error message")
}


### Proxy

 Sys.setenv(http_proxy="http://username:password@proxyaddress:8080")


### Sets operations

 setdiff(i, j) - elements present in i, not present in j