R stuff – Keep updating

ggplot2

Resources

Elegant codes [samples from here and there]

Task one – subset based on frequency

  • I have a clean data.frame with observations and variables. one of the variables is source
  • I want to make a table with source (a factor variable) and frequency of each factors.
  • I want this table only show frequency > 100.
  • Workflow
    • count frequency for each level of source (use table function. Check the structure of object returned. See this post. )
    • Based on the frequency information subset original data frame.
    • Build another table on this subset.

Finally code –
table(subset(trumpTweets$source, table(trumpTweets$source)[trumpTweets$source]>100))

Task two – %>% operator 

Transform the previous codes as

ivankaTweets$source %>% subset(., ivankaTweets$source %>% table(.)[.] >100) %>% table(.)

mutate(timestamp = ymd_hms(created_at))

mutate_each()

Plotting 

transformation ggplot2 scare_log_10 — take log of axises

What happens if you have negative values but want to reduce the range of numbers? use sign(x)*log(abs(x)) — this is the logic of transformation

geom_jitter — A geom that draws a point defined by an x and y coordinate, like geom_point, but jitters the points.Used when visualizing a large number of individual observations. Compare this to geom_point and I get the differences.

 

Introductory stuff

R vs SAS

In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.

Text manipulation

paste() function: concatenates text. Can be used in editing filenames..

paste(c("X","Y"), 1:10, sep = ""))

Exercises and tips

Index Vector! Note it’s on vector not other data types. Use square bracket to select elements based on index… Very interesting properties.

    1. Select all non-missing values…
y <- x[!is.na(x)]
    1. Select all non-missing and positive values in (x+1), note how R deals it diff. from sql
y <- (x+1)[!is.na(x) & x>0]
    1. A vector of negative integral quantities. Select all but the first five elements in a vector?
y <- x[-(1:5)]

Note if use x[- 1:5] without () there will be a bug because R interpret this as (-1:5) and that’s not legal. — Sequence of operators – > :

  1. A vector of positive integral quantities. Select elements by a more complex index where the same element is selected multiple times.
    y <- c("x","y")[rep(c(1, 2, 2, 1), times = 4)]
    
  2. A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
    > fruit <- c(5, 10, 1, 20) > names(fruit) <- c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
    
  3. A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
    > fruit <- c(5, 10, 1, 20) > names(fruit) <- c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
    

R Data structures and manipulation methods 

Objects: What R operates on. eg. vector, list, function, data frame…

Mode: data type?

For vectors, modes including numeric1, complex, logical, character and raw, also NA (several types) A vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0). Also atomic because their components can only be of one value.

For lists, mode is list. List is recusive because the element of list can itself be a list.

Vector: a R object. All components in vector are of the same mode.

Arrays: An array can be considered as a multiply subscripted collection of data entries, for example numeric. Matrix is also a kind of array. Different from c++, more like the basis of linear space..

Crete an array: 1) by vector. A vector can be see as an array as long as it has a dim attribute.

> dim(z) <- c(3,5,100)

2) by function matrix()
3) by function array()

Matrix: a two-dimensional array. Important because R has many functions used just for matrix. e.g. t(X) transposes a matrix.

Lists: An R list is an object consisting of an ordered collection of objects known as its components. Components need not be of the same modes.
Cite the components of a list by index — Lst[[1]]
Cite the components of a list by name

 

Aug 10 2017 Update

When writing a script now compared to March, feel much more familiar.

  • typeof vs class
  • as.factor
  • as.Date —- dates are internally represented in R as type double, but class “Date”. Link here

A lesson: When doing data analysis, estimate the time needed correctly… There will bugs, you need to familarize with the language… You will get distracted…

Aug 11 2017

Dates formats in R:

  • Excel import dates types are numeric… v.s Character.
  • %y — lowercase y only reads two digits… Now you need %Y for that.

Aug 14, 2017

A package for data science in R…

Tidyverse

R for Data Science

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *