ggplot2
Resources
Elegant codes [samples from here and there]
Task one – subset based on frequency
 I have a clean data.frame with observations and variables. one of the variables is source
 I want to make a table with source (a factor variable) and frequency of each factors.
 I want this table only show frequency > 100.
 Workflow
 count frequency for each level of source (use table function. Check the structure of object returned. See this post. )
 Based on the frequency information subset original data frame.
 Build another table on this subset.
Finally code –
table(subset(trumpTweets$source, table(trumpTweets$source)[trumpTweets$source]>100))
Task two – %>% operator
Transform the previous codes as
ivankaTweets$source %>% subset(., ivankaTweets$source %>% table(.)[.] >100) %>% table(.)
mutate(timestamp = ymd_hms(created_at))
mutate_each()
Plotting
transformation ggplot2 scare_log_10 — take log of axises
What happens if you have negative values but want to reduce the range of numbers? use sign(x)*log(abs(x)) — this is the logic of transformation
geom_jitter — A geom that draws a point defined by an x and y coordinate, like geom_point, but jitters the points.Used when visualizing a large number of individual observations. Compare this to geom_point and I get the differences.
Introductory stuff
R vs SAS
In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.
Text manipulation
paste() function: concatenates text. Can be used in editing filenames..
paste(c("X","Y"), 1:10, sep = ""))
Exercises and tips
Index Vector! Note it’s on vector not other data types. Use square bracket to select elements based on index… Very interesting properties.

 Select all nonmissing values…
y < x[!is.na(x)]

 Select all nonmissing and positive values in (x+1), note how R deals it diff. from sql
y < (x+1)[!is.na(x) & x>0]

 A vector of negative integral quantities. Select all but the first five elements in a vector?
y < x[(1:5)]
Note if use x[ 1:5] without () there will be a bug because R interpret this as (1:5) and that’s not legal. — Sequence of operators – > :
 A vector of positive integral quantities. Select elements by a more complex index where the same element is selected multiple times.
y < c("x","y")[rep(c(1, 2, 2, 1), times = 4)]
 A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
> fruit < c(5, 10, 1, 20) > names(fruit) < c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
 A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
> fruit < c(5, 10, 1, 20) > names(fruit) < c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
R Data structures and manipulation methods
Objects: What R operates on. eg. vector, list, function, data frame…
Mode: data type?
For vectors, modes including numeric1, complex, logical, character and raw, also NA (several types) A vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0). Also atomic because their components can only be of one value.
For lists, mode is list. List is recusive because the element of list can itself be a list.
Vector: a R object. All components in vector are of the same mode.
Arrays: An array can be considered as a multiply subscripted collection of data entries, for example numeric. Matrix is also a kind of array. Different from c++, more like the basis of linear space..
Crete an array: 1) by vector. A vector can be see as an array as long as it has a dim attribute.
> dim(z) < c(3,5,100)
2) by function matrix()
3) by function array()
Matrix: a twodimensional array. Important because R has many functions used just for matrix. e.g. t(X) transposes a matrix.
Lists: An R list is an object consisting of an ordered collection of objects known as its components. Components need not be of the same modes.
Cite the components of a list by index — Lst[[1]]
Cite the components of a list by name
Aug 10 2017 Update
When writing a script now compared to March, feel much more familiar.
 typeof vs class
 as.factor
 as.Date — dates are internally represented in R as type double, but class “Date”. Link here
A lesson: When doing data analysis, estimate the time needed correctly… There will bugs, you need to familarize with the language… You will get distracted…
Aug 11 2017
 Excel import dates types are numeric… v.s Character.
 %y — lowercase y only reads two digits… Now you need %Y for that.
Aug 14, 2017
A package for data science in R…