Monthly Archives: April 2017

R stuff – Keep updating

ggplot2

Resources

Elegant codes [samples from here and there]

Task one – subset based on frequency

  • I have a clean data.frame with observations and variables. one of the variables is source
  • I want to make a table with source (a factor variable) and frequency of each factors.
  • I want this table only show frequency > 100.
  • Workflow
    • count frequency for each level of source (use table function. Check the structure of object returned. See this post. )
    • Based on the frequency information subset original data frame.
    • Build another table on this subset.

Finally code –
table(subset(trumpTweets$source, table(trumpTweets$source)[trumpTweets$source]>100))

Task two – %>% operator 

Transform the previous codes as

ivankaTweets$source %>% subset(., ivankaTweets$source %>% table(.)[.] >100) %>% table(.)

mutate(timestamp = ymd_hms(created_at))

mutate_each()

Plotting 

transformation ggplot2 scare_log_10 — take log of axises

What happens if you have negative values but want to reduce the range of numbers? use sign(x)*log(abs(x)) — this is the logic of transformation

geom_jitter — A geom that draws a point defined by an x and y coordinate, like geom_point, but jitters the points.Used when visualizing a large number of individual observations. Compare this to geom_point and I get the differences.

 

Introductory stuff

R vs SAS

In S a statistical analysis is normally done as a series of steps, with intermediate results being stored in objects. Thus whereas SAS and SPSS will give copious output from a regression or discriminant analysis, R will give minimal output and store the results in a fit object for subsequent interrogation by further R functions.

Text manipulation

paste() function: concatenates text. Can be used in editing filenames..

paste(c("X","Y"), 1:10, sep = ""))

Exercises and tips

Index Vector! Note it’s on vector not other data types. Use square bracket to select elements based on index… Very interesting properties.

    1. Select all non-missing values…
y <- x[!is.na(x)]
    1. Select all non-missing and positive values in (x+1), note how R deals it diff. from sql
y <- (x+1)[!is.na(x) & x>0]
    1. A vector of negative integral quantities. Select all but the first five elements in a vector?
y <- x[-(1:5)]

Note if use x[- 1:5] without () there will be a bug because R interpret this as (-1:5) and that’s not legal. — Sequence of operators – > :

  1. A vector of positive integral quantities. Select elements by a more complex index where the same element is selected multiple times.
    y <- c("x","y")[rep(c(1, 2, 2, 1), times = 4)]
    
  2. A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
    > fruit <- c(5, 10, 1, 20) > names(fruit) <- c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
    
  3. A vector of character strings. Using names to as index vector to select elements. Advantage is that character strings are easier to remember than indexes, especially in connecting data frames… note name is a function.
    > fruit <- c(5, 10, 1, 20) > names(fruit) <- c("orange", "banana", "apple", "peach") > fruit[c("apple", "banana")]
    

R Data structures and manipulation methods 

Objects: What R operates on. eg. vector, list, function, data frame…

Mode: data type?

For vectors, modes including numeric1, complex, logical, character and raw, also NA (several types) A vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0). Also atomic because their components can only be of one value.

For lists, mode is list. List is recusive because the element of list can itself be a list.

Vector: a R object. All components in vector are of the same mode.

Arrays: An array can be considered as a multiply subscripted collection of data entries, for example numeric. Matrix is also a kind of array. Different from c++, more like the basis of linear space..

Crete an array: 1) by vector. A vector can be see as an array as long as it has a dim attribute.

> dim(z) <- c(3,5,100)

2) by function matrix()
3) by function array()

Matrix: a two-dimensional array. Important because R has many functions used just for matrix. e.g. t(X) transposes a matrix.

Lists: An R list is an object consisting of an ordered collection of objects known as its components. Components need not be of the same modes.
Cite the components of a list by index — Lst[[1]]
Cite the components of a list by name

 

Aug 10 2017 Update

When writing a script now compared to March, feel much more familiar.

  • typeof vs class
  • as.factor
  • as.Date —- dates are internally represented in R as type double, but class “Date”. Link here

A lesson: When doing data analysis, estimate the time needed correctly… There will bugs, you need to familarize with the language… You will get distracted…

Aug 11 2017

Dates formats in R:

  • Excel import dates types are numeric… v.s Character.
  • %y — lowercase y only reads two digits… Now you need %Y for that.

Aug 14, 2017

A package for data science in R…

Tidyverse

R for Data Science

SQL Review – Updating…

This is yet another thread for technical reviews… I remember having some problems dealing with my data management courses. Now I think back what’s missing was an understanding of the relations and structure of databases. This weakness was reflected later in unfamiliarity with data manipulation and SQL queries. Worth practicing…

Note this is also related to knowledge under the label of “database” in computer science? Probably worth going over the LeetCode exercises…

Apr 23

Concepts 

Relational databases relational database is a database that organizes information into one or more tables.

A table is a collection of data organized into rows and columns. Tables are sometimes referred to as relations.

Properties

  1. a SQL statement always ends with semicolon ;

Sample code 

Create a table

CREATE TABLE table_name(
column_1 data_type,
column_2 data_type,
column_3 data_type
);

 

Insert a row. Note: unlike python sql does not take row break seriously. Statements end with “;”

insert into celebs (id, name, age)
values (1, 'Justin Biber', 21);

Select from tables using column names. * is a special wildcard character that we have been using. It allows you to select every column in a table without having to name each one individually.

select * from celebs;

Update table. update updates the value of existing records.

update celebs 
set age = 22
where id = 1;

select * from celebs;

Add column to a table

alter table celebs add column 
twitter_handle text;

No… CodeAcademy is too slow for this. Another recource here  a one-page guide. I should do this by working on problems.
table aliasing: like creating a pointer or something..

select * from table_sample as s;

Select from two tables using LEFT JOIN. Note in the FROM clause there is only table 1!

SELECT Person.FirstName, Person.LastName, Address.City, Address.State
FROM Person 
LEFT JOIN Address
ON Person.PersonID = Address.PersonID

Using alias

# Write your MySQL query statement below
SELECT p.FirstName, p.LastName, a.City, a.State
FROM Person AS p
LEFT JOIN Address AS a
ON p.PersonID = a.PersonID

select distinct genre

SELECT DISTINCT genre FROM movies;

select second highest salary. Note alising column name in “select” here!

SELECT MAX(Salary) AS SecondHighestSalary
FROM Employee 
WHERE Salary NOT IN (SELECT MAX(Salary) FROM Employee);

Select nth highest salary – this one is tricky!
Note the use of keywork “limit” in MySQL
If do not use “limit”, need to learn algorithms and databases… Resources here. Also Stanford course. Algorithms + databases two full courses will take (12+12)*10 = 240 hrs = 20 full days to learn. Need to somehow find that time in summer…

CREATE FUNCTION getNthHighestSalary(N INT) RETURNS INT
BEGIN
DECLARE M INT;
SET M = N-1;
  RETURN (
      # Write your MySQL query statement below.
      SELECT DISTINCT Salary FROM Employee ORDER BY Salary DESC LIMIT M, 1
  );
END

2017 Apr 30

Installing and initializing MySQL on mac

  • install using .dmg
  • create user using directory utility in here for Mac 10.12 /System/Library/CoreServices/Applications     command line: open ‘Directory Utility.app’
  • This questions changes the PATH to mysql
  • Use this command to access mysql  ‘sudo mysql -u root -p’ — enter password that were gave to you when installing mysql through .dmg
  • This question solves the reset password problem
  • Accessing mysql databases
  • This command line prompt seems to identify my host ‘ALTER USER ‘root’@’localhost’ IDENTIFIED BY ‘*****’;

Use Pandas to connect to mysql: For a client program to be able to connect to the MySQL server, it must use the proper connection parameters, such as the name of the host where the server is running and the user name and password of your MySQL account. Each connection parameter has a default value, but you can override them as necessary using program options specified either on the command line or in an option file.

Next step: connect using SQLAlchemy engine… What is that?

Nov 23, 2017

SQL Join: combine rows from two or more tables based on a related column between them