# Two approaches for logistic regression

Finally, a post in this blog that actually gets a little bit technical …

This post discusses two approaches for understanding of logistic regression: Empirical risk minimizer vs probabilistic approaches.

Empirical Risk Minimizer

Empirical risk minimizing frames a problem in terms of the following components:

• Input space $X \in R^d$. Corresponds to observations, features, predictors, etc
• outcome space $Y \in \Omega$. Corresponds to target variables.
• Action space $A_R = R$ Also called decision function, predictor, hypothesis.
• A sub element in action space could be hypothesis space: all linear transformations
• Loss function: $l(\hat{y}, y)$ a loss defined on the predicted values and observed values.

The goal of the whole problem is to select a function mapping $F$ in action space that minimizes the total loss on sample. This is achieved by selecting the value of the parameters in $f$ such that it minimizes the empirical loss in the training set. We also do hyperparameter tuning, which is done on the validation set in order to prevent overfitting.

• Input space $X \in R^d$.
• outcome space $Y \in {0, 1}$. Binary target values.
• Action space $A_R = R$ The hypothesis space: all linear score functions
• $F_{score} = {x \rightarrow x^Tw | w \in R^d}$
• Loss function: $l(\hat{y}, y) = l_{logistic}(m) = \text{log} (1 + e ^{-m})$
• This is a kind of margin based loss, thus the $m$ here.
• Margin is defined as $\hat{y} y$, which has interpretation in binary classification task. Consider:
• if $m = \hat{y} y > 0$, we know we have our prediction and true value are of the same sign. Thus, in binary classification, we could already get the correct result. Thus, for $m > 0$ we should have loss = 0.
• if $m = \hat{y} y < 0$, we know we have our prediction and true value are of different signs. Thus, in binary classification, we are wrong. We need to define a positive value for loss function.
• In SVM, we define hinge loss $l(m) = \text{max}(0, 1-m)$, which is a “maximum-margin” based loss (more on this in the next post, which will cover the derivation of SVM, kernel methods) Basically, for this loss, we have when $m \geq 1$ no loss, $\latex m < 1$ loss. We can interpret $m$ as “confidence” of our prediction. When $m < 1$ this means a low confidence, thus still penalize!
• With this intuition, how do we understand logistic loss? We know:
• This loss always > 1
• When $m$ negative (i.e. wrong prediction), we have greater loss !
• When $m$ positive (i.e. correct prediction), we have less loss…
• Note also for same amount of increase in $m$, the scale that we “reward” correct prediction is less than the scale we penalize wrong predictions.

Bournoulli regression with logistic transfer function

• Input space $X = R^d$
• Outcome space $y \in {0, 1}$
• action space $A = [0, 1]$ An action is the probability that an outcome is 1

Define the standard logistic function as $\phi (\eta) = 1 / (1 + e^{-\eta})$

• Hypothesis space as $F = {x \leftarrow\phi (w^Tx) | w \in R^d}$
• Sigmoid function is any function that has an “S” shape. One example is the simple case of logistic function! Used in neural networks as activation function / transfer function. Purpose is to add non-linearity to the network.

Now we need to do a re-labeling for $y_i$ in the dataset.

• For every $y_i = 1$, we define $y' = 1$
• For every $y_i = -1$, we define $y' = 0$

Can we do this? Doesn’t this change the value of $y$-s ? The answer is , in binary classification ( or in any classification), the labels do not matter. Instead, this trick just makes the equivalent shown much easier…

Then, the negative log likelihood objective function, given this $F$ and dataset $laex D$, is :

• $NLL(w) = \sum_i^n [-y_i ' \text{log} \phi (w^T x_i)] +(y_i ' -1) \text{log} (1 - \phi (w^T x_i))$

How to understand this approach? Think about a neural network…

• Input $x$
• First linear layer: transform $x$ into $w^Tx$
• Next non-linear activation function. $\phi (\eta) = 1 / (1 + e^{-\eta})$.
• The output is interpreted as a probability of positive classes.
• Think about multi-class problems, the second layer is a softmax — and we get a vector of probabilities!

With some calculation, we can show NLL is equivalent to the sum of empirical loss.

# Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

• add Gaussian noise to the mean. Credit to Owen Zhang :
• use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!

Hashing

A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

• On why l1 gives sparsity: video here :  https://www.youtube.com/watch?v=14MKVkhvMus&feature=youtu.be
• Course here : https://onlinecourses.science.psu.edu/stat857/book/export/html/137 Statoverflow answer here: https://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

# Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
a, b = 0, 1
while True:
yield b
a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]


This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]


[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
while True:
answer = (yield) # note the usage of yield here
if answer.endswith("?"): # note it's endSwith, the s there
print("Don't ask yourself too many questions.")
print("A that's good, go on. ")
print("Don't be so negative.")


This defines a function that can return a generator.

free = psychologist()

type(free)


This will return “generator”

next(free)


free.send("what?")


This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")


This will return “Don’t be so negative.”

free.send("But she is feeling good!")


This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.

# Should I use an IDE, or should I use Vim?

This problem has been bugging me for a while so I decide to write it out even though it’s just a short piece.  This post compares tools for Python programming using :

• Jupyter Notebook
• IDEs like PyCharm
• Text editors like Vim

Jupyter Notebook:

• The pro is it’s easily visualizable. When you want a graph, you can see a graph immediately. The comments are also beautifully formatted.
• Another pro is it can be connected to a Linux server like Dumbo.
• The con is, it’s not a program. A notebook is it’s own file and although it can be downloaded as a .py file, the file is usually too long, with lots of comments like typesetting parameters.

When to use it?

• Data exploration. because of the visualization and analysis nature.
• Big data. Because it can be connected to a server, that makes running large amount of data possible.

PyCharm

• The pro is it’s suited for Python development. I have not learnt the functionalities entirely, but e.g. search and replace are easily doable in PyCharm. Debugging also seems to be easy?
• The con is it’s usually not available on a server.
• Another con is need extra finger movement when switching from terminal to Pycharm.

When to use it?

• Debugging complicated programs. e.g. NLP programs.
• No need to run on a server.

Vim

• The pro is it’s everywhere. Thus, whenever you write on your own machine, or on the server,   it feels the same.
• Another pro is it can be used for anything. like python, C++, markdown, bash… So there is no need to switch to other places when ssh to the server.
• The con is it’s not that easy to use. e.g. search and replace… hard to do this. Adjust tab? also not immediately doable.
• Another con is it’s not that easy to debug. have to manually print out variables… This makes it particularly difficult when the program is large.

When to use it?

• When need to connect to a server. e.g. big data size.

# Chi-square and two-sample t-test

This post explains a basic question I encountered, and the statistical concepts behind it.

The real-life problem

Someone asks me to construct data to prove that a treatment is useful for 1) kindergarten and 2) elementary school kids in preventing winter cold.

Chi-square and student’s t test

First, decide how to formulate the problem using statistical tests. This includes deciding the quantity and statistic to compare.

Basically, I need to compare two groups. Two tests come to mind: Pearson’s chi-square test, and two-sample t-test. This article summarizes main difference between the two tests, in terms of Null Hypothesis, Types of Data, Variations and Conclusions. The following section is largely based on that article.

Null Hypothesis

• Pearson’s chi-square test: test the relationship between two variables, or whether something has effects on the other thing (?). e.g. men and women are equally likely to vote for Republican, Democrat, Others, or Not at all. Here the two variables are “gender” and “voting choice”. The null is “gender does not affect voting choice”.
• Two-sample t-test : whether two sample have the same mean. Mathematically, this means $\mu_1 = \mu_2$ or $\mu_1 - \mu_2 = 0$. e.g. boys and girls have the same height.

Types of Data

• Pearson’s chi-square test: usually requires two variables. Each is categorical and can have many number of levels. e.g. one variable is “gender”, the other is “voting choice”.
• two sample t-test: requires two variables. One variable has exactly two levels (two-sample), the other is quantitively calculatable. e.g. in the example above, one variable is gender, the other is height.

Variations

• Pearson’s chi-square test: variations can be when the two variables are ordinal instead of categorical.
• two-sample t-test: variations can be that the two samples are paired instead of independent.

Transform the real-life problem into a statistical problem

Using chi-square test

Variable 1 is “using treatment”. Two levels: use or not.

Variable 2 is “getting winter cold”. Two levels: get cold or not.

For kindergarten kids and for pre-school kids, I thus have two 2 * 2 tables.

(question: can I do a chi-square test on three variables? The third one being “age”.)

Using two-sample t-test

Variable 1 is “using treatment”. Two levels: use or not

Variable 2 is supposed to be a numerical variable —- here use disease rate. But then there is no enough number of samples.

Thus, conclude that Chi-square test should be used here.

# Brief explanation of statistical learning framework

This post explains what is a statistical learning framework, and common results under this framework.

Problem

We have a random variable X, another random variable Y. Now we want to determine the relationship between X and Y.

We define the relationship by a prediction function f(x). For each x, this function produces an “action” a in the action space.

Now how do we get the predictive function f? We use a loss function l(a, y), that for each a and y, we produce a “loss”. Note since X is a random variable, f(x) is a transformation, so a is a random variable, too.

Also, l(a, y) is a transformation of (a, y), so l(a, y) is a random variable too. It’s distribution is based on both X and Y.

We then calculate f by minimizing the expectation of the loss, which is called “risk”. Since the distribution of l(a, y) is based both on the distribution of X and Y, to get this expectation, we need to do integration both on X and on Y. In the case of discrete variables, we do summation based on the pmf of (x, y).

The above are about theoretical properties of Y, X, loss function and prediction function. But we usually do not know the distribution of (X, Y). Thus, we choose to minimize empirical risk instead. We calculate empirical risk by summing all the empirical loss together, divided by m. (q: does this resemble Monte Carlo method? is this about computational statistics? Need a review.)

Results

In the case of square-loss, we have the result, a = E(y|x).

In the case of 0-1 loss, we have the result, a = arg max P(y|x)

Example:

We want to predict a student’s mid-term grade (Y). We want to know the relationship between predicted value, and whether she is hard-working (X).

We use square-loss for this continuous variable Y. Since we know that to minize square loss, for any random variable we should predict the mean value of the variable (c.f. regression analysis, in OLS scenerio we calculate the MSE — but need further connection to this framework).

Now we just observed unfortunately the student is not hard-working.

We know for a not-hardworking student the expectation of mid-term grade is 40.

We then predict the grade to be 40, as a way to minimize square-loss.

# Probability, statistics, frequentist and Bayesian

This post is a review of basic concepts in probability and statistics.

Useful reference: https://cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15/notes.html

https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/

Probability

It’s a tool to mathematically measure uncertainty.

Formal definition involving $\sigma-algebra$:

A probability space isa triple $(\Omega, F, P)$ consisting of :

• A sample space $\Omega$
• A set of events F – which will be $\sigma-algebra$
• A probability measure P that assigns probabbilites to the events in F.

Example: We have a fair coin. Now we toss it 1000 times, what’s the probability of getting 600 heads or more?

Statistics

The goal of statistics is to 1) draw conclusion from data (e.g. reject Null Hypothesis) and 2) evaluate the uncertainty of this information (e.g. p-value, confidence interval, or posterier distribution).

At the bottem, statistical statement is also about probability. Because it applies probability to draw conclusions from data.

Example: We would like to know whether the probability of raining tomorrow is 0.99. Then tomorrow comes, and it does not rain. Do we conclude that P(rain) = 0.99 is true?

Example 2: We would like to decide if a coin is fair. (Data) Toss the coin 1000 times, and 809 times it’s a head. Do we conclude the coin is fair?

Note : probability is logically self-contained. There are a few rules, and the answers follow from the rules. Statistics can be messy, because it involves draw conclusion from data – much art than science.

Frequentist vs Bayesian

Two schools of statistics. They are different in their interpretation of probability.

Frequentist interpret probability to be the frequencies of events in repeating experiments. E.g. P(head) = 0.6. Then if we toss a coin 1000 times, we will have 600 heads.

Bayesian interprets probability to be a state of knowledge, or a state of belief, about a preposition. E.g. P(head) = 0.6, means we are fairly certain (around 60% certain!) that a coin will be tossed head.

In practice though, Bayesian seldom use a single value to characterize such belief. Rather, it uses a distribution.

Frequentists are used in social science, biology, medicine, public health. We see two sample t-tests, p-values. Bayesian is used in computer science, “big data”.

Core difference between Frequentists and Bayesian

Bayesian considers the results from previous experiments, in the form of a prior.

See this comic for an illustration.

What does it mean?

A frequentist and a Bayesian are making a bet about whether the sun has exploded.

It’s night, so they can not observe.

They ask some expert whether the sun has gone Nova.

They also know that this expert will toss two coins. If both get 6, she will lie. Else, she won’t. (Data generation process)

Now they ask the expert, who tells them yes, the sun has gone Nova.

Frequent conclude that since the probability of getting two 6’s is 1/36 = 0.0027 <0.05 (p < 0.05), it’s very unlikely the expert has lied. Thus, she concludes the expert did not lie. Thus, she concludes that the sun has exploded.

Bayesian, however, has a strong belief that the sun has not exploded (or else they will be dead already). The prior distribution is

• P(sun has not exploded) = 0.99999999999999999,
• P(sun has exploded) = 0.00000000000000001.

Now the data generation process is essentially the following distribution:

• P(expert says sun exploded |Sun not exploded) =  1/36.
• P(expert says sun exploded |Sun exploded) =  35/36.
• P(expert says sun not exploded |Sun exploded) =  1/36.
• P(expert says sun not exploded |Sun not exploded) =  35/36.

The observed data is “expert says sun exploded”. We want to know

• P( Sun exploded | expert says sun exploded ) = P( expert says sun exploded | Sun exploded) * P( Sun exploded) / P(expert says sun exploded)

Since P(Sun exploded) is extremely small compared to other probabilities, P( Sun exploded | expert says sun exploded ) is also extremely small.

Thus although the expert is unlikely to lie (p = 0.0027), the sun is much more unlikely to have exploded. Thus, the expert most likely lied, and the sun has not exploded.

# Literature Review: Enough is enough

Another 6 days passed since I updated my blog – I’m still working on my MPhil thesis.

The problem? I started out too broad. After sending an (overdue) partial draft to the supervisor, she suggested I stop reviewing new literature. I then began wrapping things up.

After drawing the limit of literature, writing suddenly becomes much more easy.

I in fact write faster.

I also read faster. On papers on attitude change, it became easier to identify key arguments and let go of minor ones. On news about background and the history of protests in Hong Kong, it became easier to focus on what and how much is needed for my case. I briefly discussed about types of movement histories in Hong Kong, without going deeper about SMO strategies.

Thus, a lesson might be drawing boundaries is a hard but crucial step.

The lesson might also be having a clear delivery improves efficiency.

For example, this PhD spent 10+ years in his program… And it seemed he had a similar problem. On the surface, it might be procrastination. One level down there is anxiety, shame, guilt and low self-esteem. On level down, this is because of the unclear goals and priorities.

What can I do better to have finished this quicker?

• Talk to experienced people more often. Drawing boundaries is hard and there is no clearly defined rule. Thus only way is to learn from experience, and let them judge if this is enough! (Tacit knowledge / uninstitutionized knowledge)

# Literature Review: delete part of your writing might be the solution

I got stuck writing the second half of my literature review in the past few days. This post describes a solution to it.

My initial literature review has something on

• 1) belief structure — in cultural sociology and cognitive sociology
• 2) attitude change  — in social psychology
• 3) political socialization — in political psychology

For a while, I was stuck and did not know why I was stuck. I tried to write on agents of political socialization (family, peers, school, reference group…).

But then I found some literature on undergraduate political socialization, and that leads me to another rather broad field. I then became idle.

As with programming, when you find yourself thinking instead of writing, something is wrong.

After talking to a PhD student, I realized my problem was I was trying to say too many things.

“Belief” / “opinion” / “attitude” / “understandings” are not merely words in social science. They are concepts. So, for each of them, the literature is vast.

I cannot possibly write about both belief and attitude in my literature review, because that would be too much. More, they are not the same thing so they do not hold together.

After deleting all writings about belief structure and cognitive sociology, the literature review becomes much clearer.

A broad lesson is it takes practice to recognize the scope of literature, the theories and what is useful. A good idea is a clear idea.

# I will probably applying for psychology grad programs. And I will write my first chapter of literature like this

/This post described 1) my past few days’ work since Christmas ( not very much) and 2) my change of future plans and reasons and 3) my progress on MPhil thesis literature review chapter.

My last few days work

After Christmas break productivity decreased (from ~ 11 hour per day to ~ 8 then to ~ 5). Hypothesis seems to be validated that a structured environment and external accountability is important.

After coming back to my old office in Hong Kong, my productivity immediately decreased at least 70%. Truly amazing. Possible explanations: 1) jet lag. I woke up at 4am the day before, slept at 4pm, then woke up 10 pm that day. In-between I experienced periods of alternating hallucination (no kidding) and absent-mindedness. I sing along Youtube videos for 2 hours early this morning (3am – 5am). Weird.

Explanation 2) this environment arose certain negative emotions associated with past experience. Whenever I sit at my old desk I found it hard to concentrate, falling back to old patterns. In a new place where I had not worked before I felt much better.

My changed future plans and why

On another note, to the readers of this blog, I very likely will apply for PhD programs in psychology next year. Continuing working on my master’s thesis (will talk about it shortly) makes me realize my past struggle started from misaligned interest with my supervisor, my department, ethnography, and sociology in general.

The question I was interested in was in fact studied more by psychologists, and I am much satisfied with their approach to the question (belief formation, political attitude formation, etc.)

This is the reason I struggled for so long in my study. I first read about social movement when I was at UW-Madison. Then in my first few months in MPhil in 2015, I found some literature in public opinion research close to what I would like to know, although they are not directly on protests. Protest scholars on the other hand are not that interested in opinions per se – they are interested in opinions (grievances) as an independent variable in explaining emergence of protests and participation.

That’s the first time I was confused. I treated the literature review process as finding answers, but I found that the answer did not exist. Mistake 1: For research, this is supposed to be a good thing (research gap), but I did not realize at that time. Because of the habit of an undergraduate, I just thought if I did not find an answer that meant I didn’t work hard! (Two years later, in 2017 I wrote to Pam Oliver, then a social movement mail list she was in, then  John Josh asking them for literature – and indeed there is none from the angle I would like to know. )

Back to Nov 2015. With supervisor’s pointer I went to methodological literature (Grounded theory, Methods of Discovery, two books by Howard Becker – find them not that helpful). I dislike my ethnography class from week 1. Complained to supervisors but decided to hang on to it (while secretly hate it). Then during Chinese New Year 2016 I read about  necessarity of narratives / contingencies , which lead to some reading in the sociology of science – combined with a previous RA task in tacit knowledge— basically literature justifying why interpretative sociology / narrative methods in historical sociology are useful. I did not like the arguments, becasue I found the concepts vague and the logic shabby. Mistake 2: did not take action immediately due to fear of authority.

Then it’s March 2016 or so. I went to sit in at Xiang Biao’s anthropology class. Read some beautiful anthropology, but they are not relevant to my research. Foucault write about biopolitics, power / knowledge which led to some efforts in vain. Probably around the same time still trying to make sense of ethnography, so researched Alice Goffman’s On the Run and Sida Liu’s doctoral thesis. Not much progress on thesis.

In the mean time, read into cultural sociology (cognitive sociology?) but did not find a fit. Cultural sociology concerns culture / action, but 1) I’m not interested in culture and 2) I’m not interested in action.

In the meantime I read Andreas Glaeser’s book political epistemics recommended by my supervisor. We seemed to agree using this framework. This books reads further lead to phenomenology and social otology. Nobody cites the book back then. In 2017 my supervisor published a paper using this book. Now I think back, I do not really like or believe the theory. I tried to read so many times, and I will use the theory in my thesis for graduateion. But if I’m being uttly honest, I don’t understand why the theory is good and I do not feel comfortable using this theory. I don’t know why.

By this time I’m basically disillusioned and disenchanted with sociology and academia. Around May or Jun I began seriously considering a career change. Mistake 3: In facing a problem, not thinking about solutions but want to run away!

For around 12 months I did not seriously work on my thesis. In Nov and Dec I did a whole round of data coding, but not much writing. Thinking back, this is typical procrastination. In 2017 I started to learn about machine learning and coding. Read briefly into cognitive sociology, ideology… Late April to July, internship. Mistake 4: poor time management skills and no priority.

Reflecting this experience. The root problem is my literature review always felt uncomfortable, so I can not formulate a research question using any of the directions I had searched. I felt uncomfortable because belief formation is a problem in social psychology not sociology, which I realized pretty late.

Thus I will proabably applying for grad schools in psychology next year.

I will write my literature review draft this way

The literature review will have two chapters: why do people have different understandings of the same social movement, and how do they form these understandings.

Why question is tricky. I imagine myself talking to scholars in sociology on protests. Thus I should cite them. But they have not studied the problem directly. The closest I can find is 1) relative deprivation theory and 2) framing and 3) social identity and 4) system justification. The end goal of all these theories is explain participation. But I can use them to say, because some have feelings of relative deprivation while others’ don’t.

Then a natural question is, what leads to perceptions of relative deprivation? Here, sociology stops and is fine with describing the understandings themselves (meaning-making). Psychologists answer this, one theory is frustraction-aggression, the other is cognitive dissonance. These two theories are what I really wanted to know.

I will do the same for framing, social identity and system justification. The part of them that describes psychological process is useful, while the other part on how psychological process leads to participation is not useful. Lesson 1: how to selectively use a theory when it is not a good fit but it’s the only thing I can find. (I might not be doing this right though.)

How do they form these understandings? This part will likely be in political socialization, pursuation, attitude change. Still quite broad. Lesson 2: When reading about a new field for a question, do not be taken away by their arguments. One paper leads to another and this needs to be controled.

I developed a standard workflow that I currently find helpful:

• Read one highly cited document. Any paper contain the key words. Better be a review paper.
• Use the Harvard thesis guide literature review template, trace the theory history. Note key sources.
• Write down 5 ~ 7 key sources biblography information.
• Download these papers. Do not download anything more than these 5 ~ 7. Keep the most important papers! (Dr. Tian had once taught me to find the citation list. Use these with highest citations.)
• Take notes and read these 5 ~ 7 papers using the Harvard thesis guide template.
• Print out the reading notes. Select the top priority papers again.

This is like preventing over-fitting in machine learning. Because both deals with the trade-off between generalziation and spceific problems. The best strategy is thus early stopping – leave some of the specific data points unvisited for better generalization.

Updates 2018 Mar 13 : I gave up with this idea after taking a course in psychology. Reason is that I am not really interested in their debates.