Monthly Archives: March 2018

Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding 

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

  • add Gaussian noise to the mean. Credit to Owen Zhang :
  • use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: 

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!


A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection 

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

  • On why l1 gives sparsity: video here :
  • Course here : Statoverflow answer here:

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
    a, b = 0, 1
    while True:
        yield b
        a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]

This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]

[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
    print("Please tell me your problem")
    while True:
        answer = (yield) # note the usage of yield here 
        if answer is not None:
            if answer.endswith("?"): # note it's endSwith, the s there
                print("Don't ask yourself too many questions.")
            elif "good" in answer:
                print("A that's good, go on. ")
            elif "bad" in answer:
                print("Don't be so negative.")

This defines a function that can return a generator.

free = psychologist()


This will return “generator”


This will return “Please tell me your problem.”


This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")

This will return “Don’t be so negative.”

free.send("But she is feeling good!")

This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.

Should I use an IDE, or should I use Vim?

This problem has been bugging me for a while so I decide to write it out even though it’s just a short piece.  This post compares tools for Python programming using :

  • Jupyter Notebook
  • IDEs like PyCharm
  • Text editors like Vim

Jupyter Notebook:

  • The pro is it’s easily visualizable. When you want a graph, you can see a graph immediately. The comments are also beautifully formatted.
  • Another pro is it can be connected to a Linux server like Dumbo.
  • The con is, it’s not a program. A notebook is it’s own file and although it can be downloaded as a .py file, the file is usually too long, with lots of comments like typesetting parameters.

When to use it?

  • Data exploration. because of the visualization and analysis nature.
  • Big data. Because it can be connected to a server, that makes running large amount of data possible.


  • The pro is it’s suited for Python development. I have not learnt the functionalities entirely, but e.g. search and replace are easily doable in PyCharm. Debugging also seems to be easy?
  • The con is it’s usually not available on a server.
  • Another con is need extra finger movement when switching from terminal to Pycharm.

When to use it?

  • Debugging complicated programs. e.g. NLP programs.
  • No need to run on a server.


  • The pro is it’s everywhere. Thus, whenever you write on your own machine, or on the server,   it feels the same.
  • Another pro is it can be used for anything. like python, C++, markdown, bash… So there is no need to switch to other places when ssh to the server.
  • The con is it’s not that easy to use. e.g. search and replace… hard to do this. Adjust tab? also not immediately doable.
  • Another con is it’s not that easy to debug. have to manually print out variables… This makes it particularly difficult when the program is large.

When to use it?

  • When need to connect to a server. e.g. big data size.