Encoding categorical features: likelihood, one-hot, and feature selection

This post describes techniques used to encode high cardinality categorical features in a supervised learning problem.

In particular, since these values cannot be ordered, the features are nominal. Specifically, I am working with the Kaggle competition here. The problem with this dataset is that some features (e.g. types of cell phone operating systems) are categorical and has hundreds of values.

The problem occurs in how to fit these features in our model. Nominal features work fine with decision trees (random forests), Naive Bayes (use count to estimate pmf). But for other models, e.g. neural networks, logistic regression, the input needs to be numbers.

Before introducing likelihood encoding, we can go over other methods in handling such situations.

Likelihood encoding

Likelihood encoding is a way of representing the values according to their relationships with the target variable. The goal is finding a meaningful numeric encoding for a categorical feature. Meaningful in this case means as much related to the output/target as possible.

How do we do this? A simple way is 1) first group the training set by this particular categorical feature and 2) representing each value by within group mean of that value. For example, a categorical feature might be gender. Suppose the target is height. Then, we might have the average height for male is 1.70m, while the average height for female is 1.60m. We then change ‘male’ to 1.70, while ‘female’ to 1.60.

Perhaps we should also add some noise to this mean to prevent overfitting to training data. This can be done by :

• add Gaussian noise to the mean. Credit to Owen Zhang :
• use the idea of “cross-validation”. So here, instead of using the grand group mean, we use the cross-validation mean. (Not very clear on this point at the moment. Need to examine the idea of cross-validation. Will write in the next post.) Some people propose on Kaggle about using two levels of cross-validation: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features

One hot vector

This idea is similar to the dummy variable in statistics. Basically, each possible value is being transformed into its own columns. Each of these columns will be a 1 if the original feature equals this value, or 0 if the original feature does not equal this value.

An example is for natural language processing models, the first step is usually 1) tokenize the sentence and 2) constructing a vocabulary and 3) map every token to an index (a number, or a nominal value, basically). After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot

After that, we do 4) one hot encoding and 5) a linear transformation in the form of a linear layer in a neural network (basically transform high-dim one hot vectors into low dim vectors). In this way, we are basically representing every symbol in a low dimensional vector. The exact form of the vector is learned. What is happening here is actually dimension reduction.  So, after learning the weighting matrix, other methods, like PCA, can potentially work here as well!

Hashing

A classmate named Lee Tanenbaum told me about this idea. This is an extension on the one-hot encoding idea. Suppose there can be n values in the feature. Basically, we use two hash functions, hash the possible values into two variables. The first hash all values into \sqrt(n) number of baskets,basketh baskent there are \sqrt(n) number of feature values. All feature values in the same busket is going to be the same for variable A. Then, we use a second hash function, that carefully hash the values into another busket variable B. We want to make sure the combination of A and B can fully represent every possible value in the target feature. We then learn a low-dim representation for both A and B, and concantenate them together.

My opinion on this is, this is still one-hot encoding + weight learning. However, we are forcing certain structures onto the weight matrix.

Feature selection

Still based on one-hot encoding. However, instead of compressing everything into a tiny low-d vector, we discard some dummy variables based on their importance. In fact, LASSO is exactly used for this! L1 usually drives the coefficient of some features to zero, due to the diamond shape of the constraint. Source:

• On why l1 gives sparsity: video here :  https://www.youtube.com/watch?v=14MKVkhvMus&feature=youtu.be
• Course here : https://onlinecourses.science.psu.edu/stat857/book/export/html/137 Statoverflow answer here: https://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection

Domain specific methods

These models exploit the relationship between these symbols in the training set, and learn a vector representation of every symbol (e.g. word2vec). This is certainly another way of vectorizing the words. Probably I will write more about this after learning more on representation learning!

Python generators

This short post describes what is a generator in Python.

A function with yield in it is a function. However, when called the function, it returns a generator object. Generators allow you to pause a function and return an intermediate result. The function saves its execution context and can be resumed later if necessary.

def fibonacci():
a, b = 0, 1
while True:
yield b
a, b = b, a + b

g = fibonacci()

[next (g) for i in range(10)]


This will return [1, 1, 2, 3, 5, 8, 13, 21, 34, 55].

When call the list comprehension again, it will return:

[next (g) for i in range(10)]


[89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765]

Here, note the function is like a machine that can generate what you want. It will also remember its last state. This is different from a function that returns a list, which will not remember its last state..

Python generators can also interact with the code called with the next method. yield becomes an expression, and a value can be passed along with a new method called sendHere is an example piece of code:

def psychologist():
while True:
answer = (yield) # note the usage of yield here
if answer is not None:
if answer.endswith("?"): # note it's endSwith, the s there
print("Don't ask yourself too many questions.")
elif "good" in answer:
print("A that's good, go on. ")
print("Don't be so negative.")


This defines a function that can return a generator.

free = psychologist()

type(free)


This will return “generator”

next(free)


This will return “Please tell me your problem.”

free.send("what?")


This will return “Don’t ask yourself too many questions.”

free.send("I'm feeling bad.")


This will return “Don’t be so negative.”

free.send("But she is feeling good!")


This will return, “A that’s good, go on.”

send acts like next, but makes yield return the value passed. The function can, therefore, change its behavior depending on the client code.

Should I use an IDE, or should I use Vim?

This problem has been bugging me for a while so I decide to write it out even though it’s just a short piece.  This post compares tools for Python programming using :

• Jupyter Notebook
• IDEs like PyCharm
• Text editors like Vim

Jupyter Notebook:

• The pro is it’s easily visualizable. When you want a graph, you can see a graph immediately. The comments are also beautifully formatted.
• Another pro is it can be connected to a Linux server like Dumbo.
• The con is, it’s not a program. A notebook is it’s own file and although it can be downloaded as a .py file, the file is usually too long, with lots of comments like typesetting parameters.

When to use it?

• Data exploration. because of the visualization and analysis nature.
• Big data. Because it can be connected to a server, that makes running large amount of data possible.

PyCharm

• The pro is it’s suited for Python development. I have not learnt the functionalities entirely, but e.g. search and replace are easily doable in PyCharm. Debugging also seems to be easy?
• The con is it’s usually not available on a server.
• Another con is need extra finger movement when switching from terminal to Pycharm.

When to use it?

• Debugging complicated programs. e.g. NLP programs.
• No need to run on a server.

Vim

• The pro is it’s everywhere. Thus, whenever you write on your own machine, or on the server,   it feels the same.
• Another pro is it can be used for anything. like python, C++, markdown, bash… So there is no need to switch to other places when ssh to the server.
• The con is it’s not that easy to use. e.g. search and replace… hard to do this. Adjust tab? also not immediately doable.
• Another con is it’s not that easy to debug. have to manually print out variables… This makes it particularly difficult when the program is large.

When to use it?

• When need to connect to a server. e.g. big data size.

Chi-square and two-sample t-test

This post explains a basic question I encountered, and the statistical concepts behind it.

The real-life problem

Someone asks me to construct data to prove that a treatment is useful for 1) kindergarten and 2) elementary school kids in preventing winter cold.

Chi-square and student’s t test

First, decide how to formulate the problem using statistical tests. This includes deciding the quantity and statistic to compare.

Basically, I need to compare two groups. Two tests come to mind: Pearson’s chi-square test, and two-sample t-test. This article summarizes main difference between the two tests, in terms of Null Hypothesis, Types of Data, Variations and Conclusions. The following section is largely based on that article.

Null Hypothesis

• Pearson’s chi-square test: test the relationship between two variables, or whether something has effects on the other thing (?). e.g. men and women are equally likely to vote for Republican, Democrat, Others, or Not at all. Here the two variables are “gender” and “voting choice”. The null is “gender does not affect voting choice”.
• Two-sample t-test : whether two sample have the same mean. Mathematically, this means $\mu_1 = \mu_2$ or $\mu_1 - \mu_2 = 0$. e.g. boys and girls have the same height.

Types of Data

• Pearson’s chi-square test: usually requires two variables. Each is categorical and can have many number of levels. e.g. one variable is “gender”, the other is “voting choice”.
• two sample t-test: requires two variables. One variable has exactly two levels (two-sample), the other is quantitively calculatable. e.g. in the example above, one variable is gender, the other is height.

Variations

• Pearson’s chi-square test: variations can be when the two variables are ordinal instead of categorical.
• two-sample t-test: variations can be that the two samples are paired instead of independent.

Transform the real-life problem into a statistical problem

Using chi-square test

Variable 1 is “using treatment”. Two levels: use or not.

Variable 2 is “getting winter cold”. Two levels: get cold or not.

For kindergarten kids and for pre-school kids, I thus have two 2 * 2 tables.

(question: can I do a chi-square test on three variables? The third one being “age”.)

Using two-sample t-test

Variable 1 is “using treatment”. Two levels: use or not

Variable 2 is supposed to be a numerical variable —- here use disease rate. But then there is no enough number of samples.

Thus, conclude that Chi-square test should be used here.

Brief explanation of statistical learning framework

This post explains what is a statistical learning framework, and common results under this framework.

Problem

We have a random variable X, another random variable Y. Now we want to determine the relationship between X and Y.

We define the relationship by a prediction function f(x). For each x, this function produces an “action” a in the action space.

Now how do we get the predictive function f? We use a loss function l(a, y), that for each a and y, we produce a “loss”. Note since X is a random variable, f(x) is a transformation, so a is a random variable, too.

Also, l(a, y) is a transformation of (a, y), so l(a, y) is a random variable too. It’s distribution is based on both X and Y.

We then calculate f by minimizing the expectation of the loss, which is called “risk”. Since the distribution of l(a, y) is based both on the distribution of X and Y, to get this expectation, we need to do integration both on X and on Y. In the case of discrete variables, we do summation based on the pmf of (x, y).

The above are about theoretical properties of Y, X, loss function and prediction function. But we usually do not know the distribution of (X, Y). Thus, we choose to minimize empirical risk instead. We calculate empirical risk by summing all the empirical loss together, divided by m. (q: does this resemble Monte Carlo method? is this about computational statistics? Need a review.)

Results

In the case of square-loss, we have the result, a = E(y|x).

In the case of 0-1 loss, we have the result, a = arg max P(y|x)

Example:

We want to predict a student’s mid-term grade (Y). We want to know the relationship between predicted value, and whether she is hard-working (X).

We use square-loss for this continuous variable Y. Since we know that to minize square loss, for any random variable we should predict the mean value of the variable (c.f. regression analysis, in OLS scenerio we calculate the MSE — but need further connection to this framework).

Now we just observed unfortunately the student is not hard-working.

We know for a not-hardworking student the expectation of mid-term grade is 40.

We then predict the grade to be 40, as a way to minimize square-loss.

Probability, statistics, frequentist and Bayesian

This post is a review of basic concepts in probability and statistics.

Useful reference: https://cims.nyu.edu/~cfgranda/pages/DSGA1002_fall15/notes.html

https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/

Probability

It’s a tool to mathematically measure uncertainty.

Formal definition involving $\sigma-algebra$:

A probability space isa triple $(\Omega, F, P)$ consisting of :

• A sample space $\Omega$
• A set of events F – which will be $\sigma-algebra$
• A probability measure P that assigns probabbilites to the events in F.

Example: We have a fair coin. Now we toss it 1000 times, what’s the probability of getting 600 heads or more?

Statistics

The goal of statistics is to 1) draw conclusion from data (e.g. reject Null Hypothesis) and 2) evaluate the uncertainty of this information (e.g. p-value, confidence interval, or posterier distribution).

At the bottem, statistical statement is also about probability. Because it applies probability to draw conclusions from data.

Example: We would like to know whether the probability of raining tomorrow is 0.99. Then tomorrow comes, and it does not rain. Do we conclude that P(rain) = 0.99 is true?

Example 2: We would like to decide if a coin is fair. (Data) Toss the coin 1000 times, and 809 times it’s a head. Do we conclude the coin is fair?

Note : probability is logically self-contained. There are a few rules, and the answers follow from the rules. Statistics can be messy, because it involves draw conclusion from data – much art than science.

Frequentist vs Bayesian

Two schools of statistics. They are different in their interpretation of probability.

Frequentist interpret probability to be the frequencies of events in repeating experiments. E.g. P(head) = 0.6. Then if we toss a coin 1000 times, we will have 600 heads.

Bayesian interprets probability to be a state of knowledge, or a state of belief, about a preposition. E.g. P(head) = 0.6, means we are fairly certain (around 60% certain!) that a coin will be tossed head.

In practice though, Bayesian seldom use a single value to characterize such belief. Rather, it uses a distribution.

Frequentists are used in social science, biology, medicine, public health. We see two sample t-tests, p-values. Bayesian is used in computer science, “big data”.

Core difference between Frequentists and Bayesian

Bayesian considers the results from previous experiments, in the form of a prior.

See this comic for an illustration.

What does it mean?

A frequentist and a Bayesian are making a bet about whether the sun has exploded.

It’s night, so they can not observe.

They ask some expert whether the sun has gone Nova.

They also know that this expert will toss two coins. If both get 6, she will lie. Else, she won’t. (Data generation process)

Now they ask the expert, who tells them yes, the sun has gone Nova.

Frequent conclude that since the probability of getting two 6’s is 1/36 = 0.0027 <0.05 (p < 0.05), it’s very unlikely the expert has lied. Thus, she concludes the expert did not lie. Thus, she concludes that the sun has exploded.

Bayesian, however, has a strong belief that the sun has not exploded (or else they will be dead already). The prior distribution is

• P(sun has not exploded) = 0.99999999999999999,
• P(sun has exploded) = 0.00000000000000001.

Now the data generation process is essentially the following distribution:

• P(expert says sun exploded |Sun not exploded) =  1/36.
• P(expert says sun exploded |Sun exploded) =  35/36.
• P(expert says sun not exploded |Sun exploded) =  1/36.
• P(expert says sun not exploded |Sun not exploded) =  35/36.

The observed data is “expert says sun exploded”. We want to know

• P( Sun exploded | expert says sun exploded ) = P( expert says sun exploded | Sun exploded) * P( Sun exploded) / P(expert says sun exploded)

Since P(Sun exploded) is extremely small compared to other probabilities, P( Sun exploded | expert says sun exploded ) is also extremely small.

Thus although the expert is unlikely to lie (p = 0.0027), the sun is much more unlikely to have exploded. Thus, the expert most likely lied, and the sun has not exploded.

Debugging larger pyspark ML programs

This post describes my learning experience in developing larger programs, especially those :

• Takes a long time to run – due to big data sets and computationally intensive algorithms
• Requires developing locally and on HPC. That is cannot be solved in a Python IDE

The take away is:

• To save time, try writing scripts in one place only.
• Do not develop interactive and then paste everything to an editor!

The problem

I found myself spending excessive time (~ 4 days) developing a program that should be simple in its logic. Basically, I just need to call Pyspark API functions to do classification on 20Newsgroup dataset. It should be a simple program!

How did I spend my time?

• First day, I found a script doing a similar task. When I tried to use it I came into the following problems:
• Read files on HDFS. Spark only works with files on HDFS not local file system. This mistakes took me some time, as I thought the problem was with syntax in reading nested folders.
• The script does not clean text – remove headers, numbers, punctuations. To do this, I had to understand the .trans function. This also works differently in Python2 and Python3, which took me some time to realize.
• The script use Pyspark SQL, thus took some time to learn DataFrame as well.
• By day 2 the data is read into Spark DataFrame.
• I then had problem calling functions from MLLib, because I didn’t realize ML and MLlib are two libraries and have different data structures. I then came into problem when using an MLlib function to ML data.
• I also tried to convert the data structure back and forth, from RDD to Labeled Point to ML.
• To inspect what is in the data, I also spent time calling the wrong functions, or transform everything into an RDD and call map functions.
• By day 3 I intend to use Spark-submit on HPC. The main task is to learn to use editor.
• Because someone told me I should be using editor instead of debugging interactively, or else I cannot see code structure, I began to learn vim. That took a morning or so (!).
• By day 4, I am trying to clean up the code and write functions.
• This creates another level of complexity. One of the bug is I forget to update the function argument

Trouble-shooting:

• I type every line of code three to four times. First on my workstation, then on the server’s pyspark shell. Finally I copy the code into a script. This not only creates space for mistakes, but also is inefficient.
• I did this because I am not comfortable writing scripts on HPC yet.
• Also because I am not comfortable debugging with a script and an editor, without using IDE.
• I input every line of code at least three to four times.
• First in interactive mode.
• Then, copying the line into my editor. Since the project spans across days, every time I start again I need to re-read the files!
• Also, I then creates a file on a small dataset on HPC and run that.
• After that, I run the file again on a larger dataset.
• Since I am debugging at multiple places, I need to do version control as well.
• I remember scp back and forth sending files along. Whenever I edited something, I remove the older version of the program at the other place.
• I am not familiar with data structure and functions on Pyspark. This also leads to waste of time.
• Interactive mode of Spark is slow. I once forgot to cut the dataset smaller, and running one command (e.g. transform) on the whole dataset takes 10 minues! If there are 20 commands like this that would be ~3 hours.

Solution

• The above problems can be summarized as :
• I need to write script on multiple places: server and local.
• Solution: learn to use editor in the server. e.g. Vim
• Submit program interactively vs in batch. I do not know how to debug with a script so I have to use interactive mode to make sure I know what I am doing. But debugging interactively means double the amount of typing because every command needs to go through the terminal and the editor. That’s also running the program twice.
• Solution: learn to run the program from the script. The con of this is need to laod data multiple times. Also, use a smaller dataset so data loading will not be a pain.
• Time spent on start-up. The data file is huge, thus re-loading it takes time. If every time load the dataset is 3 minutes, load it 20 times would be 60 minutes.
• Solution: use a smaller sample dataset for developing.

Moral: try to write only one set of programs in a single place.

• For pyspark, I can only use HPC. So I just write on HPC.
• Pyspark also cannot use pdb.
• For Python, I should test everything locally first. There are two final programs
• A script that can run locally and on HPC.
• A script to be submitted to the cluster.

Debugging philosophy

• “Bugs” are not bugs but errors. The responsibility lies with the programmer. A program that has errors is simple wrong, because 1) the programmer is not really familiar with the rules and grammars of the library. Used the wrong data structure, used the wrong function call. etc.
• Debugging is a learning experience.  Why does debugging takes lots of time? Because the programmer is learning something new, thus need time to try and make mistakes. There seems to be a wishful thinking that a perfect program will magically began working by itself, and thus time will be well spent. It wont’, because learning always takes time!

Pyspark ML vs MLLib

A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.

Data Structure

• pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features than ML
• pyspark.ml newer library for machine learning. It can only use sql DataFrame. But easier to construct a ML pipeline.
• It is recommended to use DataFrame APIs (i.e. ML) because it is more stable. Also this is newer.

Features:

The two libraries seem to be similar in terms of feature selection APIs.

Models

• pyspark.ml implements Pipeline workflow, which is based on dataFrame and can be used to “quickly assemble and configure practical machine learning pipelines”.

Classification algorithms provided by MLlib:

Classification algorithm by ML:

Seems that ML has multilayerPerceptron which is not in MLLib.

Evaluation

• pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark.mllib does not
• However Pyspark ML’s evaluation function is difficult to use.  The docs are not clear. Thus, changing back to MLlib to use this.

When to use which?

• Efficiency: MLLib uses Labelpoint RDD, while ML uses structured DataFrame. Thus, “if the data is already structured DataFrames, then ML confer some performance benefits over RDD’s, this is apparently drastic as the complexity of your operations increase.”
• Resource: DataFrames consume far less memory when caching than RDDs. Thus lower level operations RDD’s are great but high level operations, viewing and typing with other API’s use DataFrames. (Above all quote Stackoverflow user Grr)
• It is recommended to use ML and DataFrame. However if want to use Gridsearch, seems still need to revert to Labeled point!

Reference :

A stackoverflow question: https://stackoverflow.com/questions/43240539/pyspark-mllib-versus-pyspark-ml-packages

MLlib doc: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

ML doc: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

Comparison of data structures: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.quora.com/Why-are-there-two-ML-implementations-in-Spark-ML-and-MLlib-and-what-are-their-different-features

Clean Text in Python3

A pain in the ass. This post summarizes “best” approaches to clean text data in Python3.

It will not cover depreciated syntax in Python2. For example string.maketans has a different usage in python2 — it is not discussed here.

Is that a string or a Unicode?

Reference here.

When you see a string in python2, there are two possibilities:

• ASCII strings: Every character in a string is a byte. Look up the hexadecimal value in the ASCII table.
• Unicode strings: every character in a string is one or more than one byte. Look up the hexiadecimal value in the Unicode table – there are many of them. The most popular one is UTF-8.
• Example: 猫 is represened in three bytes in Unicode. when Python2 reads this, it gots it wrong – it thinks there are 5 characters but there is in fact just three… As python2 use ascii to decode.
• To produce the correct representation, use x.decode(‘utf-8’)

String vs sequence of bytes in Python2

• String is a sequence of Unicode codepoints… they are abstract concepts and cannot be stored on disk. They are a sequence of characters.
• bytes are actual numbers… they can be stored on disk.
• Anything has to be mapped to a byte to be stored on a computer
• To map a codepoint to a byte, use Unicode encoding
• To convert a byte to a string, use decoding .

String vs sequences of bytes in Python3

• In sum, in Python 3, str represents a Unicode string, while the bytes type is a sequence of bytes. This naming scheme makes a lot more intuitive sense.

Encode vs Decode

• To represent a unicode string as a string of bytes is known as encoding

Remove punctuation

The best answer (I think) from Stackoverflow :


import string

# Thanks to Martijn Pieters for this improved version

# This uses the 3-argument version of str.maketrans
# with arguments (x, y, z) where 'x' and 'y'
# must be equal-length strings and characters in 'x'
# are replaced by characters in 'y'. 'z'
# is a string (string.punctuation here)
# where each character in the string is mapped
# to None
translator = str.maketrans('', '', string.punctuation)

# This is an alternative that creates a dictionary mapping
# of every character from string.punctuation to None (this will
# also work)
#translator = str.maketrans(dict.fromkeys(string.punctuation))

s = 'string with "punctuation" inside of it! Does this work? I hope so.'

# pass the translator to the string's translate method.
print(s.translate(translator))



The code above removes punctuation and just delete it.

Replace punctuation with a blank

This method uses regular expression… I find it to be better than using a translator!

import re
re.sub(r'[^\w]', ' ', text)


Dealing with all kinds of blanks

Again borrowing from this awesome stackoverflow post…

# Remove leading and ending spaces, use str.strip():
sentence = ' hello  apple'
sentence.strip()
>>>'hello  apple'

# Remove all spaces, use str.replace():

sentence = ' hello  apple'
sentence.replace(" ", "")
>>> 'helloapple'

# Remove duplicated spaces, use str.split(), then join the words together

sentence = ' hello  apple'
" ".join(sentence.split())
>>>'hello apple'



Summary of workflow:

1. Decide if that’s a string or a unicode, or a sequence of bytes. This will decide whether need to encode or not. Ultimately we want a string, i.e. “str” type.
2. Import re module. MHOP this is the most convenient one to remove punctuations.
1. re.sub can replace all numbers and punctuations with blanks. Thus, will not join two unrelated words together if they are connected by punctuations.
3. Use s.lower() to change a string to lower case… Note here s is a string! not a unicode object.
4. Use s.strip() to strip out the excessive blanks!
5. Use . join to join the list of words together with only one blank between them, if needed.

Machine Learning in Spark

This post describes Spark’s architecture for building and running machine learning algos.

Machine learning algorithms can be found by Spark’s MLlib API.

The data structure used are DataFrame. With regard the the last post, this means we need to import spark SQL context to use as well.

There are five key concepts:

• DataFrame: the data structure. Each column can be text, labels, features, or numbers…
• Transformer: an object to transform a DataFrame to another one. A machine learning algo is represented as a transformer that takes an input DataFrame with features and outputs a DataFrame with the predictions. A transformer is called by a transform() method on a DataFrame.
• Estimator: an algorithm can be fit to a DataFrame and produces a transformer.  e.g. a ML algo can be fit to the training set and produce a model. An estimator is called by a fit() method on a DataFrame.
• Pipeline: a collection of transformers and estimators that eventually consists of a trained model.
• Parameters: parameters of the transformer and the estimator.

Now go over the example on Spark Website for understanding of how to use machine learning Pipeline.

Step one: read in data

First create the spark SQL context, and use that to create a SparkDataFrame. This assumes a spark context is already created as sc.

sqlContext = SQLContext(sc)

training = sqlContext.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])



The results of this step:

training.show()
+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



Step two: preprocess + training = ML pipeline

Now, preprocess data by constructing a ML pipeline.

• Tokenizer here tokenize every text into words ….
• hashingTF: this one is more tricky. As per the document,

Our implementation of term frequency utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e., the number of buckets of the hash table. The default feature dimension is 220=1,048,576

My understanding on hashTF (hash term frequency?) is that :

• Hash functions map data of arbitrary size to a fixed size. It maps every word to a unique position, hopefully. But there will be collisions, which we ignore…
• Next we just count the number of items at each position.
• This saves the step of storing a map of word – index. Instead we just need to store a function.

# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)



Now let’s go over the output of each step. First, check the results of the tokenizer….

p1 = Pipeline(stages = [tokenizer])
p1.fit(training).transform(training).show()

+---+----------------+-----+--------------------+
| id|            text|label|               words|
+---+----------------+-----+--------------------+
|  0| a b c d e spark|  1.0|[a, b, c, d, e, s...|
|  1|             b d|  0.0|              [b, d]|
|  2|     spark f g h|  1.0|    [spark, f, g, h]|
+---+----------------+-----+--------------------+


Then, check the results of the HashingTF….

hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
p2 = Pipeline(stages = [tokenizer, hashingTF])
p2.fit(training).transform(training).show()

+---+----------------+-----+--------------------+--------------------+
| id|            text|label|               words|            features|
+---+----------------+-----+--------------------+--------------------+
|  0| a b c d e spark|  1.0|[a, b, c, d, e, s...|(262144,[97,98,99...|
|  1|             b d|  0.0|              [b, d]|(262144,[98,100],...|
|  2|     spark f g h|  1.0|    [spark, f, g, h]|(262144,[102,103,...|
+---+----------------+-----+--------------------+--------------------+


Third, check the output of the logistic regression model:

lr = LogisticRegression(maxIter=10, regParam=0.001)
p3 = Pipeline(stages=[tokenizer, hashingTF, lr])
p3.fit(training).transform(training).columns

['id', 'text', 'label', 'words', 'features', 'rawPrediction', 'probability', 'prediction']


In our original data, we only have id, text and label.

• ‘words’ are added by the tokenizer,
• ‘features’ are added by the hashingTF,
• ‘rawPrediction’ is created by logistic regression. It is the “score” of each class labels. It varies algo by algo. For example, in a neural network it can be interpreted as the last layer passed to the softmax.
• prediction: the predicted class label. the argmax of raw prediction.
• probability : the “real” probability given raw prediction… Looks like passing rawprediction to a softmax for multi-class classification problems.

Step three: making predictions

After building the model now we can start to make predictions by building and testing on a toy test set.

First build test set — it is similar to the training set, except without labels

test = sqlContext.createDataFrame([
(4, "spark i j k"),
(5, "l m n"),
(6, "spark hadoop spark"),
], ["id", "text"])

test.show()
+---+------------------+
| id|              text|
+---+------------------+
|  4|       spark i j k|
|  5|             l m n|
|  6|spark hadoop spark|
|  7|     apache hadoop|
+---+------------------+


Make predictions…

prediction = model.transform(test)

DataFrame[id: bigint, text: string, words: array<string>, features: vector, rawPrediction: vector, probability: vector, prediction: double]


Here prediction is of the same format as the transformed train dataset. The reason is, fitting machine learning algorithm is equivalent to transform the test data to add other columns…

Finally, printing out the results we care: prediction

selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
rid, text, prob, prediction = row
print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.159640773879,0.840359226121], prediction=1.000000
(5, l m n) --> prob=[0.837832568548,0.162167431452], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.0692663313298,0.93073366867], prediction=1.000000
(7, apache hadoop) --> prob=[0.982157533344,0.0178424666556], prediction=0.000000


Question: what will the output be if some of the classes are not in the training set?

Reference: https://spark.apache.org/docs/latest/ml-pipeline.html#main-concepts-in-pipelines