/This post described 1) my past few days’ work since Christmas ( not very much) and 2) my change of future plans and reasons and 3) my progress on MPhil thesis literature review chapter.

My last few days work

After Christmas break productivity… Read more...

All about sociology

This post describes my learning experience in developing larger programs, especially those :

  • Takes a long time to run – due to big data sets and computationally intensive algorithms
  • Requires developing locally and on HPC. That is cannot
Read more...

Techy

A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.

Data Structure

  • pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features
Read more...

Techy

A pain in the ass. This post summarizes “best” approaches to clean text data in Python3.

It will not cover depreciated syntax in Python2. For example string.maketans has a different usage in python2 — it is not discussed… Read more...

Techy

This post describes Spark’s architecture for building and running machine learning algos.

Machine learning algorithms can be found by Spark’s MLlib API.

The data structure used are DataFrame. With regard the the last post,… Read more...

Techy

This simple post describes what spark SQL is and how to use it. Basic operations concepts are not difficult to grasp.

What is SparkSQL?

  • Spark’s interface to work with structured and semistructured data.

How to use SparkSQL?

  • Use it inside
Read more...

Techy

This post explains how to build a recommendation system based on matrix factorization. However in practice, the engineering aspects are more challenging when the dataset is huge – these are not covered here.

Recommendation system:Read more...

Techy

Through a few months’ study I noticed the challenges of big data is less of an intellectual one than an engineering one. There are algorithms designed for actually sampling huge data streams (e.g. this Stanford course), but still in … Read more...

Techy

This post explains how to use the function “combineByKey”, as the pyspark document does not seem very clear on this.

When to use combineByKey?

  • You have an RDD of (key, value) pairs – a paired RDD
    • In Python, such an RDD is constructed
Read more...

Techy

This post explains the concept of soft classifiers (in its simple form) and offers examples in sklearn.

Soft classifiers

In classification problems, hard classifiers gives the exact predicted class.

But soft classifiers gives a probability… Read more...

Techy