Debugging larger pyspark ML programs

This post describes my learning experience in developing larger programs, especially those :

  • Takes a long time to run – due to big data sets and computationally intensive algorithms
  • Requires developing locally and on HPC. That is cannot be solved in a Python IDE

The take away is:

  • To save time, try writing scripts in one place only.
  • Do not develop interactive and then paste everything to an editor!

The problem

I found myself spending excessive time (~ 4 days) developing a program that should be simple in its logic. Basically, I just need to call Pyspark API functions to do classification on 20Newsgroup dataset. It should be a simple program!

How did I spend my time?

  • First day, I found a script doing a similar task. When I tried to use it I came into the following problems:
    • Read files on HDFS. Spark only works with files on HDFS not local file system. This mistakes took me some time, as I thought the problem was with syntax in reading nested folders.
    • The script does not clean text – remove headers, numbers, punctuations. To do this, I had to understand the .trans function. This also works differently in Python2 and Python3, which took me some time to realize.
    • The script use Pyspark SQL, thus took some time to learn DataFrame as well.
  • By day 2 the data is read into Spark DataFrame.
    • I then had problem calling functions from MLLib, because I didn’t realize ML and MLlib are two libraries and have different data structures. I then came into problem when using an MLlib function to ML data.
    • I also tried to convert the data structure back and forth, from RDD to Labeled Point to ML.
    • To inspect what is in the data, I also spent time calling the wrong functions, or transform everything into an RDD and call map functions.
  • By day 3 I intend to use Spark-submit on HPC. The main task is to learn to use editor.
    • Because someone told me I should be using editor instead of debugging interactively, or else I cannot see code structure, I began to learn vim. That took a morning or so (!).
  • By day 4, I am trying to clean up the code and write functions.
    • This creates another level of complexity. One of the bug is I forget to update the function argument

Trouble-shooting:

  • I type every line of code three to four times. First on my workstation, then on the server’s pyspark shell. Finally I copy the code into a script. This not only creates space for mistakes, but also is inefficient.
    • I did this because I am not comfortable writing scripts on HPC yet.
    • Also because I am not comfortable debugging with a script and an editor, without using IDE.
  • I input every line of code at least three to four times.
    • First in interactive mode.
    • Then, copying the line into my editor. Since the project spans across days, every time I start again I need to re-read the files!
    • Also, I then creates a file on a small dataset on HPC and run that.
    • After that, I run the file again on a larger dataset.
  • Since I am debugging at multiple places, I need to do version control as well.
    • I remember scp back and forth sending files along. Whenever I edited something, I remove the older version of the program at the other place.
  • I am not familiar with data structure and functions on Pyspark. This also leads to waste of time.
  • Interactive mode of Spark is slow. I once forgot to cut the dataset smaller, and running one command (e.g. transform) on the whole dataset takes 10 minues! If there are 20 commands like this that would be ~3 hours.

Solution

  • The above problems can be summarized as :
    • I need to write script on multiple places: server and local.
      • Solution: learn to use editor in the server. e.g. Vim
    • Submit program interactively vs in batch. I do not know how to debug with a script so I have to use interactive mode to make sure I know what I am doing. But debugging interactively means double the amount of typing because every command needs to go through the terminal and the editor. That’s also running the program twice.
      • Solution: learn to run the program from the script. The con of this is need to laod data multiple times. Also, use a smaller dataset so data loading will not be a pain. 
    • Time spent on start-up. The data file is huge, thus re-loading it takes time. If every time load the dataset is 3 minutes, load it 20 times would be 60 minutes.
      • Solution: use a smaller sample dataset for developing.

Moral: try to write only one set of programs in a single place.

  • For pyspark, I can only use HPC. So I just write on HPC.
  • Pyspark also cannot use pdb.
  • For Python, I should test everything locally first. There are two final programs
    • A script that can run locally and on HPC.
    • A script to be submitted to the cluster.

Debugging philosophy

  • “Bugs” are not bugs but errors. The responsibility lies with the programmer. A program that has errors is simple wrong, because 1) the programmer is not really familiar with the rules and grammars of the library. Used the wrong data structure, used the wrong function call. etc.
  • Debugging is a learning experience.  Why does debugging takes lots of time? Because the programmer is learning something new, thus need time to try and make mistakes. There seems to be a wishful thinking that a perfect program will magically began working by itself, and thus time will be well spent. It wont’, because learning always takes time!
Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *