Pyspark ML vs MLLib

A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.

Data Structure

  • pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features than ML
  • pyspark.ml newer library for machine learning. It can only use sql DataFrame. But easier to construct a ML pipeline.
  • It is recommended to use DataFrame APIs (i.e. ML) because it is more stable. Also this is newer.

Features:

The two libraries seem to be similar in terms of feature selection APIs.

Models

  • pyspark.ml implements Pipeline workflow, which is based on dataFrame and can be used to “quickly assemble and configure practical machine learning pipelines”.

Classification algorithms provided by MLlib:

Classification algorithm by ML:

Seems that ML has multilayerPerceptron which is not in MLLib.

Evaluation

  • pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark.mllib does not
  • However Pyspark ML’s evaluation function is difficult to use.  The docs are not clear. Thus, changing back to MLlib to use this.

When to use which?

  • Efficiency: MLLib uses Labelpoint RDD, while ML uses structured DataFrame. Thus, “if the data is already structured DataFrames, then ML confer some performance benefits over RDD’s, this is apparently drastic as the complexity of your operations increase.”
  • Resource: DataFrames consume far less memory when caching than RDDs. Thus lower level operations RDD’s are great but high level operations, viewing and typing with other API’s use DataFrames. (Above all quote Stackoverflow user Grr)
  • It is recommended to use ML and DataFrame. However if want to use Gridsearch, seems still need to revert to Labeled point!

Reference :

A stackoverflow question: https://stackoverflow.com/questions/43240539/pyspark-mllib-versus-pyspark-ml-packages

MLlib doc: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html

ML doc: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html

Comparison of data structures: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

https://www.quora.com/Why-are-there-two-ML-implementations-in-Spark-ML-and-MLlib-and-what-are-their-different-features

 

Share this post

2 thoughts on “Pyspark ML vs MLLib

  1. XY

    DataFrame的转化其实很方便的,我记得Spark支持绝大多数热门结构直接转成DataFrame
    pyspark.ml主要的问题还是目前能用的算法不多,很多pyspark.mllib支持的算法都不支持
    不过pyspark.mllib在spark2.0后已经停止更新了,希望很多算法能移植到pyspark.ml库里吧

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *