A post that summarizes main difference between Pyspakr ML and MLlib. This is based on Spark 2.2.0 and Python 3.
- pyspark.mllib is the older library for machine learning. It can only use RDD labeled point. But then more features than ML
- pyspark.ml newer library for machine learning. It can only use sql DataFrame. But easier to construct a ML pipeline.
- It is recommended to use DataFrame APIs (i.e. ML) because it is more stable. Also this is newer.
The two libraries seem to be similar in terms of feature selection APIs.
- pyspark.ml implements Pipeline workflow, which is based on dataFrame and can be used to “quickly assemble and configure practical machine learning pipelines”.
Classification algorithms provided by MLlib:
- Classification and regression
Classification algorithm by ML:
Seems that ML has multilayerPerceptron which is not in MLLib.
- pyspark ml implements pyspark.ml.tuning.CrossValidator while pyspark.mllib does not
- However Pyspark ML’s evaluation function is difficult to use. The docs are not clear. Thus, changing back to MLlib to use this.
When to use which?
- Efficiency: MLLib uses Labelpoint RDD, while ML uses structured DataFrame. Thus, “if the data is already structured DataFrames, then ML confer some performance benefits over RDD’s, this is apparently drastic as the complexity of your operations increase.”
- Resource: DataFrames consume far less memory when caching than RDDs. Thus lower level operations RDD’s are great but high level operations, viewing and typing with other API’s use DataFrames. (Above all quote Stackoverflow user Grr)
- It is recommended to use ML and DataFrame. However if want to use Gridsearch, seems still need to revert to Labeled point!
A stackoverflow question: https://stackoverflow.com/questions/43240539/pyspark-mllib-versus-pyspark-ml-packages
MLlib doc: https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
ML doc: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html
Comparison of data structures: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html