This simple post describes what spark SQL is and how to use it. Basic operations concepts are not difficult to grasp.
What is SparkSQL?
- Spark’s interface to work with structured and semistructured data.
How to use SparkSQL?
- Use it inside a Spark application. In pySpark this translates to create a SQLContext based on SparkContext. Code is :
sqlContext = SQLContext(sc)
This is initialization.
What are the objects in Spark SQL?
- DataFrame (previously SchemaRDD)
What is the difference between a DataFrame and a regular RDD?
- DataFrame is also a RDD. Thus, can use map and reduce methods.
- Elements in DataFrame are Rows.
- A row is simply a fixed length of array, wrapped by some more methods.
- A row represent a record.
- A row knows its schema.
- DataFrame has information of the schema — the structure of the data fields.
- A DataFrame provides new methods to run SQL queries on it.
How to run query on SparkSQL?
- First initialize sql context.
- Then create DataFrame.
- Then register the DataFrame as a temporary table. Code in Python:
input = sqlcontext.jsonFile(inputFile) input.registerTempTable("tweets") toptweets = sqlcontext.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
- Register it as table — seems to have new versions in Spark 2.0
Here is a good guide to working with SparkSQL using 20newsgroup data: https://github.com/kundan-git/pyspark-text-classification/blob/master/Spark-Text-Classification.ipynb