Spark SQL quick intro

This simple post describes what spark SQL is and how to use it. Basic operations concepts are not difficult to grasp.

What is SparkSQL?

  • Spark’s interface to work with structured and semistructured data.

How to use SparkSQL?

  • Use it inside a Spark application. In pySpark this translates to create a SQLContext based on SparkContext. Code is :
    sqlContext = SQLContext(sc)
    

This is initialization.

What are the objects in Spark SQL?

  • DataFrame (previously SchemaRDD)

What is the difference between a DataFrame and a regular RDD?

  • DataFrame is also a RDD. Thus, can use map and reduce methods.
  • Elements in DataFrame are Rows.
    • A row is simply a fixed length of array, wrapped by some more methods.
    • A row represent a record.
    • A row knows its schema.
    • DataFrame has information of the schema — the structure of the data fields.
    • A DataFrame provides new methods to run SQL queries on it.

How to run query on SparkSQL?

  • First initialize sql context.
  • Then create DataFrame.
  • Then register the DataFrame as a temporary table. Code in Python:
    • input = sqlcontext.jsonFile(inputFile)
      input.registerTempTable("tweets")
      toptweets = sqlcontext.sql("""SELECT text, retweetCount FROM tweets ORDER BY retweetCount LIMIT 10""")
      
  • Register it as table — seems to have new versions in Spark 2.0

Here is a good guide to working with SparkSQL using 20newsgroup data: https://github.com/kundan-git/pyspark-text-classification/blob/master/Spark-Text-Classification.ipynb
 

 

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *