Pyspark in Google Colab
Ever want to test out Apache spark without spinning up a linux box? Try out Colab!
The entire colab runs in a cloud VM. Let's investigate the VM. You will see that the current colab notebook is running on top of Ubuntu 18.04.6 LTS (at the time of this writing.)
!cat /etc/*release
!sudo apt-get -y install openjdk-8-jdk-headless
In order to install Apache Spark on Linux based Ubuntu, access Apache Spark Download site and go to the Download Apache Spark section and click on the link from ordered list item number 3, this takes you to the page with mirror URL’s to download. copy the link from one of the mirror site.
!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz
!pip install -q findspark
OS module in Python provides functions for interacting with the operating system. OS comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.
os.environ
in Python is a mapping object that represents the user’s environmental variables. It returns a dictionary having user’s environmental variable as key and their values as value.
os.environ
behaves like a python dictionary, so all the common dictionary operations like get and set can be performed. We can also modify os.environ but any changes will be effective only for the current process where it was assigned and it will not change the value permanently.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"
PySpark isn't on sys.path
by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.
To initialize PySpark, just call
import findspark
findspark.init()
To verify the automatically detected location, call
findspark.find()
Now, we can import SparkSession from pyspark
.sql and create a SparkSession, which is the entry point to Spark.
You can give a name to the session using appName()
and add some configurations with config()
if you wish.
from pyspark.sql import SparkSession
spark = (SparkSession
.builder
.master("local")
.appName("my_colab_spark_app")
.config('spark.ui.port', '4050')
.getOrCreate())
Finally, print the SparkSession variable.
spark