Ever want to test out Apache spark without spinning up a linux box? Try out Colab!

The entire colab runs in a cloud VM. Let's investigate the VM. You will see that the current colab notebook is running on top of Ubuntu 18.04.6 LTS (at the time of this writing.)

!cat /etc/*release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.6 LTS"
NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

!sudo apt-get -y install openjdk-8-jdk-headless

Reading package lists... Done
Building dependency tree       
Reading state information... Done
openjdk-8-jdk-headless is already the newest version (8u342-b07-0ubuntu1~18.04).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.

In order to install Apache Spark on Linux based Ubuntu, access Apache Spark Download site and go to the Download Apache Spark section and click on the link from ordered list item number 3, this takes you to the page with mirror URL’s to download. copy the link from one of the mirror site.

!wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

--2022-08-26 20:10:38--  https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 299321244 (285M) [application/x-gzip]
Saving to: ‘spark-3.3.0-bin-hadoop3.tgz.1’

spark-3.3.0-bin-had 100%[===================>] 285.45M  58.6MB/s    in 6.5s    

2022-08-26 20:10:45 (43.8 MB/s) - ‘spark-3.3.0-bin-hadoop3.tgz.1’ saved [299321244/299321244]

!tar xf spark-3.3.0-bin-hadoop3.tgz

!pip install -q findspark

OS module in Python provides functions for interacting with the operating system. OS comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.

os.environ in Python is a mapping object that represents the user’s environmental variables. It returns a dictionary having user’s environmental variable as key and their values as value.

os.environ behaves like a python dictionary, so all the common dictionary operations like get and set can be performed. We can also modify os.environ but any changes will be effective only for the current process where it was assigned and it will not change the value permanently.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

PySpark isn't on sys.path by default, but that doesn't mean it can't be used as a regular library. You can address this by either symlinking pyspark into your site-packages, or adding pyspark to sys.path at runtime. findspark does the latter.

To initialize PySpark, just call

import findspark
findspark.init()

To verify the automatically detected location, call

findspark.find()

'/content/spark-3.3.0-bin-hadoop3'

Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.

You can give a name to the session using appName() and add some configurations with config() if you wish.

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .master("local")
         .appName("my_colab_spark_app")
         .config('spark.ui.port', '4050')
         .getOrCreate())

Finally, print the SparkSession variable.

spark