Upgrading Spark Standalone Cluster in WSL2

Bonus : workaround for pyspark and python version mismatch

Table of contents

Overview

Since 2019 I've been using PySpark 2.4.4 , and my system's version of Python is moving from 3.6 into 3.8. This version drift finally caused me a bit of headache by throwing the following error when I want to load SparkSession from pyspark.sql module in my code.

TypeError: an integer is required (got type bytes)

This error is caused because pyspark=2.4.4 does not support python3.8. Most of the recommendations are to downgrade to python3.7 to work around the issue or to upgrade pyspark to the later version ala :

pip3 install --upgrade pyspark

I am using a Spark standalone cluster in my local i.e. "installing from source"-way, and the above command did nothing to my pyspark installation i.e. the version stays at 2.4.4. There are more steps needed to be taken.

Furthermore, my WSL2 is a spaghetti-maze installation of binaries / distributions. Some challenges :

  1. There are binaries that were installed by brew and some that came by means of sudo apt-get
  2. I forgot how I installed pyspark and apache-spark in the first place

Obviously doing brew uninstall pyspark and apt-get remove apache-spark did nothing. To start from a clean slate is not as easy as uninstall-and-reinstall.

Steps

First, I need to locate the installation path of both pyspark and spark-shell:

$ which pyspark && which spark-shell
/home/linuxbrew/.linuxbrew/bin/pyspark
/home/linuxbrew/.linuxbrew/bin/spark-shell

Those files are actually bash scripts used by brew to launch the application. Looking at the code of /home/linuxbrew/.linuxbrew/bin/pyspark we can see the following PATH definition :

# Add the PySpark classes to the Python path:
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH"

Then, I need to locate which path $SPARK_HOME resolves to :

$ echo $SPARK_HOME
/opt/spark

I went to /opt and voila! That's where it was installed. What have to be done next to "upgrade" spark version is to delete the folder containing all spark-related binaries and replace it with the newer one (make sure the hadoop version tallies with your installation - I am still using 2.7 at the time of writing).

rm -rf /opt/spark
sudo wget https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
sudo tar xvzf spark-3.2.1-bin-hadoop2.7.tgz
sudo mv spark-3.2.1-bin-hadoop2.7 spark

After the above steps have been done, check your pyspark version :

$ pyspark --version
22/03/26 15:51:40 WARN Utils: Your hostname, PMIIDIDNL13144 resolves to a loopback address: 127.0.1.1; using 172.17.50.37 instead (on interface eth0)22/03/26 15:51:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.1
      /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user hgao on 2022-01-20T20:15:47Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.

Conclusion

So that's how I managed to fix the issue with SparkSession load. It turned out that version mismatch was the root cause.

Key takeaways :

  1. Always be consistent with your package management. Stick to brew or apt-get but don't mix both.
  2. Try to use venv to manage Python packages for different projects. Actually. this is one advice that I've heard so many times but at the moment it feels like it's too late to be implemented in my current setup.