Upgrading Spark Standalone Cluster in WSL2
Bonus : workaround for pyspark and python version mismatch
Table of contents
Overview
Since 2019 I've been using PySpark 2.4.4
, and my system's version of Python is moving from 3.6
into 3.8
. This version drift finally caused me a bit of headache by throwing the following error when I want to load SparkSession
from pyspark.sql
module in my code.
TypeError: an integer is required (got type bytes)
This error is caused because pyspark=2.4.4
does not support python3.8
. Most of the recommendations are to downgrade to python3.7
to work around the issue or to upgrade pyspark to the later version ala :
pip3 install --upgrade pyspark
I am using a Spark standalone cluster in my local i.e. "installing from source"-way, and the above command did nothing to my pyspark
installation i.e. the version stays at 2.4.4
. There are more steps needed to be taken.
Furthermore, my WSL2 is a spaghetti-maze installation of binaries / distributions. Some challenges :
- There are binaries that were installed by
brew
and some that came by means ofsudo apt-get
- I forgot how I installed
pyspark
andapache-spark
in the first place
Obviously doing brew uninstall pyspark
and apt-get remove apache-spark
did nothing. To start from a clean slate is not as easy as uninstall-and-reinstall.
Steps
First, I need to locate the installation path of both pyspark
and spark-shell
:
$ which pyspark && which spark-shell
/home/linuxbrew/.linuxbrew/bin/pyspark
/home/linuxbrew/.linuxbrew/bin/spark-shell
Those files are actually bash scripts used by brew to launch the application. Looking at the code of /home/linuxbrew/.linuxbrew/bin/pyspark
we can see the following PATH
definition :
# Add the PySpark classes to the Python path:
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH"
Then, I need to locate which path $SPARK_HOME resolves to :
$ echo $SPARK_HOME
/opt/spark
I went to /opt
and voila! That's where it was installed. What have to be done next to "upgrade" spark version is to delete the folder containing all spark-related binaries and replace it with the newer one (make sure the hadoop
version tallies with your installation - I am still using 2.7
at the time of writing).
rm -rf /opt/spark
sudo wget https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop2.7.tgz
sudo tar xvzf spark-3.2.1-bin-hadoop2.7.tgz
sudo mv spark-3.2.1-bin-hadoop2.7 spark
After the above steps have been done, check your pyspark
version :
$ pyspark --version
22/03/26 15:51:40 WARN Utils: Your hostname, PMIIDIDNL13144 resolves to a loopback address: 127.0.1.1; using 172.17.50.37 instead (on interface eth0)22/03/26 15:51:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user hgao on 2022-01-20T20:15:47Z
Revision 4f25b3f71238a00508a356591553f2dfa89f8290
Url https://github.com/apache/spark
Type --help for more information.
Conclusion
So that's how I managed to fix the issue with SparkSession
load. It turned out that version mismatch was the root cause.
Key takeaways :
- Always be consistent with your package management. Stick to
brew
orapt-get
but don't mix both. - Try to use
venv
to manage Python packages for different projects. Actually. this is one advice that I've heard so many times but at the moment it feels like it's too late to be implemented in my current setup.