How to install and configure Pyspark 2.3.0-cloudera4?

Walid Gara
1 min readMar 26, 2021

If you would like to use pyspark in client mode with Spark 2.3.0-cloudera4, you will probably install the open-source version from pypi. Unfortunately, this version is not compatible with the Spark 2.3.0-cloudera4 installed in the cluster.

To install the right version, we need to build the right package of Spark. Well, you are in the right place because I already did it.

Installation

Download and install pyspark-cdh. This package works well in production with Cloudera Enterprise 5.16.x

$ wget https://github.com/garawalid/spark/releases/download/v.2.3.0-cloudera4/pyspark-cdh-2.3.0.tar.gz$ pip install pyspark-cdh-2.3.0.tar.gz

Configuration

We need to set SPARK_HOME,PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. Here is an example:

export SPARK_HOME=/product/cloudera/parcels/SPARK2-2.3.0.cloudera4.cdh5.13/lib/spark2export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-4.4.0/bin/pythonexport PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda-4.4.0/bin/python

Usage

Let’s create example.py that contains the following code.

from pyspark.sql import SparkSession

# Initialize Kerberos

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1, 2, 3,7],[4,5,8,7]], ["col0", "col1", "col2","col3"])
print(df.show())

Then we run theexample.py as follows:

$ python example.py+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3| 7|
| 4| 5| 8| 7|
+----+----+----+----+

--

--