How to install and configure Pyspark 2.3.0-cloudera4?
If you would like to use pyspark in client mode with Spark 2.3.0-cloudera4, you will probably install the open-source version from pypi. Unfortunately, this version is not compatible with the Spark 2.3.0-cloudera4 installed in the cluster.
To install the right version, we need to build the right package of Spark. Well, you are in the right place because I already did it.
Installation
Download and install pyspark-cdh. This package works well in production with Cloudera Enterprise 5.16.x
$ wget https://github.com/garawalid/spark/releases/download/v.2.3.0-cloudera4/pyspark-cdh-2.3.0.tar.gz$ pip install pyspark-cdh-2.3.0.tar.gz
Configuration
We need to set SPARK_HOME
,PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
. Here is an example:
export SPARK_HOME=/product/cloudera/parcels/SPARK2-2.3.0.cloudera4.cdh5.13/lib/spark2export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda-4.4.0/bin/pythonexport PYSPARK_DRIVER_PYTHON=/opt/cloudera/parcels/Anaconda-4.4.0/bin/python
Usage
Let’s create example.py
that contains the following code.
from pyspark.sql import SparkSession
# Initialize Kerberos
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1, 2, 3,7],[4,5,8,7]], ["col0", "col1", "col2","col3"])
print(df.show())
Then we run theexample.py
as follows:
$ python example.py+----+----+----+----+
|col0|col1|col2|col3|
+----+----+----+----+
| 1| 2| 3| 7|
| 4| 5| 8| 7|
+----+----+----+----+