Running PySpark in a Jupyter Notebook on Google Cloud

Create a cluster using the Google Cloud Dataproc console or the command line:

gcloud dataproc \
    --region europe-west1 \
    clusters create {cluster name} \
    --subnet default \
    --zone "" \
    --master-machine-type n1-standard-8 \
    --master-boot-disk-size 500 \
    --num-workers 5 \
    --worker-machine-type n1-standard-16 \
    --worker-boot-disk-size 500 \
    --image-version 1.2 \
    --project {project name}

Use port forwarding to login to the instance:

Install pip and other packages required on Debian:

Create new environment variables:

Start a Jupyter session:

In a browser:

Enter the token shown in the terminal.