Create a cluster using the Google Cloud Dataproc console or the command line:
gcloud dataproc \
--region europe-west1 \
clusters create {cluster name} \
--subnet default \
--zone "" \
--master-machine-type n1-standard-8 \
--master-boot-disk-size 500 \
--num-workers 5 \
--worker-machine-type n1-standard-16 \
--worker-boot-disk-size 500 \
--image-version 1.2 \
--project {project name}
Use port forwarding to login to the instance:
ssh -i <key pair location> -L 8000:localhost:8888 <username>@<IPv4 Public IP>ssh -i ~/.ssh/google_compute_engine -L 8000:localhost:8888 imran@31.210.59.152Install pip and other packages required on Debian:
yes | sudo apt install python-pip build-essential python-devsudo python -m pip install jupyterCreate new environment variables:
export PYSPARK_DRIVER_PYTHON=jupyterexport PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=8888'Start a Jupyter session:
pysparkIn a browser:
localhost:8000Enter the token shown in the terminal.