Some hints on Dataproc

When running a job in the cluster of Dataproc, it reported:

java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.

The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties to the template of creating a cluster:

properties:
          spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar

the job starts to read data from BigQuery tables.

Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p

2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?

Still need to add more items in the template to let it install pip packages:

    clusterName: robin
    config:
      gceClusterConfig:
        metadata:
          enable-cloud-sql-proxy-on-workers: 'false'
          use-cloud-sql-private-ip: 'false'
          PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth'
      initializationActions:
      - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh
        executionTimeout: 600s

3. To see how a Hive table be created

show create table <table>;

Robin on Linux

Some hints on Dataproc

Leave a Reply Cancel reply

Robin on Linux

Related Posts

Leave a Reply Cancel reply