- When running a job in the cluster of Dataproc, it reported:
java.util.concurrent.ExecutionException: java.lang.ClassNotFoundException: Failed to find data source: BIGQUERY.
The reason is I haven’t added the Jar file for BigQuery. After adding the new Jar file into properties
to the template of creating a cluster:
properties: spark:spark.jars: gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.18.1.jar
the job starts to read data from BigQuery tables.
Remember not to use gs://spark-lib/bigquery/spark-bigquery-latest.jar
because it will hang your job when you are reading BigQuery tables. Seems even google makes a significant mistake in their cloud platform :p
2. If a PySpark job needs to use some additional packages in the Dataproc cluster, what should we do?
Still need to add more items in the template to let it install pip packages:
clusterName: robin config: gceClusterConfig: metadata: enable-cloud-sql-proxy-on-workers: 'false' use-cloud-sql-private-ip: 'false' PIP_PACKAGES: 'google-cloud-storage google-api-python-client google-auth' initializationActions: - executableFile: gs://goog-dataproc-initialization-actions-us-central1/python/pip-install.sh executionTimeout: 600s
3. To see how a Hive table be created
show create table <table>;