Hi there
I'm using this package io.github.spark-redshift-community:spark-redshift_2.12:4.2.0 as a dependency in the context of AWS EMR job trying to save a dataframe to Redshift.
Sadly this attempt fails with following stacktrace:
https://gist.github.com/marek-babic/0110160bdd0ba11533b6f425559d2f1c
I know that the dataframe is in healthy state as show() and printSchema() output what I expect and the schema matches the one from Redshift table.
The code looks like so (where the capital letter vars are set appropriately):
df.write \
.format("io.github.spark_redshift_community.spark.redshift") \
.option("url", "jdbc:redshift://" + HOST_URL + ":5439/" + DATABASE_NAME) \
.option("user", USERNAME) \
.option("password", PASSWORD) \
.option("dbtable", TABLE_NAME) \
.option("aws_region", REGION) \
.option("aws_iam_role", IAM_ROLE) \
.option("tempdir", TMP_PATH) \
.option("tempformat", "CSV") \
.mode("overwrite") \
.save()
I tried to save the dataframe to S3 just by running:
df.write.format("csv").save(TMP_PATH + "/test1")
which worked, so the permissions in AWS are correct.
Any ideas why this could be happening?
Thanks
Marek