Skip to content

Databricks read from redshift first selected column returns wrong value if quotes used  #467

@HansHerrlich

Description

@HansHerrlich

Hi folks,
when issuing a query against redshift from databricks there seems to be an issue with how quotes inside an sql string
are handled.

The following set up on Redshift:

CREATE TABLE IF NOT EXISTS tmp_test
(
	my_new_varchar VARCHAR(128)   ENCODE lzo
	,some_text VARCHAR(256)   ENCODE lzo
	,some_decimal NUMERIC(12,6)   ENCODE az64
	,int_col INTEGER   ENCODE az64
	,real_col REAL   ENCODE RAW
	,double_col DOUBLE PRECISION   ENCODE RAW
	,smal_int_col SMALLINT   ENCODE az64
	,t_with_timezone TIMESTAMP WITH TIME ZONE   ENCODE az64
	,t_without_timezone TIMESTAMP WITHOUT TIME ZONE   ENCODE az64
	,dwh_created_at TIMESTAMP WITHOUT TIME ZONE  DEFAULT now() ENCODE az64
)
DISTSTYLE AUTO;


INSERT INTO tmp_test
(my_new_varchar, some_text, some_decimal, int_col, real_col, double_col, smal_int_col, t_with_timezone, t_without_timezone, dwh_created_at)
VALUES('first_val', 'second_val', 123.123, 123456, 12.12, 0.233234, 123, null, null, null);

commit;

on databricks I run the following command afterwards:

val query = """
 Select max('my_new_varchar') as column_name, max(coalesce(length(my_new_varchar),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_text') as column_name, max(coalesce(length(some_text),0)) as length_or_numeric_precision, max(0) as numeric_scale FROM tmp_test  union 
 Select max('some_decimal') as column_name ,max(length(cast(some_decimal as varchar))) as length_or_numeric_precision, max(position('.' in some_decimal))  as numeric_scale FROM tmp_test  union 
 Select max('int_col') as column_name ,max(length(cast(int_col as varchar))) as length_or_numeric_precision, max(position('.' in int_col))  as numeric_scale FROM tmp_test  union 
 Select max('real_col') as column_name ,max(length(cast(real_col as varchar))) as length_or_numeric_precision, max(position('.' in real_col))  as numeric_scale FROM tmp_test  union 
 Select max('double_col') as column_name ,max(length(cast(double_col as varchar))) as length_or_numeric_precision, max(position('.' in double_col))  as numeric_scale FROM tmp_test  union 
 Select max('smal_int_col') as column_name ,max(length(cast(smal_int_col as varchar))) as length_or_numeric_precision, max(position('.' in smal_int_col))  as numeric_scale FROM tmp_test
""".stripMargin

val maxRedDataDF = spark.read
       .format("com.databricks.spark.redshift")
       .option("url", redUrl)
       .option("tempdir", tempdir)
       .option("forward_spark_s3_credentials", "true")
       .option("autoenablessl", "false")
       .option("query", query)
       .load()

display(maxRedDataDF)

As you can see for the column my_new_varchar it does not return the column_name as for the rest of the columns in the query but instead the max values inside this column e.g. first_val

image

A dirty quick fix is to modify the query by prepanding the following line:

Select max('''tester''') as column_name, max(0) as length_or_numeric_precision, max(0) as numeric_scale  union

Now the result for the column my_new_varchar is returned as expected. But column_name tester
now is returned as tester' (see single quotation mark at string ending) if you do not use triple quotation
marks around tester but single once it gives error of column not found (Redshift) case of double quotation marks it gives sql exception (databricks).

Hope the explanation is detailed enough, if not please let me know.
Regards,
Hans

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions