Work in progress: DbType2 #1632

Jolanrensen · 2025-12-11T15:47:20Z

Helps #387

Might fix #462 ?

Follows up on #1266 and #462

Work in progress, so more information will come when the design is finalized.

…ableColumnMetadata` in `generateTypeInformation()`. This contains potential pre- and post-processing logic for any type

…recursive preprocessing

…ypes

koperagen · 2025-12-11T17:34:28Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DuckDb.kt

    /** the name of the class of the DuckDB JDBC driver */
    override val driverClassName: String = "org.duckdb.DuckDBDriver"

+    override fun generateTypeInformation(tableColumnMetadata: TableColumnMetadata): AnyTypeInformation =


Please consider: ColumnValuesConverter, JdbcColumnConverter, ...
I think it's important to convey that we, most importantly, want to convert values from JDBC classes to something Kotlin-friendly - providing KType / ColumnSchema is just a side effect, right? ValuePreprocessor and ColumnPostprocessor are good in this regard

Actually, no, the main use case of this function is building the ColumnSchema, aka the type information for DataFrame, given some JDBC TableColumnMetadata, so we know what kind of DataFrame will pop out when some SQL is read.

Actually, most values can directly go from JDBC into a ValueColumn, no conversion needed. However, in the cases where conversion needs to be done (like Date -> Kotlin Date, Struct -> DataRow), this conversion can be provided in the TypeInformation class that's returned, either in the form of a value-preprocessor, or, if you need all values to be gathered first, in the form of a column-postprocessor. Still, they are strictly tied to the TableColumnMetadata.

Does that make sense? :) I've tried several names already for this TypeInformation class, (DbTypeInformation, DbColumnType...) but none really fit or become too large. But I'd like to convey that it does not háve to convert. If you do want to convert, you can create a TypeInformation instance with typeInformationWithPreprocessingFor(...).

We could also split up providing the type information and actually converting them. However, I liked the idea of providing the type information and converting in one place, because it forces you to write a JDBC type, the conversion, and the schema that's created at once, keeping the logic together.

If you'd ignore what I made thus far, how would you make an API like this where you can define how JDBC types are mapped (and potentially converted) to column types?

Ok, if i'm thinking from scratch: looks like you're referring to ResultSet (kind of Row) -> value extraction, that also can be database specific apparently?
So people might need to specify what class their Jdbc driver will return to our "generic value extractor" of some sort. We're not converting anything at this step, just calling rs.getObject, and we need to know what to expect.
If my description is accurate, can we call it JdbcDriverSpec? Can it be a separate entity from "converters"?
ExpectedTypes?

Yes, it can definitely be a separate step.
However, the converters will still need access to the original TableColumnMetadata to function properly, and they might do some duplicate logic:

So, we have a couple cases:

The first case is the easiest: We just want to read a column from JDBC, no conversion necessary. So the Db implementation could simply give a KType representing the type of the values coming from that column, and a ValueColumn<ThatType> will be created.

The second case is a bit more complicated: We want to read a column of values from JDBC, but they need to be converted, like Dates. In this case, the implementation could give a KType again representing the type of the values coming from that column. A preprocessor could take those values and KType and convert them (and return a new KType) before a DataColumn<NewType> is created. The preprocessor might need the original TableColumnMetadata though, as it might have information that cannot be represented by just the KType.

The next case is comparable to the previous one, but now we want to post-process the column, like in a column of arrays where we want to infer their common type. So the implementation will give a KType of java.sql.Array, it may be preprocessed, a ValueColumn is created, and then the postprocessor can do its magic, converting the ValueColumn with values to any sort of column it likes. It might need the original TableColumnMetadata, and KType and return the new column and ColumnSchema.

The final case is where a structured column needs to be created: We read a STRUCT column from JDBC; the first step returns the KType java.sql.Struct. The preprocessor can convert each value to a DataRow<*> based on the KType and TableColumnMetadata, so a DataColumn<DataRow<*>, aka a ColumnGroup can be created. Though, we still need to report the new ColumnSchema.Group somewhere so we can do TableColumnMetadata -> ColumnSchema without reading actual data. Maybe in the postprocessor? The preprocessor should do nothing, making a ValueColumn<Struct>, then the post-processor should turn it in a ColumnGroup, returning the resulting ColumnSchema.Group as well.

In the final case, we can see the TableColumnMetadata needing being parsed up to 3 separate times: In the first step to generate a KType we're not even using, as it's just typeOf<AnyRow>(), in the second step, to convert each value to a correct DataRow, and in the final step to produce the right ColumnSchema.Group. This can be quite tedious, as TableColumnMetadata can contain recursive types as well... It's also hard to track the logic of a single type across multiple separate functions...
But maybe we can find a hybrid between these two approaches? Allowing logic to be grouped, but functions to be separate?

The final case is where a structured column needs to be created: We read a STRUCT column from JDBC; the first step returns the KType java.sql.Struct. The preprocessor can convert each value to a DataRow<> based on the KType and TableColumnMetadata, so a DataColumn<DataRow<>, aka a ColumnGroup can be created. Though, we still need to report the new ColumnSchema.Group somewhere so we can do TableColumnMetadata -> ColumnSchema without reading actual data. Maybe in the postprocessor

As a side note, in recent Map.toDataFrame PR i noticed that map.toDataRow that uses type inference hits a very obvious bottleneck in reflective type inference that is called for each row, each value individually. In that case column-based creation of ColumnGroup reduces time from 17s to 1.5s!
I think efficient transformation of ResultSet with Struct value should be done in ColumnPostprocessor maybe? Like DataColumn -> ColumnGroup.

However, the converters will still need access to the original TableColumnMetadata to function properly, and they might do some duplicate logic:

Makes sense

:o

Yes, it makes sense to postpone it to the post-processing step indeed! A DataRow<*> is a DataFrame with one row after all, so forming a DF with 1000 rows will create 1000 intermediate DFs with type inference. We'd need #1541 to be able to do this efficiently.

But this just shows it's good to have both pre- and post-processing :) we need both.

When a FrameColumn is created, it does make sense to create DataFrames in the preprocessing step

…s, that use JdbcTypeMapping

…some types by default

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt

zaleslaw · 2026-01-27T10:09:07Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt

-    val columnKTypes = buildColumnKTypes(tableColumns, dbType)
-    val columnData = readAllRowsFromResultSet(rs, tableColumns, columnKTypes, dbType, limit)
-    val dataFrame = buildDataFrameFromColumnData(columnData, tableColumns, columnKTypes, dbType, inferNullability)
+    val expectedJdbcTypes = getExpectedJdbcTypes(


ExpectedJdbcTypes could be more complex structure, but better to keep order of column according indicies for debugging, not name-based, it could be edge object with fields: index, name, KType for example

That's certainly possible! Though that would look a bit more like AdvancedDbType which, for each column, requires you to provide an AnyJdbcToDataFrameConverter containing all information needed to read and convert that column. Maybe the concept could be merged

zaleslaw · 2026-01-27T10:10:24Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readJdbc.kt

+        dbType = dbType,
+        tableColumns = tableColumns,
+    )
+    val preprocessedValueTypes = getPreprocessedValueTypes(


Describe somewhere the processes

Expected->Preprocessed (what's the difference and why we need this step, what does it give)

zaleslaw · 2026-01-27T10:21:25Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/readDataFrameSchema.kt

 ): DataFrameSchema {
    val determinedDbType = dbType ?: extractDBTypeFromConnection(connection)

+    // TODO don't need to read 1 row, take it just from TableColumnMetadatas


It's very safe and cheap way for any database, I moved from taking info from TableColumnMetadata, also lead to less error - prone in our codebase and make it flexible

Except for empty databases that just have a schema and no data. When building just the schema, we shouldn't have to look at the actual database contents. It's up to the DbType implementor to provide a watertight getExpectedJdbcType(), getPreprocessedValueType(), and getTargetColumnSchema() from just TableColumnMetadata. So if we manage to construct TableColumnMetadatas from connection, tableName and dbType, we can simply call those functions to get a schema without accessing any data.

...and for databases you may have no query access to

zaleslaw · 2026-01-27T10:24:27Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/MySql.kt

        get() = "com.mysql.jdbc.Driver"

-    override fun convertSqlTypeToColumnSchemaValue(tableColumnMetadata: TableColumnMetadata): ColumnSchema? {
+    override fun getExpectedJdbcType(tableColumnMetadata: TableColumnMetadata): KType {


I want to change keep naming a la convertJdbcTypeToKType

There's no converting happening at this stage yet. As a user you're supposed to provide the KTypes of the values coming directly from JDBC

zaleslaw · 2026-01-27T10:24:45Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/MariaDb.kt

            tables.getString("table_cat"),
        )

-    override fun convertSqlTypeToKType(tableColumnMetadata: TableColumnMetadata): KType? {


Great refactoring

zaleslaw · 2026-01-27T10:25:20Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DbType.kt

+        columnIndex: Int,
+        tableColumnMetadata: TableColumnMetadata,
+        expectedJdbcType: KType,
+    ): J? =


J-DBC. It's clearer in the context of AdvancedDbType:

J-DBC type

D-ataFrame type

P-ost processed type

zaleslaw · 2026-01-27T10:25:50Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DbType.kt

+        expectedPreprocessedValueType: KType,
+    ): D? =
+        when (tableColumnMetadata.jdbcType) {
+            Types.TIMESTAMP if tableColumnMetadata.javaClassName == "java.time.LocalDateTime" ->


Why are Time types processed on that stage?

zaleslaw · 2026-01-27T10:26:39Z

dataframe-jdbc/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/db/DbType.kt

-            Types.ARRAY to Array::class,
-            Types.BLOB to ByteArray::class,
-            Types.CLOB to Clob::class,
-            Types.REF to Ref::class,


Is it kept as is or changed somewhere?

As is. It's the expected types coming from JDBC but I use KTypes now instead of ::class because it's more efficient and flexible.

…ed. Added `getDataFrameCompatibleColumnNames()` function to handle missing or duplicate names, as they can apparently appear from sql but will break DF

…d lots of kdocs

…pe already handled names. Simplified column name clashes by using ColumnNameGenerator like the rest of DataFrame.

Jolanrensen added 6 commits December 9, 2025 15:00

attempt at converting duckdb to column conversions. WIP

67a1ef8

Refactored DbType to use DbColumnTypeInformation, generated from `T…

210af02

…ableColumnMetadata` in `generateTypeInformation()`. This contains potential pre- and post-processing logic for any type

converted DuckDb to new preprocessing DbType. Turns out I might need …

d21dc79

…recursive preprocessing

wip DuckDb nested preprocessing

0ee7024

added memoization for TableColumnMetadata -> AnyDbColumnTypeInformation

f96234d

renaming, added extra constructors for TypeInformation, restricting t…

d1b655b

…ypes

Jolanrensen added enhancement New feature or request databases JDBC related issues labels Dec 11, 2025

fixed duckdb tests

4d8acfb

Jolanrensen force-pushed the DbType2 branch from c4f7549 to 4d8acfb Compare December 11, 2025 15:49

koperagen reviewed Dec 11, 2025

View reviewed changes

added jdbc source type parameter

6539830

Jolanrensen force-pushed the DbType2 branch 2 times, most recently from 6b39cc6 to 9c9a699 Compare December 15, 2025 13:27

struct parsing for duckdb working!

e114fca

Jolanrensen force-pushed the DbType2 branch from 9c9a699 to e114fca Compare December 15, 2025 13:29

Jolanrensen added 6 commits December 16, 2025 12:21

merging column creation and post processing dbType

1a9f581

created AdvancedDbType so we can have "simple" and "advanced" db type…

6bf27fb

…s, that use JdbcTypeMapping

added resultSetReader option to JdbcToDataFrameConverter, converting …

d678032

…some types by default

exploring struct/composite types for postgresql

736fe79

added duckDb STRUCT[] column to FrameColumn conversion

be83cc5

Merge branch 'master' into DbType2

2a6b620

zaleslaw requested changes Jan 27, 2026

View reviewed changes

to support runtime json parsing for duckdb, allow targetSchema = null.

7b1c5af

Jolanrensen force-pushed the DbType2 branch from 31a4914 to 7b1c5af Compare January 28, 2026 12:34

Jolanrensen added 5 commits February 10, 2026 21:35

reverted changes from name-based jdbc columns back to order/index bas…

1d2494f

…ed. Added `getDataFrameCompatibleColumnNames()` function to handle missing or duplicate names, as they can apparently appear from sql but will break DF

Merge branch 'master' into DbType2

df9dd90

Merge branch 'master' into DbType2

0aa0c15

made checkSchema run only in debug builds

a9aee70

simplified DbType and JdbcToDataFrameConverter typing situation, adde…

5c8ccdc

…d lots of kdocs

Jolanrensen added 2 commits February 11, 2026 14:54

apidump

91dedd0

Turns out getDataFrameCompatibleColumnNames was not necessary as DbTy…

35dd613

…pe already handled names. Simplified column name clashes by using ColumnNameGenerator like the rest of DataFrame.

Work in progress: DbType2 #1632

Are you sure you want to change the base?

Work in progress: DbType2 #1632

Uh oh!

Conversation

Jolanrensen commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koperagen Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koperagen Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jolanrensen commented Dec 11, 2025 •

edited

Loading

koperagen Dec 12, 2025 •

edited

Loading

Jolanrensen Dec 12, 2025 •

edited

Loading

koperagen Dec 12, 2025 •

edited

Loading

Jolanrensen Dec 12, 2025 •

edited

Loading