Skip to content

Consider flattening table schemas #25

@jake-low

Description

@jake-low

Currently, all Layercake tables have the following schema:

column_name column_type
type VARCHAR
id BIGINT
tags STRUCT(...)
bbox STRUCT(xmin FLOAT, ymin FLOAT, xmax FLOAT, ymax FLOAT)
geometry GEOMETRY

The tags field is the only thing that varies between layers; it is a struct and contains all of the layer-specific columns as fields.

I am considering flattening the contents of the tags column. This would for example mean that the new settlements layer (added in #19) would have the following schema:

column_name column_type
type VARCHAR
id BIGINT
place VARCHAR
name VARCHAR
names MAP(VARCHAR, VARCHAR)
alt_name VARCHAR
alt_names MAP(VARCHAR, VARCHAR)
official_name VARCHAR
official_names MAP(VARCHAR, VARCHAR)
wikidata VARCHAR
wikipedia VARCHAR
population UBIGINT
bbox STRUCT(xmin FLOAT, ymin FLOAT, xmax FLOAT, ymax FLOAT)
geometry GEOMETRY

Advantages:

  • Flatter schemas are more familiar to most people (SQL tables, Excel spreadsheets, CSV files, etc are typically flat) and easier to visualize (a two-dimensional data table is easy to show on paper or on a screen).
  • Flatter schemas are also more straightforward to convert to other formats (e.g. export from Layercake data to Shapefile or CSV), since column remapping isn't necessary.
  • DuckDB's describe table and the gpq describe output are more useful. Currently, both commands output a table that looks much like the first table above (i.e. most of the interesting bits are hidden).
  • Many queries become simpler (e.g. where name = 'New York' instead of where tags.name = 'New York').
  • Quoting tag names that have special characters (most commonly :) is easier. Is it tags."building:levels" or "tags.building:levels"? No need to remember anymore, just quote the whole name: "building:levels".
  • tags is a bad name for the struct column since it is no longer a strict 1:1 representation of a feature's OSM tags (see discussion in Support more fine-grained schema definition (non-string types) #7); columns or attributes would be less wrong, but bland; flattening the struct sidesteps having to name it.

Disadvantages:

  • type, id, bbox, and geometry columns can potentially collide with tag keys.
  • Breaks backwards compatibility for existing Layercake users.

I think the advantages outweigh the disadvantages, but I'd like to hear from current users. Please comment if you have an opinion or alternative suggestions.

Note: This should have no effect on dataset sizes; under the hood struct fields in Parquet files get flattened anyways (each individual field becomes its own column and is stored in a contiguous block within each row group). So this is mainly about ergonomics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions