Skip to content

Conversation

@mdearos
Copy link

@mdearos mdearos commented Dec 3, 2025

Overview

This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue #1969).

To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider.

This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query:

  • Requests with a Result Type of Hits.
  • Requests with a CQL filter.
  • Requests with a BBOX filter.
  • Requests with a Temporal filter.

Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run.

This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.

Additional information

Dependency policy (RFC2)

  • I have ensured that this PR meets RFC2 requirements

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

  • I'd like to contribute bugfix Slow Query Performance in Postgres Provider Due to Full count on Large Tables to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
  • I have already previously agreed to the pygeoapi Contributions and Licensing

This commit adds a fix for the slow performance of full counts in PostgreSQL (Issue geopython#1969).

To achieve this two new PostgreSQL provider settings have been added (postgresql_pseudo_count_enabled and postgresql_pseudo_count_start). Allowing the use of pseudo counts to be configured individually on each use of the PostgreSQL provider.

This fix then uses the PostgreSQL EXPLAIN function to "guess" the number of rows that will be returned by a given request. But this does not affect all queries equally because pseudo counts cannot be run on the following types of query:
   - Requests with a Result Type of Hits.
   - Requests with a CQL filter.
   - Requests with a BBOX filter.
   - Requests with a Temporal filter.

Also, you can use the postgresql_pseudo_count_start setting to tell the system to run a full count if the row estimate is to small meaning there is enough time for a full count to be run.

This commit also adds the required documentation and postgreSQL provider test changes. Including adding a building_type and datetime column to the dummy_data.sql file.
@webb-ben
Copy link
Member

webb-ben commented Dec 6, 2025

I wonder if a postgres specific addition to the provider block is the correct approach if we want to maintain a rigid pygeoapi config schema. This appears to port some of the logic implemented in a Psuedocount-specfic pygeoapi Postgres provider.

I wonder if there is a configuration option / solution that could be used across all pygeoapi providers given the numberMatched is not required by the specification. Is better to have no count, or an incorrect one

@mikemahoney218-usgs
Copy link
Contributor

mikemahoney218-usgs commented Dec 8, 2025

Sharing my two cents: for our deployment, we decided that no count was the better option, so none of our postgresql-backed endpoints have counts. We evaluated the same pseudocount implementation and decided that it didn't give enough benefits (both in performance and in allowing users to predict result sizes -- because the count isn't accurate, so you'll likely still need to incrementally grow your result set) versus removing numberMatched in line with the specification. So our deployment has a flag in the providers block of each resource, number_matched_for_results_enabled, controlling if /items queries return numberMatched -- as well as a matching number_matched_for_hits_enabled controlling if resulttype=hits is allowed at all.

Edit to add: Though unlike Ben's comment above, I do prefer controlling this on a resource level, rather than at the server level; we have other data providers that don't have the same trouble counting records (particularly for small tables) and it's nice to enable numberMatched there. We haven't had anyone comment on the inconsistency yet. We also enable resulttype=hits for most resources, and only disable it for particularly large tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants