Class GeoParquetDialect


public class GeoParquetDialect extends DuckDBDialect
SQL Dialect for GeoParquet format.

This dialect extends the base DuckDB dialect with GeoParquet-specific functionality:

  • Parsing and utilizing GeoParquet metadata from the "geo" field
  • Setting up appropriate SQL views for GeoParquet files
  • Optimizing spatial operations and bounds computations
  • Handling both local and remote (HTTP, S3) GeoParquet data access

The dialect extracts and uses the GeoParquet specification metadata to provide improved performance for operations like bounds computation and feature access. It supports both standard GeoParquet format (1.1.0) and development versions (1.2.0-dev).

The dialect uses several performance optimizations:

  • Extracting bounds from GeoParquet metadata rather than computing them
  • Creating SQL views for consistent access to partitioned datasets
  • Using DuckDB's spatial functions for efficient querying
  • Maintaining a cache of metadata to avoid repeated parsing

It works in conjunction with GeoParquetViewManager to handle Hive-partitioned datasets, exposing each partition as a separate feature type.

  • Constructor Details

    • GeoParquetDialect

      public GeoParquetDialect(JDBCDataStore dataStore)
      Creates a new GeoParquetDialect.
      Parameters:
      dataStore - The JDBC datastore this dialect will work with
  • Method Details

    • ensureViewExists

      public void ensureViewExists(String viewName) throws IOException
      Ensures that a database view exists for the specified feature type.

      This method is called before any operations that require access to a feature type's schema or data, implementing the lazy initialization pattern. If the view already exists, this method has no effect.

      Parameters:
      viewName - The name of the view/feature type to ensure exists
      Throws:
      IOException - If there is an error creating the view
    • getTypeNames

      public List<String> getTypeNames() throws IOException
      Returns a list of all available feature type names.

      This method queries the view manager to get the names of all registered views, which correspond to available feature types in the GeoParquet dataset.

      Returns:
      A list of feature type names
      Throws:
      IOException - If there is an error retrieving the names
    • createFilterToSQL

      public FilterToSQL createFilterToSQL()
      Creates a specialized filter-to-SQL converter for GeoParquet.
      Overrides:
      createFilterToSQL in class DuckDBDialect
      Returns:
      A new GeoParquetFilterToSQL instance
    • getDatabaseInitSql

      public List<String> getDatabaseInitSql()
      Provides SQL statements to initialize the DuckDB database for GeoParquet access.

      This installs and loads required DuckDB extensions:

      • httpfs - For HTTP/S3 access to remote GeoParquet files
      • parquet - For reading Parquet file format
      Overrides:
      getDatabaseInitSql in class DuckDBDialect
      Returns:
      List of SQL statements to initialize the database
    • initialize

      public void initialize(GeoParquetConfig config) throws IOException
      Registers SQL views for GeoParquet data partitions.

      This method is called by GeoParquetDataStoreFactory#setupDataStore(JDBCDataStore, Map) to initialize the dialect with the provided configuration. It:

      1. Clears any cached metadata
      2. Initializes the view manager with the new configuration
      Parameters:
      config - The GeoParquet configuration
      Throws:
      IOException - If there's an error registering the views
    • getGeoparquetMetadata

      public GeoparquetDatasetMetadata getGeoparquetMetadata(String typeName) throws IOException
      Gets the GeoParquet metadata for a feature type.

      This is a convenience method that creates a connection and delegates to getGeoparquetMetadata(String, Connection) if the metadata for typeName is not already cached.

      Parameters:
      typeName - The feature type to get metadata for
      Returns:
      The GeoParquet metadata for the feature type
      Throws:
      IOException - If there is an error retrieving the metadata
    • loadGeoparquetMetadata

      public GeoparquetDatasetMetadata loadGeoparquetMetadata(String viewName, Connection cx)
      Loads GeoParquet metadata for a specific view.

      This method:

      1. Retrieves the URI for the view
      2. Queries the Parquet key-value metadata to extract the 'geo' field
      3. Parses the metadata for each file in the dataset
      Parameters:
      viewName - The name of the view to load metadata for
      cx - Database connection to use for querying
      Returns:
      The combined dataset metadata
    • getPrimaryKeyFinder

      public PrimaryKeyFinder getPrimaryKeyFinder()
      Provides a PrimaryKeyFinder that identifies the 'id' column as the primary key.

      This is a helper for GeoParquetDataStoreFactory to establish the feature ID column in GeoParquet datasets. It always identifies the 'id' column as a String primary key, which is the standard convention for GeoParquet files.

      Returns:
      A PrimaryKeyFinder for GeoParquet datasets
    • getOptimizedBounds

      public List<ReferencedEnvelope> getOptimizedBounds(String schema, SimpleFeatureType featureType, Connection cx) throws SQLException, IOException
      Returns optimized bounds for a feature type by using GeoParquet metadata.

      This method uses a multi-stage approach to efficiently determine dataset bounds:

      1. First tries to extract bounds from the GeoParquet 'geo' metadata field
      2. If 'geo' metadata is not available, checks for a 'bbox' column and uses aggregate functions on its components (common in datasets like OvertureMaps)
      3. Finally falls back to the generic DuckDB bounds computation using spatial functions

      Each method is progressively more computationally expensive, so we try them in order of efficiency.

      Overrides:
      getOptimizedBounds in class DuckDBDialect
      Parameters:
      schema - The database schema (unused in GeoParquet)
      featureType - The feature type to get bounds for
      cx - Database connection to use for querying
      Returns:
      A list containing a single ReferencedEnvelope representing the dataset bounds
      Throws:
      SQLException - If there's an error executing SQL
      IOException - If there's an error accessing the data
    • getGeometrySRID

      public Integer getGeometrySRID(String schemaName, String tableName, String columnName, Connection cx)
      Gets the SRID (Spatial Reference ID) for a geometry column.

      This method attempts to extract the SRID from the GeoParquet metadata's CRS information:

      1. First tries to get the CRS from the GeoParquet metadata for the specified column
      2. If available, extracts the SRID from the CRS definition using the PROJJSON representation
      3. Falls back to trying the primary geometry column if the specific column CRS is not found
      4. Falls back to EPSG:4326 (WGS84) if the CRS information is not available or doesn't contain SRID

      The CRS information is extracted from the GeoParquet 'geo' metadata field, which follows the PROJJSON v0.7 schema as defined by the OGC GeoParquet specification. This includes proper handling of CRS identifiers with authority and code properties.

      The implementation supports strongly-typed CRS objects, converting between the PROJJSON format used in GeoParquet files and GeoTools CoordinateReferenceSystem objects.

      Overrides:
      getGeometrySRID in class DuckDBDialect
      Parameters:
      schemaName - The database schema (unused in GeoParquet)
      tableName - The table/view name
      columnName - The geometry column name
      cx - Database connection
      Returns:
      The SRID of the geometry column (from metadata or 4326 as default)
    • createCRS

      public CoordinateReferenceSystem createCRS(int srid, Connection cx) throws SQLException
      Override to use the GeoParquetMetadata provided axis order on a per-FeatureType basis. SQLDialect.createCRS(int, java.sql.Connection) uses the SQLDialect.forceLongitudeFirst flag as a constant.
      Overrides:
      createCRS in class SQLDialect
      Throws:
      SQLException
    • encodeGeometryColumn

      public void encodeGeometryColumn(GeometryDescriptor gatt, String prefix, int srid, Hints hints, StringBuffer sql)
      Encodes a geometry column for a SQL query with awareness of geometry types.

      This overridden method enhances the base DuckDB dialect implementation by checking if multi-geometry encoding should be enforced for the current feature type. It uses the CURRENT_TYPENAME thread-local variable to determine the appropriate behavior based on the GeoParquet metadata.

      For example, if the geometry column is a MultiPolygon according to the GeoParquet metadata, this method will add ST_Multi() to the SQL encoding to ensure proper handling of collection geometries. This is crucial because the JDBCDataStore calls this method without providing full feature type context.

      Overrides:
      encodeGeometryColumn in class DuckDBDialect
      Parameters:
      gatt - The geometry descriptor to encode
      prefix - Column prefix to use
      srid - The spatial reference ID
      hints - Rendering hints that may affect encoding
      sql - The SQL buffer to append to
    • fixGeometryTypes

      public SimpleFeatureType fixGeometryTypes(SimpleFeatureType schema) throws IOException
      Creates a new feature type with more specific geometry types based on GeoParquet metadata, with results cached for performance.

      This method processes a feature type to enhance its geometry descriptors with more specific geometry types derived from the GeoParquet metadata. It:

      1. Ensures the database view exists for the feature type
      2. Delegates to the GeoParquetViewManager to check for a cached version of the enhanced schema
      3. If no cached version exists, creates a new schema with correct geometry type bindings
      4. Caches the result for future use

      This is essential because DuckDB only reports a generic GEOMETRY type, while the GeoParquet metadata contains information about the actual geometry types (Point, LineString, etc.).

      The caching mechanism improves performance by avoiding repeated metadata lookups and feature type construction while maintaining thread safety through the GeoParquetViewManager.

      Parameters:
      schema - The original feature type with generic geometry types
      Returns:
      A new feature type with more specific geometry types, either freshly built or from cache
      Throws:
      IOException - If there is an error accessing the GeoParquet metadata