Class GeoParquetDialect


  • public class GeoParquetDialect
    extends DuckDBDialect
    SQL Dialect for GeoParquet format.

    This dialect extends the base DuckDB dialect with GeoParquet-specific functionality:

    • Parsing and utilizing GeoParquet metadata from the "geo" field
    • Setting up appropriate SQL views for GeoParquet files
    • Optimizing spatial operations and bounds computations
    • Handling both local and remote (HTTP, S3) GeoParquet data access

    The dialect extracts and uses the GeoParquet specification metadata to provide improved performance for operations like bounds computation and feature access. It supports both standard GeoParquet format (1.1.0) and development versions (1.2.0-dev).

    The dialect uses several performance optimizations:

    • Extracting bounds from GeoParquet metadata rather than computing them
    • Creating SQL views for consistent access to partitioned datasets
    • Using DuckDB's spatial functions for efficient querying
    • Maintaining a cache of metadata to avoid repeated parsing

    It works in conjunction with GeoParquetViewManager to handle Hive-partitioned datasets, exposing each partition as a separate feature type.

    • Constructor Detail

      • GeoParquetDialect

        public GeoParquetDialect​(JDBCDataStore dataStore)
        Creates a new GeoParquetDialect.
        Parameters:
        dataStore - The JDBC datastore this dialect will work with
    • Method Detail

      • ensureViewExists

        public void ensureViewExists​(String viewName)
                              throws IOException
        Ensures that a database view exists for the specified feature type.

        This method is called before any operations that require access to a feature type's schema or data, implementing the lazy initialization pattern. If the view already exists, this method has no effect.

        Parameters:
        viewName - The name of the view/feature type to ensure exists
        Throws:
        IOException - If there is an error creating the view
      • getTypeNames

        public List<String> getTypeNames()
                                  throws IOException
        Returns a list of all available feature type names.

        This method queries the view manager to get the names of all registered views, which correspond to available feature types in the GeoParquet dataset.

        Returns:
        A list of feature type names
        Throws:
        IOException - If there is an error retrieving the names
      • createFilterToSQL

        public FilterToSQL createFilterToSQL()
        Creates a specialized filter-to-SQL converter for GeoParquet.
        Overrides:
        createFilterToSQL in class DuckDBDialect
        Returns:
        A new GeoParquetFilterToSQL instance
      • getDatabaseInitSql

        public List<String> getDatabaseInitSql()
        Provides SQL statements to initialize the DuckDB database for GeoParquet access.

        This installs and loads required DuckDB extensions:

        • httpfs - For HTTP/S3 access to remote GeoParquet files
        • parquet - For reading Parquet file format
        Overrides:
        getDatabaseInitSql in class DuckDBDialect
        Returns:
        List of SQL statements to initialize the database
      • initialize

        public void initialize​(GeoParquetConfig config)
                        throws IOException
        Registers SQL views for GeoParquet data partitions.

        This method is called by GeoParquetDataStoreFactory#setupDataStore(JDBCDataStore, Map) to initialize the dialect with the provided configuration. It:

        1. Clears any cached metadata
        2. Initializes the view manager with the new configuration
        Parameters:
        config - The GeoParquet configuration
        Throws:
        IOException - If there's an error registering the views
      • getGeoparquetMetadata

        public GeoparquetDatasetMetadata getGeoparquetMetadata​(String typeName)
                                                        throws IOException
        Gets the GeoParquet metadata for a feature type.

        This is a convenience method that creates a connection and delegates to getGeoparquetMetadata(String, Connection) if the metadata for typeName is not already cached.

        Parameters:
        typeName - The feature type to get metadata for
        Returns:
        The GeoParquet metadata for the feature type
        Throws:
        IOException - If there is an error retrieving the metadata
      • loadGeoparquetMetadata

        public GeoparquetDatasetMetadata loadGeoparquetMetadata​(String viewName,
                                                                Connection cx)
        Loads GeoParquet metadata for a specific view.

        This method:

        1. Retrieves the URI for the view
        2. Queries the Parquet key-value metadata to extract the 'geo' field
        3. Parses the metadata for each file in the dataset
        Parameters:
        viewName - The name of the view to load metadata for
        cx - Database connection to use for querying
        Returns:
        The combined dataset metadata
      • getPrimaryKeyFinder

        public PrimaryKeyFinder getPrimaryKeyFinder()
        Provides a PrimaryKeyFinder that identifies the 'id' column as the primary key.

        This is a helper for GeoParquetDataStoreFactory to establish the feature ID column in GeoParquet datasets. It always identifies the 'id' column as a String primary key, which is the standard convention for GeoParquet files.

        Returns:
        A PrimaryKeyFinder for GeoParquet datasets
      • getOptimizedBounds

        public List<ReferencedEnvelope> getOptimizedBounds​(String schema,
                                                           SimpleFeatureType featureType,
                                                           Connection cx)
                                                    throws SQLException,
                                                           IOException
        Returns optimized bounds for a feature type by using GeoParquet metadata.

        This method uses a multi-stage approach to efficiently determine dataset bounds:

        1. First tries to extract bounds from the GeoParquet 'geo' metadata field
        2. If 'geo' metadata is not available, checks for a 'bbox' column and uses aggregate functions on its components (common in datasets like OvertureMaps)
        3. Finally falls back to the generic DuckDB bounds computation using spatial functions

        Each method is progressively more computationally expensive, so we try them in order of efficiency.

        Overrides:
        getOptimizedBounds in class DuckDBDialect
        Parameters:
        schema - The database schema (unused in GeoParquet)
        featureType - The feature type to get bounds for
        cx - Database connection to use for querying
        Returns:
        A list containing a single ReferencedEnvelope representing the dataset bounds
        Throws:
        SQLException - If there's an error executing SQL
        IOException - If there's an error accessing the data
      • getGeometrySRID

        public Integer getGeometrySRID​(String schemaName,
                                       String tableName,
                                       String columnName,
                                       Connection cx)
        Gets the SRID (Spatial Reference ID) for a geometry column.

        This method attempts to extract the SRID from the GeoParquet metadata's CRS information:

        1. First tries to get the CRS from the GeoParquet metadata for the specified column
        2. If available, extracts the SRID from the CRS definition using the PROJJSON representation
        3. Falls back to trying the primary geometry column if the specific column CRS is not found
        4. Falls back to EPSG:4326 (WGS84) if the CRS information is not available or doesn't contain SRID

        The CRS information is extracted from the GeoParquet 'geo' metadata field, which follows the PROJJSON v0.7 schema as defined by the OGC GeoParquet specification. This includes proper handling of CRS identifiers with authority and code properties.

        The implementation supports strongly-typed CRS objects, converting between the PROJJSON format used in GeoParquet files and GeoTools CoordinateReferenceSystem objects.

        Overrides:
        getGeometrySRID in class DuckDBDialect
        Parameters:
        schemaName - The database schema (unused in GeoParquet)
        tableName - The table/view name
        columnName - The geometry column name
        cx - Database connection
        Returns:
        The SRID of the geometry column (from metadata or 4326 as default)
      • encodeGeometryColumn

        public void encodeGeometryColumn​(GeometryDescriptor gatt,
                                         String prefix,
                                         int srid,
                                         Hints hints,
                                         StringBuffer sql)
        Encodes a geometry column for a SQL query with awareness of geometry types.

        This overridden method enhances the base DuckDB dialect implementation by checking if multi-geometry encoding should be enforced for the current feature type. It uses the CURRENT_TYPENAME thread-local variable to determine the appropriate behavior based on the GeoParquet metadata.

        For example, if the geometry column is a MultiPolygon according to the GeoParquet metadata, this method will add ST_Multi() to the SQL encoding to ensure proper handling of collection geometries. This is crucial because the JDBCDataStore calls this method without providing full feature type context.

        Overrides:
        encodeGeometryColumn in class DuckDBDialect
        Parameters:
        gatt - The geometry descriptor to encode
        prefix - Column prefix to use
        srid - The spatial reference ID
        hints - Rendering hints that may affect encoding
        sql - The SQL buffer to append to
      • fixGeometryTypes

        public SimpleFeatureType fixGeometryTypes​(SimpleFeatureType schema)
                                           throws IOException
        Creates a new feature type with more specific geometry types based on GeoParquet metadata, with results cached for performance.

        This method processes a feature type to enhance its geometry descriptors with more specific geometry types derived from the GeoParquet metadata. It:

        1. Ensures the database view exists for the feature type
        2. Delegates to the GeoParquetViewManager to check for a cached version of the enhanced schema
        3. If no cached version exists, creates a new schema with correct geometry type bindings
        4. Caches the result for future use

        This is essential because DuckDB only reports a generic GEOMETRY type, while the GeoParquet metadata contains information about the actual geometry types (Point, LineString, etc.).

        The caching mechanism improves performance by avoiding repeated metadata lookups and feature type construction while maintaining thread safety through the GeoParquetViewManager.

        Parameters:
        schema - The original feature type with generic geometry types
        Returns:
        A new feature type with more specific geometry types, either freshly built or from cache
        Throws:
        IOException - If there is an error accessing the GeoParquet metadata