ParquetDst#
- class polars_cloud.ParquetDst(
- uri: str | Path | PartitioningScheme,
- *,
- compression: ParquetCompression = 'zstd',
- compression_level: int | None = None,
- statistics: bool | str | dict[str, bool] = True,
- row_group_size: int | None = None,
- data_page_size: int | None = None,
- maintain_order: bool = True,
- storage_options: dict[str, Any] | None = None,
- credential_provider: CredentialProviderFunction | Literal['auto'] | None = 'auto',
- metadata: ParquetMetadata | None = None,
- field_overwrites: ParquetFieldOverwrites | Sequence[ParquetFieldOverwrites] | Mapping[str, ParquetFieldOverwrites] | None = None,
Parquet destination arguments.
- Parameters:
- uri
Path to which the output should be written. Must be a URI to an accessible object store location. If set to
"local"
, the query is executed locally. IfNone
, the result will be written to a temporary location. This is useful for intermediate query results.- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This is the default behavior.
Possible values:
True
: enable default set of statistics (default). Some statistics may be disabled.False
: disable all statistics“full”: calculate and write all available statistics. Cannot be combined with
use_pyarrow
.{ "statistic-key": True / False, ... }
. Cannot be combined withuse_pyarrow
. Available keys:“min”: column minimum value (default:
True
)“max”: column maximum value (default:
True
)“distinct_count”: number of unique column values (default:
False
)“null_count”: number of null values in column (default:
True
)
- row_group_size
Size of the row groups in number of rows. Defaults to 512^2 rows.
- data_page_size
Size of the data page in bytes. Defaults to 1024^2 bytes.
- maintain_order
Maintain the order in which data is processed. Setting this to
False
can be much faster.Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- storage_options
Options that indicate how to connect to a cloud provider.
The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
Hugging Face (
hf://
): Accepts an API key under thetoken
parameter:{'token': '...'}
, or by setting theHF_TOKEN
environment variable.
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- credential_provider
Provide a function that can be called to provide cloud storage credentials. The function is expected to return a dictionary of credential keys along with an optional credential expiry time.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
- metadata
A dictionary or callback to add key-values to the file-level Parquet metadata.
Warning
This functionality is considered experimental. It may be removed or changed at any point without it being considered a breaking change.
- field_overwrites
Property overwrites for individual Parquet fields.
This allows more control over the writing process to the granularity of a Parquet field.
Warning
This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.
Attributes:
compression
Compression algorithm
compression_level
Compression level
credential_provider
Credential provider
data_page_size
Data Page size
row_group_size
Size of the row groups
- compression: ParquetCompression
Compression algorithm
- credential_provider: CredentialProviderFunction | Literal['auto'] | None
Credential provider