Parquet uploads
Overview
Parquet files can be ingested into Sift in several ways depending on your workflow. You can upload them directly with the Python client, send them through the REST API using cURL, or point Sift to a file hosted on HTTP or S3. All methods attach the data to an Asset in Sift so it can be explored, queried, and reviewed.
In most cases you provide a small configuration that tells Sift how to interpret your file, including which column to use for time and how to map data columns to Channels. Sift also requires information from the Parquet footer so it can read the file correctly. If you want to skip writing the full configuration, the detect-config
endpoint can generate one for you automatically with only minimal edits.
Prerequisites
All ingestion methods in this guide require a few values that are specific to your Sift account and project. Replace the placeholders used in the examples with your own:
- API key: required to authenticate every request. See Create an API key.
- Base URL: the Sift endpoint for your deployment. See Obtain the base URL.
- Use the gRPC API URL with the Python client.
- Use the REST API URL with cURL or URL ingestion.
Ingest methods
Ingest with Python
Use the sift-stack-py
client library to upload a Parquet file directly into Sift. This method interacts with the gRPC API.
Install the client
Upload Parquet file
Time: time_path
must be a timestamp column; time_format
must be supported (see Time formats).
Examples: For additional usage patterns, see the Parquet ingestion examples in the Sift public repository.
Ingest with cURL
cURL lets you interact with the Sift REST API directly. This is useful if you want to test ingestion quickly from the command line, or if you are in an environment where installing the Python client isn't practical. The process involves three steps: create a JSON config, request an upload URL, and then upload your Parquet file.
Before making the request, create a JSON file (for example my-upload-config.json
) that describes how Sift should interpret your Parquet file (see Parquet ingestion configuration for details):
Send the configuration JSON (my-upload-config.json
) to the REST API by POSTing to /api/v2/data-imports:upload
. If the request is valid, Sift returns a temporary upload URL.
Use the uploadUrl
returned in the previous step to upload your Parquet file.
-H "content-encoding: gzip"
.Ingest from a URL
Instead of uploading a local file, you can tell Sift to fetch and ingest a Parquet file directly from a remote location. Both HTTP and S3 sources are supported. This method requires a configuration JSON that specifies the file URL and how to interpret the Parquet data.
Create a JSON file (for example my-url-config.json
) with the remote file URL and a Parquet configuration (see Parquet ingestion configuration for details):
Start ingestion. POST the configuration to /api/v2/data-imports:url
. If the request is valid, the endpoint will return a 200
response code and Sift will fetch the file from the given URL and begin ingestion.
GZIP: If the remote file is GZIP-compressed, make sure the server sets the content-encoding: gzip
header.
Parquet ingestion configuration
The JSON configuration defines how Sift should interpret and ingest your Parquet file. This configuration is required when ingesting with cURL or URL ingestion.
Parquet upload status
When you submit a Parquet ingestion request, the response (200
for Python or URL ingestion, OK
for cURL) only confirms that the request was accepted, not that the file has been fully ingested. To verify ingestion, use the following data-import endpoints to check the upload status:
Time formats
When ingesting Parquet data, you must specify how timestamps are interpreted. These formats apply to both the timeColumn.format
field in the JSON configuration (for cURL and URL ingestion) and the time_format
parameter when using the Python client.
Supported formats are as follows: absolute formats and relative formats.
Absolute (timestamps tied to a fixed point in time)
Constant | Meaning | Example |
---|---|---|
TIME_FORMAT_ABSOLUTE_RFC3339 | RFC3339 timestamp | 2023-01-02T15:04:05Z |
TIME_FORMAT_ABSOLUTE_DATETIME | Datetime string | 2023-01-02 15:04:05 |
TIME_FORMAT_ABSOLUTE_UNIX_SECONDS | Seconds since Unix epoch | 1704067200 |
TIME_FORMAT_ABSOLUTE_UNIX_MILLISECONDS | Milliseconds since Unix epoch | |
TIME_FORMAT_ABSOLUTE_UNIX_MICROSECONDS | Microseconds since Unix epoch | |
TIME_FORMAT_ABSOLUTE_UNIX_NANOSECONDS | Nanoseconds since Unix epoch |
Relative (timestamps relative to a start time)
Constant | Units |
---|---|
TIME_FORMAT_RELATIVE_NANOSECONDS | Nanoseconds |
TIME_FORMAT_RELATIVE_MICROSECONDS | Microseconds |
TIME_FORMAT_RELATIVE_MILLISECONDS | Milliseconds |
TIME_FORMAT_RELATIVE_SECONDS | Seconds |
TIME_FORMAT_RELATIVE_MINUTES | Minutes |
TIME_FORMAT_RELATIVE_HOURS | Hours |
Data types
Data types are specified per Channel in the channelConfig.dataType
field of your JSON configuration. You can use the following options:
Constant | Meaning |
---|---|
CHANNEL_DATA_TYPE_DOUBLE | Double-precision floating-point number |
CHANNEL_DATA_TYPE_FLOAT | Single-precision floating-point number |
CHANNEL_DATA_TYPE_STRING | String value |
CHANNEL_DATA_TYPE_BOOL | Boolean value (true or false ) |
CHANNEL_DATA_TYPE_INT_32 | 32-bit signed integer |
CHANNEL_DATA_TYPE_INT_64 | 64-bit signed integer |
CHANNEL_DATA_TYPE_UINT_32 | 32-bit unsigned integer |
CHANNEL_DATA_TYPE_UINT_64 | 64-bit unsigned integer |
CHANNEL_DATA_TYPE_ENUM | Enumerated value |
CHANNEL_DATA_TYPE_BIT_FIELD | Bit field value |
Complex types
Parquet files may contain lists or maps. The complexTypesImportMode field in the Parquet configuration (JSON file) controls how these are ingested. The table below lists the available options:
Mode | Behavior |
---|---|
PARQUET_COMPLEX_TYPES_IMPORT_MODE_IGNORE | Skip complex types and do not ingest them. |
PARQUET_COMPLEX_TYPES_IMPORT_MODE_BOTH | Import as both Arrow bytes and JSON strings. |
PARQUET_COMPLEX_TYPES_IMPORT_MODE_STRING | Import only as JSON strings. |
PARQUET_COMPLEX_TYPES_IMPORT_MODE_BYTES | Import only as Arrow bytes. |
Default value: If the complexTypesImportMode
field is not included in the Parquet configuration JSON file, Sift will use
PARQUET_COMPLEX_TYPES_IMPORT_MODE_BOTH
by default.
Nested columns
To specify a nested column in your Parquet configuration, use the |
character as a separator in the path
field. For example, if your Parquet file contains a struct column named location
with a nested field lat
, set the path as "location|lat"
. This tells Sift to ingest the nested lat
field inside the location
struct.
Parquet configuration detection
If you do not need to customize how your data is imported, you can use the detect-config
endpoint to automatically generate a Parquet configuration. This endpoint analyzes your Parquet file footer and returns a configuration object with most fields pre-filled.
You will only need to manually set the timestamp information (time_column
) and the footer information (footer_offset
and footer_length
) before uploading. To use this endpoint:
Send a POST request with your Parquet file footer to the /api/v0/data-imports:detect-config
endpoint (this is a v0
endpoint).
parquetConfig
object generated from the file’s footer.Edit the returned configuration to specify the correct time column (timeColumn
) and the footer values (footerOffset
and
footerLength
).
Parquet footer
The Parquet footer is a metadata block at the end of every Parquet file. It contains schema information, row group metadata, and other details required for reading the file. When uploading Parquet files to Sift, you must specify the footerOffset
(the byte position where the footer starts) and footerLength
(the size of the footer in bytes) in your configuration.
To populate footerOffset
and footerLength
in your configuration JSON, you need to extract them from the Parquet file itself. The following examples show two common methods: