Sift | Docs

Parquet uploads

Overview

Parquet files can be ingested into Sift in several ways depending on your workflow. You can upload them directly with the Python client, send them through the REST API using cURL, or point Sift to a file hosted on HTTP or S3. All methods attach the data to an Asset in Sift so it can be explored, queried, and reviewed.

In most cases you provide a small configuration that tells Sift how to interpret your file, including which column to use for time and how to map data columns to Channels. Sift also requires information from the Parquet footer so it can read the file correctly. If you want to skip writing the full configuration, the detect-config endpoint can generate one for you automatically with only minimal edits.

Prerequisites

All ingestion methods in this guide require a few values that are specific to your Sift account and project. Replace the placeholders used in the examples with your own:

  • API key: required to authenticate every request. See Create an API key.
  • Base URL: the Sift endpoint for your deployment. See Obtain the base URL.
    • Use the gRPC API URL with the Python client.
    • Use the REST API URL with cURL or URL ingestion.

Ingest methods

Ingest with Python

Use the sift-stack-py client library to upload a Parquet file directly into Sift. This method interacts with the gRPC API.

Install the client

pip install sift-stack-py

Upload Parquet file

from sift_py.data_import.parquet import ParquetUploadService
 
# Initialize the Parquet upload service
parquet_upload_service = ParquetUploadService({
    "uri": sift_uri,
    "apikey": apikey,
})
 
# Ingest the Parquet file into Sift
import_service = parquet_upload_service.flat_dataset_upload(
    asset_name,
    "sample_data.parquet",
    time_path="timestamp",
    time_format=TimeFormatType.ABSOLUTE_UNIX_NANOSECONDS,
)
 
# Wait for ingestion to finish and save the resulting file metadata in 'uploaded_file'
uploaded_file = import_service.wait_until_complete()

Time: time_path must be a timestamp column; time_format must be supported (see Time formats).

Examples: For additional usage patterns, see the Parquet ingestion examples in the Sift public repository.

Ingest with cURL

cURL lets you interact with the Sift REST API directly. This is useful if you want to test ingestion quickly from the command line, or if you are in an environment where installing the Python client isn't practical. The process involves three steps: create a JSON config, request an upload URL, and then upload your Parquet file.

Before making the request, create a JSON file (for example my-upload-config.json) that describes how Sift should interpret your Parquet file (see Parquet ingestion configuration for details):

{
  "parquet_config": { ... }
}

Send the configuration JSON (my-upload-config.json) to the REST API by POSTing to /api/v2/data-imports:upload. If the request is valid, Sift returns a temporary upload URL.

$ curl "$SIFT_API_HOST/api/v2/data-imports:upload" \
  --data-binary @my-upload-config.json \
  -H "authorization: Bearer $SIFT_API_KEY"
{"uploadUrl":"http://$SIFT_REST_URL/api/v2/data-imports:upload/<UPLOAD_ID>"}

Use the uploadUrl returned in the previous step to upload your Parquet file.

curl "$SIFT_API_HOST/api/v2/data-imports/<UPLOAD_ID>" \
  --data-binary @my-data.parquet \
  -H "authorization: Bearer $SIFT_API_KEY"
GZIP: To upload a GZIP-compressed file, add the header: -H "content-encoding: gzip".

Ingest from a URL

Instead of uploading a local file, you can tell Sift to fetch and ingest a Parquet file directly from a remote location. Both HTTP and S3 sources are supported. This method requires a configuration JSON that specifies the file URL and how to interpret the Parquet data.

Create a JSON file (for example my-url-config.json) with the remote file URL and a Parquet configuration (see Parquet ingestion configuration for details):

{
  "url": string,
  "parquet_config": { ... }
}

Start ingestion. POST the configuration to /api/v2/data-imports:url. If the request is valid, the endpoint will return a 200 response code and Sift will fetch the file from the given URL and begin ingestion.

curl "$SIFT_API_HOST/api/v2/data-imports:url" \
  --data-binary @my-url-config.json \
  -H "authorization: Bearer $SIFT_API_KEY"

GZIP: If the remote file is GZIP-compressed, make sure the server sets the content-encoding: gzip header.

Parquet ingestion configuration

The JSON configuration defines how Sift should interpret and ingest your Parquet file. This configuration is required when ingesting with cURL or URL ingestion.

{
  "url": string,             // Required only for URL ingestion. The remote file location (HTTP or S3).
  "parquet_config": {        // Required. Defines how to ingest the Parquet file.
    "assetName": string,     // Required. Name of the Asset to attach this data to.
    "runName": string,       // Optional. Name of the Run to create for this data.
    "runId": string,         // Optional. ID of the Run to add this data to.
                             // If set, "runName" is ignored.
    "flatDataset": {         // Required. ParquetFlatDatasetConfig object.
      "timeColumn": {
        "path": string,           // Required. Path to the time column.
        "format": string,         // Required. See the "Time formats" section for options.
        "relativeStartTime": string // Optional. RFC3339 format. Used for relative time formats.
      },
      "dataColumns": [
        {
          "path": string,         // Required. Path to the data column. See the "Nested columns"
                                  // section for how to specify the path for nested columns.
          "channelConfig": {
            "name": string,        // Required. Name of the Channel.
            "dataType": string,    // Required. See the "Data types" section for options.
            "units": string,       // Optional. Channel units (defaults to empty).
            "description": string, // Optional. Channel description.
            "enumTypes": [         // Optional. Only valid if dataType is ENUM.
              {
                "key": number,     // Raw enum value.
                "name": string     // Display value for the enum.
              }
            ],
            "bitFieldElements": [  // Optional. Only valid if dataType is BIT_FIELD.
              {
                "index": number,   // Starting index of the bit field.
                "name": string,    // Name of the bit field element.
                "bitCount": number // Number of bits in the element.
              }
            ]
          }
        }
      ]
    },
    "footerOffset": number,  // Required. Byte position where the Parquet footer starts.
    "footerLength": number,  // Required. Length of the Parquet footer in bytes.
    "complexTypesImportMode": string // Optional. See the "Complex types" section for options.
  }
}
Columns: Columns not specified in the configuration are not ingested.

Parquet upload status

When you submit a Parquet ingestion request, the response (200 for Python or URL ingestion, OK for cURL) only confirms that the request was accepted, not that the file has been fully ingested. To verify ingestion, use the following data-import endpoints to check the upload status:

curl -s "$SIFT_API_HOST/api/v2/data-imports" -H "authorization: Bearer $SIFT_API_KEY" 

Time formats

When ingesting Parquet data, you must specify how timestamps are interpreted. These formats apply to both the timeColumn.format field in the JSON configuration (for cURL and URL ingestion) and the time_format parameter when using the Python client. Supported formats are as follows: absolute formats and relative formats.

Absolute (timestamps tied to a fixed point in time)

ConstantMeaningExample
TIME_FORMAT_ABSOLUTE_RFC3339RFC3339 timestamp2023-01-02T15:04:05Z
TIME_FORMAT_ABSOLUTE_DATETIMEDatetime string2023-01-02 15:04:05
TIME_FORMAT_ABSOLUTE_UNIX_SECONDSSeconds since Unix epoch1704067200
TIME_FORMAT_ABSOLUTE_UNIX_MILLISECONDSMilliseconds since Unix epoch
TIME_FORMAT_ABSOLUTE_UNIX_MICROSECONDSMicroseconds since Unix epoch
TIME_FORMAT_ABSOLUTE_UNIX_NANOSECONDSNanoseconds since Unix epoch

Relative (timestamps relative to a start time)

ConstantUnits
TIME_FORMAT_RELATIVE_NANOSECONDSNanoseconds
TIME_FORMAT_RELATIVE_MICROSECONDSMicroseconds
TIME_FORMAT_RELATIVE_MILLISECONDSMilliseconds
TIME_FORMAT_RELATIVE_SECONDSSeconds
TIME_FORMAT_RELATIVE_MINUTESMinutes
TIME_FORMAT_RELATIVE_HOURSHours

Data types

Data types are specified per Channel in the channelConfig.dataType field of your JSON configuration. You can use the following options:

ConstantMeaning
CHANNEL_DATA_TYPE_DOUBLEDouble-precision floating-point number
CHANNEL_DATA_TYPE_FLOATSingle-precision floating-point number
CHANNEL_DATA_TYPE_STRINGString value
CHANNEL_DATA_TYPE_BOOLBoolean value (true or false)
CHANNEL_DATA_TYPE_INT_3232-bit signed integer
CHANNEL_DATA_TYPE_INT_6464-bit signed integer
CHANNEL_DATA_TYPE_UINT_3232-bit unsigned integer
CHANNEL_DATA_TYPE_UINT_6464-bit unsigned integer
CHANNEL_DATA_TYPE_ENUMEnumerated value
CHANNEL_DATA_TYPE_BIT_FIELDBit field value

Complex types

Parquet files may contain lists or maps. The complexTypesImportMode field in the Parquet configuration (JSON file) controls how these are ingested. The table below lists the available options:

ModeBehavior
PARQUET_COMPLEX_TYPES_IMPORT_MODE_IGNORESkip complex types and do not ingest them.
PARQUET_COMPLEX_TYPES_IMPORT_MODE_BOTHImport as both Arrow bytes and JSON strings.
PARQUET_COMPLEX_TYPES_IMPORT_MODE_STRINGImport only as JSON strings.
PARQUET_COMPLEX_TYPES_IMPORT_MODE_BYTESImport only as Arrow bytes.

Default value: If the complexTypesImportMode field is not included in the Parquet configuration JSON file, Sift will use PARQUET_COMPLEX_TYPES_IMPORT_MODE_BOTH by default.

Nested columns

To specify a nested column in your Parquet configuration, use the | character as a separator in the path field. For example, if your Parquet file contains a struct column named location with a nested field lat, set the path as "location|lat". This tells Sift to ingest the nested lat field inside the location struct.

Parquet configuration detection

If you do not need to customize how your data is imported, you can use the detect-config endpoint to automatically generate a Parquet configuration. This endpoint analyzes your Parquet file footer and returns a configuration object with most fields pre-filled. You will only need to manually set the timestamp information (time_column) and the footer information (footer_offset and footer_length) before uploading. To use this endpoint:

Send a POST request with your Parquet file footer to the /api/v0/data-imports:detect-config endpoint (this is a v0 endpoint).

Review the response, which will include a suggested parquetConfig object generated from the file’s footer.

Edit the returned configuration to specify the correct time column (timeColumn) and the footer values (footerOffset and footerLength).

Use the updated configuration to upload your Parquet data into Sift.
# The following example shows how to extract the footer bytes from a Parquet file and submit
# them to the `detect-config` endpoint to generate a configuration.
import base64
import gzip
import json
import os
import struct
import requests
 
# Calculate Parquet footer offset and length
file_path = "sample_asset.parquet"
with open(file_path, "rb") as f:
    # See the "Parquet footer" section for how to calculate this.
    footer_len, footer_offset = extract_footer_information(file_path)
    f.seek(-(footer_len + 8), os.SEEK_END)
    footer_bytes = f.read(footer_len)
 
# Encode footer for detect-config
encoded_data = base64.b64encode(footer_bytes).decode("utf-8")
request_data = json.dumps({
    "data": encoded_data,
    "type": "DATA_TYPE_KEY_PARQUET_FLATDATASET",
})
 
# POST footer to detect-config endpoint
detect_config_url = f"{sift_uri}/api/v0/data-imports:detect-config"
headers = {"authorization": f"Bearer {apikey}"}
response = requests.post(detect_config_url, data=request_data, headers=headers)
response.raise_for_status()
config_info = response.json()
parquet_config = config_info["parquetConfig"]
 
# Update detected config with required fields
parquet_config["assetName"] = asset_name
parquet_config["flatDataset"]["timeColumn"]["path"] = "timestamp"
parquet_config["flatDataset"]["timeColumn"]["format"] = "TIME_FORMAT_ABSOLUTE_DATETIME"
parquet_config["footerOffset"] = footer_offset
parquet_config["footerLength"] = footer_len
 
# Use the updated parquet_config in your upload request to complete ingestion
# ....

The Parquet footer is a metadata block at the end of every Parquet file. It contains schema information, row group metadata, and other details required for reading the file. When uploading Parquet files to Sift, you must specify the footerOffset (the byte position where the footer starts) and footerLength (the size of the footer in bytes) in your configuration.

[ row groups ... ][ footer (metadata) ][ footer length (4 bytes) ][ "PAR1" (4 bytes) ]

To populate footerOffset and footerLength in your configuration JSON, you need to extract them from the Parquet file itself. The following examples show two common methods:

import os
import struct
 
file_path = "example.parquet"
 
with open(file_path, "rb") as f: # Get file size
f.seek(0, os.SEEK_END)
file_size = f.tell()
 
    # Read last 8 bytes
    f.seek(-8, os.SEEK_END)
    footer_info = f.read(8)
 
    # First 4 bytes = footer length (little-endian uint32)
    footer_len = struct.unpack("<I", footer_info[:4])[0]
 
    # Last 4 bytes = magic "PAR1"
    magic = footer_info[4:]
    if magic != b"PAR1":
        raise ValueError("Not a valid Parquet file: missing PAR1 magic bytes")
 
    # Footer offset = file size - footer_len - 8 (length + magic)
    footer_offset = file_size - footer_len - 8