Skip to content

Introduction

Features

The DataBridge Engine is a lightweight ETL engine. It has built-in support for:

  • File ingestion through SFTP, supports ingestion of the following file formats:
    • CSV
    • Excel
    • JSON
  • Source connectors
  • Data Transformations
    • Configured through SQL statements
    • Configured through dedicated transformation functions
  • Data Validation
    • Failed records are delivered to a dead-letter-queue which can be delivered back to the user
  • Destination connectors

Data Frame

A core concept for the DataBridge engine is the Data Frame. An Excel file, a CSV file, a JSON file or data from a source system is converted to a Data Frame in the first stage of the pipeline. A concise definition is:

Data Frame

Two-dimensional, tabular data structure (columns and rows), similar to a spreadsheet or a SQL table

A Data Frame name can be between 1-63 characters and only the following symbols are allowed:

  • a-z
  • 0-9
  • _

As a demonstration, my-data-frame, my data frame and MY_DATA_FRAME are illegal Data Frame names, where my_data_frame is a legal data frame name. "Well" named files are transparently handled by the pipeline and it is recommended to name your files accordingly (e.g. my_data.xlsx is automatically converted to the my_data data frame.)

Data Frames can be generated with or without a dedicated configuration, with the respective Data Frame types defined as follows:

Transparent Data Frame

A Data Frame generated from a source file with a name that complies with Data Frame naming rules and is either a CSV file, JSON file or a single-sheet, unencrypted Excel file.

In simple terms, the file my_data_frame.xlsx is automatically converted to the my_data_frame for later steps in the data pipeline.

Configured Data Frame

A Data Frame generated with the assistance of a configuration section in drop_zone_files. Reasons for using a Configured Data Frame may include:

  • Limited control of input data files (i.e. the input data file is named Workers Full.xlsx)
  • Input data file is encrypted (Excel password or PGP encryption)
  • Input data file is an Excel file with multiple sheets
  • Input data file is an Excel file where the header row is in row 2 (instead of row 1 as expected)

See drop_zone_files for a full list of configuration options.

:::

Files that don't conform to Data Frame naming rules require an entry in the drop_zone_files configuration entry.

Operation

The DataBridge Engine continuously monitors SFTP areas for a file named config.json. Once a file with that name is copied to the SFTP server, a DataBridge Engine Pipeline is executed according to instructions from the configuration file.

The operation is described with the diagram below:

DataBridge Engine Operation