Sentinel - Concepts

The goal of data-sentinel is to perform data ingestion and/or transformation for any context.

i.e.

Any process that applies logic to modify an existing dataset
Any process that applies logic to create a new dataset from an existing dataset
Any process that performs data import / export to different locations (i.e. ingestion)

The application has two main features

Apache Spark

Following Spark's concept, their APIs can be broadly categorized into the following:

Transformations
1. Any API that creates a transformation plan and return either a dataset or dataframe
Actions
1. Any API that triggers the execution of a transformation plan and returns an output
2. Depending on type of Action API, the output can be
  1. logged to console
  2. save to memory for later use
  3. written to an external location
    - file (i.e. csv, json, parquet, avro, etc)
    - db (i.e. oracle)

Rule Engine - Drools

For use cases with complex data transformation requirements that cannot be fulfilled by Spark APIs, the application offers the following options:

The difference between option #1 and #2 is the invocation of rules.

Option #1, rules are invoked by the application and therefore data transformation (i.e. execution of customized transformation logic) happens sequentially.
Option #2, rules are invoked by custom spark functions where data transformation can happen in parallel on the compute resources available.

Application Run is Configurable

For all parameters required to execute Spark APIs and Drools in a single application run - they are configurable.

Each application runs with one configuration yml file, with configs of the following hierarchy:

Each Transformation Definition can contain a combination of Data Containers and Data Writers.

Each Data Container contains configs for:

Each Data Writer contains configs for

List of things to think about (post-mvp):

Spark
- how to systematically expand the enablement of Spark APIs ? (i.e. read/write/transform functions, conf, handling java options with SparkSession, etc)
- how to overwrite Spark's internal classes for handling JDBC connection with Oracle DataSource created by external libraries ?
- how to fetch and execute plugin classes that implements Spark custom functions ?
- how is data processing different when running in cluster mode on CDP i.e. Databricks ?
Rules
- what is the best way to implement the use of Drools as Rule Engine ?
- what is the best way to organize rules for different use cases ?
- how to set up unit tests to trigger rules individually ?
Configurations
- what is the best way of managing app configurations with Spring Boot application ?
- configurations provide instructions and parameters to selectively instantiate classes to execute Spark functions - what is the best way to build this ?
- how to setup exceptions to manage invalid/missing configurations ?
- how to document all possible configurations properly ?
Legacy SpringBoot Applications
- how can older applications also opt into use Spark for data processing without having to migrate to data-sentinel (i.e. migrate and build extended application as plugin)

SOMETHINGS ARE WORTH MORE THAN JUST
A FLEETING THOUGHT