Sentinel - Concepts
- Hung SK
- May 1
- 2 min read
The goal of data-sentinel is to perform data ingestion and/or transformation for any context.
i.e.
Any process that applies logic to modify an existing dataset
Any process that applies logic to create a new dataset from an existing dataset
Any process that performs data import / export to different locations (i.e. ingestion)
The application has two main features
enable the use of Apache Spark APIs
enable the use of Rule Engine - Drools
Apache Spark
Following Spark's concept, their APIs can be broadly categorized into the following:
Transformations
Any API that creates a transformation plan and return either a dataset or dataframe
Actions
Any API that triggers the execution of a transformation plan and returns an output
Depending on type of Action API, the output can be
logged to console
save to memory for later use
written to an external location
file (i.e. csv, json, parquet, avro, etc)
db (i.e. oracle)
Rule Engine - Drools
For use cases with complex data transformation requirements that cannot be fulfilled by Spark APIs, the application offers the following options:
build customized transformation logic organized as re-usable rules
build custom spark functions containing #1
Application Run is Configurable
For all parameters required to execute Spark APIs and Drools in a single application run - they are configurable.
Each application runs with one configuration yml file, with configs of the following hierarchy:
Transformation Definitions
Data Containers
Data Writers
Each Transformation Definition can contain a combination of Data Containers and Data Writers.
Each Data Container contains configs for:
Spark functions categorized under Transformation
Custom Spark functions as Plugin
Rules as Plugin
Each Data Writer contains configs for
Spark functions categorized under Action
List of things to think about (post-mvp):
Spark
how to systematically expand the enablement of Spark APIs ? (i.e. read/write/transform functions, conf, handling java options with SparkSession, etc)
how to overwrite Spark's internal classes for handling JDBC connection with Oracle DataSource created by external libraries ?
how to fetch and execute plugin classes that implements Spark custom functions ?
how is data processing different when running in cluster mode on CDP i.e. Databricks ?
Rules
what is the best way to implement the use of Drools as Rule Engine ?
what is the best way to organize rules for different use cases ?
how to set up unit tests to trigger rules individually ?
Configurations
what is the best way of managing app configurations with Spring Boot application ?
configurations provide instructions and parameters to selectively instantiate classes to execute Spark functions - what is the best way to build this ?
how to setup exceptions to manage invalid/missing configurations ?
how to document all possible configurations properly ?
Legacy SpringBoot Applications
how can older applications also opt into use Spark for data processing without having to migrate to data-sentinel (i.e. migrate and build extended application as plugin)

Comments