top of page

Sentinel - Concepts

  • Writer: Hung SK
    Hung SK
  • May 1
  • 2 min read

The goal of data-sentinel is to perform data ingestion and/or transformation for any context.

i.e.

  • Any process that applies logic to modify an existing dataset

  • Any process that applies logic to create a new dataset from an existing dataset

  • Any process that performs data import / export to different locations (i.e. ingestion)


The application has two main features

  • enable the use of Apache Spark APIs

  • enable the use of Rule Engine - Drools


Apache Spark

Following Spark's concept, their APIs can be broadly categorized into the following:

  1. Transformations

    1. Any API that creates a transformation plan and return either a dataset or dataframe

  2. Actions

    1. Any API that triggers the execution of a transformation plan and returns an output

    2. Depending on type of Action API, the output can be

      1. logged to console

      2. save to memory for later use

      3. written to an external location

        • file (i.e. csv, json, parquet, avro, etc)

        • db (i.e. oracle)


Rule Engine - Drools

For use cases with complex data transformation requirements that cannot be fulfilled by Spark APIs, the application offers the following options:

  1. build customized transformation logic organized as re-usable rules

  2. build custom spark functions containing #1


The difference between option #1 and #2 is the invocation of rules.

  • Option #1, rules are invoked by the application and therefore data transformation (i.e. execution of customized transformation logic) happens sequentially.

  • Option #2, rules are invoked by custom spark functions where data transformation can happen in parallel on the compute resources available.


Application Run is Configurable

For all parameters required to execute Spark APIs and Drools in a single application run - they are configurable.


Each application runs with one configuration yml file, with configs of the following hierarchy:

  • Transformation Definitions

    • Data Containers

    • Data Writers


Each Transformation Definition can contain a combination of Data Containers and Data Writers.


Each Data Container contains configs for:

  • Spark functions categorized under Transformation

  • Custom Spark functions as Plugin

  • Rules as Plugin


Each Data Writer contains configs for

  • Spark functions categorized under Action


List of things to think about (post-mvp):

  • Spark

    • how to systematically expand the enablement of Spark APIs ? (i.e. read/write/transform functions, conf, handling java options with SparkSession, etc)

    • how to overwrite Spark's internal classes for handling JDBC connection with Oracle DataSource created by external libraries ?

    • how to fetch and execute plugin classes that implements Spark custom functions ?

    • how is data processing different when running in cluster mode on CDP i.e. Databricks ?

  • Rules

    • what is the best way to implement the use of Drools as Rule Engine ?

    • what is the best way to organize rules for different use cases ?

    • how to set up unit tests to trigger rules individually ?

  • Configurations

    • what is the best way of managing app configurations with Spring Boot application ?

    • configurations provide instructions and parameters to selectively instantiate classes to execute Spark functions - what is the best way to build this ?

    • how to setup exceptions to manage invalid/missing configurations ?

    • how to document all possible configurations properly ?

  • Legacy SpringBoot Applications

    • how can older applications also opt into use Spark for data processing without having to migrate to data-sentinel (i.e. migrate and build extended application as plugin)





Comments


© 2024 by skheadspace

bottom of page