Migrated from GitHub
Go to file
Joshua E. Jodesty 8503ec1d6b semi-final _ 2019-10-03 18:50:34 -04:00
cadCAD about to delete init integration - 'distributed_produce' 2019-10-01 12:03:07 -04:00
dist MI6 2019-10-02 10:58:19 -04:00
distroduce semi-final _ 2019-10-03 18:50:34 -04:00
documentation update readme 2019-08-22 16:18:03 -04:00
simulations about to delete init integration - 'distributed_produce' 2019-10-01 12:03:07 -04:00
testing added docs from tutorial 2019-08-22 12:52:32 -04:00
tutorials add link to readme 2019-08-28 09:13:33 -03:00
.gitignore readme pt. 2: new build 2019-10-03 11:28:58 -04:00
AUTHORS.txt hell gates pt.6 2019-09-05 16:15:53 -04:00
LICENSE.txt added liscence and authors 2019-08-22 18:33:17 -04:00
README.md semi-final _ 2019-10-03 18:50:34 -04:00
ascii_art.py integration works knock on wood pt.2 2019-09-26 14:51:05 -04:00
requirements.txt parameterized execution mone 2019-10-01 07:52:56 -04:00
setup.py parameterized execution mone 2019-10-01 07:52:56 -04:00

README.md

   ___   _      __        _  __         __           __
  / _ \ (_)___ / /_ ____ (_)/ /  __ __ / /_ ___  ___/ /
 / // // /(_-</ __// __// // _ \/ // // __// -_)/ _  / 
/____//_//___/\__//_/ _/_//_.__/\_,_/ \__/ \__/ \_,_/  
  / _ \ ____ ___  ___/ /__ __ ____ ___                 
 / ___// __// _ \/ _  // // // __// -_)                
/_/   /_/   \___/\_,_/ \_,_/ \__/ \__/                 
by Joshua E. Jodesty                                                       
                                       

What?: Description

Distributed Produce (distroduce) is a message simulation and throughput benchmarking framework / cadCAD execution mode that leverages Apache Spark and Apache Kafka Producer for optimizing Kafka cluster configurations and debugging real-time data transformations. distroduce leverages cadCAD's user-defined event simulation template and framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured) Streaming.

##How?: A Tail of Two Clusters Distributed Produce is a Spark Application used as a cadCAD Execution Mode that distributes Kafka Producers, message simulation, and message publishing to worker nodes of an EMR cluster. Messages published from these workers are sent to Kafka topics on a Kafka cluster from a Spark bootstrapped EMR cluster.

##Why?: Use Case

  • IoT Event / Device Simulation: Competes with AWS IoT Device Simulator and Azure IoT Solution Acceleration: Device Simulation. Unlike these products, Distributed Produce enables a user-defined state updates and agent actions, as well as message publish benchmarking
  • Development Environment for Real-Time Data Processing / Routing:

##Get Started:

0. Set Up Local Development Environment: see Kafka Quickstart

a. Install pyspark

pip3 install pyspark

b. Install & Unzip Kafka, Create Kafka test topic, and Start Consumer

sh distroduce/configuration/launch_local_kafka.sh

c. Run Simulation locally

zip -rq distroduce/dist/distroduce.zip distroduce/
spark-submit --py-files distroduce/dist/distroduce.zip  distroduce/local_messaging_sim.py `hostname | xargs`

1. Write cadCAD Simulation:

  • Simulation Description: To demonstration of Distributed Produce, I implemented a simulation of two users interacting over a messaging service.
  • Resources
  • Terminology:
    • Initial Conditions - State Variables and their initial values (Start event of Simulation)

      initial_conditions = {
          'state_variable_1': 0,
          'state_variable_2': 0,
          'state_variable_3': 1.5,
          'timestamp': '2019-01-01 00:00:00'
      }
      
    • Policy Functions: - computes one or more signals to be passed to State Update Functions

      def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
            ...
          return 'state_variable_name', new_value
      

      Parameters:

      • _params : dict - System parameters
      • substep : int - Current substep
      • sH : list[list[dict]] - Historical values of all state variables for the simulation. See Historical State Access for details
      • s : dict - Current state of the system, where the dict_keys are the names of the state variables and the dict_values are their current values.
      • kafkaConfig: kafka.KafkaProducer - Configuration for kafka-python Producer
    • State Update Functions: - updates state variables change over time

      def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
            ...
          return 'state_variable_name', new_value
      

      Parameters:

      • _params : dict - System parameters
      • substep : int - Current substep
      • sH : list[list[dict]] - Historical values of all state variables for the simulation. See Historical State Access for details
      • s : dict - Current state of the system, where the dict_keys are the names of the state variables and the dict_values are their current values.
      • actions : dict - Aggregation of the signals of all policy functions in the current
      • kafkaConfig: kafka.KafkaProducer - Configuration for kafka-python Producer
    • Partial State Update Block (PSUB): - a set of State Update Functions and Policy Functions that update state records

Note: State Update and Policy Functions now have the additional / undocumented parameter kafkaConfig

a. Define Policy Functions:

  • Example: Two users interacting on separate chat clients and entering / exiting chat

b. Define State Update Functions:

  • Example: Used for logging and maintaining state of user actions defined by policies

c. Define Initial Conditions & Partial State Update Block:

  • Initial Conditions: Example
  • Partial State Update Block (PSUB): Example

d. Create Simulation Executor: Used for running a simulation

2. Configure EMR Cluster

3. Launch EMR Cluster:

Option A: Preconfigured Launch

python3 distroduce/emr/launch.py

Option B: Custom Launch - Example

from distroduce.emr.launch import launch_cluster
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
region = 'us-east-1'
cluster_name = 'distibuted_produce'
launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)

4. Execute Benchmark(s):

  • Step 1: ssh unto master node
  • Step 2: Spark Submit
    spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`