cadCAD

History

Joshua E. Jodesty 8503ec1d6b semi-final _		2019-10-03 18:50:34 -04:00
..
configuration	readme pt. 3: new build	2019-10-03 12:29:34 -04:00
dist	semi-final _	2019-10-03 18:50:34 -04:00
emr	readme pt. 3: new build	2019-10-03 12:29:34 -04:00
executor	MI5	2019-10-01 21:09:22 -04:00
spark	about to delete init integration - 'distributed_produce'	2019-10-01 12:03:07 -04:00
README.md	semi-final _	2019-10-03 18:50:34 -04:00
__init__.py	about to delete init integration - 'distributed_produce'	2019-10-01 12:03:07 -04:00
action_policies.py	readme pt. 2: new build	2019-10-03 11:28:58 -04:00
local_messaging_sim.py	semi-final to pr	2019-10-03 15:31:31 -04:00
messaging_sim.py	semi-final to pr	2019-10-03 15:31:31 -04:00
simulation.py	semi-final to pr	2019-10-03 15:31:31 -04:00
state_updates.py	readme pt. 2: new build	2019-10-03 11:28:58 -04:00

README.md

   ___   _      __        _  __         __           __
  / _ \ (_)___ / /_ ____ (_)/ /  __ __ / /_ ___  ___/ /
 / // // /(_-</ __// __// // _ \/ // // __// -_)/ _  / 
/____//_//___/\__//_/ _/_//_.__/\_,_/ \__/ \__/ \_,_/  
  / _ \ ____ ___  ___/ /__ __ ____ ___                 
 / ___// __// _ \/ _  // // // __// -_)                
/_/   /_/   \___/\_,_/ \_,_/ \__/ \__/                 
by Joshua E. Jodesty

What?: Description

Distributed Produce (distroduce) is a distributed message simulation and throughput benchmarking framework / cadCAD execution mode that leverages Apache Spark and Apache Kafka Producer for optimizing Kafka cluster configurations and debugging real-time data transformations. distroduce leverages cadCAD's user-defined event simulation template and framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured) Streaming.

##How?: A Tail of Two Clusters Distributed Produce is a Spark Application used as a cadCAD Execution Mode that distributes Kafka Producers, message simulation, and message publishing to worker nodes of an AWS EMR cluster. Messages published from these workers are sent to Kafka topics on a Kafka cluster from a Spark bootstrapped EMR cluster.

##Why?: Use Case

IoT Event / Device Simulation: Competes with AWS IoT Device Simulator and Azure IoT Solution Acceleration: Device Simulation. Unlike these products, Distributed Produce enables a user-defined state updates and agent actions, as well as message publish benchmarking
Development Environment for Real-Time Data Processing / Routing:

##Get Started:

0. Set Up Local Development Environment: see Kafka Quickstart

a. Install pyspark

pip3 install pyspark

b. Install & Unzip Kafka, Create Kafka test topic, and Start Consumer

sh distroduce/configuration/launch_local_kafka.sh

c. Run Simulation locally

zip -rq distroduce/dist/distroduce.zip distroduce/
spark-submit --py-files distroduce/dist/distroduce.zip  distroduce/local_messaging_sim.py `hostname | xargs`

1. Write cadCAD Simulation:

Simulation Description: To demonstration of Distributed Produce, I implemented a simulation of two users interacting over a messaging service.
cadCAD Resources:
- Documentation
- Tutorials
Terminology:
- Initial Conditions - State Variables and their initial values (Start event of Simulation)
```
initial_conditions = {
    'state_variable_1': 0,
    'state_variable_2': 0,
    'state_variable_3': 1.5,
    'timestamp': '2019-01-01 00:00:00'
}
```
- Policy Functions: - computes one or more signals to be passed to State Update Functions
```
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
      ...
    return 'state_variable_name', new_value
```
  Parameters:
  - _params : dict - System parameters
  - substep : int - Current substep
  - sH : list[list[dict]] - Historical values of all state variables for the simulation. See Historical State Access for details
  - s : dict - Current state of the system, where the dict_keys are the names of the state variables and the dict_values are their current values.
  - kafkaConfig: kafka.KafkaProducer - Configuration for kafka-python Producer
- State Update Functions: - updates state variables change over time
```
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
      ...
    return 'state_variable_name', new_value
```
  Parameters:
  - _params : dict - System parameters
  - substep : int - Current substep
  - sH : list[list[dict]] - Historical values of all state variables for the simulation. See Historical State Access for details
  - s : dict - Current state of the system, where the dict_keys are the names of the state variables and the dict_values are their current values.
  - actions : dict - Aggregation of the signals of all policy functions in the current
  - kafkaConfig: kafka.KafkaProducer - Configuration for kafka-python Producer
- Partial State Update Block (PSUB): - a set of State Update Functions and Policy Functions that update state records

Note: State Update and Policy Functions now have the additional / undocumented parameter kafkaConfig

a. Define Policy Functions:

Example: Two users interacting on separate chat clients and entering / exiting chat

b. Define State Update Functions:

Example: Used for logging and maintaining state of user actions defined by policies

c. Define Initial Conditions & Partial State Update Block:

Initial Conditions: Example
Partial State Update Block (PSUB): Example

d. Create Simulation Executor: Used for running a simulation

Local
EMR

2. Configure EMR Cluster

3. Launch EMR Cluster:

Option A: Preconfigured Launch

python3 distroduce/emr/launch.py

Option B: Custom Launch - Example

from distroduce.emr.launch import launch_cluster
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
region = 'us-east-1'
cluster_name = 'distibuted_produce'
launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)

4. Execute Benchmark(s) on EMR:

Step 1: ssh unto master node

zip -rq distroduce/dist/distroduce.zip distroduce/

Step 2: ssh unto master node

Step 3: Spark Submit

spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`