readme pt. 3: new build
This commit is contained in:
parent
8cf3b09582
commit
ea5df50659
|
|
@ -9,20 +9,24 @@
|
||||||
by Joshua E. Jodesty
|
by Joshua E. Jodesty
|
||||||
|
|
||||||
```
|
```
|
||||||
|
## What?: *Description*
|
||||||
##Description:
|
|
||||||
***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework /
|
***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework /
|
||||||
cadCAD execution mode that leverages Apache Spark and Kafka Producer for optimizing Kafka cluster configurations and
|
cadCAD execution mode that leverages Apache Spark and Kafka Producer for optimizing Kafka cluster configurations and
|
||||||
debugging real-time data transformations (i.e. Kafka Streams, Spark (Structured) Streaming). cadCAD's user-defined event
|
debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and
|
||||||
simulation framework is used to simulate messages sent to Kafka clusters.
|
framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message
|
||||||
|
publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured)
|
||||||
|
Streaming.
|
||||||
|
|
||||||
##Use Cases:
|
##How?: *A Tail of Two Clusters*
|
||||||
* **Education:** I wanted to provide a tool for Insight Data Engineering Fellows implementing real-time data processing
|
***Distributed Produce*** is a Spark Application used as a cadCAD Execution Mode that distributes Kafka Producers,
|
||||||
that enables rapid and iterative design, debugging, and message publish benchmarking of Kafka clusters and real-time
|
message simulation, and message publishing to worker nodes of an EMR cluster. Messages published from these workers are
|
||||||
data processing using Kafka Streams and Spark (Structured) Streaming.
|
sent to Kafka topics on a Kafka cluster from a Spark bootstrapped EMR cluster.
|
||||||
* **IoT:** Competes with *AWS IoT Device Simulator* and *Azure IoT Solution Acceleration: Device Simulation*. Unlike
|
|
||||||
these products, *Distributed Produce* enables a user-defined state updates and agent actions, as well as message publish
|
##Why?: *Use Case*
|
||||||
benchmarking
|
* **IoT Event / Device Simulation:** Competes with *AWS IoT Device Simulator* and *Azure IoT Solution Acceleration:
|
||||||
|
Device Simulation*. Unlike these products, *Distributed Produce* enables a user-defined state updates and agent actions,
|
||||||
|
as well as message publish benchmarking
|
||||||
|
* **Development Environment for Real-Time Data Processing / Routing:**
|
||||||
|
|
||||||
##Get Started:
|
##Get Started:
|
||||||
|
|
||||||
|
|
@ -40,29 +44,32 @@ sh distroduce/configuration/launch_local_kafka.sh
|
||||||
python3 distroduce/local_messaging_sim.py
|
python3 distroduce/local_messaging_sim.py
|
||||||
```
|
```
|
||||||
|
|
||||||
### 1. Write Simulation:
|
### 1. Write cadCAD Simulation:
|
||||||
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
|
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
|
||||||
* [Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
|
* [Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
|
||||||
|
|
||||||
|
* **Terminology:**
|
||||||
|
* ***Initial Conditions*** - State Variables and their initial values (Start event of Simulation)
|
||||||
|
* ***[Policy Functions:](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)*** -
|
||||||
|
computes one or more signals to be passed to State Update Functions
|
||||||
|
* ***[State Update Functions]((https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions)):*** -
|
||||||
|
updates state variables change over time
|
||||||
|
* ***[Partial State Update Block](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables) (PSUB):*** -
|
||||||
|
a set of State Update Functions and Policy Functions that update state records
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
**Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`
|
**Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`
|
||||||
|
|
||||||
**a.** **Define Policy Functions:** Two users interacting on separate chat clients and entering / exiting chat rooms
|
**a.** **Define Policy Functions:**
|
||||||
* [Example](distroduce/action_policies.py)
|
* [Example:](distroduce/action_policies.py) Two users interacting on separate chat clients and entering / exiting chat
|
||||||
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)
|
|
||||||
|
|
||||||
**b.** **Define State Update Functions:** Used for logging and maintaining state of user actions defined by policies
|
**b.** **Define State Update Functions:**
|
||||||
* [Example](distroduce/state_updates.py)
|
* [Example:](distroduce/state_updates.py) Used for logging and maintaining state of user actions defined by policies
|
||||||
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions)
|
|
||||||
|
|
||||||
**c.** **Define Initial Conditions & Partial State Update Block**
|
**c.** **Define Initial Conditions & Partial State Update Block:**
|
||||||
* **Initial Conditions** - State Variables and their initial values
|
* **Initial Conditions:** [Example](distroduce/messaging_sim.py)
|
||||||
* [Example](distroduce/messaging_sim.py)
|
* **Partial State Update Block (PSUB):** [Example](distroduce/simulation.py)
|
||||||
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables)
|
|
||||||
|
|
||||||
* **Partial State Update Block (PSUB)** - a set of State Update Functions and Policy Functions that update state records
|
|
||||||
* [Example](distroduce/simulation.py)
|
|
||||||
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation#Partial-State-Update-Blocks)
|
|
||||||

|
|
||||||
|
|
||||||
**d.** **Create Simulation Executor:** Used for running a simulation
|
**d.** **Create Simulation Executor:** Used for running a simulation
|
||||||
* [Local](distroduce/local_messaging_sim.py)
|
* [Local](distroduce/local_messaging_sim.py)
|
||||||
|
|
@ -71,6 +78,11 @@ python3 distroduce/local_messaging_sim.py
|
||||||
### 2. [Configure EMR Cluster](distroduce/configuration/cluster.py)
|
### 2. [Configure EMR Cluster](distroduce/configuration/cluster.py)
|
||||||
|
|
||||||
### 3. Launch EMR Cluster:
|
### 3. Launch EMR Cluster:
|
||||||
|
**Option A:** Preconfigured Launch
|
||||||
|
```bash
|
||||||
|
python3 distroduce/emr/launch.py
|
||||||
|
```
|
||||||
|
**Option B:** Custom Launch - [Example](distroduce/emr/launch.py)
|
||||||
```python
|
```python
|
||||||
from distroduce.emr.launch import launch_cluster
|
from distroduce.emr.launch import launch_cluster
|
||||||
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
|
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
|
||||||
|
|
|
||||||
Binary file not shown.
|
|
@ -0,0 +1,39 @@
|
||||||
|
import os, json, boto3
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
|
||||||
|
def launch_cluster(name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations):
|
||||||
|
def log_uri(name, region):
|
||||||
|
return f's3n://{name}-{region}/elasticmapreduce/'
|
||||||
|
|
||||||
|
os.system(f"""
|
||||||
|
aws emr create-cluster \
|
||||||
|
--applications Name=Hadoop Name=Hive Name=Spark \
|
||||||
|
--ec2-attributes '{json.dumps(ec2_attributes)}' \
|
||||||
|
--release-label emr-5.26.0 \
|
||||||
|
--log-uri '{str(log_uri(name, region))}' \
|
||||||
|
--instance-groups '{json.dumps(instance_groups)}' \
|
||||||
|
--configurations '{json.dumps(configurations)}' \
|
||||||
|
--auto-scaling-role EMR_AutoScaling_DefaultRole \
|
||||||
|
--bootstrap-actions '{json.dumps(bootstrap_actions)}' \
|
||||||
|
--ebs-root-volume-size 10 \
|
||||||
|
--service-role EMR_DefaultRole \
|
||||||
|
--enable-debugging \
|
||||||
|
--name '{name}' \
|
||||||
|
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
|
||||||
|
--region {region}
|
||||||
|
""")
|
||||||
|
|
||||||
|
|
||||||
|
# def benchmark(names: List[str], region, ec2_attributes, bootstrap_actions, instance_groups, configurations):
|
||||||
|
# current_dir = os.path.dirname(__file__)
|
||||||
|
# s3 = boto3.client('s3')
|
||||||
|
# bucket = 'insightde'
|
||||||
|
#
|
||||||
|
# file = 'distroduce.sh'
|
||||||
|
# abs_path = os.path.join(current_dir, file)
|
||||||
|
# key = f'emr/bootstraps/{file}'
|
||||||
|
#
|
||||||
|
# s3.upload_file(abs_path, bucket, key)
|
||||||
|
# for name in names:
|
||||||
|
# launch_cluster(name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)
|
||||||
|
|
@ -1,39 +1,5 @@
|
||||||
import os, json, boto3
|
from distroduce.emr import launch_cluster
|
||||||
from typing import List
|
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
|
||||||
|
region = 'us-east-1'
|
||||||
|
cluster_name = 'distibuted_produce'
|
||||||
def launch_cluster(name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations):
|
launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)
|
||||||
def log_uri(name, region):
|
|
||||||
return f's3n://{name}-{region}/elasticmapreduce/'
|
|
||||||
|
|
||||||
os.system(f"""
|
|
||||||
aws emr create-cluster \
|
|
||||||
--applications Name=Hadoop Name=Hive Name=Spark \
|
|
||||||
--ec2-attributes '{json.dumps(ec2_attributes)}' \
|
|
||||||
--release-label emr-5.26.0 \
|
|
||||||
--log-uri '{str(log_uri(name, region))}' \
|
|
||||||
--instance-groups '{json.dumps(instance_groups)}' \
|
|
||||||
--configurations '{json.dumps(configurations)}' \
|
|
||||||
--auto-scaling-role EMR_AutoScaling_DefaultRole \
|
|
||||||
--bootstrap-actions '{json.dumps(bootstrap_actions)}' \
|
|
||||||
--ebs-root-volume-size 10 \
|
|
||||||
--service-role EMR_DefaultRole \
|
|
||||||
--enable-debugging \
|
|
||||||
--name '{name}' \
|
|
||||||
--scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
|
|
||||||
--region {region}
|
|
||||||
""")
|
|
||||||
|
|
||||||
|
|
||||||
# def benchmark(names: List[str], region, ec2_attributes, bootstrap_actions, instance_groups, configurations):
|
|
||||||
# current_dir = os.path.dirname(__file__)
|
|
||||||
# s3 = boto3.client('s3')
|
|
||||||
# bucket = 'insightde'
|
|
||||||
#
|
|
||||||
# file = 'distroduce.sh'
|
|
||||||
# abs_path = os.path.join(current_dir, file)
|
|
||||||
# key = f'emr/bootstraps/{file}'
|
|
||||||
#
|
|
||||||
# s3.upload_file(abs_path, bucket, key)
|
|
||||||
# for name in names:
|
|
||||||
# launch_cluster(name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)
|
|
||||||
Loading…
Reference in New Issue