semi-final to pr

This commit is contained in:
Joshua E. Jodesty 2019-10-03 15:31:31 -04:00
parent 8d5657cb6f
commit c7c68a1abe
6 changed files with 218 additions and 60 deletions

171
README.md
View File

@ -1,50 +1,141 @@
```
__________ ____
________ __ _____/ ____/ | / __ \
/ ___/ __` / __ / / / /| | / / / /
/ /__/ /_/ / /_/ / /___/ ___ |/ /_/ /
\___/\__,_/\__,_/\____/_/ |_/_____/
by BlockScience
======================================
Complex Adaptive Dynamics
o i e
m d s
p e i
u d g
t n
e
r
___ _ __ _ __ __ __
/ _ \ (_)___ / /_ ____ (_)/ / __ __ / /_ ___ ___/ /
/ // // /(_-</ __// __// // _ \/ // // __// -_)/ _ /
/____//_//___/\__//_/ _/_//_.__/\_,_/ \__/ \__/ \_,_/
/ _ \ ____ ___ ___/ /__ __ ____ ___
/ ___// __// _ \/ _ // // // __// -_)
/_/ /_/ \___/\_,_/ \_,_/ \__/ \__/
by Joshua E. Jodesty
```
***cadCAD*** is a Python package that assists in the processes of designing, testing and validating complex systems
through simulation, with support for Monte Carlo methods, A/B testing and parameter sweeping.
## What?: *Description*
***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework /
[cadCAD](https://cadcad.org) execution mode that leverages [Apache Spark](https://spark.apache.org/) and
[Apache Kafka Producer](https://kafka.apache.org/documentation/#producerapi) for optimizing Kafka cluster configurations
and debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and
framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message
publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured)
Streaming.
# Getting Started
## 1. Installation:
Requires [Python 3](https://www.python.org/downloads/)
##How?: *A Tail of Two Clusters*
***Distributed Produce*** is a Spark Application used as a cadCAD Execution Mode that distributes Kafka Producers,
message simulation, and message publishing to worker nodes of an EMR cluster. Messages published from these workers are
sent to Kafka topics on a Kafka cluster from a Spark bootstrapped EMR cluster.
**Option A: Install Using [pip](https://pypi.org/project/cadCAD/)**
##Why?: *Use Case*
* **IoT Event / Device Simulation:** Competes with *AWS IoT Device Simulator* and *Azure IoT Solution Acceleration:
Device Simulation*. Unlike these products, *Distributed Produce* enables a user-defined state updates and agent actions,
as well as message publish benchmarking
* **Development Environment for Real-Time Data Processing / Routing:**
##Get Started:
### 0. Set Up Local Development Environment: see [Kafka Quickstart](https://kafka.apache.org/quickstart)
**a.** Install `pyspark`
```bash
pip3 install cadCAD
pip3 install pyspark
```
**b.** Install & Unzip Kafka, Create Kafka `test` topic, and Start Consumer
```bash
sh distroduce/configuration/launch_local_kafka.sh
```
**c.** Run Simulation locally
```bash
zip -rq distroduce/dist/distroduce.zip distroduce/
spark-submit --py-files distroduce/dist/distroduce.zip distroduce/local_messaging_sim.py `hostname | xargs`
```
**Option B:** Build From Source
### 1. Write cadCAD Simulation:
* **Simulation Description:**
To demonstration of *Distributed Produce*, I implemented a simulation of two users interacting over a messaging service.
* **Resources**
* [cadCAD Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
* [cadCAD Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
* **Terminology:**
* ***[Initial Conditions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-variables)*** - State Variables and their initial values (Start event of Simulation)
```python
initial_conditions = {
'state_variable_1': 0,
'state_variable_2': 0,
'state_variable_3': 1.5,
'timestamp': '2019-01-01 00:00:00'
}
```
* ***[Policy Functions:](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)*** -
computes one or more signals to be passed to State Update Functions
```python
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
...
return 'state_variable_name', new_value
```
Parameters:
* **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
* **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
* **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See
[Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
* **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the
`dict_values` are their current values.
* **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python`
[Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
* ***[State Update Functions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions):*** -
updates state variables change over time
```python
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
...
return 'state_variable_name', new_value
```
Parameters:
* **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
* **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
* **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See
[Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
* **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the
`dict_values` are their current values.
* **actions** : `dict` - Aggregation of the signals of all policy functions in the current
* **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python`
[Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
* ***[Partial State Update Block](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables) (PSUB):*** -
a set of State Update Functions and Policy Functions that update state records
![](https://i.imgur.com/9rlX9TG.png)
**Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`
**a.** **Define Policy Functions:**
* [Example:](distroduce/action_policies.py) Two users interacting on separate chat clients and entering / exiting chat
**b.** **Define State Update Functions:**
* [Example:](distroduce/state_updates.py) Used for logging and maintaining state of user actions defined by policies
**c.** **Define Initial Conditions & Partial State Update Block:**
* **Initial Conditions:** [Example](distroduce/messaging_sim.py)
* **Partial State Update Block (PSUB):** [Example](distroduce/simulation.py)
**d.** **Create Simulation Executor:** Used for running a simulation
* [Local](distroduce/local_messaging_sim.py)
* [EMR](distroduce/messaging_sim.py)
### 2. [Configure EMR Cluster](distroduce/configuration/cluster.py)
### 3. Launch EMR Cluster:
**Option A:** Preconfigured Launch
```bash
python3 distroduce/emr/launch.py
```
pip3 install -r requirements.txt
python3 setup.py sdist bdist_wheel
pip3 install dist/*.whl
**Option B:** Custom Launch - [Example](distroduce/emr/launch.py)
```python
from distroduce.emr.launch import launch_cluster
from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
region = 'us-east-1'
cluster_name = 'distibuted_produce'
launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)
```
## 2. Learn the basics
**Tutorials:** available both as [Jupyter Notebooks](tutorials) and [videos](https://www.youtube.com/watch?v=uJEiYHRWA9g&list=PLmWm8ksQq4YKtdRV-SoinhV6LbQMgX1we)
Familiarize yourself with some system modelling concepts and cadCAD terminology.
## 3. Documentation:
* [System Model Configuration](documentation/Simulation_Configuration.md)
* [System Simulation Execution](documentation/Simulation_Execution.md)
* [Policy Aggregation](documentation/Policy_Aggregation.md)
* [System Model Parameter Sweep](documentation/System_Model_Parameter_Sweep.md)
## 4. Connect
Find other cadCAD users at our [Discourse](https://community.cadcad.org/). We are a small but rapidly growing community.
### 4. Execute Benchmark(s):
* **Step 1:** ssh unto master node
* **Step 2:** Spark Submit
```
spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`
```

View File

@ -11,8 +11,9 @@ by Joshua E. Jodesty
```
## What?: *Description*
***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework /
cadCAD execution mode that leverages Apache Spark and Kafka Producer for optimizing Kafka cluster configurations and
debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and
[cadCAD](https://cadcad.org) execution mode that leverages [Apache Spark](https://spark.apache.org/) and
[Apache Kafka Producer](https://kafka.apache.org/documentation/#producerapi) for optimizing Kafka cluster configurations
and debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and
framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message
publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured)
Streaming.
@ -41,24 +42,65 @@ sh distroduce/configuration/launch_local_kafka.sh
```
**c.** Run Simulation locally
```bash
python3 distroduce/local_messaging_sim.py
zip -rq distroduce/dist/distroduce.zip distroduce/
spark-submit --py-files distroduce/dist/distroduce.zip distroduce/local_messaging_sim.py `hostname | xargs`
```
### 1. Write cadCAD Simulation:
* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
* [Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
### 1. Write cadCAD Simulation:
* **Simulation Description:**
To demonstration of *Distributed Produce*, I implemented a simulation of two users interacting over a messaging service.
* **Resources**
* [cadCAD Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
* [cadCAD Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
* **Terminology:**
* ***Initial Conditions*** - State Variables and their initial values (Start event of Simulation)
* ***[Initial Conditions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-variables)*** - State Variables and their initial values (Start event of Simulation)
```python
initial_conditions = {
'state_variable_1': 0,
'state_variable_2': 0,
'state_variable_3': 1.5,
'timestamp': '2019-01-01 00:00:00'
}
```
* ***[Policy Functions:](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)*** -
computes one or more signals to be passed to State Update Functions
* ***[State Update Functions]((https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions)):*** -
```python
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
...
return 'state_variable_name', new_value
```
Parameters:
* **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
* **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
* **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See
[Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
* **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the
`dict_values` are their current values.
* **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python`
[Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
* ***[State Update Functions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions):*** -
updates state variables change over time
```python
def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
...
return 'state_variable_name', new_value
```
Parameters:
* **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
* **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
* **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See
[Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
* **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the
`dict_values` are their current values.
* **actions** : `dict` - Aggregation of the signals of all policy functions in the current
* **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python`
[Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
* ***[Partial State Update Block](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables) (PSUB):*** -
a set of State Update Functions and Policy Functions that update state records
![](https://i.imgur.com/9rlX9TG.png)
**Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`
**a.** **Define Policy Functions:**
@ -95,6 +137,5 @@ launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance
* **Step 1:** ssh unto master node
* **Step 2:** Spark Submit
```
PRIVATE_IP=`hostname -I | xargs`
spark-submit --master yarn --py-files distroduce.zip messaging_sim.py $PRIVATE_IP
spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`
```

Binary file not shown.

View File

@ -1,3 +1,4 @@
import sys
from datetime import datetime
from cadCAD import configs
@ -30,8 +31,15 @@ if __name__ == "__main__":
}
)
# Configuration for Kafka Producer
kafkaConfig = {
'send_topic': 'test',
'producer_config': {
'bootstrap_servers': f'{sys.argv[1]}:9092',
'acks': 'all'
}
}
exec_mode = ExecutionMode()
kafkaConfig = {'send_topic': 'test', 'producer_config': {'bootstrap_servers': f'localhost:9092', 'acks': 'all'}}
dist_proc_ctx = ExecutionContext(context=exec_mode.dist_proc, method=distributed_produce, kafka_config=kafkaConfig)
run = Executor(exec_context=dist_proc_ctx, configs=configs, spark_context=sc)

View File

@ -1,3 +1,4 @@
import sys
from datetime import datetime
from cadCAD import configs
@ -30,8 +31,15 @@ if __name__ == "__main__":
}
)
# Configuration for Kafka Producer
kafkaConfig = {
'send_topic': 'test',
'producer_config': {
'bootstrap_servers': f'{sys.argv[1]}:9092',
'acks': 'all'
}
}
exec_mode = ExecutionMode()
kafkaConfig = {'send_topic': 'test', 'producer_config': {'bootstrap_servers': f'{sys.argv[1]}:9092', 'acks': 'all'}}
dist_proc_ctx = ExecutionContext(context=exec_mode.dist_proc, method=distributed_produce, kafka_config=kafkaConfig)
run = Executor(exec_context=dist_proc_ctx, configs=configs, spark_context=sc)

View File

@ -57,21 +57,31 @@ def main(executor, sim_config, intitial_conditions, sim_composition):
i = 0
for raw_result, tensor_field in executor.execute():
result = arrange_cols(pd.DataFrame(raw_result), False)[
result = arrange_cols(pd.DataFrame(raw_result), False)
metrics_result = result[
[
'run_id', 'timestep', 'substep',
'record_creation', 'total_msg_count', 'total_send_time'
]
]
msgs_result = result[
[
'run_id', 'timestep', 'substep',
'record_creation',
'client_a', 'client_b'
]
]
print()
if i == 0:
print(tabulate(tensor_field, headers='keys', tablefmt='psql'))
last = result.tail(1)
last = metrics_result.tail(1)
last['msg_per_sec'] = last['total_msg_count'] / last['total_send_time']
print("Output: Head")
print(tabulate(result.head(5), headers='keys', tablefmt='psql'))
print("Output: Tail")
print(tabulate(result.tail(5), headers='keys', tablefmt='psql'))
print("Messages Output: Head")
print(tabulate(msgs_result.head(5), headers='keys', tablefmt='psql'))
print("Metrics Output: Head")
print(tabulate(metrics_result.head(5), headers='keys', tablefmt='psql'))
print("Metrics Output: Tail")
print(tabulate(metrics_result.tail(5), headers='keys', tablefmt='psql'))
print(tabulate(last, headers='keys', tablefmt='psql'))
print()
i += 1