semi-final to pr

2019-10-03 15:31:31 -04:00 · 2019-10-03 15:31:31 -04:00 · c7c68a1abe
parent 8d5657cb6f
commit c7c68a1abe
6 changed files with 218 additions and 60 deletions
--- a/README.md
+++ b/README.md
@ -1,50 +1,141 @@
 ```
-                    __________   ____
-  ________ __ _____/ ____/   |  / __ \
- / ___/ __` / __  / /   / /| | / / / /
-/ /__/ /_/ / /_/ / /___/ ___ |/ /_/ /
-\___/\__,_/\__,_/\____/_/  |_/_____/
-by BlockScience
-======================================
-       Complex Adaptive Dynamics       
-       o       i        e
-       m       d        s
-       p       e        i
-       u       d        g
-       t                n
-       e
-       r
+   ___   _      __        _  __         __           __
+  / _ \ (_)___ / /_ ____ (_)/ /  __ __ / /_ ___  ___/ /
+ / // // /(_-</ __// __// // _ \/ // // __// -_)/ _  / 
+/____//_//___/\__//_/ _/_//_.__/\_,_/ \__/ \__/ \_,_/  
+  / _ \ ____ ___  ___/ /__ __ ____ ___                 
+ / ___// __// _ \/ _  // // // __// -_)                
+/_/   /_/   \___/\_,_/ \_,_/ \__/ \__/                 
+by Joshua E. Jodesty                                                       
+                                       
 ```
-***cadCAD*** is a Python package that assists in the processes of designing, testing and validating complex systems 
-through simulation, with support for Monte Carlo methods, A/B testing and parameter sweeping. 
+## What?: *Description*
+***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework / 
+[cadCAD](https://cadcad.org) execution mode that leverages [Apache Spark](https://spark.apache.org/) and 
+[Apache Kafka Producer](https://kafka.apache.org/documentation/#producerapi) for optimizing Kafka cluster configurations 
+and debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and 
+framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message 
+publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured) 
+Streaming. 

-# Getting Started
-## 1. Installation: 
-Requires [Python 3](https://www.python.org/downloads/) 
+##How?: *A Tail of Two Clusters*
+***Distributed Produce*** is a Spark Application used as a cadCAD Execution Mode that distributes Kafka Producers, 
+message simulation, and message publishing to worker nodes of an EMR cluster. Messages published from these workers are 
+sent to Kafka topics on a Kafka cluster from a Spark bootstrapped EMR cluster.

-**Option A: Install Using [pip](https://pypi.org/project/cadCAD/)** 
+##Why?: *Use Case*
+* **IoT Event / Device Simulation:** Competes with *AWS IoT Device Simulator* and *Azure IoT Solution Acceleration: 
+Device Simulation*. Unlike these products, *Distributed Produce* enables a user-defined state updates and agent actions, 
+as well as message publish benchmarking
+* **Development Environment for Real-Time Data Processing / Routing:**
+
+##Get Started:
+
+### 0. Set Up Local Development Environment: see [Kafka Quickstart](https://kafka.apache.org/quickstart)
+**a.** Install `pyspark`
 ```bash
-pip3 install cadCAD
+pip3 install pyspark
+``` 
+**b.** Install & Unzip Kafka, Create Kafka `test` topic, and Start Consumer
+```bash
+sh distroduce/configuration/launch_local_kafka.sh
+```
+**c.** Run Simulation locally
+```bash
+zip -rq distroduce/dist/distroduce.zip distroduce/
+spark-submit --py-files distroduce/dist/distroduce.zip  distroduce/local_messaging_sim.py `hostname | xargs`
 ```

-**Option B:** Build From Source
+### 1. Write cadCAD Simulation:
+* **Simulation Description:**
+    To demonstration of *Distributed Produce*, I implemented a simulation of two users interacting over a messaging service.
+* **Resources**
+    * [cadCAD Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
+    * [cadCAD Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
+* **Terminology:**
+    * ***[Initial Conditions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-variables)*** - State Variables and their initial values (Start event of Simulation)
+        ```python
+        initial_conditions = {
+            'state_variable_1': 0,
+            'state_variable_2': 0,
+            'state_variable_3': 1.5,
+            'timestamp': '2019-01-01 00:00:00'
+        }
+        ```
+    * ***[Policy Functions:](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)*** - 
+    computes one or more signals to be passed to State Update Functions
+        ```python
+        def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
+              ...
+            return 'state_variable_name', new_value
+        ```
+        
+        Parameters:
+        * **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
+        * **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
+        * **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See 
+        [Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
+        * **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the 
+        `dict_values` are their current values.
+        * **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python` 
+        [Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
+        
+    * ***[State Update Functions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions):*** - 
+    updates state variables change over time
+        ```python
+        def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
+              ...
+            return 'state_variable_name', new_value
+        ```
+        Parameters:
+        * **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
+        * **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
+        * **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See 
+        [Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
+        * **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the 
+        `dict_values` are their current values.
+        * **actions** : `dict` - Aggregation of the signals of all policy functions in the current
+        * **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python` 
+        [Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
+    * ***[Partial State Update Block](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables) (PSUB):*** - 
+    a set of State Update Functions and Policy Functions that update state records
+    ![](https://i.imgur.com/9rlX9TG.png)
+
+**Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`
+
+**a.** **Define Policy Functions:**
+* [Example:](distroduce/action_policies.py) Two users interacting on separate chat clients and entering / exiting chat 
+
+**b.** **Define State Update Functions:**
+* [Example:](distroduce/state_updates.py) Used for logging and maintaining state of user actions defined by policies
+
+**c.** **Define Initial Conditions & Partial State Update Block:**
+* **Initial Conditions:** [Example](distroduce/messaging_sim.py)
+* **Partial State Update Block (PSUB):** [Example](distroduce/simulation.py)
+
+**d.** **Create Simulation Executor:** Used for running a simulation
+* [Local](distroduce/local_messaging_sim.py)
+* [EMR](distroduce/messaging_sim.py)
+
+### 2. [Configure EMR Cluster](distroduce/configuration/cluster.py)
+
+### 3. Launch EMR Cluster:
+**Option A:** Preconfigured Launch
+```bash
+python3 distroduce/emr/launch.py
 ```
-pip3 install -r requirements.txt
-python3 setup.py sdist bdist_wheel
-pip3 install dist/*.whl
+**Option B:** Custom Launch - [Example](distroduce/emr/launch.py)
+```python
+from distroduce.emr.launch import launch_cluster
+from distroduce.configuration.cluster import ec2_attributes, bootstrap_actions, instance_groups, configurations
+region = 'us-east-1'
+cluster_name = 'distibuted_produce'
+launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance_groups, configurations)
 ```

- 
-## 2. Learn the basics
-**Tutorials:** available both as [Jupyter Notebooks](tutorials) and [videos](https://www.youtube.com/watch?v=uJEiYHRWA9g&list=PLmWm8ksQq4YKtdRV-SoinhV6LbQMgX1we) 
-
-Familiarize yourself with some system modelling concepts and cadCAD terminology.
-
-## 3. Documentation:
-* [System Model Configuration](documentation/Simulation_Configuration.md)
-* [System Simulation Execution](documentation/Simulation_Execution.md)
-* [Policy Aggregation](documentation/Policy_Aggregation.md)
-* [System Model Parameter Sweep](documentation/System_Model_Parameter_Sweep.md)
-
-## 4. Connect
-Find other cadCAD users at our [Discourse](https://community.cadcad.org/). We are a small but rapidly growing community.
+### 4. Execute Benchmark(s):
+* **Step 1:** ssh unto master node
+* **Step 2:** Spark Submit
+    ```
+    spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`
+    ```
--- a/distroduce/README.md
+++ b/distroduce/README.md
@ -11,8 +11,9 @@ by Joshua E. Jodesty
 ```
 ## What?: *Description*
 ***Distributed Produce*** (**[distroduce](distroduce)**) is a message simulation and throughput benchmarking framework / 
-cadCAD execution mode that leverages Apache Spark and Kafka Producer for optimizing Kafka cluster configurations and 
-debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and 
+[cadCAD](https://cadcad.org) execution mode that leverages [Apache Spark](https://spark.apache.org/) and 
+[Apache Kafka Producer](https://kafka.apache.org/documentation/#producerapi) for optimizing Kafka cluster configurations 
+and debugging real-time data transformations. *distroduce* leverages cadCAD's user-defined event simulation template and 
 framework to simulate messages sent to Kafka clusters. This enables rapid and iterative design, debugging, and message 
 publish benchmarking of Kafka clusters and real-time data processing using Kafka Streams and Spark (Structured) 
 Streaming. 
@ -41,24 +42,65 @@ sh distroduce/configuration/launch_local_kafka.sh
 ```
 **c.** Run Simulation locally
 ```bash
-python3 distroduce/local_messaging_sim.py
+zip -rq distroduce/dist/distroduce.zip distroduce/
+spark-submit --py-files distroduce/dist/distroduce.zip  distroduce/local_messaging_sim.py `hostname | xargs`
 ```

-### 1. Write cadCAD Simulation: 
-* [Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
-* [Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
-
+### 1. Write cadCAD Simulation:
+* **Simulation Description:**
+    To demonstration of *Distributed Produce*, I implemented a simulation of two users interacting over a messaging service.
+* **Resources**
+    * [cadCAD Documentation](https://github.com/BlockScience/cadCAD/tree/master/documentation)
+    * [cadCAD Tutorials](https://github.com/BlockScience/cadCAD/tree/master/tutorials)
 * **Terminology:**
-    * ***Initial Conditions*** - State Variables and their initial values (Start event of Simulation)
+    * ***[Initial Conditions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-variables)*** - State Variables and their initial values (Start event of Simulation)
+        ```python
+        initial_conditions = {
+            'state_variable_1': 0,
+            'state_variable_2': 0,
+            'state_variable_3': 1.5,
+            'timestamp': '2019-01-01 00:00:00'
+        }
+        ```
    * ***[Policy Functions:](https://github.com/BlockScience/cadCAD/tree/master/documentation#Policy-Functions)*** - 
    computes one or more signals to be passed to State Update Functions
-    * ***[State Update Functions]((https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions)):*** - 
+        ```python
+        def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
+              ...
+            return 'state_variable_name', new_value
+        ```
+        
+        Parameters:
+        * **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
+        * **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
+        * **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See 
+        [Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
+        * **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the 
+        `dict_values` are their current values.
+        * **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python` 
+        [Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
+        
+    * ***[State Update Functions](https://github.com/BlockScience/cadCAD/tree/master/documentation#state-update-functions):*** - 
    updates state variables change over time
+        ```python
+        def state_update_function_A(_params, substep, sH, s, actions, kafkaConfig):
+              ...
+            return 'state_variable_name', new_value
+        ```
+        Parameters:
+        * **_params** : `dict` - [System parameters](https://github.com/BlockScience/cadCAD/blob/master/documentation/System_Model_Parameter_Sweep.md)
+        * **substep** : `int` - Current [substep](https://github.com/BlockScience/cadCAD/tree/master/documentation#Substep)
+        * **sH** : `list[list[dict]]` - Historical values of all state variables for the simulation. See 
+        [Historical State Access](https://github.com/BlockScience/cadCAD/blob/master/documentation/Historically_State_Access.md) for details
+        * **s** : `dict` - Current state of the system, where the `dict_keys` are the names of the state variables and the 
+        `dict_values` are their current values.
+        * **actions** : `dict` - Aggregation of the signals of all policy functions in the current
+        * **kafkaConfig:** `kafka.KafkaProducer` - Configuration for `kafka-python` 
+        [Producer](https://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html)
    * ***[Partial State Update Block](https://github.com/BlockScience/cadCAD/tree/master/documentation#State-Variables) (PSUB):*** - 
    a set of State Update Functions and Policy Functions that update state records
    ![](https://i.imgur.com/9rlX9TG.png)

-
 **Note:** State Update and Policy Functions now have the additional / undocumented parameter `kafkaConfig`

 **a.** **Define Policy Functions:**
@ -95,6 +137,5 @@ launch_cluster(cluster_name, region, ec2_attributes, bootstrap_actions, instance
 * **Step 1:** ssh unto master node
 * **Step 2:** Spark Submit
    ```
-    PRIVATE_IP=`hostname -I | xargs`
-    spark-submit --master yarn --py-files distroduce.zip messaging_sim.py $PRIVATE_IP
+    spark-submit --master yarn --py-files distroduce.zip messaging_sim.py `hostname | xargs`
    ```
--- a/distroduce/dist/distroduce.zip
+++ b/distroduce/dist/distroduce.zip
--- a/distroduce/local_messaging_sim.py
+++ b/distroduce/local_messaging_sim.py
@ -1,3 +1,4 @@
+import sys
 from datetime import datetime

 from cadCAD import configs
@ -30,8 +31,15 @@ if __name__ == "__main__":
        }
    )

+    # Configuration for Kafka Producer
+    kafkaConfig = {
+        'send_topic': 'test',
+        'producer_config': {
+            'bootstrap_servers': f'{sys.argv[1]}:9092',
+            'acks': 'all'
+        }
+    }
    exec_mode = ExecutionMode()
-    kafkaConfig = {'send_topic': 'test', 'producer_config': {'bootstrap_servers': f'localhost:9092', 'acks': 'all'}}
    dist_proc_ctx = ExecutionContext(context=exec_mode.dist_proc, method=distributed_produce, kafka_config=kafkaConfig)
    run = Executor(exec_context=dist_proc_ctx, configs=configs, spark_context=sc)

--- a/distroduce/messaging_sim.py
+++ b/distroduce/messaging_sim.py
@ -1,3 +1,4 @@
+import sys
 from datetime import datetime

 from cadCAD import configs
@ -30,8 +31,15 @@ if __name__ == "__main__":
        }
    )

+    # Configuration for Kafka Producer
+    kafkaConfig = {
+        'send_topic': 'test',
+        'producer_config': {
+            'bootstrap_servers': f'{sys.argv[1]}:9092',
+            'acks': 'all'
+        }
+    }
    exec_mode = ExecutionMode()
-    kafkaConfig = {'send_topic': 'test', 'producer_config': {'bootstrap_servers': f'{sys.argv[1]}:9092', 'acks': 'all'}}
    dist_proc_ctx = ExecutionContext(context=exec_mode.dist_proc, method=distributed_produce, kafka_config=kafkaConfig)
    run = Executor(exec_context=dist_proc_ctx, configs=configs, spark_context=sc)

--- a/distroduce/simulation.py
+++ b/distroduce/simulation.py
@ -57,21 +57,31 @@ def main(executor, sim_config, intitial_conditions, sim_composition):

    i = 0
    for raw_result, tensor_field in executor.execute():
-        result = arrange_cols(pd.DataFrame(raw_result), False)[
+        result = arrange_cols(pd.DataFrame(raw_result), False)
+        metrics_result = result[
            [
                'run_id', 'timestep', 'substep',
                'record_creation', 'total_msg_count', 'total_send_time'
            ]
        ]
+        msgs_result = result[
+            [
+                'run_id', 'timestep', 'substep',
+                'record_creation',
+                'client_a', 'client_b'
+            ]
+        ]
        print()
        if i == 0:
            print(tabulate(tensor_field, headers='keys', tablefmt='psql'))
-        last = result.tail(1)
+        last = metrics_result.tail(1)
        last['msg_per_sec'] = last['total_msg_count'] / last['total_send_time']
-        print("Output: Head")
-        print(tabulate(result.head(5), headers='keys', tablefmt='psql'))
-        print("Output: Tail")
-        print(tabulate(result.tail(5), headers='keys', tablefmt='psql'))
+        print("Messages Output: Head")
+        print(tabulate(msgs_result.head(5), headers='keys', tablefmt='psql'))
+        print("Metrics Output: Head")
+        print(tabulate(metrics_result.head(5), headers='keys', tablefmt='psql'))
+        print("Metrics Output: Tail")
+        print(tabulate(metrics_result.tail(5), headers='keys', tablefmt='psql'))
        print(tabulate(last, headers='keys', tablefmt='psql'))
        print()
        i += 1