In this activity we use the workflow depicted in Figure 1. It consists of 20 computationally intensive, independent tasks followed by a less intensive final task that depends on the previous 20 tasks. The term “independent” refers to a set of tasks that may all be executed concurrently. Additionally, each task now has a RAM requirement such that the amount of RAM a task uses is equal the sum of its input and output file sizes.
The structure of this workflow appears to be “joining” multiple tasks into one, so naturally this structure is typically called a “join” (note that this has nothing to do with a database type of “join”).
Figure 2 depicts this activity’s cyberinfrastructure. We build upon the basic cyberinfrastructure that was introduced in the previous activity. The Compute Service (CS) now has a configurable number of resources on which it can execute tasks. It has access to several physical machines, or “compute nodes”, equipped with dual-core processors and 80 GB of RAM. When the CS receives a job containing multiple tasks, it may execute these tasks concurrently across multiple compute nodes. The CS in this activity is what is known as a “cluster” and can be decomposed into two parts: a “frontend node” and “compute node(s)”. The frontend node handles job requests from the Workflow Management System (WMS) and dispatches work to the compute node(s) according to some decision-making algorithm. In this activity, our WMS submits all workflow tasks to the CS at once, specifying for each task on which compute node it should run, trying to utilize the available compute nodes as much as possible.
Connecting the CS’s frontend node and compute nodes are high-bandwidth, low latency-network links going from each machine to a centralized switch, which also serves as the gateway for network traffic entering and exiting the cluster’s local area network. This means that a file being transferred from the Remote Storage Service at storage_db.edu to the CS at hpc.edu must travel through two links: first the link between storage_db.edu and the switch, then the link between the switch and the frontend node at hpc.edu/node_0. Say that the file is 3000 MB, based on what we learned from the primer on file transfer times, we expect the duration of this file transfer to be as follows:
Furthermore, when a compute node reads a file from scratch space, the file will travel through the two links separating the compute node and the frontend node. For example, if the compute node at hpc.edu/node_1 reads a 3000 MB file from the CS’s scratch space at hpc.edu/node_0, the expected duration for that read operation is as follows:
You will start this activity by using a CS with only a single compute node. You will then augment the CS with more cores and more nodes to see how the execution of our workflow is affected. Individual processor speed and RAM capacity of compute nodes are kept constant throughout this entire activity.
The WMS implementation in this activity submits tasks to the CS using the following scheme regarding file operations: read the initial input files from the remote storage service, write the final output file to the remote storage service, read and write all other files using the CS’s scratch space. Scratch space is another name for temporary, non-volatile storage that computing services have access to while jobs are being executed. Having a scratch space on the CS is key to enabling data locality, which is itself key to better performance, as we learned in the previous Activity 1. This scheme is depicted in Figure 3.
Now we explore two fundamental concepts: parallelism and utilization.
Parallelism. Multi-core architectures enable computers to execute multiple instructions at the same time, and in our case, multiple independent tasks. For example, given a workflow of two independent, identical tasks and a CS with a dual-core compute node, it may be possible to execute the workflow in the same amount of time it would take to execute that workflow if it had only a single task. One can take advantage of such parallelism to get things done faster. More cores, however, does not mean that they can always be used to the fullest extent. This is because the amount of possible parallelism can be limited by the structure of the application and/or the available compute resources. To avoid being wasteful with our resources, it is crucial to understand how well or badly we are utilizing them.
Utilization. The utilization of a core while executing a given workload is defined as follows: (compute time) / (compute time + idle time). The utilization of a multi-core host then, is defined as the average core utilization. For instance, consider a dual-core host that executes a workflow for 1 hour. The first core computes for 30 min, and then is idle for 30 min. The second core is idle for 15 minutes, and then computes for 45 minutes.
The first core’s utilization is 50%, and the second core’s utilization is 75%. The overall utilization is thus 62.5%.
Figure 4 illustrates the concept of utilization. The area of the container rectangle (in this example a 2-by-60 rectangle) represents the total amount of time all cores could have computed for. The colored area represents how much time all cores actually computed for. The proportion of the colored area to the total area tells us the utilization of this host for a given workload. Optimizing for utilization means maximizing the colored area.
For the remainder of this activity, we will be using the visualization tool. In the terminal, run the following commands:
docker pull wrenchproject/wrench-pedagogic-modules:ics332-activity-visualization
docker container run -p 3000:3000 -d wrenchproject/wrench-pedagogic-modules:ics332-activity-visualization
<UH Username>@hawaii.edu
Google AccountActivity 2: Parallelism
Answer these questions
Assuming the cluster has 1 20-core node:
In Step 1, each workflow task had a RAM requirement such that its RAM usage was equal to the sum of its input and output file sizes. What about RAM required for the task itself? That is, real-world workflow tasks (and programs in general) usually require some amount of RAM for storing program instructions, possibly large temporary data-structures, the runtime stack, etc. In this step we thus introduce an additional 12 GB RAM requirement for each task (Figure 9). For example, task0 previously required 4 GB of RAM, whereas in this step it requires 16 GB of RAM.
If a compute node does not have enough RAM to execute the task, its execution is deferred by the CS until the required amount of RAM becomes available on that compute node (in other terms, we do not allow swapping - see your OS course!). Since our hosts have 80 GB of RAM, this means that at most 5 tasks can run concurrently on a host (because 6 times 16 is greater than 80). The following questions reveal how this requirement forces us to find another means of utilizing parallelism and increasing workflow execution performance.
Answer these questions
When you are finished using the visualization tool, run: docker kill $(docker ps -a -q --filter ancestor=wrenchproject/wrench-pedagogic-modules:ics332-activity-visualization)
In this activity, you were presented with a workflow containing a set of tasks that could be executed in parallel. To achieve better performance, you first attempted to “vertically” scale the compute service by adding more cores to a compute node (in practice, vertically scaling can also mean purchasing a faster processor). Your results showed that parallel execution of tasks did in fact increase workflow execution performance to a certain degree. Next, you calculated utilization when using a different number of cores. These results should have demonstrated to you that one cannot always execute workflows as fast as possible while achieving 100% utilization.
After introducing parallelism and utilization, you added additional RAM requirements to the workflow in order to simulate a situation more relevant to actual practice. Under those circumstances, the workflow execution performance collapsed when running this workflow on a single node CS. Rather than simply adding more cores to the single compute node, you grew the cluster “horizontally” by adding more compute nodes. Using this strategy, you were able to increase workflow execution performance up to a certain point.
Sometimes adding more cores or extra machines does nothing except decrease utilization. But most importantly, both strategies, when used judiciously, can be beneficial. Therefore, it is crucial to be cognizant of the hardware resources a workflow demands in addition to dependencies present in its structure.
Head over to the next section, Activity 3: The Right Tool for the Job, where we present you with a case study on purchasing hardware infrastructure to improve the performance of a specific workflow execution.