Nomad
Oversubscribe memory
Job authors must set a memory limit for each task. If the memory limit is too low, then the task may exceed it and stop running. If the memory limit is too high, the cluster is left underutilized and resources are wasted. Job authors usually set limits based on the task's typical memory usage—plus an extra safety margin to handle unexpected load spikes or uncommon scenarios. Cumulatively, this can lead to a significant amount of the cluster memory being reserved but unused in clusters.
To help prevent this, Nomad 1.1 now provides job authors with two separate memory limits:
A reserve limit to represent the task’s typical memory usage. This value is used by the Nomad scheduler to reserve and place the task.
A max limit, which is the largest amount of memory the task may burst up to.
If another process competes for the client’s memory or the client's available memory becomes too low, Nomad uses the operating system primitives to recover. In Linux via cgroups, Nomad reclaims memory by pushing the tasks back to their reserved memory limits. It may also reschedule tasks to other clients.
Memory oversubscription is not enabled by default. You enable it by sending a payload with the appropriate options specified to the scheduler configuration API endpoint. In this tutorial, you will enable the oversubscription feature and observe the memory utilization of a service job. You will change the memory parameters and observe the behaviors of the job as the memory parameters are adjusted.
Launch Terminal
This tutorial includes a free interactive command-line lab that lets you follow along on actual cloud infrastructure.
Requirements
Linux or macOS host
- Docker
- 2GB+ RAM
jq—This tutorial uses the
jq
command to filter and rewrite JSON.
Configure your learning environment
Fetch the tutorial content
This tutorial uses content provided in the hashicorp/learn-nomad-features
repository on GitHub. You can download a ZIP archive directly or use git
to
clone the repository.
$ wget https://github.com/hashicorp/learn-nomad-features/archive/memory-oversubscription.zip
Unarchive the downloaded release.
$ unzip memory-oversubscription.zip
The unzipping process creates the learn-nomad-features-memory-oversubscription
directory, which
contains the memory-oversubscription
directory you will use in this tutorial. Change to the tutorial directory.
$ cd learn-nomad-features-memory-oversubscription/memory-oversubscription
Start the tutorial environment
This tutorial includes a Nomad job specification that starts a monitoring application and a job that allocates memory in a bursty fashion that makes it difficult to determine its memory resource needs.
Start a Nomad agent
Open another terminal session in the same folder, and run a Nomad dev agent with the following command.
$ sudo nomad agent -dev -config=config/nomad.hcl
Switch back to the first terminal session so that you can run the commands in the rest of the tutorial.
Manage environment variables
Because this tutorial uses a local Nomad dev agent, you need to unset the
NOMAD_ADDR
and NOMAD_TOKEN
variables if they are set in your current shell
environment.
$ unset NOMAD_ADDR NOMAD_TOKEN
The shell does not provide any feedback when you run this command.
To simplify the curl
commands you will be running, create a NOMAD_ADDR
environment variable that points to your running Nomad dev agent.
$ export NOMAD_ADDR=http://127.0.0.1:4646
As before, the shell does not provide any feedback when you run this command.
View the current scheduler configuration
$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq .
The response is a SchedulerConfig JSON object and information about its last modified index.
{
"SchedulerConfig": {
"SchedulerAlgorithm": "binpack",
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false
},
"MemoryOversubscriptionEnabled": false,
"CreateIndex": 5,
"ModifyIndex": 5
},
"Index": 5,
"LastContact": 0,
"KnownLeader": true
}
If you don't receive a JSON response from Nomad, make sure that the NOMAD_ADDR
environment variable is set correctly and that your Nomad dev agent is running.
Using the jq
command, you can filter the response down to the SchedulerConfig
object itself.
$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq '.SchedulerConfig'
{
"CreateIndex": 5,
"MemoryOversubscriptionEnabled": false,
"ModifyIndex": 5,
"PreemptionConfig": {
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false,
"SystemSchedulerEnabled": true
},
"SchedulerAlgorithm": "binpack"
}
Run the monitoring job
The monitoring.nomad
job includes an ephemeral monitoring environment for use
with the tutorial. Use the nomad job run
command to run the monitoring.nomad
job.
$ nomad job run monitoring.nomad
==> 2021-07-01T14:34:00-04:00: Monitoring evaluation "b5186f1d"
2021-07-01T14:34:00-04:00: Evaluation triggered by job "monitoring"
==> 2021-07-01T14:34:01-04:00: Monitoring evaluation "b5186f1d"
2021-07-01T14:34:01-04:00: Evaluation within deployment: "f7ec09be"
2021-07-01T14:34:01-04:00: Allocation "6c125a03" created: node "54308685", group "metrics"
2021-07-01T14:34:01-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:34:01-04:00: Evaluation "b5186f1d" finished with status "complete"
==> 2021-07-01T14:34:01-04:00: Monitoring deployment "f7ec09be"
⠧ Deployment "f7ec09be" in progress...
2021-07-01T14:34:09-04:00
ID = f7ec09be
Job ID = monitoring
Job Version = 0
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
metrics 1 1 0 0 2021-07-01T14:44:00-04:00
Observe the sample application
The tutorial repository includes a sample application that uses memory in a
predictable way. The code for the sample application and a Dockerfile to build
it for yourself is included in the memory-wave
folder at the root of the
learn-nomad-features
repository.
Run the sample job
The wave.nomad
job runs a prebuilt instance of the container from Docker Hub.
You will need to update the image in the job specification if you would like to
use your own image.
Use the nomad job run
command to run the wave.nomad
job.
$ nomad job run wave.nomad
==> 2021-07-01T14:35:00-04:00: Monitoring evaluation "450db87b"
2021-07-01T14:35:00-04:00: Evaluation triggered by job "wave"
2021-07-01T14:35:00-04:00: Evaluation within deployment: "404b29b2"
2021-07-01T14:35:00-04:00: Allocation "97fcb825" created: node "54308685", group "wave"
2021-07-01T14:35:00-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:35:00-04:00: Evaluation "450db87b" finished with status "complete"
==> 2021-07-01T14:35:00-04:00: Monitoring deployment "404b29b2"
⠏ Deployment "404b29b2" in progress...
2021-07-01T14:35:02-04:00
ID = 404b29b2
Job ID = wave
Job Version = 0
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
wave 1 1 0 0 2021-07-01T14:45:00-04:00
Open the Influx UI
Use the Nomad UI to determine the IP address and port of the Influx instance. In your browser, navigate to http://localhost:4646/ui/jobs/monitoring/metrics
Next, select the running allocation by its ID. Nomad will show the allocation detail page.
If not visible, scroll down to the Ports section of the page.
Click on the hyperlinked Host Address to open the Influx UI in a new browser tab.
Sign in to the Influx UI using the username admin
and the password password
. The
Getting Started page opens.
Click on the Dashboards option in the side bar.
Select the Wave Dashboard to open the tutorial's dashboard.
Change the timeframe to Past 5m.
Change the refresh rate to 5s.
As the application runs, the Memory Stats graph should show regular peaks and valleys.
The Max Memory Usage for Period statistic shows that the wave task is using 499 MiB in the 5 minute period captured in the metrics view. The Average Memory Usage for Period shows that the job uses around 400 MiB.
Without memory oversubscription, the job needs to reserve more than the memory required for the entire life of the task. If the task uses more memory than the job indicates, Docker forcibly stops the task because it's out of memory ("OOM kill").
Once you have enabled memory oversubscription, your job can reserve an amount closer to the actual average usage of the application.
Enable memory oversubscription
To enable memory oversubscription, you must set MemoryOversubscriptionEnabled
to true
. The general process is:
- Fetch the current SchedulerConfig.
- Update
MemoryOversubscriptionEnabled
totrue
. - POST the updated value back to Nomad.
You can save the SchedulerConfig contents to your filesystem and edit them, or
you can use the jq
command in a shell-command pipeline to update the value
without writing it to disk. This tutorial demonstrates the pipeline method.
Update the MemoryOversubscriptionEnabled value
Run this single-line command to get the current scheduler configuration with curl
, update the
value with jq
, and send it back to Nomad using a curl
PUT request.
$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | \
jq '.SchedulerConfig | .MemoryOversubscriptionEnabled=true' | \
curl -X PUT $NOMAD_ADDR/v1/operator/scheduler/configuration -d @-
Nomad's response shows that the value was updated and provides you with the change index.
{ "Updated": true, "Index": 40 }
Verify the value
Verify the cluster's MemoryOversubscriptionEnabled
value by running the curl
command to query the /v1/operator/scheduler/configuration
endpoint again.
$ curl -s $NOMAD_ADDR/v1/operator/scheduler/configuration | jq .
{
"SchedulerConfig": {
"SchedulerAlgorithm": "binpack",
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false
},
"MemoryOversubscriptionEnabled": true,
"CreateIndex": 5,
"ModifyIndex": 40
},
"Index": 40,
"LastContact": 0,
"KnownLeader": true
}
Update the job to use oversubscription
Open wave.nomad
in a text editor and scroll down to the resources
stanza.
Reduce the memory value from 520
to the observed average value of 400
. Add
a memory_max
value to inform Nomad of how much extra memory the job can
use; set it to 520
.
Once complete, your resources
stanza should look like the following.
wave.nomad
job "wave" {
datacenters = ["dc1"]
group "wave" {
task "wave" {
driver = "docker"
config {
image = "voiselle/wave:v5"
args = [ "300", "200", "15", "64", "4" ]
}
resources {
memory = 400
memory_max = 520
}
}
}
}
Re-run the job
Run the job to update the configuration.
$ nomad job run wave.nomad
==> 2021-07-01T14:43:30-04:00: Monitoring evaluation "d6f822a1"
2021-07-01T14:43:30-04:00: Evaluation triggered by job "wave"
==> 2021-07-01T14:43:31-04:00: Monitoring evaluation "d6f822a1"
2021-07-01T14:43:31-04:00: Evaluation within deployment: "692890d2"
2021-07-01T14:43:31-04:00: Allocation "c78818af" created: node "54308685", group "wave"
2021-07-01T14:43:31-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:43:31-04:00: Evaluation "d6f822a1" finished with status "complete"
==> 2021-07-01T14:43:31-04:00: Monitoring deployment "692890d2"
⠙ Deployment "692890d2" in progress...
2021-07-01T14:43:33-04:00
ID = 692890d2
Job ID = wave
Job Version = 1
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
wave 1 1 0 0 2021-07-01T14:53:30-04:00
Examine the dashboard
Switch back to the Influx browser tab and watch the Wave Dashboard.
Observe that the running allocation uses more than the allocated value of 400 MiB without being OOM-killed.
Validate memory_max setting
Open wave.nomad
in a text editor and scroll down to the config
stanza's
args
value. Update the second value in the list to 300
.
Once complete, your args
stanza should look like the following.
wave.nomad
job "wave" {
datacenters = ["dc1"]
group "wave" {
task "wave" {
driver = "docker"
config {
image = "voiselle/wave:v5"
args = [ "300", "300", "15", "64", "4" ]
}
resources {
memory = 400
memory_max = 520
}
}
}
}
This change causes the wave application to use approximately 600 MiB of RAM at
its peak usage. Because the job's max_memory
value of 550 MiB, Docker will
OOM-kill the job once it passes that value.
Run the job to update the configuration.
$ nomad job run wave.nomad
==> 2021-07-01T14:46:33-04:00: Monitoring evaluation "f54c7610"
2021-07-01T14:46:33-04:00: Evaluation triggered by job "wave"
2021-07-01T14:46:33-04:00: Allocation "f51cd2c2" created: node "54308685", group "wave"
==> 2021-07-01T14:46:34-04:00: Monitoring evaluation "f54c7610"
2021-07-01T14:46:34-04:00: Evaluation within deployment: "c76d1227"
2021-07-01T14:46:34-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-07-01T14:46:34-04:00: Evaluation "f54c7610" finished with status "complete"
==> 2021-07-01T14:46:34-04:00: Monitoring deployment "c76d1227"
⠸ Deployment "c76d1227" in progress...
2021-07-01T14:46:39-04:00
ID = c76d1227
Job ID = wave
Job Version = 2
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
wave 1 1 0 0 2021-07-01T14:56:33-04:00
Examine the dashboard
Switch back to the Influx browser tab and watch the Wave Dashboard.
Observe that once the application uses more than the specified memory_max value— 550 MiB—that the container is OOM-killed.
Clean up
Now that you have configured memory oversubscription in your local Nomad dev instance, you can clean up the running containers and Docker images.
Stop Nomad jobs
Use the nomad job stop
command to stop the wave job.
$ nomad job stop wave
==> Monitoring evaluation "85b1c158"
Evaluation triggered by job "wave"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "85b1c158" finished with status "complete"
Use the nomad job stop
command to stop the monitoring job.
$ nomad job stop monitoring
==> Monitoring evaluation "dd37cc88"
Evaluation triggered by job "monitoring"
==> Monitoring evaluation "dd37cc88"
Evaluation within deployment: "aeff4677"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "dd37cc88" finished with status "complete"
Stop the Nomad dev agent
Switch to the terminal running your Nomad dev agent and stop it by pressing
Ctrl-C
. You can now close this terminal session.
Remove tutorial Docker images (optional)
The tutorial pulls three Docker containers that will be cached by your local Docker daemon. Once you are completely done with the tutorial, run the following command to remove them if you wish.
$ docker image rm voiselle/wave:v5 influxdb:2.0.7 telegraf:1.19.0
Next steps
In this tutorial, you learned how to update the SchedulerConfig
value to
enable memory oversubscription in your Nomad cluster, how to configure the
memory
attribute for oversubscription, and how to use the memory_max
attribute to prevent a misbehaving workload from depleting the available memory
on your Nomad client nodes.
Read more about memory oversubscription in the Nomad documentation. To learn more about advanced scheduling configuration, visit the Define Application Placement Preferences collection.