BASH script to monitor subprocess and throttle it for CPU temperature control
$begingroup$
I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
time { command ...; command ...; }
construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;
# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}
# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}
# Start the timer for cumulative run time reports
launch_time=$(date +%s);
echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
bash
$endgroup$
migrated from unix.stackexchange.com Jan 11 '17 at 10:04
This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.
add a comment |
$begingroup$
I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
time { command ...; command ...; }
construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;
# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}
# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}
# Start the timer for cumulative run time reports
launch_time=$(date +%s);
echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
bash
$endgroup$
migrated from unix.stackexchange.com Jan 11 '17 at 10:04
This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23
add a comment |
$begingroup$
I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
time { command ...; command ...; }
construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;
# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}
# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}
# Start the timer for cumulative run time reports
launch_time=$(date +%s);
echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
bash
$endgroup$
I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.
The actual commands run are, of course, not included, since they are irrelevant to the question.
I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?
For the timing function I know I could have used the
time { command ...; command ...; }
construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.
The Script:
The code comments should explain what it does, as well as why I did some of it the way I did.
#!/bin/bash
# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);
# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";
# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}
# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.
# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;
# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;
# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}
# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}
# Start the timer for cumulative run time reports
launch_time=$(date +%s);
echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;
bash
bash
asked Dec 29 '16 at 5:59
Gypsy SpellweaverGypsy Spellweaver
1961215
1961215
migrated from unix.stackexchange.com Jan 11 '17 at 10:04
This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.
migrated from unix.stackexchange.com Jan 11 '17 at 10:04
This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23
add a comment |
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PID
is a portable alternative to-e /proc/$pid
- bash builtins
let
and[[ x -gt y ]]
can replacebc
for these purposes
[[ .. ]]
is a builtin alternative to[ .. ]
date +%s
can be replaced by builtinprintf
gawk
can extract the temperature more flexibly thangrep
+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleep
to the end - no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"
function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2
$endgroup$
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.kill
is portable, yet[[ .. ]]
isn't as much so.gawd
overgrep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,kill
commands will be issued.-SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low,-SIGCONT
will be reissued every second. Would not only onekill
per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare thesensors
command line, also run once per loop, at ~8ms each, or 32000 times longer.[[ ]]
is "portable" to any other bash and definitely more efficient than forkingbc
.
$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (bc
and[ .. ]
) I ran both versions ofwatch_child
with sleeps disabled and the same gawk-basedget_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use ofsensors
/gawk
altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152320%2fbash-script-to-monitor-subprocess-and-throttle-it-for-cpu-temperature-control%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PID
is a portable alternative to-e /proc/$pid
- bash builtins
let
and[[ x -gt y ]]
can replacebc
for these purposes
[[ .. ]]
is a builtin alternative to[ .. ]
date +%s
can be replaced by builtinprintf
gawk
can extract the temperature more flexibly thangrep
+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleep
to the end - no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"
function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2
$endgroup$
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.kill
is portable, yet[[ .. ]]
isn't as much so.gawd
overgrep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,kill
commands will be issued.-SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low,-SIGCONT
will be reissued every second. Would not only onekill
per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare thesensors
command line, also run once per loop, at ~8ms each, or 32000 times longer.[[ ]]
is "portable" to any other bash and definitely more efficient than forkingbc
.
$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (bc
and[ .. ]
) I ran both versions ofwatch_child
with sleeps disabled and the same gawk-basedget_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use ofsensors
/gawk
altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago
|
show 2 more comments
$begingroup$
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PID
is a portable alternative to-e /proc/$pid
- bash builtins
let
and[[ x -gt y ]]
can replacebc
for these purposes
[[ .. ]]
is a builtin alternative to[ .. ]
date +%s
can be replaced by builtinprintf
gawk
can extract the temperature more flexibly thangrep
+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleep
to the end - no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"
function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2
$endgroup$
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.kill
is portable, yet[[ .. ]]
isn't as much so.gawd
overgrep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,kill
commands will be issued.-SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low,-SIGCONT
will be reissued every second. Would not only onekill
per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare thesensors
command line, also run once per loop, at ~8ms each, or 32000 times longer.[[ ]]
is "portable" to any other bash and definitely more efficient than forkingbc
.
$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (bc
and[ .. ]
) I ran both versions ofwatch_child
with sleeps disabled and the same gawk-basedget_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use ofsensors
/gawk
altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago
|
show 2 more comments
$begingroup$
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PID
is a portable alternative to-e /proc/$pid
- bash builtins
let
and[[ x -gt y ]]
can replacebc
for these purposes
[[ .. ]]
is a builtin alternative to[ .. ]
date +%s
can be replaced by builtinprintf
gawk
can extract the temperature more flexibly thangrep
+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleep
to the end - no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"
function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2
$endgroup$
Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.
Moving on to the code:
- terminal semicolons aren't needed
- configuration should go at the top
kill -0 PID
is a portable alternative to-e /proc/$pid
- bash builtins
let
and[[ x -gt y ]]
can replacebc
for these purposes
[[ .. ]]
is a builtin alternative to[ .. ]
date +%s
can be replaced by builtinprintf
gawk
can extract the temperature more flexibly thangrep
+sed
- your time/run/report pattern can be factored into a function
- the monitoring loop can be simplified by moving
sleep
to the end - no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU
- can save a couple of forks by reading temp directly from /sys
Putting it all together:
#!/bin/bash
temp_lo=60
temp_hi=95
temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"
function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}
function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}
function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}
monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2
edited 22 mins ago
answered 2 hours ago
Oh My GoodnessOh My Goodness
49017
49017
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.kill
is portable, yet[[ .. ]]
isn't as much so.gawd
overgrep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,kill
commands will be issued.-SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low,-SIGCONT
will be reissued every second. Would not only onekill
per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare thesensors
command line, also run once per loop, at ~8ms each, or 32000 times longer.[[ ]]
is "portable" to any other bash and definitely more efficient than forkingbc
.
$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (bc
and[ .. ]
) I ran both versions ofwatch_child
with sleeps disabled and the same gawk-basedget_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use ofsensors
/gawk
altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago
|
show 2 more comments
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.kill
is portable, yet[[ .. ]]
isn't as much so.gawd
overgrep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,kill
commands will be issued.-SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low,-SIGCONT
will be reissued every second. Would not only onekill
per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare thesensors
command line, also run once per loop, at ~8ms each, or 32000 times longer.[[ ]]
is "portable" to any other bash and definitely more efficient than forkingbc
.
$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (bc
and[ .. ]
) I ran both versions ofwatch_child
with sleeps disabled and the same gawk-basedget_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use ofsensors
/gawk
altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.
kill
is portable, yet [[ .. ]]
isn't as much so. gawd
over grep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience.
kill
is portable, yet [[ .. ]]
isn't as much so. gawd
over grep
+sed
is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,
kill
commands will be issued. -SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT
will be reissued every second. Would not only one kill
per threshold crossing be better at conserving CPU resources?$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded,
kill
commands will be issued. -SIGSTOP
will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT
will be reissued every second. Would not only one kill
per threshold crossing be better at conserving CPU resources?$endgroup$
– Gypsy Spellweaver
1 hour ago
$begingroup$
There are really no CPU resources used:
kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors
command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]]
is "portable" to any other bash and definitely more efficient than forking bc
.$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
There are really no CPU resources used:
kill
is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors
command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]]
is "portable" to any other bash and definitely more efficient than forking bc
.$endgroup$
– Oh My Goodness
49 mins ago
$begingroup$
To give an idea of the cost of forks (
bc
and [ .. ]
) I ran both versions of watch_child
with sleeps disabled and the same gawk-based get_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
To give an idea of the cost of forks (
bc
and [ .. ]
) I ran both versions of watch_child
with sleeps disabled and the same gawk-based get_temp
. Based on loops executed per 5 seconds, the modified version is about 50% faster.$endgroup$
– Oh My Goodness
35 mins ago
$begingroup$
edit: you can cut the use of
sensors
/gawk
altogether; see edits to my code$endgroup$
– Oh My Goodness
21 mins ago
$begingroup$
edit: you can cut the use of
sensors
/gawk
altogether; see edits to my code$endgroup$
– Oh My Goodness
21 mins ago
|
show 2 more comments
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152320%2fbash-script-to-monitor-subprocess-and-throttle-it-for-cpu-temperature-control%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49
$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12
$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23