BASH script to monitor subprocess and throttle it for CPU temperature control












2












$begingroup$


I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.



The actual commands run are, of course, not included, since they are irrelevant to the question.



I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?



For the timing function I know I could have used the



time { command ...; command ...; }


construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.



The Script:



The code comments should explain what it does, as well as why I did some of it the way I did.



#!/bin/bash

# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);

# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";

# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}

# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.

# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;

# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;

# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}

# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}

# Start the timer for cumulative run time reports
launch_time=$(date +%s);

echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;

echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;









share|improve this question









$endgroup$



migrated from unix.stackexchange.com Jan 11 '17 at 10:04


This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.


















  • $begingroup$
    Would fit well at Code Review as well.
    $endgroup$
    – phk
    Dec 29 '16 at 12:49










  • $begingroup$
    @phk Since cross-posting is out, how do I move this there?
    $endgroup$
    – Gypsy Spellweaver
    Jan 10 '17 at 9:12










  • $begingroup$
    Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
    $endgroup$
    – phk
    Jan 10 '17 at 10:23
















2












$begingroup$


I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.



The actual commands run are, of course, not included, since they are irrelevant to the question.



I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?



For the timing function I know I could have used the



time { command ...; command ...; }


construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.



The Script:



The code comments should explain what it does, as well as why I did some of it the way I did.



#!/bin/bash

# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);

# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";

# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}

# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.

# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;

# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;

# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}

# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}

# Start the timer for cumulative run time reports
launch_time=$(date +%s);

echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;

echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;









share|improve this question









$endgroup$



migrated from unix.stackexchange.com Jan 11 '17 at 10:04


This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.


















  • $begingroup$
    Would fit well at Code Review as well.
    $endgroup$
    – phk
    Dec 29 '16 at 12:49










  • $begingroup$
    @phk Since cross-posting is out, how do I move this there?
    $endgroup$
    – Gypsy Spellweaver
    Jan 10 '17 at 9:12










  • $begingroup$
    Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
    $endgroup$
    – phk
    Jan 10 '17 at 10:23














2












2








2





$begingroup$


I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.



The actual commands run are, of course, not included, since they are irrelevant to the question.



I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?



For the timing function I know I could have used the



time { command ...; command ...; }


construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.



The Script:



The code comments should explain what it does, as well as why I did some of it the way I did.



#!/bin/bash

# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);

# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";

# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}

# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.

# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;

# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;

# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}

# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}

# Start the timer for cumulative run time reports
launch_time=$(date +%s);

echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;

echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;









share|improve this question









$endgroup$




I need to run CPU-intensive tasks on a very old machine with overheating issues. The script below will monitor temperature and pause the jobs when it gets too high, continuing them when it's back to normal.



The actual commands run are, of course, not included, since they are irrelevant to the question.



I am looking for hidden traps I may have set in my code (listed at the bottom), and for other things I have done incorrectly. Aside from special characters in the commands and arguments that are run, which are hand created so I can control that risk, what traps or "got-ya's" have I unknowingly set into the code? What ways are there for making this more error-proof, or better in other ways?



For the timing function I know I could have used the



time { command ...; command ...; }


construct, but I was more interested in the time spent by the machine (and previously, by me in the chair) than in the CPU time involved.



The Script:



The code comments should explain what it does, as well as why I did some of it the way I did.



#!/bin/bash

# Build my time reporting function
function report {
# Get the current time, do the math, report the results.
end_time=$(date +%s);

# The time used for the last run process
proc_time=$(echo "$end_time"-"$start_time" | bc);
echo " ******* Processing time: $(date -u -d @${proc_time} +%T)";

# The cummulative time for all processes so far
run_time=$(echo "$end_time"-"$launch_time" | bc);
echo " ******* Running time: $(date -u -d @${run_time} +%T)";
}

# The high and low temperatures to monitor for. Processing is paused
# once the high temp is reached, and will not resume again until the
# low temp is reached.

# My system recovers to 60°C reasonably quick (idle is around 45°C)
temp_lo=60;

# My system dies at about 115°C - since 100°C is normal, suggests my
# sensors are not accurate, but I work with what I have.
# 20°C margin allows for delay in the detection of the high temp, and
# delay in the process pausing, while still keeping temp under danger
# zone. Also allows for when Core0 is rising faster than Core1. They
# seem to take turns being the leader, but seldom more than 5-10°C
# difference.
temp_hi=95;

# The routine to read the CPU temp with lm sensors. Could be coded
# inline in the watch_child function, but that means placing it in
# three places, and if the grep/sed needs adjusting, then I have to
# remember to change _all_ three, and not make any typos. This cuts
# my chance of errors to a third.
function get_temp {
# the grep and/or sed may need changing for other sensor output
# on different systems
sensors | grep 'Core1' | sed -e 's/.*: +([+-][0-9.]+)°C.*$/01/'
}

# Routine to monitor the CPU temp, pausing the processing as needed
# to remain in the 'safe' range for processor temperature.
function watch_child {
# argument should be the PID of the backgrounded process
childd=$1;
# pre-load the CPU temp
temp=$(get_temp);
# As long as the backgrounded process is still running
while [ -e /proc/$childd ]; do
# Monitor the process, for still running, and the temp, still
# safe
while [ -e /proc/$childd ] && [ $(echo "$temp < $temp_hi" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# If the process is still running, then it was over-temp that
# caused the while loop to end
if [ -e /proc/$childd ]; then
# Tell the process to take a break
kill -SIGSTOP "$childd";
fi
# Drops through here if the process has ended, otherwise,
# monitor the temp for a restart
while [ -e /proc/$childd ] && [ $(echo "$temp > $temp_lo" | bc) = 1 ]; do
# wait a spell
sleep 5;
# re-load the temp for a re-check
temp=$(get_temp);
done
# Drop through here if the process has ended.
if [ -e /proc/$childd ]; then
# Otherwise, tell the process that the break is over.
kill -SIGCONT "$childd";
fi
done
# Only get this far once the process has ended.
# In the rare case of the process never waking up, the outer while
# loop will run infinitely!
# Human monitoring still required!
}

# Start the timer for cumulative run time reports
launch_time=$(date +%s);

echo "********* The step to perform.";
# Start the timer for this process
start_time=$(date +%s);
# Launch the dangerous process in the background
my_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;

echo "********* The next step to perform.";
# Start the timer for the next process
start_time=$(date +%s);
# Launch the dangerous process in the background
another_long_running_command arg1 arg2 &
# Capture its PID
child=$!;
# Block, with temp throttling, until this process is done
watch_child $child;
report;






bash






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Dec 29 '16 at 5:59









Gypsy SpellweaverGypsy Spellweaver

1961215




1961215




migrated from unix.stackexchange.com Jan 11 '17 at 10:04


This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.









migrated from unix.stackexchange.com Jan 11 '17 at 10:04


This question came from our site for users of Linux, FreeBSD and other Un*x-like operating systems.














  • $begingroup$
    Would fit well at Code Review as well.
    $endgroup$
    – phk
    Dec 29 '16 at 12:49










  • $begingroup$
    @phk Since cross-posting is out, how do I move this there?
    $endgroup$
    – Gypsy Spellweaver
    Jan 10 '17 at 9:12










  • $begingroup$
    Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
    $endgroup$
    – phk
    Jan 10 '17 at 10:23


















  • $begingroup$
    Would fit well at Code Review as well.
    $endgroup$
    – phk
    Dec 29 '16 at 12:49










  • $begingroup$
    @phk Since cross-posting is out, how do I move this there?
    $endgroup$
    – Gypsy Spellweaver
    Jan 10 '17 at 9:12










  • $begingroup$
    Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
    $endgroup$
    – phk
    Jan 10 '17 at 10:23
















$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49




$begingroup$
Would fit well at Code Review as well.
$endgroup$
– phk
Dec 29 '16 at 12:49












$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12




$begingroup$
@phk Since cross-posting is out, how do I move this there?
$endgroup$
– Gypsy Spellweaver
Jan 10 '17 at 9:12












$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23




$begingroup$
Actually, I am not entirely sure. Either deleting it here and posting it there then or flagging this thread for moderator attention. If you don't find any info on this at help center, Code Review Meta or Meta Stack Exchange then ask at Code Review Meta.
$endgroup$
– phk
Jan 10 '17 at 10:23










1 Answer
1






active

oldest

votes


















1












$begingroup$

Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.



Moving on to the code:




  • terminal semicolons aren't needed

  • configuration should go at the top


  • kill -0 PID is a portable alternative to -e /proc/$pid

  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes


  • [[ .. ]] is a builtin alternative to [ .. ]


  • date +%s can be replaced by builtin printf


  • gawk can extract the temperature more flexibly than grep+sed

  • your time/run/report pattern can be factored into a function

  • the monitoring loop can be simplified by moving sleep to the end

  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU

  • can save a couple of forks by reading temp directly from /sys


Putting it all together:



#!/bin/bash
temp_lo=60
temp_hi=95

temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"

function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}

function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2





share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
    $endgroup$
    – Oh My Goodness
    49 mins ago












  • $begingroup$
    To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
    $endgroup$
    – Oh My Goodness
    35 mins ago












  • $begingroup$
    edit: you can cut the use of sensors/gawk altogether; see edits to my code
    $endgroup$
    – Oh My Goodness
    21 mins ago











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152320%2fbash-script-to-monitor-subprocess-and-throttle-it-for-cpu-temperature-control%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1












$begingroup$

Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.



Moving on to the code:




  • terminal semicolons aren't needed

  • configuration should go at the top


  • kill -0 PID is a portable alternative to -e /proc/$pid

  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes


  • [[ .. ]] is a builtin alternative to [ .. ]


  • date +%s can be replaced by builtin printf


  • gawk can extract the temperature more flexibly than grep+sed

  • your time/run/report pattern can be factored into a function

  • the monitoring loop can be simplified by moving sleep to the end

  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU

  • can save a couple of forks by reading temp directly from /sys


Putting it all together:



#!/bin/bash
temp_lo=60
temp_hi=95

temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"

function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}

function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2





share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
    $endgroup$
    – Oh My Goodness
    49 mins ago












  • $begingroup$
    To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
    $endgroup$
    – Oh My Goodness
    35 mins ago












  • $begingroup$
    edit: you can cut the use of sensors/gawk altogether; see edits to my code
    $endgroup$
    – Oh My Goodness
    21 mins ago
















1












$begingroup$

Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.



Moving on to the code:




  • terminal semicolons aren't needed

  • configuration should go at the top


  • kill -0 PID is a portable alternative to -e /proc/$pid

  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes


  • [[ .. ]] is a builtin alternative to [ .. ]


  • date +%s can be replaced by builtin printf


  • gawk can extract the temperature more flexibly than grep+sed

  • your time/run/report pattern can be factored into a function

  • the monitoring loop can be simplified by moving sleep to the end

  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU

  • can save a couple of forks by reading temp directly from /sys


Putting it all together:



#!/bin/bash
temp_lo=60
temp_hi=95

temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"

function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}

function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2





share|improve this answer











$endgroup$













  • $begingroup$
    Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
    $endgroup$
    – Oh My Goodness
    49 mins ago












  • $begingroup$
    To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
    $endgroup$
    – Oh My Goodness
    35 mins ago












  • $begingroup$
    edit: you can cut the use of sensors/gawk altogether; see edits to my code
    $endgroup$
    – Oh My Goodness
    21 mins ago














1












1








1





$begingroup$

Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.



Moving on to the code:




  • terminal semicolons aren't needed

  • configuration should go at the top


  • kill -0 PID is a portable alternative to -e /proc/$pid

  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes


  • [[ .. ]] is a builtin alternative to [ .. ]


  • date +%s can be replaced by builtin printf


  • gawk can extract the temperature more flexibly than grep+sed

  • your time/run/report pattern can be factored into a function

  • the monitoring loop can be simplified by moving sleep to the end

  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU

  • can save a couple of forks by reading temp directly from /sys


Putting it all together:



#!/bin/bash
temp_lo=60
temp_hi=95

temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"

function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}

function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2





share|improve this answer











$endgroup$



Although unrelated to the code, I'll mention that for a CPU to overheat, especially a dual-core CPU, is not usual except with very high ambient temps. I'd suggest removing the heat sink and re-applying thermal paste. Any number of youtube videos can provide step-by-step instructions.



Moving on to the code:




  • terminal semicolons aren't needed

  • configuration should go at the top


  • kill -0 PID is a portable alternative to -e /proc/$pid

  • bash builtins let and [[ x -gt y ]] can replace bc for these purposes


  • [[ .. ]] is a builtin alternative to [ .. ]


  • date +%s can be replaced by builtin printf


  • gawk can extract the temperature more flexibly than grep+sed

  • your time/run/report pattern can be factored into a function

  • the monitoring loop can be simplified by moving sleep to the end

  • no real harm in monitoring more aggressively, since the loop is not going to use a lot of CPU

  • can save a couple of forks by reading temp directly from /sys


Putting it all together:



#!/bin/bash
temp_lo=60
temp_hi=95

temp_label=$( grep -l ^Core /sys/bus/platform/devices/coretemp.*/hwmon/hwmon*/temp*_label |head -1 )
temp_source=${temp_label%_label}_input
alias now="printf '%(%s)Tn' -1"

function watch_child {
childd=$1
while kill -0 $childd; do
temp=$(( $(<$temp_source) / 1000 ))
[[ $temp -ge $temp_hi ]] && kill -SIGSTOP $childd
[[ $temp -le $temp_lo ]] && kill -SIGCONT $childd
sleep 1
done
}

function elapsed {
echo " ******* $1 time: $(date -u -d @$(( ${3:-$(now)}-$2 )) +%T)"
}

function monitor {
launch_time=${launch_time:-$(now)}
start_time=$(now)
echo "********* $1"
shift
"$@" &
watch_child $!
elapsed Processing $start_time
elapsed Running $launch_time
}

monitor "The step to perform." my_long_running_command arg1 arg2
monitor "The next step to perform." another_long_running_command arg1 arg2






share|improve this answer














share|improve this answer



share|improve this answer








edited 22 mins ago

























answered 2 hours ago









Oh My GoodnessOh My Goodness

49017




49017












  • $begingroup$
    Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
    $endgroup$
    – Oh My Goodness
    49 mins ago












  • $begingroup$
    To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
    $endgroup$
    – Oh My Goodness
    35 mins ago












  • $begingroup$
    edit: you can cut the use of sensors/gawk altogether; see edits to my code
    $endgroup$
    – Oh My Goodness
    21 mins ago


















  • $begingroup$
    Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
    $endgroup$
    – Gypsy Spellweaver
    1 hour ago










  • $begingroup$
    There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
    $endgroup$
    – Oh My Goodness
    49 mins ago












  • $begingroup$
    To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
    $endgroup$
    – Oh My Goodness
    35 mins ago












  • $begingroup$
    edit: you can cut the use of sensors/gawk altogether; see edits to my code
    $endgroup$
    – Oh My Goodness
    21 mins ago
















$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago




$begingroup$
Thanks for the review. Agreed on the root cause, but all remedies failed. It was a 12+ yr old core with a hard life. Semicolons are a personal style and convenience. kill is portable, yet [[ .. ]] isn't as much so. gawd over grep+sed is a great call, reducing CPU load as well (I think). Refactoring time/run/report is a good one too. Not so sure about the increased aggressiveness, the objective is to not only know when the core is cool, but also allow it to cool as fast as possible.
$endgroup$
– Gypsy Spellweaver
1 hour ago












$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago




$begingroup$
One issue: as long as the temp is not between the threshold values, additional, unneeded, kill commands will be issued. -SIGSTOP will be repeatedly issued every second, the core is below temp_high. Once the temp goes below temp_low, -SIGCONT will be reissued every second. Would not only one kill per threshold crossing be better at conserving CPU resources?
$endgroup$
– Gypsy Spellweaver
1 hour ago












$begingroup$
There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
$endgroup$
– Oh My Goodness
49 mins ago






$begingroup$
There are really no CPU resources used: kill is a builtin that invokes a single syscall. On my system, ten million invocations is 2.5s of CPU time, or ~250ns each. Compare the sensors command line, also run once per loop, at ~8ms each, or 32000 times longer. [[ ]] is "portable" to any other bash and definitely more efficient than forking bc.
$endgroup$
– Oh My Goodness
49 mins ago














$begingroup$
To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago






$begingroup$
To give an idea of the cost of forks (bc and [ .. ]) I ran both versions of watch_child with sleeps disabled and the same gawk-based get_temp. Based on loops executed per 5 seconds, the modified version is about 50% faster.
$endgroup$
– Oh My Goodness
35 mins ago














$begingroup$
edit: you can cut the use of sensors/gawk altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago




$begingroup$
edit: you can cut the use of sensors/gawk altogether; see edits to my code
$endgroup$
– Oh My Goodness
21 mins ago


















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f152320%2fbash-script-to-monitor-subprocess-and-throttle-it-for-cpu-temperature-control%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to reconfigure Docker Trusted Registry 2.x.x to use CEPH FS mount instead of NFS and other traditional...

is 'sed' thread safe

How to make a Squid Proxy server?