How to keep the GPU (and CPU) cooler: Difference between revisions
m (→Keep the Nvidia GPU cooler: note about service sysfs restart) |
m (→And keep the CPU cooler: +link to some thermal control for CPU) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
I want to try to keep the GPU cool (and see whether the errors will stop occuring). | I want to try to keep the GPU cool (and see whether the errors will stop occuring). | ||
My graphical errors lead to almost complete freezing of X (say, during Skype calls). | My graphical errors lead to almost complete freezing of X (say, during Skype video calls). | ||
My system is ALT p7 with linux 3.10.21-std-def-alt1 on an HP laptop. | My system is ALT p7 with linux 3.10.21-std-def-alt1 on an HP laptop. | ||
Line 24: | Line 24: | ||
shown by {{cmd|sensors}} command. | shown by {{cmd|sensors}} command. | ||
I use {{pkg|sysfsutils}} package for setting the values in sysfs on boot. | I use {{pkg|sysfsutils}} package for setting the values in sysfs on boot.<ref>After the advice from [[:ru:cpufreq]].</ref> | ||
({{cmd|service sysfs restart}} will reset the parameters to the new values afterwards, when needed, but not {{cmd|service sysfs start}}!) | ({{cmd|service sysfs restart}} will reset the parameters to the new values afterwards, when needed, but not {{cmd|service sysfs start}}!) | ||
Line 46: | Line 46: | ||
I raised the "fanboost" threshold above the "downclock" threshold | I raised the "fanboost" threshold above the "downclock" threshold | ||
(unlike in the default settings). | (unlike in the default settings). | ||
===resume after suspend=== | |||
[http://forum.altlinux.org/index.php/topic,12229.msg154855.html#msg154855 It has been observed] that sysfs utils are not run when coming from a suspend. Therefore additional care must be taken to set the parameters on resume! | |||
===Monitoring output examples=== | |||
===={{cmd|sensors}} output example==== | |||
# sensors | |||
nouveau-pci-0200 | |||
Adapter: PCI adapter | |||
temp1: +81.0°C (high = +83.0°C, hyst = +3.0°C) | |||
(crit = +95.0°C, hyst = +5.0°C) | |||
(emerg = +135.0°C, hyst = +5.0°C) | |||
k10temp-pci-00c3 | |||
Adapter: PCI adapter | |||
temp1: +53.6°C (high = +70.0°C) | |||
# | |||
====example of temperature threshold hits from {{cmd|journalctl}}==== | |||
# journalctl | fgrep nouveau | fgrep -10 critical | |||
дек 15 23:28:07 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:07 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:08 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:08 host-17.localdomain kernel[1506]: [ 6914.386641] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:08 host-17.localdomain kernel[1506]: [ 6914.387656] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:08 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:10 host-17.localdomain kernel[1506]: [ 6916.421280] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:10 host-17.localdomain kernel[1506]: [ 6916.422291] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:28:40 host-17.localdomain kernel: nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold | |||
дек 15 23:28:40 host-17.localdomain kernel[1506]: [ 6945.541057] nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold | |||
дек 15 23:29:16 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:16 host-17.localdomain kernel[1506]: [ 6981.487662] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:16 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:16 host-17.localdomain kernel[1506]: [ 6981.488677] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6982.722574] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6982.723589] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6983.152643] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
-- | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:32:53 host-17.localdomain kernel: nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold | |||
дек 18 18:45:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
дек 18 18:45:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 | |||
== And keep the CPU cooler == | == And keep the CPU cooler == | ||
Line 56: | Line 115: | ||
force of a governor. | force of a governor. | ||
I'm unhappy that it will active all the time though. | I'm unhappy that it will be active all the time though, of course. | ||
Usually I'm happy with the "ondemend" governor, and only when the temperature | Usually I'm happy with the "ondemend" governor, and only when the temperature | ||
rises, I want to scale down... | rises, I want to scale down... | ||
TODO: find out how to do this! | TODO: find out how to do this! (Have a look at this, via [http://forum.altlinux.org/index.php/topic,30120.msg213519.html#msg213519]: "[https://wiki.archlinux.org/index.php/CPU_Frequency_Scaling Additional control for modern Intel CPUs is available with the Linux Thermal Daemon (available as thermald in the AUR), which proactively controls thermal using P-states, T-states, and the Intel power clamp driver.]" And what about AMD?) | ||
# | # | ||
Line 68: | Line 127: | ||
devices/system/cpu/cpu0/cpufreq/scaling_governor = powersave | devices/system/cpu/cpu0/cpufreq/scaling_governor = powersave | ||
devices/system/cpu/cpu1/cpufreq/scaling_governor = powersave | devices/system/cpu/cpu1/cpufreq/scaling_governor = powersave | ||
==FYI: more temperature attributes in sysfs== | |||
Perhaps we can make some use of them. | |||
# find /sys/ -name '*temp*' | |||
/sys/bus/pci/drivers/k10temp | |||
/sys/devices/pci0000:00/0000:00:18.3/temp1_input | |||
/sys/devices/pci0000:00/0000:00:18.3/temp1_max | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_crit | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_crit_hyst | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_pwm | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_max_hyst | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_temp | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_emergency_hyst | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_temp_hyst | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_emergency | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_input | |||
/sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_max | |||
/sys/devices/virtual/thermal/thermal_zone0/temp | |||
/sys/devices/virtual/thermal/thermal_zone0/emul_temp | |||
/sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp | |||
/sys/devices/virtual/thermal/thermal_zone0/trip_point_1_temp | |||
/sys/devices/virtual/thermal/thermal_zone0/trip_point_2_temp | |||
/sys/devices/virtual/thermal/thermal_zone1/temp | |||
/sys/devices/virtual/thermal/thermal_zone1/emul_temp | |||
/sys/devices/virtual/thermal/thermal_zone1/trip_point_0_temp | |||
/sys/devices/virtual/thermal/thermal_zone1/trip_point_1_temp | |||
/sys/devices/platform/hp-wmi/hddtemp | |||
/sys/module/hwmon/holders/k10temp | |||
/sys/module/k10temp | |||
/sys/module/k10temp/drivers/pci:k10temp | |||
==See also== | ==See also== | ||
* [[:ru:cpufreq]] | * [[:ru:cpufreq]] | ||
==Notes and refs== | |||
<references /> |
Latest revision as of 22:50, 5 January 2014
nouveau errors co-occur with high temperatures (seen from journalctl), therefore, as an experiment, I want to try to keep the GPU cool (and see whether the errors will stop occuring).
My graphical errors lead to almost complete freezing of X (say, during Skype video calls).
My system is ALT p7 with linux 3.10.21-std-def-alt1 on an HP laptop.
# lspci | fgrep -i vga 02:00.0 VGA compatible controller: NVIDIA Corporation C77 [GeForce 8200M G] (rev a2) #
Keep the Nvidia GPU cooler
By reading documentation on the web and by experiments, I have found out that setting the threshold temperatures (in sysfs) to lower values does cause the GPU to become cooler!
Current temperatures (and the threshold values except for "fanboost") are shown by sensors command.
I use sysfsutils package for setting the values in sysfs on boot.[1] (service sysfs restart will reset the parameters to the new values afterwards, when needed, but not service sysfs start!)
# # /etc/sysfs.conf - Configuration file for setting sysfs attributes. #
# (temperature is in millidegrees)
# "fanboost"; default: 90 degrees C class/drm/card0/device/hwmon/hwmon0/temp1_auto_point1_temp = 85000 # "downclock"; default: 95 class/drm/card0/device/hwmon/hwmon0/temp1_max = 83000 # "critical"; default: 105 class/drm/card0/device/hwmon/hwmon0/temp1_crit = 95000 # "emergency"; default: 135 # class/drm/card0/device/hwmon/hwmon0/temp1_emergency = 135000
"downclock" seems to bring most effect. Because I want the fan to be silent sometimes, and "fanboost" turned out not to be a really effective way to cool down the GPU, I raised the "fanboost" threshold above the "downclock" threshold (unlike in the default settings).
resume after suspend
It has been observed that sysfs utils are not run when coming from a suspend. Therefore additional care must be taken to set the parameters on resume!
Monitoring output examples
sensors output example
# sensors nouveau-pci-0200 Adapter: PCI adapter temp1: +81.0°C (high = +83.0°C, hyst = +3.0°C) (crit = +95.0°C, hyst = +5.0°C) (emerg = +135.0°C, hyst = +5.0°C)
k10temp-pci-00c3 Adapter: PCI adapter temp1: +53.6°C (high = +70.0°C)
#
example of temperature threshold hits from journalctl
# journalctl | fgrep nouveau | fgrep -10 critical дек 15 23:28:07 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:07 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:08 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:08 host-17.localdomain kernel[1506]: [ 6914.386641] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:08 host-17.localdomain kernel[1506]: [ 6914.387656] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:08 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:10 host-17.localdomain kernel[1506]: [ 6916.421280] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:10 host-17.localdomain kernel[1506]: [ 6916.422291] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:28:40 host-17.localdomain kernel: nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold дек 15 23:28:40 host-17.localdomain kernel[1506]: [ 6945.541057] nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold дек 15 23:29:16 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:16 host-17.localdomain kernel[1506]: [ 6981.487662] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:16 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:16 host-17.localdomain kernel[1506]: [ 6981.488677] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6982.722574] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6982.723589] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 15 23:29:17 host-17.localdomain kernel[1506]: [ 6983.152643] nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 -- дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:25 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:27 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:32:53 host-17.localdomain kernel: nouveau [ PTHERM][0000:02:00.0] temperature (106 C) hit the 'critical' threshold дек 18 18:45:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001 дек 18 18:45:10 host-17.localdomain kernel: nouveau E[ PTHERM][0000:02:00.0] unhandled intr 0x00000001
And keep the CPU cooler
The heat from CPU goes to GPU, too, therefore cooling down the GPU without cooling down some really high CPU temperatures is not effective.
I do not know yet of a similar downclock-policy setting mechanism for CPU which would be based on temperature thresholds, therefore I simply use the brute force of a governor.
I'm unhappy that it will be active all the time though, of course. Usually I'm happy with the "ondemend" governor, and only when the temperature rises, I want to scale down...
TODO: find out how to do this! (Have a look at this, via [1]: "Additional control for modern Intel CPUs is available with the Linux Thermal Daemon (available as thermald in the AUR), which proactively controls thermal using P-states, T-states, and the Intel power clamp driver." And what about AMD?)
# # /etc/sysfs.conf - Configuration file for setting sysfs attributes. #
devices/system/cpu/cpu0/cpufreq/scaling_governor = powersave devices/system/cpu/cpu1/cpufreq/scaling_governor = powersave
FYI: more temperature attributes in sysfs
Perhaps we can make some use of them.
# find /sys/ -name '*temp*' /sys/bus/pci/drivers/k10temp /sys/devices/pci0000:00/0000:00:18.3/temp1_input /sys/devices/pci0000:00/0000:00:18.3/temp1_max /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_crit /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_crit_hyst /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_pwm /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_max_hyst /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_temp /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_emergency_hyst /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_auto_point1_temp_hyst /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_emergency /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_input /sys/devices/pci0000:00/0000:00:0b.0/0000:02:00.0/hwmon/hwmon0/temp1_max /sys/devices/virtual/thermal/thermal_zone0/temp /sys/devices/virtual/thermal/thermal_zone0/emul_temp /sys/devices/virtual/thermal/thermal_zone0/trip_point_0_temp /sys/devices/virtual/thermal/thermal_zone0/trip_point_1_temp /sys/devices/virtual/thermal/thermal_zone0/trip_point_2_temp /sys/devices/virtual/thermal/thermal_zone1/temp /sys/devices/virtual/thermal/thermal_zone1/emul_temp /sys/devices/virtual/thermal/thermal_zone1/trip_point_0_temp /sys/devices/virtual/thermal/thermal_zone1/trip_point_1_temp /sys/devices/platform/hp-wmi/hddtemp /sys/module/hwmon/holders/k10temp /sys/module/k10temp /sys/module/k10temp/drivers/pci:k10temp
See also
Notes and refs
- ↑ After the advice from ru:cpufreq.