Tux

...making Linux just a little more fun!

Load average vs CPUs

Mike Orr [sluggoster at gmail.com]


Mon, 23 Aug 2010 13:15:26 -0700

Hello Answer Gang. I was raised to believe that the load average reported by 'uptime' and 'top' should not go above 1.0 except for short-term transitory cases, and if it persistently goes to 2 or 3 then either you seriously need more hardware or you're trying to run too much. But recently I heard that the target number is actually equal to the number of CPUs. My server has 4 CPUs according to /proc/cpuinfo, so does that mean I should let the load average go up to 4 before being concerned?

In fact, my load average has not been that high. It was 1.5 or 1.7 when I noticed the server bogging down noticeably, and that was probably because of a large backup rsync, and a webapp that was being unusually memory-grubbing. Still, my basic question remains, is 1.5 or 2 normal for a multi-CPU system?

One other issue is, I'm not sure if it really has four full CPUs. I think it might have two of those "dual core" thingys that have two processors but share the same I/O and cache or something. Here's a bit of the 'dmesg' output in case it's meaningful:

[    0.000000] Initializing CPU#0
[    0.000000] Checking aperture...
[    0.000000] No AGP bridge found
[    0.000000] Calgary: detecting Calgary via BIOS EBDA area
[    0.000000] Calgary: Unable to locate Rio Grande table in EBDA - bailing!
[    0.000000] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    0.000000] Placing 64MB software IO TLB between ffff8800091de000 - ffff88000
d1de000
[    0.000000] software IO TLB at phys 0x91de000 - 0xd1de000
[    0.000000] Memory: 8185276k/8912896k available (5499k kernel code, 524948k a
bsent, 202672k reserved, 3081k data, 796k init)
...
[    0.020000] Initializing CPU#1
[    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
[    0.020000] CPU: L2 cache: 1024K
[    0.020000] CPU 1/0x6 -> Node 0
[    0.020000] CPU: Physical Processor ID: 3
[    0.020000] CPU: Processor Core ID: 0
[    0.020000] CPU1: Thermal monitoring enabled (TM1)
[    0.300088] CPU1: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
[    0.300099] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
[    0.310146] Booting processor 2 APIC 0x1 ip 0x6000
[    0.020000] Initializing CPU#2
[    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
[    0.020000] CPU: L2 cache: 1024K
[    0.020000] CPU 2/0x1 -> Node 0
[    0.020000] CPU: Physical Processor ID: 0
[    0.020000] CPU: Processor Core ID: 0
[    0.020000] CPU2: Thermal monitoring enabled (TM1)
[    0.470108] CPU2: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
[    0.470120] checking TSC synchronization [CPU#0 -> CPU#2]: passed.
[    0.480175] Booting processor 3 APIC 0x7 ip 0x6000
[    0.020000] Initializing CPU#3
[    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
[    0.020000] CPU: L2 cache: 1024K
[    0.020000] CPU 3/0x7 -> Node 0
[    0.020000] CPU: Physical Processor ID: 3
[    0.020000] CPU: Processor Core ID: 0
[    0.020000] CPU3: Thermal monitoring enabled (TM1)
[    0.640119] CPU3: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
[    0.640128] checking TSC synchronization [CPU#0 -> CPU#3]: passed.
[    0.650032] Brought up 4 CPUs
[    0.650037] Total of 4 processors activated (22401.29 BogoMIPS).
-- 
Mike Orr <sluggoster at gmail.com>


Top    Back


afsilva at gmail.com [(afsilva at gmail.com)]


Mon, 23 Aug 2010 18:54:29 -0400

On Mon, Aug 23, 2010 at 4:15 PM, Mike Orr <sluggoster at gmail.com> wrote:

> Hello Answer Gang. I was raised to believe that the load average
> reported by 'uptime' and 'top' should not go above 1.0 except for
> short-term transitory cases, and if it persistently goes to 2 or 3
> then either you seriously need more hardware or you're trying to run
> too much. But recently I heard that the target number is actually
> equal to the number of CPUs. My server has 4 CPUs according to
> /proc/cpuinfo, so does that mean I should let the load average go up
> to 4 before being concerned?
>
> In fact, my load average has not been that high. It was 1.5 or 1.7
> when I noticed the server bogging down noticeably, and that was
> probably because of a large backup rsync, and a webapp that was being
> unusually memory-grubbing.  Still, my basic question remains, is 1.5
> or 2 normal for a multi-CPU system?
>
> One other issue is, I'm not sure if it really has four full CPUs. I
> think it might have two of those "dual core" thingys that have two
> processors but share the same I/O and cache or something. Here's a bit
> of the 'dmesg' output in case it's meaningful:
>
> ``
> [    0.000000] Initializing CPU#0
> [    0.000000] Checking aperture...
> [    0.000000] No AGP bridge found
> [    0.000000] Calgary: detecting Calgary via BIOS EBDA area
> [    0.000000] Calgary: Unable to locate Rio Grande table in EBDA -
> bailing!
> [    0.000000] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [    0.000000] Placing 64MB software IO TLB between ffff8800091de000 -
> ffff88000
> d1de000
> [    0.000000] software IO TLB at phys 0x91de000 - 0xd1de000
> [    0.000000] Memory: 8185276k/8912896k available (5499k kernel code,
> 524948k a
> bsent, 202672k reserved, 3081k data, 796k init)
> ...
> [    0.020000] Initializing CPU#1
> [    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
> [    0.020000] CPU: L2 cache: 1024K
> [    0.020000] CPU 1/0x6 -> Node 0
> [    0.020000] CPU: Physical Processor ID: 3
> [    0.020000] CPU: Processor Core ID: 0
> [    0.020000] CPU1: Thermal monitoring enabled (TM1)
> [    0.300088] CPU1: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
> [    0.300099] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> [    0.310146] Booting processor 2 APIC 0x1 ip 0x6000
> [    0.020000] Initializing CPU#2
> [    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
> [    0.020000] CPU: L2 cache: 1024K
> [    0.020000] CPU 2/0x1 -> Node 0
> [    0.020000] CPU: Physical Processor ID: 0
> [    0.020000] CPU: Processor Core ID: 0
> [    0.020000] CPU2: Thermal monitoring enabled (TM1)
> [    0.470108] CPU2: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
> [    0.470120] checking TSC synchronization [CPU#0 -> CPU#2]: passed.
> [    0.480175] Booting processor 3 APIC 0x7 ip 0x6000
> [    0.020000] Initializing CPU#3
> [    0.020000] CPU: Trace cache: 12K uops, L1 D cache: 16K
> [    0.020000] CPU: L2 cache: 1024K
> [    0.020000] CPU 3/0x7 -> Node 0
> [    0.020000] CPU: Physical Processor ID: 3
> [    0.020000] CPU: Processor Core ID: 0
> [    0.020000] CPU3: Thermal monitoring enabled (TM1)
> [    0.640119] CPU3: Intel(R) Xeon(TM) CPU 2.80GHz stepping 09
> [    0.640128] checking TSC synchronization [CPU#0 -> CPU#3]: passed.
> [    0.650032] Brought up 4 CPUs
> [    0.650037] Total of 4 processors activated (22401.29 BogoMIPS).
> ''
>
> --
> Mike Orr <sluggoster at gmail.com>
>                                              
> TAG mailing list
> TAG at lists.linuxgazette.net
> http://lists.linuxgazette.net/listinfo.cgi/tag-linuxgazette.net
>

That is correct. Independent if it is true CPU or cores, you should be OK to get 1.0x# of cpus listed on your /proc/cpuinfo

Take a look at this article, it is a great explanation to the load average: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

AS

-- http://www.the-silvas.com

An HTML attachment was scrubbed... URL: <http://lists.linuxgazette.net/private.cg[...]nts/20100823/33ce86ac/attachment.htm>


Top    Back


Ben Okopnik [ben at okopnik.com]


Mon, 23 Aug 2010 20:27:39 -0400

Hey, Mike -

On Mon, Aug 23, 2010 at 01:15:26PM -0700, Mike Orr wrote:

> Hello Answer Gang. I was raised to believe that the load average
> reported by 'uptime' and 'top' should not go above 1.0 except for
> short-term transitory cases, and if it persistently goes to 2 or 3
> then either you seriously need more hardware or you're trying to run
> too much. But recently I heard that the target number is actually
> equal to the number of CPUs. My server has 4 CPUs according to
> /proc/cpuinfo, so does that mean I should let the load average go up
> to 4 before being concerned?

Well, it is true that load averages are done per core: total load/number of cores. This is the certainly the way it's done in the Solaris, and I don't doubt that it's the same for other Unices as well:

http://www.princeton.edu/~unix/Solaris/troubleshoot/cpuload.html

Note the following part, by the way:

Load averages above 1 per CPU indicate that the CPUs are fully utilized.
Depending on the type of load and the I/O requirements, user-visible
performance may not be affected until levels of 2 per CPU are reached. A
general rule of thumb is that load averages that are persistently above
4 times the number of CPUs will result in sluggish performance.

I don't know how well this part applies to Linux, though. I've seen a system bog quite badly when the webserver was being DOSed; the load average was barely above 2 (single core machine), but even typing at the command line showed considerable latency, and the mailserver couldn't spawn any new threads. As always, YMMV.

Which reminds: I have a love/hate relationship with Solaris; certain things about it are as joyful as discovering half a worm in your apple, but there are others that are just stellar. 'sar' is one of the latter, and unfortunately the Linux version isn't quite as slick and featureful. I wish they'd just emulate the Solaris version, option for option - although I don't know that the Linux kernel reports quite as much, and with enough granularity, to provide the same type of info. I guess if I really cared, I'd become a kernel geek... :)

> In fact, my load average has not been that high. It was 1.5 or 1.7
> when I noticed the server bogging down noticeably, and that was
> probably because of a large backup rsync, and a webapp that was being
> unusually memory-grubbing.  Still, my basic question remains, is 1.5
> or 2 normal for a multi-CPU system?

Sure. If you're on a four-core system, that's equivalent to a 0.5 load, which is perfectly normal.

> One other issue is, I'm not sure if it really has four full CPUs. I
> think it might have two of those "dual core" thingys that have two
> processors but share the same I/O and cache or something.

I'm not familiar enough with "those dual-core thingies" to guess at that, but 'iostat' and 'vmstat' respectively ought to tell you whether you're getting I/O-bound or whatever (that's my usual guess when a system bogs, BTW; CPU overloading is relatively rare by comparison.)

> Here's a bit
> of the 'dmesg' output in case it's meaningful:

[snippage+greppage]

> [    0.000000] Initializing CPU#0
> [    0.020000] Initializing CPU#1
> [    0.020000] CPU: Physical Processor ID: 3
> [    0.020000] CPU: Processor Core ID: 0
> [    0.020000] Initializing CPU#2
> [    0.020000] CPU: Physical Processor ID: 0
> [    0.020000] CPU: Processor Core ID: 0
> [    0.020000] Initializing CPU#3
> [    0.020000] CPU: Physical Processor ID: 3
> [    0.020000] CPU: Processor Core ID: 0
> [    0.650032] Brought up 4 CPUs
> [    0.650037] Total of 4 processors activated (22401.29 BogoMIPS).

It looks like two physical CPUs with four cores. Yeah, hit up vmstat and iostat and see where it's getting maxed.

-- 
                       OKOPNIK CONSULTING
        Custom Computing Solutions For Your Business
Expert-led Training | Dynamic, vital websites | Custom programming
               443-250-7895    http://okopnik.com


Top    Back


Henry Grebler [henrygrebler at optusnet.com.au]


Tue, 24 Aug 2010 22:59:45 +1000

-->It looks like two physical CPUs with four cores. Yeah, hit up vmstat and -->iostat and see where it's getting maxed.

If you have 2 cores with hyperthreading, it will register as 4 CPUs to Linux (and probably other OSes) but give the grunt of 2 + ~20% ie 2.4 CPUs.

I guess if you look up "hyperthreading" you can get chapter and verse.

In any case, hyperthreading will probably produce verse performance than you might have expected :-)


Top    Back


Mulyadi Santosa [mulyadi.santosa at gmail.com]


Wed, 25 Aug 2010 00:56:05 +0700

Hi...

On Tue, Aug 24, 2010 at 03:15, Mike Orr <sluggoster at gmail.com> wrote:

> Hello Answer Gang. I was raised to believe that the load average
> reported by 'uptime' and 'top' should not go above 1.0 except for
> short-term transitory cases, and if it persistently goes to 2 or 3
> then either you seriously need more hardware or you're trying to run
> too much.

I understand if you thought so. Even I.. thought....once, 1.0 means 100%, on all CPUs :D heheheheheh

> But recently I heard that the target number is actually
> equal to the number of CPUs. My server has 4 CPUs according to
> /proc/cpuinfo,

Ehm, maybe below command would help: $ egrep 'core id|physical id' /proc/cpuinfo

Sample output from my core duo machine: physical id : 0 core id : 0 physical id : 0 core id : 1

different core id....same physical id--> dual core :D

>so does that mean I should let the load average go up
> to 4 before being concerned?

"4", to me, in this case, it's like "OK, I can max it up to 100%...on every CPU"...but, do you only concern about "CPU load" when it comes to determine your overall machine load?

> In fact, my load average has not been that high. It was 1.5 or 1.7
> when I noticed the server bogging down noticeably, and that was
> probably because of a large backup rsync, and a webapp that was being
> unusually memory-grubbing.

precisely :) so sometimes "< n ... where n is the number of your core" could might mean "i am beefed" already :D

We need to see another metric here ....

>Still, my basic question remains, is 1.5
> or 2 normal for a multi-CPU system?

in your case, without seeing other aspect: normal.

-- regards,

Mulyadi Santosa Freelance Linux trainer and consultant

blog: the-hydra.blogspot.com training: mulyaditraining.blogspot.com


Top    Back


Mike Orr [sluggoster at gmail.com]


Tue, 24 Aug 2010 11:51:14 -0700

On Tue, Aug 24, 2010 at 10:56 AM, Mulyadi Santosa <mulyadi.santosa at gmail.com> wrote:

>>so does that mean I should let the load average go up
>> to 4 before being concerned?
>
> "4", to me, in this case, it's like "OK, I can max it up to 100%...on
> every CPU"...but, do you only concern about "CPU load" when it comes
> to determine your overall machine load?

Well, of course. My programs are not CPU-intensive number crunching or image transformation. The bottleneck is usually disk I/O or memory grubbing. But even these make the load average rise indirectly, because programs have to wait on the run queue while the kernel swaps memory or seeks the disk. So the load average is still a useful indicator even if the culprit isn't CPU intensity. I generally find that a load average of 1.5 is associated with bog-down, even on a 4-core server. (At least on a typical web+database server; other applications may be different.) So that tells me it's time to look around and see if something needs to be killed. In this case, I killed a webapp that had grubbed 4 GB of resident memory for no legitimate reason and was stuck, and killed an rsync that I wasn't sure was legitimate (it turned out it was), and responsiveness came back and the load average went down to 0.5.

-- 
Mike Orr <sluggoster at gmail.com>


Top    Back