So uh...how do you know which pin_cpus to use?

Never really understood this:

"The correct pin_cpus setting to use is dependent on your CPU architecture. Intel and AMD systems enumerate processors in different ways. Using the wrong pin_cpus setting can cause poor performance."

Is there a magical formula? Any advice would help thanks.

James

The best thing to do is to install the hwloc package and use the lstopo or lstopo-no-graphics tool to render a big ascii art image of the system.

on centos7 this works:

lstopo-no-graphics --of txt

You'll get something that looks like this:

https://www.open-mpi.org/projects/hwloc/lstopo/images/2XeonE5v2+2cuda+1display_v1.11.png

or

https://www.open-mpi.org/projects/hwloc/lstopo/images/4Opteron6200.v1.11.png

The numbers towards the bottom are the cpu ids. So you can see that using something like

1,3,5,7,9,11,13,15,17,19,21,23,25

on an intel cpu would be the worst thing you could do since 21,23,25 are on the same physical cores as 1,3, and 5

Oh, I should add... ".. on that particular system". On some of our numa machines the allocation is different and 1,3,5,7,9 would be the right cpus to use!

Ok cool thanks Justin...so basically I wanna stagger these out so I don't have several processes on the same core ya?

cat /proc/cpuinfo | egrep "processor|core id"
processor : 0
core id : 0
processor : 1
core id : 0
processor : 2
core id : 1
processor : 3
core id : 1
processor : 4
core id : 2
processor : 5
core id : 2
processor : 6
core id : 3
processor : 7
core id : 3
processor : 8
core id : 4
processor : 9
core id : 4
processor : 10
core id : 5
processor : 11
core id : 5
processor : 12
core id : 0
processor : 13
core id : 0
processor : 14
core id : 1
processor : 15
core id : 1
processor : 16
core id : 2
processor : 17
core id : 2
processor : 18
core id : 3
processor : 19
core id : 3
processor : 20
core id : 4
processor : 21
core id : 4
processor : 22
core id : 5
processor : 23
core id : 5

1,3,5,7,9,11 seem to be the best ones here. Thanks...that's super helpful!

James

Possibly.. I'd check with what hwloc says. I think just turning off hyper threading makes this even easier since that completely removes the possibility of accidentally pinning 2 workers to the same core.

Sweet...thanks Justin...hwloc is a cool app!

James

Yeah... it can be a bit confusing though since it has both a 'logical' (-l) and a 'physical' (-p) view.

I _think_ that the cpu ids in the physical view match what taskset use via broctl.

Fortunately you can run hwloc-ps -p and compare which pids are mapped to which cpus to verify it is working right.

2.6 kernels on Linux enumerate HT in a different way 3.x and 4.x do

2.6

Core 0 thread 0
Core 0 thread 1

Etc

3.x

Core 0-N on CPU 0 first half of threads
Then CPU 1
Then CPU 0 second half of threads
Then CPU 1

Results for HT vs cross numa are about to be published, soon :wink:
I don't like cache misses when CPU 1 is reaching for data on node 0 though. It is not about cross numa bandwidth it's the fact then you have in the worst case 67ns to process a smallest packet on 10Gbit. And L3 hit on ivy bridge is at least 15ns.
Miss is 5x that.

Ah! That explains a lot. I wonder if numa allocation changed too. We just upgraded some machines from centos6 to 7 and I was wondering how the meticulously written node.cfg we had been using for months now appeared completely wrong.

I wonder if broctl should support hwloc for cpu pinning instead of task set. I wouldn't mind having an 'auto' mode that just does the right thing.

It looks like on our dual socket numa box we should be using

0,2,4,6,8,10,12,14 for one 10g card and
1,3,5,7,9,11,13,15 for the other 10g card

0-19 are the physical cores and 20-39 are the HT cores, but using 0,1,2,3 flips between numa nodes which is not what anyone wants.

Lesson learned for me. Never answer from a phone, esp. trying to cover numa allocation on 56 threads on 4 inches :wink:

Take back what I said. Here is how it looks like, I’m in front of a server with 2x NIC. I have E5-2697 v3 here, 14 physical cores per CPU, HT enabled, kernel 4.4.something.

0-13 - NUMA node 0, CPU 0, hthreads 0-13

14-27 - NUMA node 1, CPU 1, cores 14-27

28-41 - NUMA node 0, CPU 0, hthreads 28-41

42-55 - NUMA node 1, CPU 1 again

1st card should use virtual cores (AKA threads) 0-13 + 28-41

2nd card should use 14-27 + 42-55