[ACCEPTED]-multithreaded blas in python/numpy-blas
I already posted this in another thread but I think it fits better in this one:
UPDATE (30.07.2014):
I re-run the the benchmark on our new HPC. Both 79 the hardware as well as the software stack 78 changed from the setup in the original answer.
I 77 put the results in a google spreadsheet (contains also the 76 results from the original answer).
Hardware
Our HPC 75 has two different nodes one with Intel Sandy 74 Bridge CPUs and one with the newer Ivy Bridge 73 CPUs:
Sandy (MKL, OpenBLAS, ATLAS):
- CPU: 2 x 16 Intel(R) Xeon(R) E2560 Sandy Bridge @ 2.00GHz (16 Cores)
- RAM: 64 GB
Ivy (MKL, OpenBLAS, ATLAS):
- CPU: 2 x 20 Intel(R) Xeon(R) E2680 V2 Ivy Bridge @ 2.80GHz (20 Cores, with HT = 40 Cores)
- RAM: 256 GB
Software
The 72 software stack is for both nodes the sam. Instead 71 of GotoBLAS2, OpenBLAS is used and there is also a multi-threaded ATLAS BLAS 70 that is set to 8 threads (hardcoded).
- OS: Suse
- Intel Compiler: ictce-5.3.0
- Numpy: 1.8.0
- OpenBLAS: 0.2.6
- ATLAS:: 3.8.4
Dot-Product Benchmark
Benchmark-code 69 is the same as below. However for the new 68 machines I also ran the benchmark for matrix 67 sizes 5000 and 8000.
The table below includes the 66 benchmark results from the original answer 65 (renamed: MKL --> Nehalem MKL, Netlib 64 Blas --> Nehalem Netlib BLAS, etc)
Single threaded performance:
Multi threaded performance (8 threads):
Threads vs Matrix size (Ivy Bridge MKL):
Benchmark Suite
Single threaded performance:
Multi threaded (8 threads) performance:
Conclusion
The 63 new benchmark results are similar to the 62 ones in the original answer. OpenBLAS and MKL perform 61 on the same level, with the exception of 60 Eigenvalue test. The Eigenvalue test performs only reasonably 59 well on OpenBLAS in single threaded mode. In multi-threaded mode the 58 performance is worse.
The "Matrix size vs threads chart" also show that 57 although MKL as well as OpenBLAS generally 56 scale well with number of cores/threads,it 55 depends on the size of the matrix. For small 54 matrices adding more cores won't improve 53 performance very much.
There is also approximately 52 30% performance increase from Sandy Bridge to Ivy Bridge which 51 might be either due to higher clock rate 50 (+ 0.8 Ghz) and/or better architecture.
Original Answer (04.10.2011):
Some 49 time ago I had to optimize some linear algebra 48 calculations/algorithms which were written 47 in python using numpy and BLAS so I benchmarked/tested 46 different numpy/BLAS configurations.
Specifically 45 I tested:
- Numpy with ATLAS
- Numpy with GotoBlas2 (1.13)
- Numpy with MKL (11.1/073)
- Numpy with Accelerate Framework (Mac OS X)
I did run two different benchmarks:
- simple dot product of matrices with different sizes
- Benchmark suite which can be found here.
Here 44 are my results:
Machines
Linux (MKL, ATLAS, No-MKL, GotoBlas2):
- OS: Ubuntu Lucid 10.4 64 Bit.
- CPU: 2 x 4 Intel(R) Xeon(R) E5504 @ 2.00GHz (8 Cores)
- RAM: 24 GB
- Intel Compiler: 11.1/073
- Scipy: 0.8
- Numpy: 1.5
Mac Book Pro (Accelerate 43 Framework):
- OS: Mac OS X Snow Leopard (10.6)
- CPU: 1 Intel Core 2 Duo 2.93 Ghz (2 Cores)
- RAM: 4 GB
- Scipy: 0.7
- Numpy: 1.3
Mac Server (Accelerate Framework):
- OS: Mac OS X Snow Leopard Server (10.6)
- CPU: 4 X Intel(R) Xeon(R) E5520 @ 2.26 Ghz (8 Cores)
- RAM: 4 GB
- Scipy: 0.8
- Numpy: 1.5.1
Dot product benchmark
Code:
import numpy as np
a = np.random.random_sample((size,size))
b = np.random.random_sample((size,size))
%timeit np.dot(a,b)
Results:
System | size = 1000 | size = 2000 | size = 3000 | netlib BLAS | 1350 ms | 10900 ms | 39200 ms | ATLAS (1 CPU) | 314 ms | 2560 ms | 8700 ms | MKL (1 CPUs) | 268 ms | 2110 ms | 7120 ms | MKL (2 CPUs) | - | - | 3660 ms | MKL (8 CPUs) | 39 ms | 319 ms | 1000 ms | GotoBlas2 (1 CPU) | 266 ms | 2100 ms | 7280 ms | GotoBlas2 (2 CPUs)| 139 ms | 1009 ms | 3690 ms | GotoBlas2 (8 CPUs)| 54 ms | 389 ms | 1250 ms | Mac OS X (1 CPU) | 143 ms | 1060 ms | 3605 ms | Mac Server (1 CPU)| 92 ms | 714 ms | 2130 ms |
Benchmark Suite
Code:
For 42 additional information about the benchmark 41 suite see here.
Results:
System | eigenvalues | svd | det | inv | dot | netlib BLAS | 1688 ms | 13102 ms | 438 ms | 2155 ms | 3522 ms | ATLAS (1 CPU) | 1210 ms | 5897 ms | 170 ms | 560 ms | 893 ms | MKL (1 CPUs) | 691 ms | 4475 ms | 141 ms | 450 ms | 736 ms | MKL (2 CPUs) | 552 ms | 2718 ms | 96 ms | 267 ms | 423 ms | MKL (8 CPUs) | 525 ms | 1679 ms | 60 ms | 137 ms | 197 ms | GotoBlas2 (1 CPU) | 2124 ms | 4636 ms | 147 ms | 456 ms | 743 ms | GotoBlas2 (2 CPUs)| 1560 ms | 3278 ms | 116 ms | 295 ms | 460 ms | GotoBlas2 (8 CPUs)| 741 ms | 2914 ms | 82 ms | 262 ms | 192 ms | Mac OS X (1 CPU) | 948 ms | 4339 ms | 151 ms | 318 ms | 566 ms | Mac Server (1 CPU)| 1033 ms | 3645 ms | 99 ms | 232 ms | 342 ms |
Installation
Installation of MKL included 40 installing the complete Intel Compiler Suite 39 which is pretty straight forward. However 38 because of some bugs/issues configuring 37 and compiling numpy with MKL support was 36 a bit of a hassle.
GotoBlas2 is a small package 35 which can be easily compiled as a shared 34 library. However because of a bug you have 33 to re-create the shared library after building 32 it in order to use it with numpy.
In addition 31 to this building it for multiple target 30 plattform didn't work for some reason. So 29 I had to create an .so file for each platform 28 for which i want to have an optimized libgoto2.so file.
If 27 you install numpy from Ubuntu's repository 26 it will automatically install and configure 25 numpy to use ATLAS. Installing ATLAS from source can 24 take some time and requires some additional 23 steps (fortran, etc).
If you install numpy 22 on a Mac OS X machine with Fink or Mac Ports it will 21 either configure numpy to use ATLAS or Apple's Accelerate Framework. You 20 can check by either running ldd on the numpy.core._dotblas file 19 or calling numpy.show_config().
Conclusions
MKL performs best closely followed 18 by GotoBlas2.
In the eigenvalue test GotoBlas2 performs surprisingly 17 worse than expected. Not sure why this is 16 the case.
Apple's Accelerate Framework performs really good especially 15 in single threaded mode (compared to the 14 other BLAS implementations).
Both GotoBlas2 and 13 MKL scale very well with number of threads. So 12 if you have to deal with big matrices running 11 it on multiple threads will help a lot.
In 10 any case don't use the default netlib blas implementation 9 because it is way too slow for any serious 8 computational work.
On our cluster I also 7 installed AMD's ACML and performance was similar to 6 MKL and GotoBlas2. I don't have any numbers tough.
I 5 personally would recommend to use GotoBlas2 because 4 it's easier to install and it's free.
If 3 you want to code in C++/C also check out 2 Eigen3 which is supposed to outperform MKL/GotoBlas2 in some 1 cases and is also pretty easy to use.
Not all of NumPy uses BLAS, only some functions 8 -- specifically dot()
, vdot()
, and innerproduct()
and several functions 7 from the numpy.linalg
module. Also note that many NumPy 6 operations are limited by memory bandwidth 5 for large arrays, so an optimised implementation 4 is unlikely to give any improvement. Whether 3 multi-threading can give better performance 2 if you are limited by memory bandwidth heavily 1 depends on your hardware.
It's possible that because Matrix x Matrix 13 multiplication is memory constrained that 12 adding extra cores on the same memory hierarchy 11 doesn't give you much. Of course, if you're 10 seeing substantial speedup when you switch 9 to your Fortran implementation then I might 8 be incorrect.
My understanding is that proper 7 caching is far more important for these 6 sorts of problems than compute power. Presumably 5 BLAS does this for you.
For a simple test 4 you could try installing Enthought's python distribution 3 for comparison. They link against Intel's 2 Math Kernel Library which I believe harnesses multiple cores 1 if available.
Have you heard of MAGMA? Matrix Algebra 4 on GPU and Multicore Architecture http://icl.cs.utk.edu/magma/
The MAGMA 3 project aims to develop a dense linear algebra 2 library similar to LAPACK but for heterogeneous/hybrid 1 architectures, starting with current "Multicore+GPU" systems.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.