[ACCEPTED]-Fast 4x4 Matrix Multiplication in C-neon
No, the Strassen or Coppersmith-Winograd 25 algorithm wouldn't make much difference 24 here. They start to pay off for larger matrices 23 only.
If your matrix-multiplication is really 22 a bottleneck you could rewrite the algorithm 21 using NEON SIMD instructions. That would 20 only help for ARMv7 as ARMv6 does not has 19 this extension though.
I'd expect a factor 18 3 speedup over the compiled C-code for your 17 case.
EDIT: You can find a nice implementation 16 in ARM-NEON here: http://code.google.com/p/math-neon/
For your C-code there 15 are two things you could do to speed up 14 the code:
Don't inline the function. Your 13 matrix multiplication generates quite a 12 bit of code as it's unrolled, and the ARM 11 only has a very tiny instruction cache. Excessive 10 inlining can make your code slower because 9 the CPU will be busy loading code into the 8 cache instead of executing it.
Use the restrict 7 keyword to tell the compiler that the source- and 6 destination pointers don't overlap in memory. Currently 5 the compiler is forced to reload every source 4 value from memory whenever a result is written 3 because it has to assume that source and 2 destination may overlap or even point to 1 the same memory.
Just nitpicking. I wonder why people still 2 obfuscate their code voluntarly? C is already 1 difficult to read, no need to add to it.
static inline void Matrix4x4MultiplyBy4x4 (float src1[4][4], float src2[4][4], float dest[4][4])
{
dest[0][0] = src1[0][0] * src2[0][0] + src1[0][1] * src2[1][0] + src1[0][2] * src2[2][0] + src1[0][3] * src2[3][0];
dest[0][1] = src1[0][0] * src2[0][1] + src1[0][1] * src2[1][1] + src1[0][2] * src2[2][1] + src1[0][3] * src2[3][1];
dest[0][2] = src1[0][0] * src2[0][2] + src1[0][1] * src2[1][2] + src1[0][2] * src2[2][2] + src1[0][3] * src2[3][2];
dest[0][3] = src1[0][0] * src2[0][3] + src1[0][1] * src2[1][3] + src1[0][2] * src2[2][3] + src1[0][3] * src2[3][3];
dest[1][0] = src1[1][0] * src2[0][0] + src1[1][1] * src2[1][0] + src1[1][2] * src2[2][0] + src1[1][3] * src2[3][0];
dest[1][1] = src1[1][0] * src2[0][1] + src1[1][1] * src2[1][1] + src1[1][2] * src2[2][1] + src1[1][3] * src2[3][1];
dest[1][2] = src1[1][0] * src2[0][2] + src1[1][1] * src2[1][2] + src1[1][2] * src2[2][2] + src1[1][3] * src2[3][2];
dest[1][3] = src1[1][0] * src2[0][3] + src1[1][1] * src2[1][3] + src1[1][2] * src2[2][3] + src1[1][3] * src2[3][3];
dest[2][0] = src1[2][0] * src2[0][0] + src1[2][1] * src2[1][0] + src1[2][2] * src2[2][0] + src1[2][3] * src2[3][0];
dest[2][1] = src1[2][0] * src2[0][1] + src1[2][1] * src2[1][1] + src1[2][2] * src2[2][1] + src1[2][3] * src2[3][1];
dest[2][2] = src1[2][0] * src2[0][2] + src1[2][1] * src2[1][2] + src1[2][2] * src2[2][2] + src1[2][3] * src2[3][2];
dest[2][3] = src1[2][0] * src2[0][3] + src1[2][1] * src2[1][3] + src1[2][2] * src2[2][3] + src1[2][3] * src2[3][3];
dest[3][0] = src1[3][0] * src2[0][0] + src1[3][1] * src2[1][0] + src1[3][2] * src2[2][0] + src1[3][3] * src2[3][0];
dest[3][1] = src1[3][0] * src2[0][1] + src1[3][1] * src2[1][1] + src1[3][2] * src2[2][1] + src1[3][3] * src2[3][1];
dest[3][2] = src1[3][0] * src2[0][2] + src1[3][1] * src2[1][2] + src1[3][2] * src2[2][2] + src1[3][3] * src2[3][2];
dest[3][3] = src1[3][0] * src2[0][3] + src1[3][1] * src2[1][3] + src1[3][2] * src2[2][3] + src1[3][3] * src2[3][3];
};
Are you sure that your unrolled code is 12 faster than the explicit loop based approach? Mind 11 that the compilers are usually better than 10 humans performing optimizations!
In fact, I'd 9 bet there are more chances for a compiler 8 to emit automatically SIMD instructions 7 from a well written loop than from a series 6 of "unrelated" statements...
You could also 5 specify the matrices sizes in the argument 4 declaration. Then you could use the normal 3 bracket syntax to access the elements, and 2 it could also be a good hint for the compiler 1 to make its optimisations too.
Your completely unrolled traditional product 19 is likely pretty fast.
Your matrix is too 18 small to overcome the overheard of managing 17 a Strassen multiplication in its traditional 16 form with explicit indexes and partitioning 15 code; you'd likely lose any effect on optimization 14 to that overhead.
But if you want fast, I'd 13 use SIMD instructions if they are available. I'd 12 be surprised if the ARM chips these days 11 don't have them. If they do, you can manage 10 all the products in row/colum in a single 9 instruction; if the SIMD is 8 wide, you 8 might manage 2 row/column multiplies in a 7 single instruction. Setting the operands 6 up to do that instruction might require 5 some dancing around; SIMD instructions will 4 easily pick up your rows (adjacent values), but 3 will not pick up the columns (non-contiguous). And 2 it may take some effort to compute the sum 1 of the products in a row/column.
Are these arbitrary matrices or do they 8 have any symmetries? If so, those symmetries 7 can often to exploited for improved performance 6 (for example in rotation matrices).
Also, I 5 agree with fortran above, and would run 4 some timing tests to verify that your hand 3 unrolled code is faster than an optimizing 2 compiler can create. At the least, you 1 may be able to simplify your code.
Paul
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.