[ACCEPTED]-Computing the null space of a matrix as fast as possible-cuda

Accepted answer
Score: 14

To answer your question directly... yes! QR 15 decomposition!

Let A be an m-by-n matrix 14 with rank n. QR decomposition finds orthonormal 13 m-by-m matrix Q and upper triangular m-by-n 12 matrix R such that A = QR. If we define 11 Q = [Q1 Q2], where Q1 is m-by-n and Q2 is 10 m-by-(m-n), then the columns of Q2 form 9 the null space of A^T.

QR decomposition is 8 computed either by Gram-Schmidt, Givens 7 rotations, or Householder reflections. They 6 have different stability properties and 5 operation counts.

You are right: SVD is expensive! I 4 can't speak for what state-of-the-art stuff 3 uses, but when I hear "compute null space" (EDIT: in 2 a way that is simple for me to understand), I 1 think QR.

Score: 4

I don't think the above proposed method 14 always gives the whole null space. To recap: "A 13 = QR, where Q = [Q1 Q2], and Q1 is m-by-n 12 and Q2 is m-by-(m-n). Then the columns 11 of Q2 form the null space of A^T."

Indeed, this 10 may only give a subspace of the null space. Simple 9 counter-example is when A=0, in which case 8 the null space of A^T is the whole R^m.

Therefore, it 7 is necessary to check R too. Based on my 6 experience with Matlab, if a row of R is 5 straight 0, then the corresponding column 4 in Q should also be a basis of the null 3 space of A^T. Clearly this observation 2 is heuristic and hinges on the particular 1 algorithm used for QR decomposition.

Score: 3

Gaussian elimination is plenty fast for 7 4x3 matrices. IIRC I've done about 5 million 6 per second with Java without parallelism. With 5 such a small problem, your best bet is to 4 code the routine (row reduce etc.) yourself; otherwise 3 you'll waste most of the time putting the 2 data into the right format for the external 1 routine.

Score: 2

In the anwers above, it has been already 14 pointed out how the null space of a matrix 13 can be calculated by using the QR or the 12 SVD approach. SVD should be preferred when 11 accuracy is required, see also Null-space of a rectangular dense matrix.

As of February 10 2015, CUDA 7 (now in release candidate) makes 9 SVD available through its new cuSOLVER library. Below 8 I report an example on how using cuSOLVER's 7 SVD to calculate the null space of a matrix.

Be 6 aware that the problem you are focusing 5 on concerns the calculation of several small 4 matrices, so you should adapt the example 3 I'm providing below by using streams to 2 make sense for your case. To associate a 1 stream to each task you can use

cudaStreamCreate()

and

cusolverDnSetStream()

kernel.cu

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<math.h>

#include <cusolverDn.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

/********/
/* MAIN */
/********/
int main(){

    // --- gesvd only supports Nrows >= Ncols
    // --- column major memory ordering

    const int Nrows = 7;
    const int Ncols = 5;

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- Singular values threshold
    double threshold = 1e-12;

    // --- Setting the host, Nrows x Ncols matrix
    double *h_A = (double *)malloc(Nrows * Ncols * sizeof(double));
    for(int j = 0; j < Nrows; j++)
        for(int i = 0; i < Ncols; i++)
            h_A[j + i*Nrows] = (i + j*j) * sqrt((double)(i + j));

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- host side SVD results space
    double *h_U = (double *)malloc(Nrows * Nrows     * sizeof(double));
    double *h_V = (double *)malloc(Ncols * Ncols     * sizeof(double));
    double *h_S = (double *)malloc(min(Nrows, Ncols) * sizeof(double));

    // --- device side SVD workspace and matrices
    double *d_U;            gpuErrchk(cudaMalloc(&d_U,  Nrows * Nrows     * sizeof(double)));
    double *d_V;            gpuErrchk(cudaMalloc(&d_V,  Ncols * Ncols     * sizeof(double)));
    double *d_S;            gpuErrchk(cudaMalloc(&d_S,  min(Nrows, Ncols) * sizeof(double)));

    // --- CUDA SVD initialization
    cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));

    // --- CUDA SVD execution
    cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful SVD execution\n\n";

    // --- Moving the results from device to host
    gpuErrchk(cudaMemcpy(h_S, d_S, min(Nrows, Ncols) * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_U, d_U, Nrows * Nrows     * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_V, d_V, Ncols * Ncols     * sizeof(double), cudaMemcpyDeviceToHost));

    for(int i = 0; i < min(Nrows, Ncols); i++) 
        std::cout << "d_S["<<i<<"] = " << std::setprecision(15) << h_S[i] << std::endl;

    printf("\n\n");

    int count = 0;
    bool flag = 0;
    while (!flag) {
        if (h_S[count] < threshold) flag = 1;
        if (count == min(Nrows, Ncols)) flag = 1;
        count++;
    }
    count--;
    printf("The null space of A has dimension %i\n\n", min(Ncols, Nrows) - count);

    for(int j = count; j < Ncols; j++) {
        printf("Basis vector nr. %i\n", j - count);
        for(int i = 0; i < Ncols; i++)
                std::cout << "d_V["<<i<<"] = " << std::setprecision(15) << h_U[j*Ncols + i] << std::endl;
        printf("\n");
    }

    cusolverDnDestroy(solver_handle);

    return 0;

}

Utilities.cuh

#ifndef UTILITIES_CUH
#define UTILITIES_CUH

extern "C" int iDivUp(int, int);
extern "C" void gpuErrchk(cudaError_t);
extern "C" void cusolveSafeCall(cusolverStatus_t);

#endif

Utilities.cu

#include <stdio.h>
#include <assert.h>

#include "cuda_runtime.h"
#include <cuda.h>

#include <cusolverDn.h>

/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }

/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) { exit(code); }
   }
}

extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }

/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
    switch (error)
    {
        case CUSOLVER_STATUS_SUCCESS:
            return "CUSOLVER_SUCCESS";

        case CUSOLVER_STATUS_NOT_INITIALIZED:
            return "CUSOLVER_STATUS_NOT_INITIALIZED";

        case CUSOLVER_STATUS_ALLOC_FAILED:
            return "CUSOLVER_STATUS_ALLOC_FAILED";

        case CUSOLVER_STATUS_INVALID_VALUE:
            return "CUSOLVER_STATUS_INVALID_VALUE";

        case CUSOLVER_STATUS_ARCH_MISMATCH:
            return "CUSOLVER_STATUS_ARCH_MISMATCH";

        case CUSOLVER_STATUS_EXECUTION_FAILED:
            return "CUSOLVER_STATUS_EXECUTION_FAILED";

        case CUSOLVER_STATUS_INTERNAL_ERROR:
            return "CUSOLVER_STATUS_INTERNAL_ERROR";

        case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
            return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";

    }

    return "<unknown>";
}

inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
    if(CUSOLVER_STATUS_SUCCESS != err) {
        fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
                                _cudaGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
    }
}

extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }
Score: 1

"seems very expensive" - what 7 data do you have that supports this?

Maybe 6 Block Lanczos is the answer you seek.

Or maybe this.

Both JAMA 5 and Apache Commons Math have SVD implementations 4 in Java. Why not take those and try them 3 out? Get some real data for your case instead 2 of impressions. It won't cost you much, since 1 the code is already written and tested.

Score: 1

I think the most important thing for CUDA 22 is to find an algorithm that doesn't depend 21 on conditional branching (which is quite 20 slow on graphics hardware). Simple if statements 19 that can be optimized into conditional assignment 18 are much better (or you can use the ?: operator).

If 17 necessary, you should be able to do some 16 form of pivoting using conditional assignment. It 15 might actually be harder to determine how 14 to store your result: if your matrix is 13 rank-deficient, what do you want your CUDA 12 program to do about it?

If you assume your 11 4x3 matrix is not actually rank-deficient, you 10 can find your (single) null-space vector 9 without any conditionals at all: the matrix 8 is small enough that you can use Cramer's 7 rule efficiently.

Actually, since you don't 6 actually care about the scale of your null 5 vector, you don't have to divide by the 4 determinant -- you can just take the determinants 3 of the minors:

    x1 x2 x3
M = y1 y2 y3
    z1 z2 z3
    w1 w2 w3

         |y1 y2 y3|        |x1 x2 x3|       |x1 x2 x3|        |x1 x2 x3|
->  x0 = |z1 z2 z3|  y0 = -|z1 z2 z3|  z0 = |y1 y2 y3|  w0 = -|y1 y2 y3|
         |w1 w2 w3|        |w1 w2 w3|       |w1 w2 w3|        |z1 z2 z3|

Note that these 3x3 determinants 2 are just triple products; you can save computation 1 by reusing the cross products.

Score: 1

I wondered if the matrixes are related rather 27 than just being random, so that the null 26 spaces you are seeking can be considered 25 to be like 1-dimensional tangents to a curve 24 in N-space (N = 9). If so, you may be able 23 to speed things up by using Newton's method 22 to solve successive instances of the system 21 of quadratic equations Ax = 0, |x|^2 = 1, starting 20 from a previous null space vector. Newton's 19 method uses first derivatives to converge 18 to a solution, and so would use Gaussian 17 elimination to solve 9x9 systems. Using 16 this technique would require that you be 15 able to make small steps from matrix to 14 matrix by say varying a parameter.

So the 13 idea is that you initialize using SVD on 12 the first matrix, but thereafter you step 11 from matrix to matrix, using the null space 10 vector of one as the starting point for 9 the iteration for the next one. You need 8 one or two iterations to get convergence. If 7 you don't get convegence you use SVD to 6 restart. If this situation is what you have, it 5 is much faster than starting fresh on each 4 matrix.

I used this a long time ago to map 3 contours in the solutions of sets of 50 2 x 50 quadratic equations associated with 1 the behavior of electric power systems.

More Related questions