A NumPy-compatible matrix library accelerated by CUDA

CuPy is an open-source matrix library accelerated with NVIDIA CUDA. It also uses CUDA-related libraries including cuBLAS, cuDNN, cuRand, cuSolver, cuSPARSE, cuFFT and NCCL to make full use of the GPU architecture.

CuPy's interface is highly compatible with NumPy; in most cases it can be used as a drop-in replacement.
All you need to do is just replace `numpy`

with `cupy`

in your Python code.
It supports various methods, indexing, data types, broadcasting and more.

```
>>> import cupy as cp
>>> x = cp.arange(6).reshape(2, 3).astype('f')
>>> x
array([[ 0., 1., 2.],
[ 3., 4., 5.]], dtype=float32)
>>> x.sum(axis=1)
array([ 3., 12.], dtype=float32)
```

The easiest way to install CuPy is to use pip. CuPy provides Wheels (precompiled binary packages) for the recommended environments. These packages include cuDNN and NCCL. See details. And, CuPy can be installed from source code. The install script in the source code automatically detects installed versions of CUDA, cuDNN and NCCL in your environment. See details.

```
(For CUDA 8.0)
% pip install cupy-cuda80
(For CUDA 9.0)
% pip install cupy-cuda90
(For CUDA 9.1)
% pip install cupy-cuda91
(For CUDA 9.2)
% pip install cupy-cuda92
(For CUDA 10.0)
% pip install cupy-cuda100
(For CUDA 10.1)
% pip install cupy-cuda101
(Install CuPy from source)
% pip install cupy
```

You can easily make a custom CUDA kernel if you want to make your code run faster, requiring only a small code snippet of C++. CuPy automatically wraps and compiles it to make a CUDA binary. Compiled binaries are cached and reused in subsequent runs. See details.

```
>>> x = cp.arange(6, dtype='f').reshape(2, 3)
>>> y = cp.arange(3, dtype='f')
>>> kernel = cp.ElementwiseKernel(
... 'float32 x, float32 y', 'float32 z',
... '''if (x - 2 > y) {
... z = x * y;
... } else {
... z = x + y;
... }''', 'my_kernel')
>>> kernel(x, y)
array([[ 0., 2., 4.],
[ 0., 4., 10.]], dtype=float32)
```