ggml/examples/gpt-j/README.md

# gpt-j

Local GPT-J inference on your computer using C/C++

No video card required. You just need to have 16 GB of RAM.

For example, you can run this on a 16 GB MacBook M1.

## Motivation

The GPT-J 6B model is the open-source alternative to OpenAI's GPT-3. It's basically a neural network that
allows you to generate coherent, human-like text given a certain context (prompt).

The GPT-J model is quite big - the compact version of the model uses 16-bit floating point representation
of the weights and is still 12 GB big. This means that in order to run inference on your computer, you
would need to have a video card with at least 12 GB of video RAM. Alternatively, you can try to run the
python implementations on the CPU, but that would probably not be very efficient as they are primarily
optimized for running on a GPU (or at least this is my guess - I don't have much experience with python).

Looking on the internet, I couldn't find a dedicated CPU implementation that would allow me to run the model
without a high-end video card. So I decided to write my own inference using a custom build tensor library.
The tensor library (called [ggml](https://github.com/ggerganov/ggml), written in C) is in early development
stage, but it already allows me to run the GPT-J model.

On my MacBook M1 Pro, I achieve an inference speed of about `125 ms/token` or about 2-3 words per second.

Here is a sample run with prompt `int main(int argc, char ** argv) {`:

```
$ time ./bin/gpt-j -p "int main(int argc, char ** argv) {"

gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 13

int main(int argc, char ** argv) {
    (void)argc;
    (void)argv;

    {
        struct sockaddr_in addr;
        int addrlen;
        char * ip = "192.168.1.4";
        int i;

        if ( (addrlen = sizeof(addr)) == -1 )
            return -1;

        for (i = 0; i < 10; ++i) {
            addr.sin_family = AF_INET;
            addr.sin_addr.s_addr = inet_addr(ip);

main: mem per token = 16430420 bytes
main:     load time =  6211.48 ms
main:   sample time =    13.74 ms
main:  predict time = 26420.34 ms / 124.62 ms per token
main:    total time = 33035.37 ms

real	0m33.171s
user	3m32.269s
sys	     0m3.686s

$
```

It took ~6.2 seconds to load the model to memory. After that, it took ~26.4 seconds to generate 200
tokens of what looks like to be the beginning of a networking program in C. Pretty cool!

## Implementation details

The high level implementation of the model is contained in the [main.cpp](main.cpp) file. The core
computations are performed by the `ggml` library.

The most performance critical part of the implementation is of course the matrix multiplication routine.
99% of the time is spent here, so it is important to optimize this as much as possible.

On Arm64, I utilize the 128-bit NEON intrinsics for 16-bit floating point operations:

https://github.com/ggerganov/ggml/blob/1548ac6743c594cc920ccb3503444b0e2bdf4d56/src/ggml.c#L187-L243

These instructions allow each core to operate simultaneously on 64 floating point numbers. I'm no expert
in SIMD, but after quite some trials this was the most efficient code for dot product that I could come up
with. Combined with the parallel computation on 8 CPU threads, I think I got close to the maximum performance
that one could possibly get on the M1 CPU. Still, I'm curious to know if there is a more efficient way to
implement this.

One interesting property of the GPT-J transformer architecture is that it allows you to perform part
of the inference in parallel - i.e. the Feed-forward layer can be computed in parallel to the Self-Attention
layer:

https://github.com/ggerganov/ggml/blob/1548ac6743c594cc920ccb3503444b0e2bdf4d56/examples/gpt-j/main.cpp#L507-L531

So I thought why not bring in the M1 GPU to compute half of the neural network in parallel to the CPU.
Thanks to the shared memory model, it was relatively easy to offload half of the computation to the GPU
using [Metal Performance Shaders](https://developer.apple.com/documentation/metalperformanceshaders).
However, to my surprise, I did not get any performance improvement at all. My conclusion was that the
8-thread NEON CPU computation is basically saturating the memory bandwidth of the M1 and since the CPU
and the GPU on the MacBook are sharing that bandwidth, it does not help to offload the computation to the
GPU. Another observation was that the MPS GPU matrix multiplication using 16-bit floats had the same
performance as the 8-thread NEON CPU implementation. Again, I explain this with a saturated memory channel.
But of course, I could be totally wrong and somehow my implementation wasn't utilizing the resources
correctly.

Another property of my implementation is that it does not perform any memory allocations once the model
is loaded into memory. All required memory is allocated at the start of the program.

## Usage

If you want to give this a try and you are on Linux or Mac OS, simply follow these instructions:

```bash
# Clone the ggml library and build the gpt-j example
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-j

# Download the ggml-compatible GPT-J 6B model (requires 12GB disk space)
../examples/gpt-j/download-ggml-model.sh 6B

# Run the inference (requires 16GB of CPU RAM)
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"
```

To run the `gpt-j` tool, you need the 12GB `ggml-model.bin` file which contains the GPT-J model in
[ggml](https://github.com/ggerganov/ggml) format. In the instructions above, I download the binary file
directly from one of my servers, using the [download-ggml-model.sh](download-ggml-model.sh) script.

---

Alternatively, you can perform the conversion yourself.

First, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B

Note that the full model is quite big - about 72 GB. After you download it, you need to make the
conversion using the [convert-h5-to-ggml.py](convert-h5-to-ggml.py) script. This will generate the
`ggml-model.bin` file, which you can then use with the `gpt-j` program.

## GPT-2

I have also implemented a tool for CPU inference using the smaller GPT-2 models. They have worse
quality compared to GPT-J, but are much faster to execute.

Checkout the GPT-2 example here: [gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2)