ggml/examples/gpt-j/README.md

# gpt-j

Local GPT-J inference on your computer using C/C++

No video card required. You just need to have 16 GB of RAM.

## Motivation

The GPT-J 6B model is the open-source alternative to OpenAI's GPT-3. It's basically a neural network that allows you to
generate coherent, human-like text given a certain context (prompt).

The GPT-J model is quite big - the compact version of the model uses 16-bit floating point representation of the weights
and is still 12 GB big. This means that in order to run inference on your computer, you would need to have a video card
with at least 12 GB of video RAM. Alternatively, you can try to run the python implementations on the CPU, but that
would probably not be very efficient as they are primarily optimized for running on a GPU (or at least this is my guess -
I don't have much experience with python).

I wanted to try and run the model on my MacBook, so I decided to implement the model inference from scratch using my own
custom build tensor library. The tensor library (called [ggml](https://github.com/ggerganov/ggml), written in C) is in
early development stage, but it already allows me to run the GPT-J model.

On my 32GB MacBook M1 Pro, I achieve an inference speed of about `125 ms/token` or about ~6 words per second (1 word
typically consists of 1 or 2 tokens).

Here is a sample run with prompt `int main(int argc, char ** argv) {`:

```
$ time ./bin/gpt-j -p "int main(int argc, char ** argv) {"

gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 13

int main(int argc, char ** argv) {
    (void)argc;
    (void)argv;

    {
        struct sockaddr_in addr;
        int addrlen;
        char * ip = "192.168.1.4";
        int i;

        if ( (addrlen = sizeof(addr)) == -1 )
            return -1;

        for (i = 0; i < 10; ++i) {
            addr.sin_family = AF_INET;
            addr.sin_addr.s_addr = inet_addr(ip);

main: mem per token = 16430420 bytes
main:     load time =  6211.48 ms
main:   sample time =    13.74 ms
main:  predict time = 26420.34 ms / 124.62 ms per token
main:    total time = 33035.37 ms

real	0m33.171s
user	3m32.269s
sys      0m3.686s

$
```

It took ~6.2 seconds to load the model to memory. After that, it took ~26.4 seconds to generate 200 tokens of what
looks like to be the beginning of a networking program in C. Pretty cool!

Here is another run, just for fun:

```
time ./bin/gpt-j -n 500 -t 8 -p "Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?
"

gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 1
gptj_model_load: ggml ctx size = 13334.86 MB
gptj_model_load: memory_size =  1792.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 24

Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?

I've inherited a team with some very strange and un-documented practices, one of them is that they use an old custom
application with a very slow tech stack written in Python that the team doesn't want to touch but also doesn't want to
throw away as it has some "legacy" code in it.

The problem is, the tech stack is very very slow.

They have a single web server on a VM that is slow.
The server is a little bit busy (not very busy though) and they have a lot of processes (30+ that are constantly being
spawned by the application)
They have an application that is single threaded and was written in Python and the team don't want to touch this, and
the application is very slow.

My task as a new member of the team is to fix this.

I'm a senior dev on the team (3 years on the project) and have been told that I will take the lead on this task. I know
next to nothing about Python. So here is what I have so far.

What I have done is I've been trying to debug the processes with the "ps" command. This way I can see what is running
and where. From what I see, the application spawns 10 processes a minute and some of them are used for nothing.

I have also started to look for the code. The application source is not in GitHub or any other repository, it is only on
our internal GitLab.

What I've found so far:

The application uses a custom SQLAlchemy implementation to interact with the data. I've looked at the source, it looks
like an object cache or something like that. But from what I've seen, the cache gets full every 20 minutes and then gets
cleared with a special command.

Another strange thing is that the application creates a file for every entry in the database (even if the entry already
exists). I've looked at the file to see if it contains something, but it seems to be a JSON file with lots of records.

The other strange thing is that I can only find the database tables in the GitLab repository and not the code. So I
can't really understand how the application is supposed to interact with the database.

I also found a "log" directory, but the code is encrypted with AES. From what I've found, it is in

main: mem per token = 16430420 bytes
main:     load time =  3900.10 ms
main:   sample time =    32.58 ms
main:  predict time = 68049.91 ms / 130.11 ms per token
main:    total time = 73020.05 ms

real	1m13.156s
user	9m1.328s
sys.    0m7.103s
```

## Implementation details

The high level implementation of the model is contained in the [main.cpp](main.cpp) file. The core computations are
performed by the [ggml](https://github.com/ggerganov/ggml/blob/master/include/ggml/ggml.h) library.


#### Matrix multiplication

The most performance critical part of the implementation is of course the matrix multiplication routine. 99% of the time
is spent here, so it was important to optimize this as much as possible.

On Arm64, I utilize the 128-bit NEON intrinsics for 16-bit floating point operations:

https://github.com/ggerganov/ggml/blob/fb558f78d905f85c54813602649ddd628ffe0f3a/src/ggml.c#L187-L243

These instructions allow each core to operate simultaneously on 64 16-bit floats. I'm no expert in SIMD, but after quite
some trials this was the most efficient code for dot product of a row and column that I could come up with. Combined
with the parallel computation on 8 CPU threads, I believe I'm close to the maximum performance that one could possibly
get on the M1 CPU. Still, I'm curious to know if there is a more efficient way to implement this.


#### Attempt to use the M1 GPU

One interesting property of the GPT-J transformer architecture is that it allows you to perform part of the inference in
parallel - i.e. the Feed-forward network can be computed in parallel to the Self-attention layer:

https://github.com/ggerganov/ggml/blob/fb558f78d905f85c54813602649ddd628ffe0f3a/examples/gpt-j/main.cpp#L507-L531

So I thought why not try and bring in the M1 GPU to compute half of the neural network in parallel to the CPU and
potentially gain some extra performance. Thanks to the M1's shared memory model, it was relatively easy to offload part
of the computation to the GPU using Apple's [Metal Performance
Shaders](https://developer.apple.com/documentation/metalperformanceshaders). The GPU shares the host memory, so there is
no need to copy the data back and forth as you would normally do with Cuda or OpenCL. The weight matrices are directly
available to be used by the GPU.

However, to my surprise, using MPS together with the CPU did not lead to any performance improvement at all. My
conclusion was that the 8-thread NEON CPU computation is already saturating the memory bandwidth of the M1 and since
the CPU and the GPU on the MacBook are sharing that bandwidth, it does not help to offload the computation to the GPU.
Another observation was that the MPS GPU matrix multiplication using 16-bit floats had the same performance as the
8-thread NEON CPU implementation. Again, I explain this with a saturated memory channel. But of course, my explanation
could be totally wrong and somehow the implementation wasn't utilizing the resources correctly.

In the end, I decided to not use MPS or the GPU all together.

### Zero memory allocations

Another property of my implementation is that it does not perform any memory allocations once the model is loaded into
memory. All required memory is allocated at the start of the program with a single `malloc` (technically 2 calls, but
that is not important).

## Usage

If you want to give this a try and you are on Linux or Mac OS, simply follow these instructions:

```bash
# Clone the ggml library and build the gpt-j example
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j4 gpt-j

# Download the ggml-compatible GPT-J 6B model (requires 12GB disk space)
../examples/gpt-j/download-ggml-model.sh 6B

# Run the inference (requires 16GB of CPU RAM)
./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"
```

To run the `gpt-j` tool, you need the 12GB `ggml-model.bin` file which contains the GPT-J model in
[ggml](https://github.com/ggerganov/ggml) compatible format. In the instructions above, the binary file
is downloaded from my repository on Hugging Face using the [download-ggml-model.sh](download-ggml-model.sh) script.
You can also, download the file manually from this link:

https://huggingface.co/datasets/ggerganov/ggml/tree/main

---

Alternatively, if you don't want to download the 12GB ggml model file, you can perform the conversion yourself using
python.

First, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B

Note that the full model is quite big - about 72 GB. After you download it, you need to convert it to ggml format using
the [convert-h5-to-ggml.py](convert-h5-to-ggml.py) script. This will generate the `ggml-model.bin` file, which you can
then use with the `gpt-j` program.


## GPT-2

I also implemented a tool for CPU inference using the smaller GPT-2 models. They have worse quality compared to GPT-J,
but are much faster to execute.

For example, the Small GPT-2 model is only 240 MB big and the inference speed on my MacBook is about 200 tokens/sec.

For more details, checkout the GPT-2 example here: [gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2)
Initial release 2 years ago			`# gpt-j`

			`Local GPT-J inference on your computer using C/C++`

			`No video card required. You just need to have 16 GB of RAM.`

			`## Motivation`

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`The GPT-J 6B model is the open-source alternative to OpenAI's GPT-3. It's basically a neural network that allows you to`
			`generate coherent, human-like text given a certain context (prompt).`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`The GPT-J model is quite big - the compact version of the model uses 16-bit floating point representation of the weights`
			`and is still 12 GB big. This means that in order to run inference on your computer, you would need to have a video card`
			`with at least 12 GB of video RAM. Alternatively, you can try to run the python implementations on the CPU, but that`
			`would probably not be very efficient as they are primarily optimized for running on a GPU (or at least this is my guess -`
			`I don't have much experience with python).`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`I wanted to try and run the model on my MacBook, so I decided to implement the model inference from scratch using my own`
			`custom build tensor library. The tensor library (called [ggml](https://github.com/ggerganov/ggml), written in C) is in`
			`early development stage, but it already allows me to run the GPT-J model.`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			On my 32GB MacBook M1 Pro, I achieve an inference speed of about `125 ms/token` or about ~6 words per second (1 word
			`typically consists of 1 or 2 tokens).`
Initial release 2 years ago
			Here is a sample run with prompt `int main(int argc, char ** argv) {`:

			```
			`$ time ./bin/gpt-j -p "int main(int argc, char ** argv) {"`

			`gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...`
			`gptj_model_load: n_vocab = 50400`
			`gptj_model_load: n_ctx = 2048`
			`gptj_model_load: n_embd = 4096`
			`gptj_model_load: n_head = 16`
			`gptj_model_load: n_layer = 28`
			`gptj_model_load: n_rot = 64`
			`gptj_model_load: f16 = 1`
			`gptj_model_load: ggml ctx size = 13334.86 MB`
			`gptj_model_load: memory_size = 1792.00 MB, n_mem = 57344`
			`gptj_model_load: ................................... done`
			`gptj_model_load: model size = 11542.79 MB / num tensors = 285`
			`main: number of tokens in prompt = 13`

			`int main(int argc, char ** argv) {`
			`(void)argc;`
			`(void)argv;`

			`{`
			`struct sockaddr_in addr;`
			`int addrlen;`
			`char * ip = "192.168.1.4";`
			`int i;`

			`if ( (addrlen = sizeof(addr)) == -1 )`
			`return -1;`

			`for (i = 0; i < 10; ++i) {`
			`addr.sin_family = AF_INET;`
			`addr.sin_addr.s_addr = inet_addr(ip);`

			`main: mem per token = 16430420 bytes`
			`main: load time = 6211.48 ms`
			`main: sample time = 13.74 ms`
			`main: predict time = 26420.34 ms / 124.62 ms per token`
			`main: total time = 33035.37 ms`

			`real 0m33.171s`
			`user 3m32.269s`
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`sys 0m3.686s`
Initial release 2 years ago
			`$`
			```

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`It took ~6.2 seconds to load the model to memory. After that, it took ~26.4 seconds to generate 200 tokens of what`
			`looks like to be the beginning of a networking program in C. Pretty cool!`

			`Here is another run, just for fun:`

			```
			`time ./bin/gpt-j -n 500 -t 8 -p "Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?`
			`"`

			`gptj_model_load: loading model from 'models/gpt-j-6B/ggml-model.bin' - please wait ...`
			`gptj_model_load: n_vocab = 50400`
			`gptj_model_load: n_ctx = 2048`
			`gptj_model_load: n_embd = 4096`
			`gptj_model_load: n_head = 16`
			`gptj_model_load: n_layer = 28`
			`gptj_model_load: n_rot = 64`
			`gptj_model_load: f16 = 1`
			`gptj_model_load: ggml ctx size = 13334.86 MB`
			`gptj_model_load: memory_size = 1792.00 MB, n_mem = 57344`
			`gptj_model_load: ................................... done`
			`gptj_model_load: model size = 11542.79 MB / num tensors = 285`
			`main: number of tokens in prompt = 24`

			`Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?`

			`I've inherited a team with some very strange and un-documented practices, one of them is that they use an old custom`
			`application with a very slow tech stack written in Python that the team doesn't want to touch but also doesn't want to`
			`throw away as it has some "legacy" code in it.`

			`The problem is, the tech stack is very very slow.`

			`They have a single web server on a VM that is slow.`
			`The server is a little bit busy (not very busy though) and they have a lot of processes (30+ that are constantly being`
			`spawned by the application)`
			`They have an application that is single threaded and was written in Python and the team don't want to touch this, and`
			`the application is very slow.`

			`My task as a new member of the team is to fix this.`

			`I'm a senior dev on the team (3 years on the project) and have been told that I will take the lead on this task. I know`
			`next to nothing about Python. So here is what I have so far.`

			`What I have done is I've been trying to debug the processes with the "ps" command. This way I can see what is running`
			`and where. From what I see, the application spawns 10 processes a minute and some of them are used for nothing.`

			`I have also started to look for the code. The application source is not in GitHub or any other repository, it is only on`
			`our internal GitLab.`

			`What I've found so far:`

			`The application uses a custom SQLAlchemy implementation to interact with the data. I've looked at the source, it looks`
			`like an object cache or something like that. But from what I've seen, the cache gets full every 20 minutes and then gets`
			`cleared with a special command.`

			`Another strange thing is that the application creates a file for every entry in the database (even if the entry already`
			`exists). I've looked at the file to see if it contains something, but it seems to be a JSON file with lots of records.`

			`The other strange thing is that I can only find the database tables in the GitLab repository and not the code. So I`
			`can't really understand how the application is supposed to interact with the database.`

			`I also found a "log" directory, but the code is encrypted with AES. From what I've found, it is in`

			`main: mem per token = 16430420 bytes`
			`main: load time = 3900.10 ms`
			`main: sample time = 32.58 ms`
			`main: predict time = 68049.91 ms / 130.11 ms per token`
			`main: total time = 73020.05 ms`

			`real 1m13.156s`
			`user 9m1.328s`
			`sys. 0m7.103s`
			```
Initial release 2 years ago
			`## Implementation details`

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`The high level implementation of the model is contained in the [main.cpp](main.cpp) file. The core computations are`
			`performed by the [ggml](https://github.com/ggerganov/ggml/blob/master/include/ggml/ggml.h) library.`


			`#### Matrix multiplication`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`The most performance critical part of the implementation is of course the matrix multiplication routine. 99% of the time`
			`is spent here, so it was important to optimize this as much as possible.`
Initial release 2 years ago
			`On Arm64, I utilize the 128-bit NEON intrinsics for 16-bit floating point operations:`

Update README.md 2 years ago			`https://github.com/ggerganov/ggml/blob/fb558f78d905f85c54813602649ddd628ffe0f3a/src/ggml.c#L187-L243`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`These instructions allow each core to operate simultaneously on 64 16-bit floats. I'm no expert in SIMD, but after quite`
			`some trials this was the most efficient code for dot product of a row and column that I could come up with. Combined`
			`with the parallel computation on 8 CPU threads, I believe I'm close to the maximum performance that one could possibly`
			`get on the M1 CPU. Still, I'm curious to know if there is a more efficient way to implement this.`


			`#### Attempt to use the M1 GPU`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`One interesting property of the GPT-J transformer architecture is that it allows you to perform part of the inference in`
			`parallel - i.e. the Feed-forward network can be computed in parallel to the Self-attention layer:`
Initial release 2 years ago
Update README.md 2 years ago			`https://github.com/ggerganov/ggml/blob/fb558f78d905f85c54813602649ddd628ffe0f3a/examples/gpt-j/main.cpp#L507-L531`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`So I thought why not try and bring in the M1 GPU to compute half of the neural network in parallel to the CPU and`
			`potentially gain some extra performance. Thanks to the M1's shared memory model, it was relatively easy to offload part`
			`of the computation to the GPU using Apple's [Metal Performance`
			`Shaders](https://developer.apple.com/documentation/metalperformanceshaders). The GPU shares the host memory, so there is`
			`no need to copy the data back and forth as you would normally do with Cuda or OpenCL. The weight matrices are directly`
			`available to be used by the GPU.`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`However, to my surprise, using MPS together with the CPU did not lead to any performance improvement at all. My`
			`conclusion was that the 8-thread NEON CPU computation is already saturating the memory bandwidth of the M1 and since`
			`the CPU and the GPU on the MacBook are sharing that bandwidth, it does not help to offload the computation to the GPU.`
			`Another observation was that the MPS GPU matrix multiplication using 16-bit floats had the same performance as the`
			`8-thread NEON CPU implementation. Again, I explain this with a saturated memory channel. But of course, my explanation`
			`could be totally wrong and somehow the implementation wasn't utilizing the resources correctly.`

			`In the end, I decided to not use MPS or the GPU all together.`

			`### Zero memory allocations`

			`Another property of my implementation is that it does not perform any memory allocations once the model is loaded into`
			memory. All required memory is allocated at the start of the program with a single `malloc` (technically 2 calls, but
			`that is not important).`
Initial release 2 years ago
			`## Usage`

			`If you want to give this a try and you are on Linux or Mac OS, simply follow these instructions:`

			```bash
			`# Clone the ggml library and build the gpt-j example`
			`git clone https://github.com/ggerganov/ggml`
			`cd ggml`
			`mkdir build && cd build`
			`cmake ..`
			`make -j4 gpt-j`

			`# Download the ggml-compatible GPT-J 6B model (requires 12GB disk space)`
			`../examples/gpt-j/download-ggml-model.sh 6B`

			`# Run the inference (requires 16GB of CPU RAM)`
			`./bin/gpt-j -m models/gpt-j-6B/ggml-model.bin -p "This is an example"`
			```

			To run the `gpt-j` tool, you need the 12GB `ggml-model.bin` file which contains the GPT-J model in
Update README.md 2 years ago			`[ggml](https://github.com/ggerganov/ggml) compatible format. In the instructions above, the binary file`
			`is downloaded from my repository on Hugging Face using the [download-ggml-model.sh](download-ggml-model.sh) script.`
			`You can also, download the file manually from this link:`

			`https://huggingface.co/datasets/ggerganov/ggml/tree/main`
Initial release 2 years ago
			`---`

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`Alternatively, if you don't want to download the 12GB ggml model file, you can perform the conversion yourself using`
			`python.`
Initial release 2 years ago
			`First, you need to download the full GPT-J model from here: https://huggingface.co/EleutherAI/gpt-j-6B`

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`Note that the full model is quite big - about 72 GB. After you download it, you need to convert it to ggml format using`
			the [convert-h5-to-ggml.py](convert-h5-to-ggml.py) script. This will generate the `ggml-model.bin` file, which you can
			then use with the `gpt-j` program.

Initial release 2 years ago
			`## GPT-2`

Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`I also implemented a tool for CPU inference using the smaller GPT-2 models. They have worse quality compared to GPT-J,`
			`but are much faster to execute.`

			`For example, the Small GPT-2 model is only 240 MB big and the inference speed on my MacBook is about 200 tokens/sec.`
Initial release 2 years ago
Update README.md + minor stuff - Changed default threads to 4 - Added GGML_PERF for enabling runtime performance timings 2 years ago			`For more details, checkout the GPT-2 example here: [gpt-2](https://github.com/ggerganov/ggml/tree/master/examples/gpt-2)`