|
|
@ -59,8 +59,8 @@ For a quick demo, simply run `make base.en`:
|
|
|
|
```java
|
|
|
|
```java
|
|
|
|
$ make base.en
|
|
|
|
$ make base.en
|
|
|
|
|
|
|
|
|
|
|
|
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c
|
|
|
|
cc -I. -O3 -std=c11 -pthread -DGGML_USE_ACCELERATE -c ggml.c -o ggml.o
|
|
|
|
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp
|
|
|
|
c++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o
|
|
|
|
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
|
|
|
|
c++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp whisper.o ggml.o -o main -framework Accelerate
|
|
|
|
./main -h
|
|
|
|
./main -h
|
|
|
|
|
|
|
|
|
|
|
@ -70,13 +70,17 @@ options:
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-h, --help show this help message and exit
|
|
|
|
-s SEED, --seed SEED RNG seed (default: -1)
|
|
|
|
-s SEED, --seed SEED RNG seed (default: -1)
|
|
|
|
-t N, --threads N number of threads to use during computation (default: 4)
|
|
|
|
-t N, --threads N number of threads to use during computation (default: 4)
|
|
|
|
|
|
|
|
-p N, --processors N number of processors to use during computation (default: 1)
|
|
|
|
-ot N, --offset-t N time offset in milliseconds (default: 0)
|
|
|
|
-ot N, --offset-t N time offset in milliseconds (default: 0)
|
|
|
|
-on N, --offset-n N segment index offset (default: 0)
|
|
|
|
-on N, --offset-n N segment index offset (default: 0)
|
|
|
|
|
|
|
|
-mc N, --max-context N maximum number of text context tokens to store (default: max)
|
|
|
|
|
|
|
|
-wt N, --word-thold N word timestamp probability threshold (default: 0.010000)
|
|
|
|
-v, --verbose verbose output
|
|
|
|
-v, --verbose verbose output
|
|
|
|
--translate translate from source language to english
|
|
|
|
--translate translate from source language to english
|
|
|
|
-otxt, --output-txt output result in a text file
|
|
|
|
-otxt, --output-txt output result in a text file
|
|
|
|
-ovtt, --output-vtt output result in a vtt file
|
|
|
|
-ovtt, --output-vtt output result in a vtt file
|
|
|
|
-osrt, --output-srt output result in a srt file
|
|
|
|
-osrt, --output-srt output result in a srt file
|
|
|
|
|
|
|
|
-owts, --output-words output word-level timestamps to a text file
|
|
|
|
-ps, --print_special print special tokens
|
|
|
|
-ps, --print_special print special tokens
|
|
|
|
-pc, --print_colors print colors
|
|
|
|
-pc, --print_colors print colors
|
|
|
|
-nt, --no_timestamps do not print timestamps
|
|
|
|
-nt, --no_timestamps do not print timestamps
|
|
|
@ -114,23 +118,26 @@ whisper_model_load: n_text_layer = 6
|
|
|
|
whisper_model_load: n_mels = 80
|
|
|
|
whisper_model_load: n_mels = 80
|
|
|
|
whisper_model_load: f16 = 1
|
|
|
|
whisper_model_load: f16 = 1
|
|
|
|
whisper_model_load: type = 2
|
|
|
|
whisper_model_load: type = 2
|
|
|
|
whisper_model_load: mem_required = 505.00 MB
|
|
|
|
whisper_model_load: mem_required = 670.00 MB
|
|
|
|
whisper_model_load: adding 1607 extra tokens
|
|
|
|
whisper_model_load: adding 1607 extra tokens
|
|
|
|
whisper_model_load: ggml ctx size = 163.43 MB
|
|
|
|
whisper_model_load: ggml ctx size = 140.60 MB
|
|
|
|
whisper_model_load: memory size = 22.83 MB
|
|
|
|
whisper_model_load: memory size = 22.83 MB
|
|
|
|
whisper_model_load: model size = 140.54 MB
|
|
|
|
whisper_model_load: model size = 140.54 MB
|
|
|
|
|
|
|
|
|
|
|
|
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, lang = en, task = transcribe, timestamps = 1 ...
|
|
|
|
system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 |
|
|
|
|
|
|
|
|
|
|
|
|
[00:00.000 --> 00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
|
|
|
|
main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
whisper_print_timings: load time = 87.21 ms
|
|
|
|
[00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
|
|
|
|
whisper_print_timings: mel time = 24.26 ms
|
|
|
|
|
|
|
|
whisper_print_timings: sample time = 3.87 ms
|
|
|
|
|
|
|
|
whisper_print_timings: encode time = 323.67 ms / 53.94 ms per layer
|
|
|
|
whisper_print_timings: load time = 105.91 ms
|
|
|
|
whisper_print_timings: decode time = 83.25 ms / 13.87 ms per layer
|
|
|
|
whisper_print_timings: mel time = 24.62 ms
|
|
|
|
whisper_print_timings: total time = 522.66 ms
|
|
|
|
whisper_print_timings: sample time = 3.63 ms
|
|
|
|
|
|
|
|
whisper_print_timings: encode time = 324.71 ms / 54.12 ms per layer
|
|
|
|
|
|
|
|
whisper_print_timings: decode time = 83.58 ms / 13.93 ms per layer
|
|
|
|
|
|
|
|
whisper_print_timings: total time = 542.81 ms
|
|
|
|
```
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The command downloads the `base.en` model converted to custom `ggml` format and runs the inference on all `.wav` samples in the folder `samples`.
|
|
|
|
The command downloads the `base.en` model converted to custom `ggml` format and runs the inference on all `.wav` samples in the folder `samples`.
|
|
|
@ -172,8 +179,8 @@ make large
|
|
|
|
|
|
|
|
|
|
|
|
| Model | Disk | Mem | SHA |
|
|
|
|
| Model | Disk | Mem | SHA |
|
|
|
|
| --- | --- | --- | --- |
|
|
|
|
| --- | --- | --- | --- |
|
|
|
|
| tiny | 75 MB | ~280 MB | `bd577a113a864445d4c299885e0cb97d4ba92b5f` |
|
|
|
|
| tiny | 75 MB | ~390 MB | `bd577a113a864445d4c299885e0cb97d4ba92b5f` |
|
|
|
|
| base | 142 MB | ~430 MB | `465707469ff3a37a2b9b8d8f89f2f99de7299dac` |
|
|
|
|
| base | 142 MB | ~500 MB | `465707469ff3a37a2b9b8d8f89f2f99de7299dac` |
|
|
|
|
| small | 466 MB | ~1.0 GB | `55356645c2b361a969dfd0ef2c5a50d530afd8d5` |
|
|
|
|
| small | 466 MB | ~1.0 GB | `55356645c2b361a969dfd0ef2c5a50d530afd8d5` |
|
|
|
|
| medium | 1.5 GB | ~2.6 GB | `fd9727b6e1217c2f614f9b698455c4ffd82463b4` |
|
|
|
|
| medium | 1.5 GB | ~2.6 GB | `fd9727b6e1217c2f614f9b698455c4ffd82463b4` |
|
|
|
|
| large | 2.9 GB | ~4.7 GB | `b1caaf735c4cc1429223d5a74f0f4d0b9b59a299` |
|
|
|
|
| large | 2.9 GB | ~4.7 GB | `b1caaf735c4cc1429223d5a74f0f4d0b9b59a299` |
|
|
|
@ -315,7 +322,7 @@ https://user-images.githubusercontent.com/1991296/199337538-b7b0c7a3-2753-4a88-a
|
|
|
|
## Implementation details
|
|
|
|
## Implementation details
|
|
|
|
|
|
|
|
|
|
|
|
- The core tensor operations are implemented in C ([ggml.h](ggml.h) / [ggml.c](ggml.c))
|
|
|
|
- The core tensor operations are implemented in C ([ggml.h](ggml.h) / [ggml.c](ggml.c))
|
|
|
|
- The high-level C-style API is implemented in C++ ([whisper.h](whisper.h) / [whisper.cpp](whisper.cpp))
|
|
|
|
- The transformer model and the high-level C-style API are implemented in C++ ([whisper.h](whisper.h) / [whisper.cpp](whisper.cpp))
|
|
|
|
- Sample usage is demonstrated in [main.cpp](examples/main)
|
|
|
|
- Sample usage is demonstrated in [main.cpp](examples/main)
|
|
|
|
- Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
|
|
|
|
- Sample real-time audio transcription from the microphone is demonstrated in [stream.cpp](examples/stream)
|
|
|
|
- Various other examples are available in the [examples](examples) folder
|
|
|
|
- Various other examples are available in the [examples](examples) folder
|
|
|
|