From 0e689f83d8049f2d532f9778827d61b7db2a2c77 Mon Sep 17 00:00:00 2001 From: Georgi Gerganov Date: Wed, 2 Nov 2022 22:03:27 +0200 Subject: [PATCH] Update README.md --- README.md | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 68 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index a888880..84d49d9 100644 --- a/README.md +++ b/README.md @@ -308,12 +308,76 @@ to highlight words with high or low confidence: image -## Word-level timestamps (experimental) +## Controlling the length of the generated text segments (experimental) -The [main](examples/main) example has experimental support for word-level timestamp generation. The accuracy -is not great, but might be improved in the future. +For example, to limit the line length to a maximum of 16 characters, simply add `-ml 16`: -To use it, simply add the `-owts` command-line argument. There is a free parameter `-wt` that should be around `0.01`. +```java +./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 16 + +whisper_model_load: loading model from './models/ggml-base.en.bin' +... +system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | + +main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... + +[00:00:00.000 --> 00:00:00.850] And so my +[00:00:00.850 --> 00:00:01.590] fellow +[00:00:01.590 --> 00:00:04.140] Americans, ask +[00:00:04.140 --> 00:00:05.660] not what your +[00:00:05.660 --> 00:00:06.840] country can do +[00:00:06.840 --> 00:00:08.430] for you, ask +[00:00:08.430 --> 00:00:09.440] what you can do +[00:00:09.440 --> 00:00:10.020] for your +[00:00:10.020 --> 00:00:11.000] country. +``` + +## Word-level timestamp + +The `--max-len` argument can be used to obtain word-level timestamps. Simply use `-ml 1`: + +```java +./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav -ml 1 + +whisper_model_load: loading model from './models/ggml-base.en.bin' +... +system_info: n_threads = 4 / 10 | AVX2 = 0 | AVX512 = 0 | NEON = 1 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | + +main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... + +[00:00:00.000 --> 00:00:00.320] +[00:00:00.320 --> 00:00:00.370] And +[00:00:00.370 --> 00:00:00.690] so +[00:00:00.690 --> 00:00:00.850] my +[00:00:00.850 --> 00:00:01.590] fellow +[00:00:01.590 --> 00:00:02.850] Americans +[00:00:02.850 --> 00:00:03.300] , +[00:00:03.300 --> 00:00:04.140] ask +[00:00:04.140 --> 00:00:04.990] not +[00:00:04.990 --> 00:00:05.410] what +[00:00:05.410 --> 00:00:05.660] your +[00:00:05.660 --> 00:00:06.260] country +[00:00:06.260 --> 00:00:06.600] can +[00:00:06.600 --> 00:00:06.840] do +[00:00:06.840 --> 00:00:07.010] for +[00:00:07.010 --> 00:00:08.170] you +[00:00:08.170 --> 00:00:08.190] , +[00:00:08.190 --> 00:00:08.430] ask +[00:00:08.430 --> 00:00:08.910] what +[00:00:08.910 --> 00:00:09.040] you +[00:00:09.040 --> 00:00:09.320] can +[00:00:09.320 --> 00:00:09.440] do +[00:00:09.440 --> 00:00:09.760] for +[00:00:09.760 --> 00:00:10.020] your +[00:00:10.020 --> 00:00:10.510] country +[00:00:10.510 --> 00:00:11.000] . +``` + +## Karaoke-style movie generation (experimental) + +The [main](examples/main) example provides support for output of karaoke-style movies, where the +currently pronounced word is highlighted. Use the `-wts` argument and run the generated bash script. +This requires to have `ffmpeg` installed. Here are a few *"typical"* examples: