To check we do not match any words that are not present in a media file with
speech, we use the English-language Harvard sentences for audio, and compare to
the classic French pangram: "Portez ce vieux whisky au juge blond qui fume" [1]
and ensure no Features are produced
[1]: "Take this old whisky to the blond judge who is smoking"
Found out that Whisper throws a hissy fit in the form of a RuntimeError if the
there is no speech in the audio. We should consider catching this.
> RuntimeError: stack expects a non-empty TensorList
> stdout: "No active speech found in audio"
For the moment we can check that no audio throws an error and leave this as a TODO
Functional tests for WordFeatureExtractor consist of making sure it can find
words known in advance. The Harvard Sentences [1] are a useful means of doing
that. These are 'standard sentences' that are used for speech quality
measurements, and so would be decent candidates for assessing word recognition.
The Open Speech REpository [2] has samples of sentences to download.
In testing, the Whisper medium model had trouble with a few words:
- glue
- well
- punch
- truck
I'm not sure why. Even when I recorded myself speaking the Harvard sentences in
higher quality (OSR files are 8kHz range) it would still not recognise these
words. A separate functional test of only those words was added as a result.
This would perhaps be worth exploring in more detail if there was time.
[1]: See eg https://www.cs.columbia.edu/~hgs/audio/harvard.html
[2]: https://www.voiptroubleshooter.com/open_speech/index.html
Calls pulled out relate to setup and working of Whisper:
- _whispermodel()
- _batched_inference_pipeline()
- _transcribe()
Defaults defined: model, device, compute type, beamsize, batchsize, pipeline type
Tests:
- basic init
- init with no media
- run() with no words (early exit 0 Features)
- run() with mocked transcribe
NOTE: these are unit tests and do not exercise Whisper
Uses a manually-crafted video with laughters between 15-20s.
Test takes LaughFE's internal Feature time adjustment into account (see related
commit).
Note: very slow test
@see: df3c559
Functional tests for LoudAudioFeatureExtractor
Currently uses one manually-generated video with blank audio except between
15-20s where 1-2 sine tones are present
Only has a single property at present: SAMPLE_DIR for the path to where
sample videos are stored
TestVideoActivityFEFunctional now inherits from this instead of unittest.TestCase