The problem that I had overlooked until now (!!) due to only using small
exemplar videos for testing was that VideoActivityFeatureExtractor keeps
a fraction of small (0.5s) Features. This is not so problematic for a short
exemplar video, but ends up witha lot of jumpy Features in an actual source
video.
Fix approach:
Like LoudnessFE, keep a specific number and expand the duration (which I think I
will do for LoudnessFE too), dropping any that would be consumed in that range
Tries to drop the lowest-scoring Features until the target time (range) is
reached. This is not optimised and a relatively naïve approach- there are many
inputs which would result in a non-ideal pruning.
TargetTimeAdjuster will adjust a list of Features until it is within an optional
margin of a target total duration.
Helper functions:
- _determine_margin() :: figure out the max and min cutoff times, considering
margin and margin strategy (percent / absolute)
- _features_total_time() :: basic sum of list of Features' durations
TODO: rename to TargetDurationAdjuster ? rename 'strategy' ??
Adjusters will be used to modify a list of Features. This could either be:
- to modify the overall set (eg to target a time)
- to modify individual Features
The most important Adjuster will be one that targets an overall time, eg:
"modify this list of Features such that their times add up to 1 minute (either ±
a % or a hard limit)"
@see: feature_extractors.py::FeatureExtractor
To check we do not match any words that are not present in a media file with
speech, we use the English-language Harvard sentences for audio, and compare to
the classic French pangram: "Portez ce vieux whisky au juge blond qui fume" [1]
and ensure no Features are produced
[1]: "Take this old whisky to the blond judge who is smoking"
Found out that Whisper throws a hissy fit in the form of a RuntimeError if the
there is no speech in the audio. We should consider catching this.
> RuntimeError: stack expects a non-empty TensorList
> stdout: "No active speech found in audio"
For the moment we can check that no audio throws an error and leave this as a TODO
Functional tests for WordFeatureExtractor consist of making sure it can find
words known in advance. The Harvard Sentences [1] are a useful means of doing
that. These are 'standard sentences' that are used for speech quality
measurements, and so would be decent candidates for assessing word recognition.
The Open Speech REpository [2] has samples of sentences to download.
In testing, the Whisper medium model had trouble with a few words:
- glue
- well
- punch
- truck
I'm not sure why. Even when I recorded myself speaking the Harvard sentences in
higher quality (OSR files are 8kHz range) it would still not recognise these
words. A separate functional test of only those words was added as a result.
This would perhaps be worth exploring in more detail if there was time.
[1]: See eg https://www.cs.columbia.edu/~hgs/audio/harvard.html
[2]: https://www.voiptroubleshooter.com/open_speech/index.html
Calls pulled out relate to setup and working of Whisper:
- _whispermodel()
- _batched_inference_pipeline()
- _transcribe()
Defaults defined: model, device, compute type, beamsize, batchsize, pipeline type
Tests:
- basic init
- init with no media
- run() with no words (early exit 0 Features)
- run() with mocked transcribe
NOTE: these are unit tests and do not exercise Whisper
BREAKING CHANGE: no words to WFE are no longer an error, they raise a notice
WordFeatureExtractor is not fast- even the import is slow. However, it processes
files and returns Features corresponding to matched words.
WhisperFE will be slightly different to other FEs in that there is/are specific
target words to be searched for. Not specifying these could be an error (this
commit specifies this as such) but a better approach may be to downgrade that to
a (logging) notice, and simply match nothing / early exit.
Uses a manually-crafted video with laughters between 15-20s.
Test takes LaughFE's internal Feature time adjustment into account (see related
commit).
Note: very slow test
@see: df3c559