TargetTimeAdjuster will adjust a list of Features until it is within an optional
margin of a target total duration.
Helper functions:
- _determine_margin() :: figure out the max and min cutoff times, considering
margin and margin strategy (percent / absolute)
- _features_total_time() :: basic sum of list of Features' durations
TODO: rename to TargetDurationAdjuster ? rename 'strategy' ??
Adjusters will be used to modify a list of Features. This could either be:
- to modify the overall set (eg to target a time)
- to modify individual Features
The most important Adjuster will be one that targets an overall time, eg:
"modify this list of Features such that their times add up to 1 minute (either ±
a % or a hard limit)"
@see: feature_extractors.py::FeatureExtractor
To check we do not match any words that are not present in a media file with
speech, we use the English-language Harvard sentences for audio, and compare to
the classic French pangram: "Portez ce vieux whisky au juge blond qui fume" [1]
and ensure no Features are produced
[1]: "Take this old whisky to the blond judge who is smoking"
Found out that Whisper throws a hissy fit in the form of a RuntimeError if the
there is no speech in the audio. We should consider catching this.
> RuntimeError: stack expects a non-empty TensorList
> stdout: "No active speech found in audio"
For the moment we can check that no audio throws an error and leave this as a TODO
Functional tests for WordFeatureExtractor consist of making sure it can find
words known in advance. The Harvard Sentences [1] are a useful means of doing
that. These are 'standard sentences' that are used for speech quality
measurements, and so would be decent candidates for assessing word recognition.
The Open Speech REpository [2] has samples of sentences to download.
In testing, the Whisper medium model had trouble with a few words:
- glue
- well
- punch
- truck
I'm not sure why. Even when I recorded myself speaking the Harvard sentences in
higher quality (OSR files are 8kHz range) it would still not recognise these
words. A separate functional test of only those words was added as a result.
This would perhaps be worth exploring in more detail if there was time.
[1]: See eg https://www.cs.columbia.edu/~hgs/audio/harvard.html
[2]: https://www.voiptroubleshooter.com/open_speech/index.html
Calls pulled out relate to setup and working of Whisper:
- _whispermodel()
- _batched_inference_pipeline()
- _transcribe()
Defaults defined: model, device, compute type, beamsize, batchsize, pipeline type
Tests:
- basic init
- init with no media
- run() with no words (early exit 0 Features)
- run() with mocked transcribe
NOTE: these are unit tests and do not exercise Whisper
BREAKING CHANGE: no words to WFE are no longer an error, they raise a notice
WordFeatureExtractor is not fast- even the import is slow. However, it processes
files and returns Features corresponding to matched words.
WhisperFE will be slightly different to other FEs in that there is/are specific
target words to be searched for. Not specifying these could be an error (this
commit specifies this as such) but a better approach may be to downgrade that to
a (logging) notice, and simply match nothing / early exit.
Uses a manually-crafted video with laughters between 15-20s.
Test takes LaughFE's internal Feature time adjustment into account (see related
commit).
Note: very slow test
@see: df3c559
To help functional testing, LaughFE's internal adjustment times are exposed.
Recap: when a laugh is detected by LaughFE, the time of the laugh itself it not
used directly; instead, the resulting Feature has some time prepended to try to
capture the thing that caused the laugh.
When functional testing the FEs we set up specially-crafted videos with features
at known points, so to make sure the LaughterFE is tested correctly we adjust
the tests by the amount of time the FE adjusts by so that it properly tests the
intended behaviour.
@see:
- feature_extractors.py::LaughterFeatureExtractor
Functional tests for LoudAudioFeatureExtractor
Currently uses one manually-generated video with blank audio except between
15-20s where 1-2 sine tones are present
Only has a single property at present: SAMPLE_DIR for the path to where
sample videos are stored
TestVideoActivityFEFunctional now inherits from this instead of unittest.TestCase
Problems fixed:
- Feature repr did not include feature_extractor since that API was changed
- Intervals that were equivalent were not equal, so Features were not properly
sorted or equal
This was done so the collection of loudnesses from an audio file could be mocked
in testing, but improves readability
TODO: review number of params and consider further refactoring