You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.1 KiB

Highlight Generator Pipeline Planning

Overview

/bertieb/itproject-docs/src/commit/bfb32d992a2894830ffdfc09414acb244e5b5767/~/downloads/highlightgeneration-process.svg

https://roberthallam.com/files/highlightgeneration-process.svg

Pipeline `API'

Input

User-driven selection of input videos.

files

user-selected list of ≥1 input files to be processed

(optional) time restriction

start and end time (or: start time + duration?) of file to restrict highlight generation to (format: s & H:M:s ?) (see note)

(optional, stretch) feature extractor mapping

a map of files to feature exactors, eg:

video1:
    path: /video/directory/video1.mkv
    feature_extractors:
      - laughter-detection
      - loud-moments

video2:
    path: /video/directory/video2.mp4
    start: 10:00
    end: 50:00
    feature_extractors:
      - word-recognition
Time Restriction

To properly operate on a restricted range, this can create a temporary media file (using eg ffmpeg) for operation on by the Feature Selection step.

Creating a temporary file can be avoided by ensuring each Feature Selector respects a custom duration, but since some are third-party they may need their implementation updated.

Discussion point: pros and cons of updating 3P feature selectors

Output

Conceptually, a list of files (either original path, or path to temporary time-restricted version) and associated options for each. This will either be a language-specific object, or the equivelent JSON.

Example:

{
    source_videos: [
        { "path":  "/video/director/video1.mkv",
          "feature_extractors": [
              "laughter-detection",
              "loud-moments",
              ],
          },
        { "path": "/tmp/videohighlights/inputclip0001.mkv",
          "feature_extractors": [
              "word-recognition"
              ],
          }
    ]
}
Further Considerations
  • time specification formats – start & end ? start & duration? either? negative times for specifying time distance from end of file?

Source / Feature Selection

A Source is an automation-driven method of figuring out what bits of an input video to keep.

Input

A ist of input videos as in Input

Options
  • Source-specific options (eg min duration, threshold etc for laughter-detection),

  • working directory

  • minimum duration (see Further Considerations)

Output

A set of ≥0 Feature-type objects or equivalent JSON.

Further Considerations

At time of writing the feature selection drivers conceptually output timestamps, as opposed to durations; but conceptually durations make more sense. It may be worthwhile to automatically promote any `point' timestamps to a duration.

Pros: makes the next step in the pipeline more uniform Cons: will probably over-sample

Consequent consideration: does that mean we should let the user adjust the pre-consolidation times too? Probably, but doing that it a UX-friendly way will potentially take some doing.

Consolidation

The consolidation stage takes a list of timestamps (across videos) and consolidates / merges proximal times to create clip definitions across sources.

input

a list of video files with associated timestamps or time ranges (and their sources), eg in JSON:

{ "videos": [ "/path/to/videos/video1.mp4": [ { "time": 180, "source": "laughter-detect" },
                                              { "time": 187, "source": "laughter-detect" },
                                              { "time": 295, "source": "loud-detect" },
                                              { "time": 332, "source": "laughter-detect" }
                                              ],
              "/path/to/videos/video2.mp4": [ { "start": 45, "end": 130, "source": "segmenter" } ],
              ]
  }
Approach

The input list of feature times goes through comparison process: if a feature has overlap with another (that is, starts or ends within the time period of another feature), those two features are consolidated into one. This comparison can be done with a delta, a small amount of time (eg 15s) for which `nearby' intervals can be considered overlapping.

Options
  • maximum delta between clips to be consolidated (default: 15s [rationale: Stetson-Harrison approach1)

  • maximum duration of a consolidated clip (default: 60s [rationale: max duration of YT shorts?])

  • maximum number of consolidated clips to output (default: unlimited)

Refinement

User-driven process of selection and applying operators to clips before final output.

1. Selection

User choice of which clips to keep.

2. Process

User applies (video) Processes to the clip(s):

duration

select start and end time (possibly before/after generated clip's boundaries)

join

further join clips which were not joined at consolidation stage

filters

eg sharpen / slomo / (de)saturate etc [stretch]

split

Note: need to be careful not to reimplement an NLE here!

Highlights

Ultimate output.


1

The Stetson-Harrison approach