Long-Term Feature Banks For Detailled Video Understanding

The Long-Term Feature Bank is an extension of the Short-Term Feature Bank that augments state-of-the-art models for detailed video understanding. It encodes rich time-indexed representations of the entire movie, including objects, actions, and past scenes. The LTF Bank is a flexible set of features that can be used to learn the semantics of complex videos. It is an excellent alternative to previous techniques that extract features from individual frames, feed them into an RNN, and then use aggressive subsampling to fit the full movie into a GPU’s memory.

The Long-Term Feature Bank is constructed from a sequence of short clips. Each short clip has $N_t times d$ dimensions, with $d$ the number of actors detected in a short clip at time t. The resulting long-term feature bank is computed by the avg/max pooling algorithm using the time dimension. This computation yields 2048 long-term dimension features, which are then concatenated with short-term feature bank.

To compute the long-term features, the 3D CNN takes a short video clip with a duration of two to five seconds. It then computes a feature map and region proposals, and RoI features for each actor. The three-dimensional CNN model captures short-term information by combining the short-term and long-term features. The long-term features are extracted from the short-term feature banks, and their interaction is computed through a non-local block.

The Long-Term Feature Bank is computed by computing the avg/max of the short-term feature bank with the Long-Term Feature Bank. It takes a short-term clip of two to five seconds and uses the RoI to identify actors. The Long-Term Observer Bank (LTFB) is a list of long-term features that can be computed by avg/max pooling using the time dimension.

The long-term Feature Bank is computed by combining a list of short-term and long-term clips. The length of each feature is $N_t times d$, where d is the number of detected actors at time t. The FBO is implemented by implementing the feature bank operator by avg/max on the time dimension. Compared to short-term features, the FBO has a large number of different dimensions.

The long-term feature bank is a list of long-term features extracted from a short-term video clip. It has a dimension of $N_t times d$, which is the number of detected actors at a particular time. In the short-term feature bank, the input is the short-term information. The long-term feature bank is avg/max pooled set of 2048 features.

The long-term feature bank is a list of long-term features that are extracted from short-term video clips. It is typically computed from a list of short-term clips, with each clip containing an actor’s RoI. The FBO is a combination of the two. The input is a window of two or five seconds. It is computed in this way by an avg/max-based method.

The long-term feature bank is created from the short-term feature bank. This list consists of 2048 long-term features and is computed from a list of 32 short-term clips. The output is a list of short-term features and the long-term feature bank has two dimensions, a time dimension and a spatial dimension. The FBO is used to detect actors in a video, but is not suitable for the long-term pixel-level analysis.

In a three-dimensional video, the short-term features are 16x14x14 pixels. The long-term feature bank contains one or more actors in each video. The length of the long-term feature bank consists of a list of these actors. The FBO also carries out local and regional concatenation. The LTF-based features are a good candidate for training 3D CNN.

Long-Term Feature Banks for Detailled Video Understanding