Alignment key for the A/V features in the .npy/.hdf5 files #20

amanchadha · 2021-04-20T01:36:21Z

Hi Vladimir,

Long time no talk :) I was wondering if you can share the code that converted the .npy features (from your VGGish and I3D feature extractor) that you made available to me mid last year, to .hdf5 in MDVC Readme: Usage. In particular, I am interested in understanding how you "align" the audio and video features (based on the code below).

Questions:

Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?
Is the D_audio/D_video simply the feature dimension?

def load_multimodal_features_from_h5(feat_h5_video, feat_h5_audio, feature_names_list, 
                                     video_id, start, end, duration, get_full_feat=False, cs=True):
    supported_feature_names = {'i3d_features', 'c3d_features', 'vggish_features'}
    assert isinstance(feature_names_list, list)
    assert len(feature_names_list) > 0
    assert set(feature_names_list).issubset(supported_feature_names)

    if 'vggish_features' in feature_names_list:
        audio_stack = feat_h5_audio.get(f'{video_id}/vggish_features')

        # some videos doesn't have audio
        if audio_stack is None:
            print(f'audio_stack is None @ {video_id}')
            audio_stack = torch.empty((0, 128)).float()

        T_audio, D_audio = audio_stack.shape

    if 'i3d_features' in feature_names_list:
        video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
        video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
        
        assert video_stack_rgb.shape == video_stack_flow.shape
        T_video, D_video = video_stack_rgb.shape

        if T_video > T_audio:
            video_stack_rgb = video_stack_rgb[:T_audio, :]
            video_stack_flow = video_stack_flow[:T_audio, :]
            T = T_audio
        elif T_video < T_audio:
            audio_stack = audio_stack[:T_video, :]
            T = T_video
        else:
            # or T = T_audio
            T = T_video
        
        # at this point they should be the same
        assert audio_stack.shape[0] == video_stack_rgb.shape[0]

Thanks again for your help!

The text was updated successfully, but these errors were encountered:

v-iashin · 2021-04-20T05:44:40Z

Hi 👋 ! Indeed!

I am afraid, I don't have the exact snippet which does .mp4 -> .npy -> .hdf5. However, the procedure is quite straightforward. Someone asked it before and I wrote one from my memory: #11 (comment). What you need there is the answer to the first question. I didn't run it back then. I just wrote it down in my comment. So, please ensure it won't fail with some errors (check out the following comments on possible bugs). Overall, just extract features with video_features (repo) code and run that script on top of it.

Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?

I am not sure what you mean here. T_audio and T_video are temporal dims of features. The I3D features are extracted from 24 frames from a video sampled at 25 fps and temporarily span 0.96 (24/25) seconds with no overlap. At the same time, VGGish features are extracted from 0.96 s audio segments. As you see, both sequences should have the same temporal length if they are extracted from the same video. Therefore, they are aligned. However, sometimes (I don't remember how rarely), they are not equal. In this case, we just trim the longest one to match the shortest one (see the i3d part in the snippet you provided).

Is the D_audio/D_video simply the feature dimension?

Yes, it is.

tkbadamdorj · 2021-04-26T18:45:13Z

Hi Vladimir,

Do the audio features for each video cover the entire video? Did you filter out the audio segments that are not inside event proposals?

Thank you!

v-iashin · 2021-04-27T04:21:22Z

Yes, similar to visual features and speech, the audio is available for the entire video. And yes we trim the modalities to be in a segment as shown here

MDVC/dataset/dataset.py

Line 67 in df3b88a

audio_stack = audio_stack[start_idx:end_idx, :]

amanchadha changed the title ~~.npy to .hdf5 for A/V features?~~ Aligning the A/V features in the .npy/.hdf5 files? Apr 20, 2021

amanchadha changed the title ~~Aligning the A/V features in the .npy/.hdf5 files?~~ Alignment key for the A/V features in the .npy/.hdf5 files Apr 20, 2021

v-iashin closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment key for the A/V features in the .npy/.hdf5 files #20

Alignment key for the A/V features in the .npy/.hdf5 files #20

amanchadha commented Apr 20, 2021 •

edited

Loading

v-iashin commented Apr 20, 2021

tkbadamdorj commented Apr 26, 2021

v-iashin commented Apr 27, 2021

Alignment key for the A/V features in the .npy/.hdf5 files #20

Alignment key for the A/V features in the .npy/.hdf5 files #20

Comments

amanchadha commented Apr 20, 2021 • edited Loading

v-iashin commented Apr 20, 2021

tkbadamdorj commented Apr 26, 2021

v-iashin commented Apr 27, 2021

amanchadha commented Apr 20, 2021 •

edited

Loading