Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment key for the A/V features in the .npy/.hdf5 files #20

Closed
amanchadha opened this issue Apr 20, 2021 · 3 comments
Closed

Alignment key for the A/V features in the .npy/.hdf5 files #20

amanchadha opened this issue Apr 20, 2021 · 3 comments

Comments

@amanchadha
Copy link

amanchadha commented Apr 20, 2021

Hi Vladimir,

Long time no talk :) I was wondering if you can share the code that converted the .npy features (from your VGGish and I3D feature extractor) that you made available to me mid last year, to .hdf5 in MDVC Readme: Usage. In particular, I am interested in understanding how you "align" the audio and video features (based on the code below).

Questions:

  1. Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?
  2. Is the D_audio/D_video simply the feature dimension?
def load_multimodal_features_from_h5(feat_h5_video, feat_h5_audio, feature_names_list, 
                                     video_id, start, end, duration, get_full_feat=False, cs=True):
    supported_feature_names = {'i3d_features', 'c3d_features', 'vggish_features'}
    assert isinstance(feature_names_list, list)
    assert len(feature_names_list) > 0
    assert set(feature_names_list).issubset(supported_feature_names)

    if 'vggish_features' in feature_names_list:
        audio_stack = feat_h5_audio.get(f'{video_id}/vggish_features')

        # some videos doesn't have audio
        if audio_stack is None:
            print(f'audio_stack is None @ {video_id}')
            audio_stack = torch.empty((0, 128)).float()

        T_audio, D_audio = audio_stack.shape

    if 'i3d_features' in feature_names_list:
        video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
        video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
        
        assert video_stack_rgb.shape == video_stack_flow.shape
        T_video, D_video = video_stack_rgb.shape

        if T_video > T_audio:
            video_stack_rgb = video_stack_rgb[:T_audio, :]
            video_stack_flow = video_stack_flow[:T_audio, :]
            T = T_audio
        elif T_video < T_audio:
            audio_stack = audio_stack[:T_video, :]
            T = T_video
        else:
            # or T = T_audio
            T = T_video
        
        # at this point they should be the same
        assert audio_stack.shape[0] == video_stack_rgb.shape[0]

Thanks again for your help!

@amanchadha amanchadha changed the title .npy to .hdf5 for A/V features? Aligning the A/V features in the .npy/.hdf5 files? Apr 20, 2021
@amanchadha amanchadha changed the title Aligning the A/V features in the .npy/.hdf5 files? Alignment key for the A/V features in the .npy/.hdf5 files Apr 20, 2021
@v-iashin
Copy link
Owner

Hi 👋 ! Indeed!

I am afraid, I don't have the exact snippet which does .mp4 -> .npy -> .hdf5. However, the procedure is quite straightforward. Someone asked it before and I wrote one from my memory: #11 (comment). What you need there is the answer to the first question. I didn't run it back then. I just wrote it down in my comment. So, please ensure it won't fail with some errors (check out the following comments on possible bugs). Overall, just extract features with video_features (repo) code and run that script on top of it.

Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?

I am not sure what you mean here. T_audio and T_video are temporal dims of features. The I3D features are extracted from 24 frames from a video sampled at 25 fps and temporarily span 0.96 (24/25) seconds with no overlap. At the same time, VGGish features are extracted from 0.96 s audio segments. As you see, both sequences should have the same temporal length if they are extracted from the same video. Therefore, they are aligned. However, sometimes (I don't remember how rarely), they are not equal. In this case, we just trim the longest one to match the shortest one (see the i3d part in the snippet you provided).

Is the D_audio/D_video simply the feature dimension?

Yes, it is.

@tkbadamdorj
Copy link

Hi Vladimir,

Do the audio features for each video cover the entire video? Did you filter out the audio segments that are not inside event proposals?

Thank you!

@v-iashin
Copy link
Owner

Yes, similar to visual features and speech, the audio is available for the entire video. And yes we trim the modalities to be in a segment as shown here

audio_stack = audio_stack[start_idx:end_idx, :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants