Validation error: arrays lenghts

Hi!
I am experiencing some issue when trying my test-submission script.
I am getting this error:

main.DataValidationError: Arrays lengths for query do not match. video_ids: 8295; timestamps: 682608; features: 341304.

This obviously comes from the /opt/validation function:

def validate_lengths(dataset: str, features_npz):
    n_video_ids = len(features_npz["video_ids"])
    n_timestamps = len(features_npz["timestamps"])
    n_features = len(features_npz["features"])
    if not (n_video_ids == n_timestamps == n_features):
        raise DataValidationError(
            f"Arrays lengths for {dataset} do not match. "
            f"video_ids: {n_video_ids}; "
            f"timestamps: {n_timestamps}; "
            f"features: {n_features}. "
        )

However, when reading the Code Submission Format page, I understood that the timestamp sub-array is like this:

timestamps is a 1D or 2D array of timestamps indicating the start and end times in seconds that the descriptor describes.

The code in my main.py looks like this:

def generate_query_descriptors(query_video_ids) -> np.ndarray:
    # Initialize return values
    video_ids = []
    timestamps = []
    descriptors = []

    # Generate descriptors for each video

    for i in tqdm.tqdm(range(query_video_ids.shape[0])):
        try:
            video_id = query_video_ids[i]
            video_file = f'{QRY_VIDEOS_DIRECTORY}/{video_id}.mp4'  
            start_timestamps, end_timestamps, qry_descriptor = extract_descriptor(video_file)
            descriptors.append(qry_descriptor)

            timestamps.append(np.hstack([start_timestamps, end_timestamps]))
            video_ids.append(video_id)
        except Exception as e:
            print(query_video_ids[i], e)

    descriptors = np.concatenate(descriptors).astype(np.float32)
    timestamps = np.concatenate(timestamps).astype(np.float32)

    return video_ids, descriptors, timestamps

Where the start and end descriptors come from:

    start_timestamps = np.array(tuple(start_timestamps.values()), dtype=np.float32)
    end_timestamps = np.array(tuple(end_timestamps.values()), dtype=np.float32)

Any hints at what might I be doing wrong?

Thanks! :slight_smile:

Hey @ndujar-

My guess is that np.concatenate is not doing what you want. You might try logging the shape of your arrays to ensure that you get the dimensionality you expect.

Hi @chrisk-dd ,

sorry for the delay in replying.

Thank you very much for pointing me in the right direction.
Indeed, my arrays were wrongly assembled and they didn’t match.

I believe I am now a bit closer to the solution, but now I am facing the following error:

ValueError: Object arrays cannot be loaded when allow_pickle=False

Doing a quick test, I have explored the contents of my arrays:

import numpy as np

b = np.load('reference_descriptors.npz',allow_pickle=True)
for key in b.keys():
    print(key)                       
    print(b[key].shape) 

Which now outputs:

video_ids
(40318,)
features
(40318,)
timestamps
(40318,)

A closer look at a specific item in the features ndarray:

print(b['video_ids'][1], b['features'][1].shape)
print(b['video_ids'][27], b['features'][27].shape)

R204414 (59, 146)
R210266 (18, 146)

Which I believe should be correct, as now the features present one array of 146 floats per second of their respective video.

Given that different videos have different lengths, the ndarrays have to be saved as arrays of objects.
Or perhaps I am wrong?

Thanks!

I don’t think that the ndarrays need to contain objects. I might recommend that you take a look at how the quick start solution submission is structured and try to get your submission to match the way data is formatted there.

Thanks again for the clarification @chrisk-dd . I actually messed up in my understanding of the quickstart code and assumed (wrongly) that there should be one row per video_id.
Following your suggestion, I have re-run the submission_quickstart/main.py and checked the internals of the generated query_descriptors.npz file:

import pandas as pd
import numpy as np

b = np.load('../submission_quickstart/reference_descriptors.npz')
query_df = pd.DataFrame.from_dict({item: b[item] for item in b.files}, orient='index')

query_df.T

This outputs:
image

So, basically there are multiple repeated entries for the video_ids “column”.
As many as intervals from which a descriptor has been extracted (with the start_timestamp and end_timestamp defining them).
Now I understand.

I hope this serves to clarify to others :slight_smile:

Thanks!

2 Likes