Validation error: arrays lenghts

ndujar · January 24, 2023, 5:48pm

Hi!
I am experiencing some issue when trying my test-submission script.
I am getting this error:

main.DataValidationError: Arrays lengths for query do not match. video_ids: 8295; timestamps: 682608; features: 341304.

This obviously comes from the /opt/validation function:

def validate_lengths(dataset: str, features_npz):
    n_video_ids = len(features_npz["video_ids"])
    n_timestamps = len(features_npz["timestamps"])
    n_features = len(features_npz["features"])
    if not (n_video_ids == n_timestamps == n_features):
        raise DataValidationError(
            f"Arrays lengths for {dataset} do not match. "
            f"video_ids: {n_video_ids}; "
            f"timestamps: {n_timestamps}; "
            f"features: {n_features}. "
        )

However, when reading the Code Submission Format page, I understood that the timestamp sub-array is like this:

timestamps is a 1D or 2D array of timestamps indicating the start and end times in seconds that the descriptor describes.

The code in my main.py looks like this:

def generate_query_descriptors(query_video_ids) -> np.ndarray:
    # Initialize return values
    video_ids = []
    timestamps = []
    descriptors = []

    # Generate descriptors for each video

    for i in tqdm.tqdm(range(query_video_ids.shape[0])):
        try:
            video_id = query_video_ids[i]
            video_file = f'{QRY_VIDEOS_DIRECTORY}/{video_id}.mp4'  
            start_timestamps, end_timestamps, qry_descriptor = extract_descriptor(video_file)
            descriptors.append(qry_descriptor)

            timestamps.append(np.hstack([start_timestamps, end_timestamps]))
            video_ids.append(video_id)
        except Exception as e:
            print(query_video_ids[i], e)

    descriptors = np.concatenate(descriptors).astype(np.float32)
    timestamps = np.concatenate(timestamps).astype(np.float32)

    return video_ids, descriptors, timestamps

Where the start and end descriptors come from:

    start_timestamps = np.array(tuple(start_timestamps.values()), dtype=np.float32)
    end_timestamps = np.array(tuple(end_timestamps.values()), dtype=np.float32)

Any hints at what might I be doing wrong?

Thanks!

chrisk-dd · January 25, 2023, 6:17pm

Hey @ndujar-

My guess is that np.concatenate is not doing what you want. You might try logging the shape of your arrays to ensure that you get the dimensionality you expect.

ndujar · February 7, 2023, 2:32pm

Hi @chrisk-dd ,

sorry for the delay in replying.

Thank you very much for pointing me in the right direction.
Indeed, my arrays were wrongly assembled and they didn’t match.

I believe I am now a bit closer to the solution, but now I am facing the following error:

ValueError: Object arrays cannot be loaded when allow_pickle=False

Doing a quick test, I have explored the contents of my arrays:

import numpy as np

b = np.load('reference_descriptors.npz',allow_pickle=True)
for key in b.keys():
    print(key)                       
    print(b[key].shape)

Which now outputs:

video_ids
(40318,)
features
(40318,)
timestamps
(40318,)

A closer look at a specific item in the features ndarray:

print(b['video_ids'][1], b['features'][1].shape)
print(b['video_ids'][27], b['features'][27].shape)

R204414 (59, 146)
R210266 (18, 146)

Which I believe should be correct, as now the features present one array of 146 floats per second of their respective video.

Given that different videos have different lengths, the ndarrays have to be saved as arrays of objects.
Or perhaps I am wrong?

Thanks!

chrisk-dd · February 8, 2023, 7:57am

I don’t think that the ndarrays need to contain objects. I might recommend that you take a look at how the quick start solution submission is structured and try to get your submission to match the way data is formatted there.

ndujar · February 8, 2023, 2:16pm

Thanks again for the clarification @chrisk-dd . I actually messed up in my understanding of the quickstart code and assumed (wrongly) that there should be one row per video_id.
Following your suggestion, I have re-run the submission_quickstart/main.py and checked the internals of the generated query_descriptors.npz file:

import pandas as pd
import numpy as np

b = np.load('../submission_quickstart/reference_descriptors.npz')
query_df = pd.DataFrame.from_dict({item: b[item] for item in b.files}, orient='index')

query_df.T

This outputs:

So, basically there are multiple repeated entries for the video_ids “column”.
As many as intervals from which a descriptor has been extracted (with the start_timestamp and end_timestamp defining them).
Now I understand.

I hope this serves to clarify to others

Thanks!

Topic		Replies	Views
Your submission did not output the expected file so it could not be scored. This may be due to an unhandled exception or syntax error in your code. The log output may have more details Video Similarity Challenge	2	260	February 17, 2023
Same submission but failed in phase 2 Video Similarity Challenge	2	234	April 6, 2023
Facing problems in submission Video Similarity Challenge	5	235	January 19, 2023
About data type mismatch Video Similarity Challenge	1	295	March 13, 2023
Submission Error help needed Deep Chimpact: Depth Estimation Challenge	2	381	September 24, 2021

Validation error: arrays lenghts

Related topics