Query_ids and reference_ids fields in hdf5

evgenity · June 30, 2021, 5:17pm

Could you please provide more info on “query_ids” and “reference_ids” fields in hdf5? It seems they are not being filled in the benchmark examples. The documentation for the contest says that these fields should contain IDs of the corresponding images, but it also says IDs should be sorted. Do we need to fill these fields? If so, could you please specify the format of this IDs (e.g. string or int) or provide and example of how to fill them for rand case.

import h5py
import numpy as np

M_ref = np.random.rand(1_000_000, 256).astype('float32')
M_query = np.random.rand(50_000, 256).astype('float32')

out = OUT_DIR / "fb-isc-submission.h5"
with h5py.File(out, "w") as f:
    f.create_dataset("query", data=M_query)
    f.create_dataset("reference", data=M_ref)
    f.create_dataset('query_ids', data=qry_ids) #<-------
    f.create_dataset('reference_ids', data=ref_ids) #<-------

mike-dd · June 30, 2021, 8:20pm

Welcome, evgenity! Good question – we should have included a couple more lines in that example than we did.

You do need to include these ID datasets in the submitted hdf5 file. The data type will be strings, since the IDs include a ‘Q’ or ‘R’ prefix.

Here are two lines that would create a valid submission:

qry_ids = ['Q' + str(x).zfill(5) for x in range(50_000)]
ref_ids = ['R' + str(x).zfill(6) for x in range(1_000_000)]

Thanks for pointing this out. I’ll also add the above lines to the website so it will be more clear for the next person.

jayqi · July 1, 2021, 12:14am

Hi @evgenity, to add some more detail in response to your question:

The code that @mike-dd provided does indeed generate valid ID arrays. (In fact, we use code like that in various places for competition administration.) However, we recommend against directly using that code when creating your HDF5 submission file. Instead, we recommend you generate your ID arrays using a process that will correspond to the order in which you load and/or process the image files. The purpose of having competitors include the ID arrays is as a check that your descriptor vectors are in the correct order. That check doesn’t work if you generate the ID arrays independently of how you process your data.

As an example of one way you might generate the ID array for the query images:

from pathlib import Path

query_image_paths = sorted(Path("data/raw/query_images").iterdir())
query_image_paths[:3]
#> [ PosixPath('data/raw/query_images/Q00000.jpg'),
#>   PosixPath('data/raw/query_images/Q00001.jpg'),
#>   PosixPath('data/raw/query_images/Q00002.jpg')]

query_ids = [path.stem for path in query_image_paths]
query_ids[:3]
#> ['Q00000', 'Q00001', 'Q00002']

^{Created at 2021-06-30 17:58:34 MDT by reprexlite v0.4.2}

In this example, the file paths in query_image_paths correspond directly with query_ids. If you process your images in the order that they are in query_image_paths, then you’ll know that you have both your descriptor vectors and your query IDs in the same and correct order. This is a useful way to catch errors; for example, if you left out sorted in the above code snippet, iterdir() by itself will yield paths in an arbitrary unsorted order. Your query_ids would also be in the same unsorted order, and we’d be able to catch that when you submit your file to the competition.

The code using zfill in the previous response can still be quite useful if you want to write your own validation step to check whether your HDF5 submission has the correct ID values in the correct order.

Topic		Replies	Views
One question about Phase 2 Image Similarity Challenge	4	451	October 25, 2021
The format to submit Track 2 reference features Image Similarity Challenge	4	416	October 15, 2021
Submission format of Descriptor Track Image Similarity Challenge	1	400	July 6, 2021
Your submission did not output the expected file so it could not be scored. This may be due to an unhandled exception or syntax error in your code. The log output may have more details Video Similarity Challenge	2	260	February 17, 2023
Same submission but failed in phase 2 Video Similarity Challenge	2	234	April 6, 2023

Query_ids and reference_ids fields in hdf5

Related topics