Query_ids and reference_ids fields in hdf5

Could you please provide more info on “query_ids” and “reference_ids” fields in hdf5? It seems they are not being filled in the benchmark examples. The documentation for the contest says that these fields should contain IDs of the corresponding images, but it also says IDs should be sorted. Do we need to fill these fields? If so, could you please specify the format of this IDs (e.g. string or int) or provide and example of how to fill them for rand case.

import h5py
import numpy as np

M_ref = np.random.rand(1_000_000, 256).astype('float32')
M_query = np.random.rand(50_000, 256).astype('float32')

out = OUT_DIR / "fb-isc-submission.h5"
with h5py.File(out, "w") as f:
    f.create_dataset("query", data=M_query)
    f.create_dataset("reference", data=M_ref)
    f.create_dataset('query_ids', data=qry_ids) #<-------
    f.create_dataset('reference_ids', data=ref_ids) #<-------

Welcome, evgenity! Good question – we should have included a couple more lines in that example than we did.

You do need to include these ID datasets in the submitted hdf5 file. The data type will be strings, since the IDs include a ‘Q’ or ‘R’ prefix.

Here are two lines that would create a valid submission:

qry_ids = ['Q' + str(x).zfill(5) for x in range(50_000)]
ref_ids = ['R' + str(x).zfill(6) for x in range(1_000_000)]

Thanks for pointing this out. I’ll also add the above lines to the website so it will be more clear for the next person.

Hi @evgenity, to add some more detail in response to your question:

The code that @dd-mike provided does indeed generate valid ID arrays. (In fact, we use code like that in various places for competition administration.) However, we recommend against directly using that code when creating your HDF5 submission file. Instead, we recommend you generate your ID arrays using a process that will correspond to the order in which you load and/or process the image files. The purpose of having competitors include the ID arrays is as a check that your descriptor vectors are in the correct order. That check doesn’t work if you generate the ID arrays independently of how you process your data.

As an example of one way you might generate the ID array for the query images:

from pathlib import Path

query_image_paths = sorted(Path("data/raw/query_images").iterdir())
#> [ PosixPath('data/raw/query_images/Q00000.jpg'),
#>   PosixPath('data/raw/query_images/Q00001.jpg'),
#>   PosixPath('data/raw/query_images/Q00002.jpg')]

query_ids = [path.stem for path in query_image_paths]
#> ['Q00000', 'Q00001', 'Q00002']

Created at 2021-06-30 17:58:34 MDT by reprexlite v0.4.2

In this example, the file paths in query_image_paths correspond directly with query_ids. If you process your images in the order that they are in query_image_paths, then you’ll know that you have both your descriptor vectors and your query IDs in the same and correct order. This is a useful way to catch errors; for example, if you left out sorted in the above code snippet, iterdir() by itself will yield paths in an arbitrary unsorted order. Your query_ids would also be in the same unsorted order, and we’d be able to catch that when you submit your file to the competition.

The code using zfill in the previous response can still be quite useful if you want to write your own validation step to check whether your HDF5 submission has the correct ID values in the correct order.