Questions about Data and Study Design

The paper describes scripted data and non-scripted data along with a link to the script but the page can not be found!
Which files are scripted and which files aren’t scripted? What is the script?
How many participants participated in the study? Is the data provided from one or many people?
What are the test conditions (are these people monitored for a day, a week…?)

Some of these questions are less about affecting the data analysis and more about thoroughness. It is very important to include these details in a publication. I am surprised that the information is not available.

If anyone can answer these questions, please do.

Hi rmroast01

Thanks for your question. The script can be found here:
http://irc-sphere.ac.uk/lib/tinymce/plugins/moxiemanager/data/files/data_collection_script.pdf

Due to time constraints all of the data is scripted - we found that the dataset would be more valuable with more annotations.

In total 10 people participated. There are multiple repetitions of each person. Some will be in train and test, while some will only be in train and others only in test. The location was selected randomly.

Test conditions are also scripts.

Hi ntwomey,

Thanks for providing that additional information.

While we’re on the topic of missing scripts, would it be possible to post the Python script that creates the targets.csv files from the annotation*.csv files?

That script is referred to on the problem description page: “The target files are generated with this Python script.” But clicking through on the link returns a 404 page from github.

Thanks again!

Oh yes. I forgot to put that up.

I will upload it this afternoon and update you here when it’s done.

Niall

Hi !

Thanks for the previous answers. I also have a couple of questions concerning the data and study design.

I understand that 10 people took part in the study, some in train, some in test and others in both. And that train and test data are both scripted.

Was the same script used for train and test?

Besides, from what I understood, a certain number of people followed the test script : each participant therefore generated a « long » test dataset (approximately 30 minutes if it is the same script as the train script). However, in the challenge, we only have multiple « small » test datasets (10-30 seconds).

Here are my questions to the organizers of the challenge:
For each participant, how did you convert the « long » test dataset (corresponding to one script) into many « small » test datasets ? To be more precise, for each participant, did you keep the entire « long » dataset (ie the entire script) and split it into many « small » test datasets ? Or, for each participant, did you keep only part of the « long » dataset (ie only part of the script) before splitting it into many small test datasets ?

  • In both cases, how did you split a long dataset (may it be the entire script or part of it) into multiple small datasets ? Did you divide each long dataset into N small datasets of fixed size ? Did you divide each long dataset into a random number of small datasets of random size ? …
  • Is there an order in the 820 small test datasets ?
  • Why don’t we have an ID for the people that took part in the test ?

Thank you very much for your answers !

Interesting questions, any comments on them? @ntwomey

Dear all,

I have been away and sick for the past week. Apologies for the delay in replying.

Was the same script used for train and test?

Yes

how did you convert the « long » test dataset (corresponding to one script) into many « small » test datasets

We randomly sampled each long sequence to smaller subsequences. Durations are uniformally distributed between 10 and 30 seconds, and similar parameters define gaps between consecutive intervals. We did this so that it would be harder to use the script to inform the predictions - we wanted to produce data that looks more free-living, and using short sequences with gaps achieves this.

Is there an order to the 820 small datasets?

The 820 are randomly permuted, so records 123 and 124 are unlikely to be from the same sequence.

Why don’t we have an ID for the people that took part in the test ?

We decided to remove it to make the reconstruction of the long sequence from the shorter sequences more difficult. In other words, we are trying to promote anonymous activity recognition as much as possible.

It is worth saying that there is some information about the identities in the meta.json files, ie the IDs of the annotators. If you want to try to reverse engineer the user IDs, you may (or not!) be able to do so. :slight_smile:

Good luck!

Niall