While processing the transcript data, we found a few points needing clarification:
In some files, many rows in the ‘Transcript’ column lack a speaker ID. Most utterances include a speaker ID like “Teacher (##:##)” or “Student1 (##:##)” but others do not. Could you clarify whether this is intentional? If not, does a missing speaker ID imply it shares the same speaker as the previous row? Some cases don’t seem to follow that pattern, so we wanted to confirm.
For example, (1) 210.018_ELA1_Year1_Part1, row 10, 11, - empty, seems previous speaker ID sharing
(2) 210.041_MathIC2_Year3, row 281, 282, 286, 287 - empty, not sure they are sharing same speaker ID with previous speaker id
The transcript files are at the utterance level, with each row representing one utterance and including a timestamp. In contrast, the train_gt.csv file is structured by seconds, with each row representing one second. Since aligning these two files is critical, could you explain how you generated train_gt.csv based on the transcript files? Specifically, is there a rule or process for converting timestamps to per-second labels? This is important because when we convert transcript timestamps to seconds, 2%-7% of the labels don’t match.
For example, an instance marked as a “closed question” in the transcript file may not be marked in the corresponding second in train_gt.csv. Any clarification on your alignment process would help us resolve this discrepancy.
In general, the absence of a speaker ID indicates that the speaker is the same as the previous speaker. It is possible that there are some errors in the labels here.
The ground truth for training was derived from the transcript files by determining, for each second, the set of transcript utterances that should be included. This is either all of the utterances that are captured in a given second or the most recent utterance.
For example, for a sequence of utterances that go (00:11), (00:22), (00:33), then the utterance at (00:11) will label seconds 00011 to 00021, the utterance at (00:22) will label seconds 00022 to 00032, and so on.
For overlapping utterances, the second should be labeled with the union of all utterance labels. In addition, the last utterance in the transcript is assumed to last until the final second of the video / transcript.
I hope this helps, please let me know if there are other questions. If there are any inconsistencies between my described alignment of the ground truth and what you observe, please let me know as soon as possible, as I agree that this is critically important! Please provide an example of the filename and the timestamp where you observe the discrepancy (though not any of the text itself).
Following up on the alignment issue between train_gt.csv and the transcript Excel files, we identified a clear example of discrepancy.
All rows with a clip_id starting with ‘220.044_MATH1_Year1_20170206_Part1’ in train_gt.csv contain only 0s, while the corresponding Excel file 220.044_Math1_Year1_20170206_Part1.xlsx has valid values.
This appears to be caused by a case mismatch in the clip_id between the CSV and the Excel file.
Thank you for your clear response to our previous request. We hope this information helps improve the dataset.
Thanks @aiwei! You are correct that the training labels are incorrectly zeroed for the audio for this clip. I am in the process of making corrections and will post the updated training ground truth on Globus and make an announcement here and via the platform.
As an additional note, it appears to me at this moment that only the training ground truth was affected by this issue, so there are no changes to the Phase 1 or Phase 2 labels.
An updated training label file has been uploaded to Globus, train_gt_corrected20250731.csv. Please see the announcement and the problem description page for further details.
Thank you again for bringing this error to my attention! Please reach out with any other discrepancies you notice, and I’ll do my best to address them.