I would like to know, if possible, where the plasmid DNA sequences are from; 63017 training + 18816 testing is a very large number of engineered DNA sequences. Are they actual sequences from IGEM? If this cannot be shared now, can this be made known at the end of the challenge?
A paper was released a couple of weeks back titled “Attribution of genetic engineering:
A practical and accurate machine-learning toolkit for biosecurity.”(https://www.biorxiv.org/content/10.1101/2020.08.22.262576v1.full.pdf). It doesn’t acknowledge the competition, but given its affiliated with AltLabs, the feature set is the same, and we have the same number of classes, it seems pretty likely the dataset discussed is the same one we are provided. I think the paper answers your question, see the section “Processing the Addgene Dataset” in the paper