Question on Data Sources

dwinterscsu · September 2, 2020, 6:46pm

I would like to know, if possible, where the plasmid DNA sequences are from; 63017 training + 18816 testing is a very large number of engineered DNA sequences. Are they actual sequences from IGEM? If this cannot be shared now, can this be made known at the end of the challenge?

KieranLitschel · September 6, 2020, 11:31pm

A paper was released a couple of weeks back titled “Attribution of genetic engineering:
A practical and accurate machine-learning toolkit for biosecurity.”(https://www.biorxiv.org/content/10.1101/2020.08.22.262576v1.full.pdf). It doesn’t acknowledge the competition, but given its affiliated with AltLabs, the feature set is the same, and we have the same number of classes, it seems pretty likely the dataset discussed is the same one we are provided. I think the paper answers your question, see the section “Processing the Addgene Dataset” in the paper

Topic		Replies	Views
GEAC: Update on Results & Data Usage Genetic Engineering Attribution	0	423	January 26, 2021
-- dataset download -- Genetic Engineering Attribution	1	582	September 11, 2020
How to beat the BLAST baseline? Genetic Engineering Attribution	1	788	September 11, 2020
External data: a question for the orgenizers Genetic Engineering Attribution	0	415	October 9, 2020
About the Genetic Engineering Attribution category Genetic Engineering Attribution	2	842	August 20, 2020

Question on Data Sources

Related topics