Baseline model
With the assumption that this disease is a seasonal disease, thus the number of cases should repeat itself every year or so, I did a simple extrapolation using the following formula to both of the cities separately:
# of cases = a sin(pi / 26 (# of weeks since Jan 1, 1990) +b) + c
and its result has a MAE of 25.9567.
(The value of a, b, and c is left for you to figure out)
#Reasoning behind the equation
a - scaling factor to de-normalize the number of cases from [-1,1] to anything possible.
sin - the first continuous periodical function pops out of my head.
pi/26 - the period of sin is 2 pi, thus coefficient of pi/26 would make the period to be 52, and for simplicity one year has 52 weeks.
b - offset to align the peak of sine with peak of data.
c - offset to make the sin function positive.
Reasoning for a good baseline
You should be able to build this model only with dengue_labels_train.csv
and generate the submission result only with submission_format.csv
.
If you are using any features (stored in the other files), as the features contain more information and the use of them are thus supposed to improve the accuracy, it should always get a better result than this one.