I saw a separate post regarding the incorrect week 53s that exist in the label data, but I’m confused by the week 53s that exists in the features data. There are 5 instances of these, and each one of these rows contains absolutely no features data. At first thought, you might think to drop these rows, but then you end up with your features going from the week starting Dec 23rd to Jan 8th (i.e. no data for the first starting Jan 1st of these years), and you obviously end up with 5 less rows of feature data vs outcome data.
I actually took a look at the NOAA site (where this data originated) to try to understand this and I noticed that week 53 is indeed incorrect, and that these rows should be dropped. It then looks like the data provided for the week starting Jan 8th in the DrivenData info is actually for Jan 1st (according to the NOAA data), which means that all the dates are wrong in the DrivenData features file.
Thanks for your quick reply bull - really appreciate it!
Just one question then… This means the features data for the week starting Jan 1 in these cases is empty…which isn’t an insurmountable problem however, why is the data then out of synch with the original data on the NOAA web site? To provide a specific example…
in the DrivenData dataset, the ndvi figures for San Juan in the week commencing Jan1st and Jan 8th 1993 are: Null for Jan 1st and 0.02835, 0.04366667, 0.07865714, 0.04645714 for Jan 8th. If you look at the original data on the NOAA site, the figures are specified for the week commencing Jan 1st - hence my confusion.
So I now understand that the weekofyear field is calculated based on the ISO standard (which is essentially driven by the number of Thursdays in a year) and hence some years have 52 or 53 weeks. If this is correct:
a) does anyone know why 1995, 2000 and 2006 have 51 weeks (it looks like these years should all have 52 weeks?)
b) is the week_start_date field therefore incorrect. e.g 1990 W52 start date is 24/12/90. The week start date of the following week is 1/01/91. Sould this be 31/12/1990 instead, or do we assume that all statistics for the week starting Dec 24 1990 have actually been captured or scaled to represent an 8-day week instead, as a result of every new year starting on Jan 1 regardless of date?
c) why is there no feature data for weeks 53 of 1993, 1998, 2004 and 2010? Note, there are dengue fever counts for these weeks.We appear to have dengue case counts for these weeks.
I know that this is a very late reply. However, putting in my observations so that it helps out.
If you look at the data grouped by city, year except for the limiting years (i.e. 2000 and 2013 for Iquitos and 1990 and 2013 for San Juan), the count and distinct count of weekofyear is 52. As you have rightly pointed out, there are cases wherein weekofyear is 52 and 53 for the Jan months.
Even though the years that have a weekofyear 53, the count and the distinct count of weekofyear is always 52. So I plan to feature engineer a new column weekofyear_new (for all such years weekofyear is 52 and 53 for the Jan months) by incrementing the original weekofyear. And For all cases of weekofyear_new greater than 52 by setting it to 1.