Back to DrivenData | Blog

This dataset presumes a solution (Heart Disease)

This isn’t really a very good study, because it presumes that heart disease has a cause outside of one that is autocorrelated. For instance, if one were to take a specific set of generations and to see whether or not heart disease were present (1) or not present (0) in that situation it would be much more likely to provide a prior indicator of whether or not someone is likely to have heart disease and furthermore would allow a data scientist to establish that there were a prior pattern that is much more indicative of the cause than anything independent variables.

Accordingly, this isn’t likely to result in any meaningful solution because heart disease is something that comes with a prior probability. Simply look at the autocorrelation of heart disease across organisms with much faster procreation rates (like insects).

People like using y=f(x) assumptions (cause and effect) because it’s convenient to make that assumption, but in terms of predicting heart disease, it could be simply that y[n+1] = f(y[n+1]). That’s much more likely than heart disease being solved by 90 data points among 12 prior selected independent variables.

While the data collection could perhaps have been done better (its from 1988), it is clear though that there is predictive signal within the dataset given the performances achieved by ML algorithms (may it be through autocorrelation or confounding factors).

Hi, thank you for your responses. How do you know the data is from 1988? Does anyone know how to get hold of the original data? (With 75+ variables)

The Driven Data description of the problem notes that they are using the UC Irvine Heart Disease dataset

https://archive.ics.uci.edu/ml/datasets/Heart+Disease