Interpreting amount_tsh

I notice that some of the values in amount_tsh are 0, yet the well is still functional. I find it hard to understand how a well can be functional when there is no water available to the well. Can someone explain what is going on here? Or is this a clerical error?

@lhz1029 You are correct that certain values here are suspect. As we’re sure you know, this is one of the joys of working with Real World Data™!

Part of doing well in this competition is coming up with a strategy for managing missing and erroneous values in the data.

Good luck!

It says that amount_tsh is the total static head and that it means the amount of water available to a waterpoint. One definition of total static head is from Wartsila which describes it as: “The vertical height of a stationary column of liquid produced by a pump, measured from the suction level.”

Could a pump produce water which it stores while no longer working? I think so, as it just means it won’t be able to pump new water but still has previously pumped water in its storage.

The feature total standing head ranges from 0 to 350,000 of which 70% of the values are zero. Not a very informative feature.

Yes, but given what it means (‘Total Static Head’), it may certainly be significant.

what-is-head12

But there are, for example, 240 examples in which the total static head > 8848 meters (the height of Mt. Everest). There is one at 350000. On the other hand, the third quartile is 20 meters. Are those values legitimate, or are they outliers? How can one determine?

Not sure if the zeros indicate this or are just missing values in this data set, but open tanks and closed circulation water systems will have a total static head of zero.

I set the “0” value into the mean value, and removed the outliers( I regarded the highest value as outliers indeed), but not sure whether it is workable. maybe I should just keep them as “0”?

I am still trying to figure out what to do with the same. But I think it would be best to first, if decided, to remove the outliers before obtaining the mean value for they greatly influence the mean if included in its calculation. Also, I am thinking of replacing both the 0 and the outliers with the mean calculated.
What do you think about that?