Question about PERWT column

The PERWT column is described as:

PERWT (float) — Indicates how many persons in the U.S. population are represented by a given person in an IPUMS sample.

I have a couple of concerns about this:

  1. It does not seem like a property of an individual (sensitive information)
  2. I think part of the reason for this column is to compress the dataset, so that redundant rows can be excluded, saving space. If each row actually represented PERWT distinct people, then the actual dataset size would be much larger, potentially changing the problem and best approaches.

I’m wondering if this column deserves special treatment? Or maybe it should simply be removed from the dataset… Any clarification would be very helpful

So, the PERWT column is a little counter-intuitive, and that short description doesn’t quite do it justice. Here’s basically how it works:

The American Community Survey should, ideally, be a uniform random 2.5% sample of the population of the US population. The PUMS data that they release publicly is then a 40% subsample of that (or a 1% sample of the US population).

The tricky bit is that not everyone who’s handed a survey actually completes it and returns it, and not everyone is reachable to begin with. Whether or not a survey was successfully gathered isn’t strictly random either, it’s correlated a lot with demographics and geography. In order to try to correct for under-sampling some segments of the population (and over-sampling others), there is the PERWT variable (and for households, the HWT variable). PERWT essentially gives a ratio between how many people in a certain demographic/geographic slice should have been sampled (based on other data on the US population), vs. how many actually were successfully sampled. When this data set is used by social scientists or policy makers to check hypotheses about people in the US, they can use statistics with individual records weighted by PERWT in order to ensure their conclusions aren’t biased by variations in sampling rates across the population. So there aren’t exact duplicate rows in the collected survey data, but to make up for under-sampling, you can duplicate the rows that were collected.

In effect, PERWT is a record-level feature that’s dependent on demographic information and geographic information (and so it is correlated with other features, in a similar fashion to income or education). We’re asking you to address it because this is a small example of some of the complexities of working in a real world data environment. There’s many ways the weighting problem could be approached for your privatized synthetic data, at varying levels of complexity and fidelity, but for the challenge we’re only asking you to take the simplest approach-- just treat PERWT like you would any other feature in the data, and attempt to preserve its distributional correlations with other features, within each map segment.

Thanks for the clarification!

One followup: Is it guaranteed that “BINS” will be the same in the final scoring (in scripts/metric.py)?

One followup: Is it guaranteed that “BINS” will be the same in the final scoring (in scripts/metric.py)?

We have not made that guarantee, no. That being said, they may very well be.

Okay that’s fine but will the BINS used for final scoring be provided to us at some point? Since it’s part of how we are evaluated, I think it’s important to know what they are.

Also, I noticed a likely mistake with BINS for DEPARTS and ARRIVES. The range of values seems to be 0-2400 but the bins only cover 0-24

Okay that’s fine but will the BINS used for final scoring be provided to us at some point? Since it’s part of how we are evaluated, I think it’s important to know what they are.

For all practical purposes, you can assume they will be exactly the same or substantially the same. I’m just confirming that has not been guaranteed.

Also, I noticed a likely mistake with BINS for DEPARTS and ARRIVES. The range of values seems to be 0-2400 but the bins only cover 0-24

Good catch, this had already been updated on the backend but thanks for pointing out that the metric.py only had hundreds bins. Updated version has been fixed on the data page.

Update: Thanks again to @rmckenna - the bin widths have been finalized to 15 minute intervals after consultation with SMEs.