This was in given in response to a post below, but I’m copying this to top level for ease of reference. In the tips/tricks for our taxi problem this sprint, we mention that queries that count taxis have far lower sensitivity than queries that count trips. To help illustrate that idea, here’s a sampling of queries that have sensitivity 1 on the Chicago Taxi Data:
- “How many taxis dropped off in community area 17?”
- “How many taxis did at least 15 trips on Friday night?”
- “How many taxis did less than 5 trips every morning and afternoon?”
- “How many taxis picked up between 5-10 trips in community area 23?”
- “How many taxis picked up in areas 10, 12 or 16 and dropped off in community area 50?”
Hi Christine, I just looked at the problem description. To have better understanding the problem, what you meant is “How many distinct taxis id…?” Is that correct ?
Yep, that is correct. As we mentioned in another thread, you can intuitively think of taxi_id as the ‘name’ for the individuals in this data set.
Let’s say I’m doing a survey on the computing devices people own. With a survey, when I collect Alice’s data I might label all of her responses with her name. I want to query “the number of individuals in my survey who have at least one computing device”. I can search through the data and find all of the devices labeled with Alice’s name, and let’s say it turns out that Alice has two laptops and one old desktop computer. Alice will contribute exactly +1 to that counting query, because she is an individual who has at least one computing device… even though she has a three devices in total. The query is counting individuals, rather than devices. Since one individual can only contribute at most 1 to that query, its sensitivity is 1.
If instead I query the total number of laptops in the data set, then Alice will be contributing 2 to the count, and the overall sensitivity of the query would be max_laptops_per_individual. In that case, we’re counting devices rather than individuals.
If I want to estimate the total number of laptops in the data-set using queries over individuals, I can potentially use a histogram:
L1 = “How many individuals have exactly 1 laptop”,
L2 = “How many individuals have exactly 2 laptops”,
L3 = “How many individuals have exactly 3-4 laptops”, etc…
And then take something like: est_total_laptop = L1 + 2 x L2 + 3.2 x L3
The taxi data is a bit weird to think about because one taxi driver can have up to 200 trips (whereas we might cut off the maximum number of laptops per person at 10). But mechanically, it’s the same thing. The taxi_id is what you use to find all of the trips that belong to one individual taxi driver. If you do a query counting the number of taxi drivers (ie, distinct taxi_id values) that satisfy some property, that query will have a sensitivity of 1 (as in the examples above). If you do a simple query counting the number of taxi trips that satisfy some property, the query will have a sensitivity of 200.