Lars Vilhuber
2016-05-18
or rather…
Big data is:
Representative big data is:
Surveys | Administrative | Organic Data | |
---|---|---|---|
Aim | Informational | Administer programs | … something else (Twitter?) |
Who | Trained professionals designing, fielding, analyzing surveys | Trained professionals running a bureaucracy, collecting necessary data | Trained professionals optimizing revenue |
Core | Well established science, defining population, frame | Definition of population, frame critical, but ex-post | Population and frame often unclear |
Stats | Primary purpose is to create statistics | Statistics about populations is secondary purpose | Public statistics at best incidental, possibly self-serving |
“Respondent load should always be considered when planning a statistical collection and there should be policies and practices in place to manage relationships with respondents. The aim should always be to keep reporting load to the minimum and to maintain the high quality of collections.”
Incentives
Survey
Administrative
Eliminating discrepancies:
“Afin de réduire le nombre de questions […] Statistique Canada utilisera vos données sur le revenu […]” *
What about using organic data?
J'en veux!
Vous pouvez.
A Metadata Question
8 février 2017
As of 2016-05-17, 11:18
Google Consumer Survey
Live download from https://www.googleapis.com/consumersurveys/v2/surveys/qht3vffpx6peusl5jopzboksqm/results
{
"error": {
"errors": [
{
"domain": "global",
"reason": "required",
"message": "Login Required",
"locationType": "header",
"location": "Authorization"
}
],
"code": 401,
"message": "Login Required"
}
}
Two challenges:
“● A billion hours ago, modern homo sapiens emerged. ● A billion minutes ago, Christianity began. ● A billion seconds ago, the IBM PC was released. ● A billion Google searches ago … was this morning.”
See later
Based on quarterly wage reports from 50 states and DC:
Clearly explained in the article:
600 million collisions every second = 1 PB/s
1 in a million of interest
HARDWARE pre-selects (0.01%) - throws away 99.99% of data! Forever!
SOFTWARE selects (1%) using 15,000 processors
Computing gap
Challenges in
Administrative data and organic data increase the challenge
How are administrative data collected and stored?
Historical availability already compromised
How are organic data collected and stored?
Access to the raw data of the Ringtail system?
How to validate:
Many of the examples above are “siloed” because of computational constraints.
Administrative data is often “siloed” for a combination of (perceived and real) confidentiality constraints and legal barriers.
LEHD succeeded in brining together 51 (+) state administrations, sharing their data to a trusted party.
Success: QWI
Less so: Researcher access (14 out of 51)
Mostly in confidential silos
Broad consensus on legal framework
Still some issues
28 countries
10 provinces + 3 territories + 1 federal government
Lock up the researcher
One view
Statistics Canada
But we like the results!
Kingi, Stanchi, Vilhuber (unpublished)
Difficult to replicate
Significant number of papers not reproducible because of access difficulties
Or maybe: Pervasive data
Challenge is to