Challenges for Official Statistics

Lars Vilhuber
2015-06-20

Where I'm coming from

Let's start with who's being challenged

*

Goals of NSO

Provide

  • high-quality
  • cost-effective statistics to
  • inform decisions made by a
  • wide range of stakeholders

Challenges for Official Statistics

there are many:

  • budget cuts
  • increasing privacy concerns
  • challenge from alternate data sources
  • increased expectations in an era of information
  • challenge to assess new stakeholder needs (who are the clients?)
  • changes in institutional culture (imposed and expected)

and

  • budget cuts

Challenges discussed here

Alternate data sources

What is "big" data?

Big data is:

  • 8 GB?
  • 2 GB/day?
  • 7.5 TB?
  • 309.3 million people, measured once?
  • 150 million people, measured 80 times?
  • 16,721,787,543 tables?
  • 100 countries, 5-15 times?
  • 19.3 billion records ?
  • 112 weekly data points?

Representative big data is:

  • I can't run it on my laptop
  • My RA/co-author/[fill in] can't run it on her laptop
  • 2 years worth of stock trades?
  • 10 questions / population/ 1 country?
  • 3 variables for 98% of one country's workforce?
  • 30+ variables for same?
  • 1% samples of 100 countries' censuses?
  • 10% of tweets?
  • 1 variable for 10% of Twitter users?

What is "big" data?

This brings up the question: How do we collect data?

How do we collect data?

  • surveys:
    • well established science,
    • trained professionals designing, fielding, analyzing surveys
  • administrative data:
    • aim is to administer programs
    • definition of population, frame critical, but ex-post
    • statistics about populations is secondary purpose
  • organic data
    • aim is … something else (Twitter?)
    • population and frame often unclear

Old issues we know how to do

  • defining population
  • estimating variability measures
  • computing statistics

New challenges

  • treating admin/organic as a noisy data source, different from surveys
  • designing administrative data collection with statistics in mind
  • handling large data flows in commonly accepted ways
  • novel confidentiality issues when ingesting other data sources
  • reconceptualizing multiciplity of data sources

Data collection in surveys

Respondent load should always be considered when planning a statistical collection and there should be policies and practices in place to manage relationships with respondents. The aim should always be to keep reporting load to the minimum and to maintain the high quality of collections.”

Australian National Statistical Service

Data collection in administrative data Monsters

Data collection discrepancies

Data collection discrepancies

Survey

  • where did you work (precise lat/long) in the past 10 years?
  • who did you work for in the past 10 years?

Administrative

  • IRS Form W4, line 8
  • CRA-ARC T4, box 54

Multiplicity of sources

Migration sources

ACS Migration (5-year)

ACS Migration

IRS Migration (1 year)

IRS Migration

OnTheMap for Burleigh County

OnTheMap county

OnTheMap for North Dakota

OnTheMap county

Job-to-job flows

J2J Migration

Challenges

  • Underlying population is the same
  • Discrete data products
  • Reconciling differences

Potential organic migration sources

What about using organic data?

Potential organic migration sources

Facebook

Using (changes in) Facebook home location

Potential organic migration sources

Twitter

Using tweeted information on “new house”

Potential organic migration sources

Travel tweets

Just Landed - 36 Hours by blprnt

Challenges

  • Representativeness?
  • Reconciling differences is a challenge

But:

  • no competition, right?

New statistical sources: Private sector

Existing organic statistics

Google Flu

Existing organic statistics

Commercial organic statistics

BPP

Commercial organic statistics

ADP

But notice...

… they all refer back to official statistics!

Benchmarking to NSO: Flu

Google Flu

Benchmarking to NSO: Prices

Google Flu

Benchmarking to NSO: Unemployment claims

Social Index

Benchmarking to NSO: Employment report

ADP benchmark

Challenges from other entities

  • Timeliness
  • Data collection by non-NSO

Challenges: Privacy

Challenges: Protection gap

  • More detailed data implies more data on specific individuals and firms
  • Are protection methods sufficient?

Access methods

  • apply methods, produce more public-use statistics (move the data to the researcher)
  • fail to apply methods, provide controlled direct access to data (move the researcher to the data; contracts/RDCs/etc.)

What level of protection?

  • older methods break down as published data become denser
  • newer methods are still being developed
    • synthetic data
    • noise infusion
  • more robust methods
    • differential privacy

How much protection?

  • tradeoff utility - protection
  • how much protection do data providers require (people, firms, etc.)
  • how much utility do stakeholders request (people, firms, government, etc.)
  • what technology is available to implement?

How much protection?

Challenges: Computing resources

Challenges: Computing resources

  • public use statistics as a form of efficient distribution (aggregation)
  • also allows for distributed computing!

The old model

old model

But if we need information on thin tails?

Allow researchers to explore the entire multi-dimensional distribution, including its extremes, for instance for rare events or measures of inequality or program impact.

The new model?

supercomputer

In physics...

CERN

Data processing in physics

  • Large Hadron Collider (since 2008)
  • Arecibo Radiotelescope (since 1963)

Data processing in physics

  • pre-select (0.01%) [CERN animation]
  • select (1%) using 75,000 processors
  • distribute 25 Petabytes/year to 11 centers
  • analyze in 170 centers

For social scientists

Computing gap

  • Census Bureau vs. XSEDE
  • Statistics Canada vs. Compute Canada

Challenges in

  • moving data to compute resource, or
  • compute resource to data

Administrative data and organic data increase the challenge

Challenge: Cultural change

Change in the culture

  • computing approaches require a change in procedures
  • openness to new research methods (not in statistics!) and new skill sets (not in statistics!)
  • academic-agency collaborations are an important part

Change in culture

“… the current Census Bureau survey and census methods are unsustainable. Changes must occur in the acquisition of data and construction of statistical information for the Census Bureau to succeed.”

Robert Groves, Director, Census Bureau, September 8, 2011

Change in culture

“Modern computational tools play the same role now that survey design and implementation did in the 1960s.”

John Abowd and Steve Fienberg, CNSTAT, May 8, 2015

Change in culture

“We would like to suggest […] implementing a variety of new models for facilitating the movements of researchers between academia and the Federal Statistical System”

Report of the NSF reverse site visit of the NCRN, April 2015

Summary: Challenges for NSO

Summary: Challenges for NSO

Thank you