Data Practices:

1.2 Sourcing Data

[Use arrow keys to navigate, "s" to show speaker notes, and "f" for fullscreen.]

PDF Print

With Notes

Topics Covered

  1. Defining constraints for questions
  2. Importance of the "right" data
  3. Spectrum of Open vs Closed
  4. Finding Data
  5. What to do about dark data
  6. Data purchasing or bartering
  7. Other things to consider

Defining Constraints for Questions

Focus on Outcomes

  • Avoid analysis paralysis
  • Beating data overload
  • Take small bites
  • Don't let data dictate your actions
  • Goal oriented

The "Right" Metrics

  • Context is key
  • Research will evolve over time
  • "One Metric That Matters" (OMTM)
  • Explore provenance & methodology
  • Strong / Defensible foundations

Metrics Models

Pirate Metrics

Five metrics designed for online startups


  • Acquisition
  • Activation
  • Retention
  • Referral
  • Revenue


Ensures focus on only important metrics you have authority to improve.

  • Important
  • Potential Improvement
  • Authority


Simple frameowrk to make complicated metric selection, straight-forward

  • Trackable
  • Important
  • Explainable


Select only the most important metrics. Typically given in context of a goal/objective

  • Key Performance Indicators

Exercise 1: Provenance & Methodology

Let's examine an exercise in evaluating data that someone else has gathered. Some of it has been sourced and imported, one of the datasets is off on another site and will need to be added.



  • Where your data came from
  • How it was gathered / derived
  • How clean is the data
  • Is there bias or inconsistency in how the data is gathered / portrayed?
  • examine all ethical considerations
  • How well documented and easy to understand is it? Could anyone pick up this project and run with it?

Context, In Moderation

  • Gather ancillary info to frame your main metrics
  • Avoid data overload!
  • 3-5 supporting metrics
  • Define and regularly revisit

Qualitative vs Quantitative


  • Describes qualities or characteristics
  • Gathered using
    • Questionnaires
    • Interviews
    • Observation
  • Difficult to measure and analyze
  • Look for patterns
  • Good use for NLP


  • Describes quantity (what / how many)
  • Can be counted and compared numerically
  • Gathered using:
    • Instruments
    • Thermometer
    • Log data
  • Good use for statistical analysis

Importance of the "Right" Data

Exercise 2: Finding the Right Data

  • What types of data sources does your organization have?
    • Data Lake / Warehouse
    • Relational Database
    • Spreadsheets
    • Unstructured data
    • Multidimensional Data
    • Knowledge graph
    • Public Data
    • Partner Data
    • Harvested web content
    • Other
  • Do individuals have data they use to make decisions that no one else has (dark data)?
  • What data aggregation methods do you employ now?

  • Are there data silos? Can users access all data sources within the org?
  • What inefficiencies exist within the organization?
    • Inefficiency and waste can be found in most organizations. When you understand where your organization is wasting time, energy, or capital you have a much better idea where you need to collect and analyze data.
  • What processes or automation could be put in place to reduce or remove those inefficiencies?
    • Implementing a solution to remove waste and inefficiency should be the goal. Whatever data you gather should be able to help make decisions and drive action.
  • What data do we need to make that happen?
    • Since you’re gathering data to drive decisions, this means that your data needs to be actionable, a metric that can be moved by pulling levers within the org.

Garbage In, Garbage Out

  • From computer science (Charles Babbage)
  • Flawed input == Flawed output
  • Beware of "Garbage in, Gospel out"

Spectrum of Open vs Closed

The Spectrum of Open

Lessons from Open Source

  1. Stop reinventing the wheel!
  2. The world beyond code/data
  3. Expanding the industry
  4. For the love of it

Exercise 3:
Using the "right" open

Using the spectrum of openness, consider the suggested theoretical datasets and classify how open you think they should be.

Do more than just apply a label.

  • Define the scope (what do YOU think is in the data?)
  • Provide justification for your decision
  • Identify potential difficulties (PII / PHI / Regulations / etc)
  • Would this data allow your competitors undue advantage / insight?
  • What economic return could you expect from each of these?

Finding Data

Utilize Existing Knowledge

  • Who has the data
    • IT/IS
    • Human Resources
    • Payroll / Accounting
    • Other functional departments?
  • Include the subject matter experts, not just the data

Find Your Dark Data

  • Dark data is everywhere!
  • Bad behavior that enables:
    • Duplicative effort
    • Stale / inaccurate data
    • Data brawls!
  • Change here is (mostly) social, not technological

Catalog Your Data

  • Democratize access to data
  • Ensure the best data is used
  • Reduce data prep
  • Build collaboratively

Incorporate Open Data

  • Many sources available
  • Governmental data is an especially good target
  • Beware purchased data, can often be acquired for free

Other Sources of Truth to Incorporate

  • Documentation
  • Research
  • Team Members

Exercise 4: Connecting Disparate Data Sources

  • "Add to Project" (New Project)
  • "Add to Project (existing project)"
    • Or add your own via "add data"
    • google sheets, excel, csv, etc
  • New SQL Query (click right column to copy)
        SELECT * FROM 
        INNER JOIN `1_2_disparate_data_2` 
        ON `1_2_disparate_data_1`.customer_id = `1_2_disparate_data_2`.customer_id

Data Purchasing or Bartering

Purchasing Data

The data brokerage world is a multi-million dollar a year market, but is it right for you? There are definite pros and cons to buying data, but unless you’re buying specific market data it’s usually something you can do a far better job of yourself.


  • Easy trade of capital for data
  • No data expertise required
  • Reduces search and discovery time


  • Costly
  • Often generic data
  • Rarely get to inspect the data first
  • Not specific to your use case or model
  • Many times it's open data that has just been packaged for easier acquisition

Data Bartering

  • Trading data for goods/services/data = Indirect Monetization
  • Highlight: Additional data / context
  • Be on the lookout for multipliers (Triple win!)
  • Example: Abe's (Now Direct Eats) [Infonomics]

Other Things to Consider

Provenance / Lineage

  • Provenance (backward) vs Lineage (bi-directional
  • General Considerations:
    • Origins / Source
    • Actions that influence / transform
    • Where the data moves over time
    • Derivative works
  • Reproducibility is key!


  • Included in provenance
  • Consider the implications of the source (trustworthy? timely?)
  • Good opportunity to identify potential bias (even if unintentional)

Creation Methodology

  • Do you understand how it was created? Do others?
  • Is it documented?
  • Was a subject matter expert involved in the creation/curation/modeling?

Distribution / Size of Data

  • Who was the data created for?
  • Do you have all required context (especially if repurposing the data
  • Do you need it all? (sampling)
  • How big is the data likely to get over time?

Potential Economic Value

  • Even open data has value
  • Can you monetize this data directly? Indirectly?
  • Moving beyond the cost center

Want to run a workshop like this at your company?

Don't forget to sign the values and principles!