Data Practices:

1.3 Profile and Prepare

[Use arrow keys to navigate, "s" to show speaker notes, and "f" for fullscreen.]

PDF Print

With Notes

Topics Covered

  1. Data Joining / Normalization
  2. Profiling Data
    • Characteristics of data
    • Defining you data
    • Discover content
    • Discover relationships
  3. Preparing Data
    • Data cleaning
    • Cleaning Backflow and Automation
    • Data "wrangling"

Data Normalization

"The process of reorganizing data to remove redundance and ensure all data dependencies are logical."

  • Logical = all related data is stored together
  • More broadly, often means changing format / medium
  • Can be time consuming
  • Often introduces complexity for "live" data

Profiling Data

Data Characteristics

Cleanliness

Accuracy

Reliability

Validity

Cleanliness

  • Is your data complete?
  • Does you data have errors?
  • Are there missing values?
  • Are there formatting issues?

Accuracy

In this context, Accuracy is more about the "precision" of the measurement, rather than the errors or unpredictability.

Example: "measuring the weight of a commodity such as rice using an appropriate balance would be considered an accurate measurement. If, howerver, the balance only measured in 0.5 kilo intervals, forcing you to judge between these intervals, you would not consider your measurement to be accurate."

Reliability

Checking the distribution of data to determine if any outliers are "true" or the result of errors in measurement or recording.

  • Non-binary
  • More rigorous in research settings
  • Hard to measure absolutely
  • In a real world setting, more useful to use this as a litmus test for how much "wiggle room" you are willing to accept in data that isn't generated by you.

Validity

Does the data measure what it purports to measure?"

  • Easiest way is to engage a Subject Matter Expert
  • Often subjective or visceral
  • In some cases there is an established reference to compare against

Defining Your Data

Shape

  • Can be taken quite literally
    • How wide / long is the data?
    • Individual column distributions
  • Value relations - Look for things like:
    • Clusters
    • Trend lines
    • Curves
    • Other patterns
  • "The logical distribution of values"

Features

  • Shorthand: "features == columns" or "what's being measured"
  • More technical Dataset is (X,y)
    • X = Independent variable
    • y = target
  • Example
    • X is the "date" (could also be facebook ads running, school in session, etc)
    • y is the resulting traffic measure

Limitations

  • Important, especially as it relates to ethical data handling
  • Look at things like:
    • Completeness
    • Trustworthiness of responses (ex: survey data)
    • Differences in quality and format for disparate data sources
    • Bias in framing or collection of data

Exercise 1: Profile Data

Using the suggested data set, get a feel for the data. Consider the following:

  • Shape of the data
  • Features (what's being measured?)
  • Limitations (ex: is there inherent bias?)
  • What sorts of questions could this data answer?
  • Would you use this data? In what setting?
  • Be prepared to give a short summary to the group

Suggested Data:
https://data.world/dpe/provenance-exercise

Discover Content

What's in, and missing from, the data?

Null or ambiguous content

Incorrectly formatted data

Range (and other analysis)

Discover Relationships

Matchable / Relatable Datasets

Data being matchable is about connecting it to some other representation of the same real-world entity or concept. Data are related when they are "about" the same thing.

When talking about how matchable/relateable your data is, it is looking at items that could easily be used to link it with other real world data. For example:

  • Geo (zip, state, country)
  • Catalog Identifier (SKU, Product Name, Model Number)
  • Other indicators of uniqueness (gender, species, organization)

Vectors for Linkage

What is Linked Data?

The application of the same ideas that created the web (linking documents) applied to data (relationships between data). Four principles:

  1. URIs name things (identify)
  2. Use HTTP URIs for lookup
  3. Provide useful info about identity
  4. Refer to other things using HTTP URI

Linked Data Example

Nobel Prize + DBPedia

        PREFIX dbpo:  
        PREFIX rdfs:  
        PREFIX nobel:  
        PREFIX rdf:  
        PREFIX owl:  
        SELECT DISTINCT ?label ?country 
        WHERE { 
          ?laur rdf:type nobel:Laureate . 
          ?laur rdfs:label ?label . 
          ?laur dbpo:birthPlace ?country . 
          ?country rdf:type dbpo:Country . 
          ?country owl:sameAs ?dbp . 
          SERVICE  { 
            ?dbp dbpo:areaTotal ?area . 
            FILTER (?area < 10000000000) 
          } 
        }
        

(https://en.wikipedia.org/wiki/Linked_data)

Where else the data is used

Researching or documenting other places where you data is in use can help build context as well as understand where there may be work that you can build upon.

  • Provenance / Lineage
  • How was data collected?
  • How was data measured
  • Collaborative work can help everyone

Data Preparation

Data Cleaning

  • Big topic, start small (sampling)
  • Things to look for:
    • Missing / Zero data
    • Data type mismatch (ex: MM/DD/YYYY)
    • Outliers
  • Use pictures
  • Sanity check (SME / Common Sense)
  • Document everything!

Backflow and Automation

  • Use cleaning documentation as rules
  • Port those to source
  • Use ETL process if no access to source
  • Lots of rinse, lather, repear

Exercise 2: Cleaning Data

Using the provided data (CSV), download and fix the errors and reupload.

Dataset: https://data.world/dpe/data-cleaning-exercise

  1. Download the two data files
  2. Remove rows with completely empty data
  3. Check for other missing data
  4. Check for duplicate values
  5. Check for erroneous values
    • Hasty gender?
    • Is "old" an appropriate value for age?
    • -78 is probably just a sign error
    • Validate columns with set values (male/female)
    • Check formatting on preformatted values (emails)
  6. Drop "full name" as redundant
  7. Try the second file on your own

Exercise 3: Joining / Normalizing Data

Using the two clean data files from the previous exercise, write a very simple query to join them. Try your hand at a filter or an order by.

  1. Create a new data project on data.world
  2. Upload both files to the data project
  3. Enter the "workspace"
  4. Create a new SQL query
  5. Input query below (don't forget you can click and paste from right-hand nav)
SELECT *
  FROM `1_3_data_cleaning_1`
       OUTER JOIN `1_3_data_cleaning_2`
 WHERE `1_3_data_cleaning_1`.id = `1_3_data_cleaning_2`.customer_id

Want to run a workshop like this at your company?

community@data.world



Don't forget to sign the values and principles! https://datapractices.org/manifesto