Data Practices:

1.3 Profile and Prepare

[Use arrow keys to navigate, "s" to show speaker notes, and "f" for fullscreen.]

PDF Print

With Notes

Topics Covered

Data Joining / Normalization
Profiling Data
- Characteristics of data
- Defining you data
- Discover content
- Discover relationships
Preparing Data
- Data cleaning
- Cleaning Backflow and Automation
- Data "wrangling"

Data Normalization

"The process of reorganizing data to remove redundance and ensure all data dependencies are logical."

Logical = all related data is stored together
More broadly, often means changing format / medium
Can be time consuming
Often introduces complexity for "live" data

Profiling Data

Data Characteristics

Cleanliness

Accuracy

Reliability

Validity

Cleanliness

Is your data complete?
Does you data have errors?
Are there missing values?
Are there formatting issues?

Accuracy

In this context, Accuracy is more about the "precision" of the measurement, rather than the errors or unpredictability.

Example: "measuring the weight of a commodity such as rice using an appropriate balance would be considered an accurate measurement. If, howerver, the balance only measured in 0.5 kilo intervals, forcing you to judge between these intervals, you would not consider your measurement to be accurate."

Reliability

Checking the distribution of data to determine if any outliers are "true" or the result of errors in measurement or recording.

Non-binary
More rigorous in research settings
Hard to measure absolutely
In a real world setting, more useful to use this as a litmus test for how much "wiggle room" you are willing to accept in data that isn't generated by you.

Validity

Does the data measure what it purports to measure?"

Easiest way is to engage a Subject Matter Expert
Often subjective or visceral
In some cases there is an established reference to compare against

Defining Your Data

Shape

Can be taken quite literally
- How wide / long is the data?
- Individual column distributions
Value relations - Look for things like:
- Clusters
- Trend lines
- Curves
- Other patterns
"The logical distribution of values"

Features

Shorthand: "features == columns" or "what's being measured"
More technical Dataset is (X,y)
- X = Independent variable
- y = target
Example
- X is the "date" (could also be facebook ads running, school in session, etc)
- y is the resulting traffic measure

Limitations

Important, especially as it relates to ethical data handling
Look at things like:
- Completeness
- Trustworthiness of responses (ex: survey data)
- Differences in quality and format for disparate data sources
- Bias in framing or collection of data

Exercise 1: Profile Data

Using the suggested data set, get a feel for the data. Consider the following:

Shape of the data
Features (what's being measured?)
Limitations (ex: is there inherent bias?)
What sorts of questions could this data answer?
Would you use this data? In what setting?
Be prepared to give a short summary to the group

Suggested Data:
https://data.world/dpe/provenance-exercise

Discover Content

What's in, and missing from, the data?

Null or ambiguous content

Incorrectly formatted data

Range (and other analysis)

Discover Relationships

Matchable / Relatable Datasets

Data being matchable is about connecting it to some other representation of the same real-world entity or concept. Data are related when they are "about" the same thing.

When talking about how matchable/relateable your data is, it is looking at items that could easily be used to link it with other real world data. For example:

Geo (zip, state, country)
Catalog Identifier (SKU, Product Name, Model Number)
Other indicators of uniqueness (gender, species, organization)

Vectors for Linkage

What is Linked Data?

The application of the same ideas that created the web (linking documents) applied to data (relationships between data). Four principles:

URIs name things (identify)
Use HTTP URIs for lookup
Provide useful info about identity
Refer to other things using HTTP URI

Linked Data Example

Nobel Prize + DBPedia

        PREFIX dbpo:  
        PREFIX rdfs:  
        PREFIX nobel:  
        PREFIX rdf:  
        PREFIX owl:  
        SELECT DISTINCT ?label ?country 
        WHERE { 
          ?laur rdf:type nobel:Laureate . 
          ?laur rdfs:label ?label . 
          ?laur dbpo:birthPlace ?country . 
          ?country rdf:type dbpo:Country . 
          ?country owl:sameAs ?dbp . 
          SERVICE  { 
            ?dbp dbpo:areaTotal ?area . 
            FILTER (?area < 10000000000) 
          } 
        }

(https://en.wikipedia.org/wiki/Linked_data)

Where else the data is used

Researching or documenting other places where you data is in use can help build context as well as understand where there may be work that you can build upon.

Provenance / Lineage
How was data collected?
How was data measured
Collaborative work can help everyone

Data Preparation

Data Cleaning

Big topic, start small (sampling)
Things to look for:
- Missing / Zero data
- Data type mismatch (ex: MM/DD/YYYY)
- Outliers
Use pictures
Sanity check (SME / Common Sense)
Document everything!

Backflow and Automation

Use cleaning documentation as rules
Port those to source
Use ETL process if no access to source
Lots of rinse, lather, repear

Exercise 2: Cleaning Data

Using the provided data (CSV), download and fix the errors and reupload.

Dataset: https://data.world/dpe/data-cleaning-exercise

Download the two data files
Remove rows with completely empty data
Check for other missing data
Check for duplicate values
Check for erroneous values
- Hasty gender?
- Is "old" an appropriate value for age?
- -78 is probably just a sign error
- Validate columns with set values (male/female)
- Check formatting on preformatted values (emails)
Drop "full name" as redundant
Try the second file on your own

Exercise 3: Joining / Normalizing Data

Using the two clean data files from the previous exercise, write a very simple query to join them. Try your hand at a filter or an order by.

Create a new data project on data.world
Upload both files to the data project
Enter the "workspace"
Create a new SQL query
Input query below (don't forget you can click and paste from right-hand nav)

SELECT *
  FROM `1_3_data_cleaning_1`
       OUTER JOIN `1_3_data_cleaning_2`
 WHERE `1_3_data_cleaning_1`.id = `1_3_data_cleaning_2`.customer_id

Want to run a workshop like this at your company?

community@data.world

Data Practices:

1.3 Profile and Prepare

Topics Covered

Data Normalization

Profiling Data

Data Characteristics

Cleanliness

Accuracy

Reliability

Validity

Cleanliness

Accuracy

Reliability

Validity

Defining Your Data

Shape

Features

Limitations

Exercise 1: Profile Data

Discover Content

Discover Relationships

Matchable / Relatable Datasets

Vectors for Linkage

What is Linked Data?

Linked Data Example

Where else the data is used

Data Preparation

Data Cleaning

Backflow and Automation

Exercise 2: Cleaning Data

Exercise 3: Joining / Normalizing Data

Want to run a workshop like this at your company?

Don't forget to sign the values and principles! https://datapractices.org/manifesto