Back to courseware
[Use arrow keys to navigate, "s" to show speaker notes, and "f" for fullscreen.]
"The process of reorganizing data to remove redundance and ensure all data dependencies are logical."
In this context, Accuracy is more about the "precision" of the measurement, rather than the errors or unpredictability.
Example: "measuring the weight of a commodity such as rice using an appropriate balance would be considered an accurate measurement. If, howerver, the balance only measured in 0.5 kilo intervals, forcing you to judge between these intervals, you would not consider your measurement to be accurate."
Checking the distribution of data to determine if any outliers are "true" or the result of errors in measurement or recording.
Does the data measure what it purports to measure?"
Using the suggested data set, get a feel for the data. Consider the following:
Suggested Data:
https://data.world/dpe/provenance-exercise
What's in, and missing from, the data?
Null or ambiguous content
Incorrectly formatted data
Range (and other analysis)
Data being matchable is about connecting it to some other representation of the same real-world entity or concept. Data are related when they are "about" the same thing.
When talking about how matchable/relateable your data is, it is looking at items that could easily be used to link it with other real world data. For example:
The application of the same ideas that created the web (linking documents) applied to data (relationships between data). Four principles:
Nobel Prize + DBPedia
PREFIX dbpo:
PREFIX rdfs:
PREFIX nobel:
PREFIX rdf:
PREFIX owl:
SELECT DISTINCT ?label ?country
WHERE {
?laur rdf:type nobel:Laureate .
?laur rdfs:label ?label .
?laur dbpo:birthPlace ?country .
?country rdf:type dbpo:Country .
?country owl:sameAs ?dbp .
SERVICE {
?dbp dbpo:areaTotal ?area .
FILTER (?area < 10000000000)
}
}
Researching or documenting other places where you data is in use can help build context as well as understand where there may be work that you can build upon.
Using the provided data (CSV), download and fix the errors and reupload.
Dataset: https://data.world/dpe/data-cleaning-exercise
Using the two clean data files from the previous exercise, write a very simple query to join them. Try your hand at a filter or an order by.
SELECT *
FROM `1_3_data_cleaning_1`
OUTER JOIN `1_3_data_cleaning_2`
WHERE `1_3_data_cleaning_1`.id = `1_3_data_cleaning_2`.customer_id
