Waldek Węgrzyn / May 18 2020

Dataviz 101: introduction to data wrangling and visualisation

Medialab Katowice | Karol Piekarski & Waldek Węgrzyn | workshop for Metropolitný inštitút Bratislavy

Based on real-life cases from Medialab Katowice, you are going to follow the process of data processing, analysis and visualisation.

you will get familiar with popular formats (csv, json, API), tools for data transformation (OpenRefine) and presentation online (Datawrapper)
you will see how choosing a specific type of data visualisation may affect the way your data-driven message is seen by the final recipient
understanding the principles of visual language will help you create accurate and engaging narratives
you will learn how to publish online the research results enriched with visualizations

1. Introduction

1.1. Hello, we're Medialab Katowice

Get inspired by Medialab case studies, see how we:

used spatial analysis to explore the cultural life in Katowice
engaged the crowd at the European Economic Congress to discover the stereotypes about Katowice
designed a unique visual language to promote Katowice architecture

1.2. And who are you?

Please introduce yourself in a few words :)

your name
profession/background
position at MiB

1.3. Topic: Krížna Street

We're briefly presenting why we are interested in Krížna Street.

2. How to start thinking about data visualisation?

2.1 Let's have a look at good examples

Data visualisation as a tool for explaining and understanding has been used and developed for quite a long time. One of the best-known examples of well crafted data representation is a map created in 1869 by a French military engineer Charles Minard. His map shows the size of Napoleon's army during the Moscow campaign, along with the temperature and main river crossings:

Another inspiring example dates back to 1930s. Along with a flourishing modernism came the development of simple visual language, with the aim of explaining complex issues of social and economic growth. This is when Marie and Otto Neurath together with Gerd Arntz developed the ISOTYPE – a concept of showing numbers by the repetition of simple symbols:

Nowadays, we can see yet another revival of data visualisation. Infographics fit perfectly to interactive digital media – and the possibility to engage the reader by merging texts, charts, images into an interactive storytelling (or scrollytelling) experience is explored by initiatives like The Pudding:

So what do all these great examples have in common?

A clear idea about the message the author wants to convey
(or sometimes about the impact he wants to make on the reader)
Understanding the reader: his or her skills, knowledge and motivation
Clarity and legibility: the optimal use of different visual properties

2.2 Important questions

What's our aim?
We should start by setting a main goal. It will serve as a reference point, helping us verify ideas and compare effectiveness of different solutions. For example, it could be one of those below:
- Finding out who's parking near Krížna street
- Figuring out what consequences we may expect if the number of parking lots is reduced
- Explaining the need of reducing number of parking lots or introducing a limited parking zone
- Finding out which areas should be covered by the zone
Who's our recipient / reader / user?
This is a second thing that's necessary to craft a good data visualisation. For example:
- Residents of Krížna neighborhood
- Car owners parking near Krížna
- City officials
- Ourselves, if we want to explore the data
What's the message?
We may already know it (e.g. "Too many cars parking during working hours"), but sometimes we need to explore our data first.

For the purpose of this workshop, we decided that we're going to explain the need of limiting the number of parking lots to city officials and urban activists. It means that our reader has much more knowledge and better skills to understand the topic than a local resident. He or she will also have a higher motivation to learn about the idea – we can create an online article and send it to them. While talking to car owners, we should probably choose much simpler and more direct medium.

2.3 Exploration and explanation

Apart from explaining things, data visualisation often serves as an exploratory tool for ourselves, helping to find specific aspects, patterns or questions in the dataset.

Why should we explore data by visual means? Let's look at the example of the Anscombe's quartet – a collection of four datasets (I, II, III, IV), which seem very similar when examined by the typical methods of statistical analysis:

However, a simple scatter plot visualisation of x and y values reveals obvious differences between four datasets:

We can easily imagine that a bigger and more complicated dataset can include a lot of information that could be extracted from its visualisations.

3. Exploring Krížna Street dataset

Let's start our data-based project. This is the dataset about cars parking near Krížna Street in the form the CSV file.

full_dataset.csv

And here's our first task:

Download the dataset (icon on the right)

You may wonder why to keep the data in a plain text file when we can use more sophisticated options such as spreadsheets or databases? The problem is that they store data in a way that is not always recognized by other applications. For this reason, small data sets are often kept in ordinary text files, such as CSV, so they can be read quickly by any computer without installing additional software.

In a nutshell, a CSV (comma separated values) is no different than a typical data table inside a text file with values separated (most often) by commas – as opposed to the column boundaries in the excel table.

0 items

What kind of data do we have? Let's look at several variables / columns:

lat & long – location of spotted cars (from the GPS)
datum & cas – date and time when the car was spotted
is weekday (true or false)
time of day ("evening", "midday" or "all day")
spotted count – how many times we have spotted the car
days count – during how many days of the study we have spotted the car
registration plates – we only left first two letters, available as ecv_code
dr_okres & dr_obec – some information on the origin of the car (but not available for all cars)
address of the owner – removed from this dataset
distance – from the parked car to the owner's address
distance category – set arbitrarily by us, based on the distance

However, it's extremely hard to read it from the table. Let's browse for interesting topics (and answers, if possible) by visual exploration.

3.1 Exercise: a basic exploration

A quick file inspection before you start working with the data is a good habit for any analyst. It will help you get a sense of what’s inside the data, discover missing records or spot other errors that may have occurred while writing data to the file. If you don't have more advanced software at hand, you can quickly preview the CSV file content by using online tools.

Let's start with the most basic file inspection using WTFcsv:

Go to https://databasic.io/en/wtfcsv/
Click the [ upload a file ] tab (yellow, as everything here)
Click [ upload a file ] button (white)
Select the file with our dataset
Click [ ANALYZE ]

You should see a bunch of small charts. They are generated automatically for each column, basing only on the guess of the data type (if it's a number, text or date) – that's why some of them make sense, and some don't. Try to examine the charts and look for any information.

Looking at the results, we may start considering the questions like:

Are all the cars from Bratislava?
If not: what's the number of local cars to cars from outside of Bratislava?
Where do they come from?
How long do they remain there?
Are there more or less cars during working hours?
Is it easy to park near your house?
Are they mostly private cars?

Two areas that may look interesting to us:

Origin of cars (locals, from other parts of Bratislava, from outside)
Changes over time of a day (morning, evening, weekday)

TIP: There are other exploratory tools, like

Nineteen – often useful for survey data
Pandas Profiling – a bit more advanced, see the results for our dataset

Of course it's perfectly ok to use any visualisation tool for the data exploration stage – starting from simple spreadsheet charts, through online tools (like Datawrapper, Flourish, RAW Graphs etc.) up to programming languages or libraries like Python, R or D3.

3.2 Exercise: Exploration on a map

While previewing the file in WTFCSV, we'be been trying to organize cars on Krizna by looking at different features we collected in a form of micro-summaries. Regardless the tool used, it's worth mentioning that any data can be organized in five general ways, as specified by the so-called LATCH method by Richard Saul Wurman:

Location
Alphabet
Time
Category
Hierarchy

Of course each of the five may adopted in multiple ways (like time of day or time of week, category of the car owner and category of car origin, etc.) – but it's very useful to consider all five categories when looking for interesting topics in your dataset. By the way, the Napoleon's Russian campaing chart by Minard mentioned in the very beginning, is a brilliant example of combining location, time and category on a single data visualization.

Since the location is a strong feature of our dataset, let's use the tool that can generate a detailed map –like Carto.

Feel free to explore the map and widgets below the map

(Probably it'll be easier for you to use the fullscreen version)

Obviously, one dot represents each car spotted during the study. At the bottom of the map you'll find a number of widgets, each one representing one column, same as in WTFcsv.

By clicking on the bars, you're able to filter points on the map. Feel free to explore it. Is there any relationship between the place where the cars are parked and the categories (like distance category, spotted count or time of day)?

Seeing the study outcomes on a map can lead to interesting discoveries. You probably already have some ideas – we will try to get back to them later on.

Now, let's focus on the basic, yet interesting comparison, that we've already seen in both WTFcsv and Carto Widget: How many spotted cars come from the neighborhood, from other districts or from outside of Bratislava?

To speed things up, we've prepared a separate file which contains counts of cars for all these categories. Please download the file.

day1_distance_categories_sums.csv

0 items

[ short break! ]

4. How should I present my data?

4.1 Exercise: choosing a chart

Let's start with a simple exercise:

Go to www.datawrapper.de (log in if you have an account)
Click [ Start creating ]
Upload your data (you can simply drag the file you previously downloaded)
The next step is to check if data has been imported properly. You should see a table created. Let's leave it for now. Click [ Proceed ]
Now you probably see a sadly dropping line chart, not very informative. Try different chart types available on the left.

Is a pie chart better choice than a stacked column chart? Or maybe a simple bar chart?

No matter if you're exploring data in search for information, or explaining something you already know to others, you'll quickly face a question: what type of visualisation should I pick?

There is a lot of data visualisation conventions: charts, diagrams, maps and good practices. Some of them are used specifically for one kind of data (like OHLC chart), others have been widely used for a long time and are easily understood by most of us (like a line chart or a bar chart, both introduced by William Playfair in 18th century).

Obviously the decision to choose one type of visualisation depends on your data, your reader and the context. Instead of sticking to one solution, you shouldn't be afraid to experiment: try one kind chart and if it's not exactly showing what you wanted, move to a different one. However, there are a few questions that may help narrow down the number of chart types you want to try.

The first thing to consider is the nature of the data we have:

Is it quantitative (like a number of cars) or categorical (like a name of a district)?
For categorical data: is it nominal or ordinal (like consecutive observation days)?
Is there any hierarchy? Are we talking about smaller elements that belong to bigger groups?

Secondly, we should be aware of what we want to focus on:

Do we want to compare numbers, sizes, quantities?
Are we talking about parts of a whole, or rather separate objects, groups?
Is it more important to see exact values in detail, or rather to have a general overview over outliers, extremes or patterns?
Do we want to show relationships, connections?

There is a number of great resources available on the web regarding the choice of different visualisation types – it's highly recommended to have a look:

The Data Visualisation Catalogue (try "Search by Function")
From Data to Viz
Interactive Chart Chooser
Visual Vocabulary (use tabs on the top)

4.2 Same data, different approach

Coming back to the example we're working on in Datawrapper – even our very simple case may be approached in several possible ways.

Approach 1: Compare number of cars for all 6 categories (focus on differences)

Choose the bar chart

With this approach, you enable the reader to easily compare values between categories, even if differences are not big:

Approach 2: Show categories as part of the whole (focus on proportions)

Choose the stacked column chart
...or choose the pie chart / donut chart

Whats the difference? We could spend more time discussing if stacked-like charts are better than pie-like charts. For now, let's just say that stacked chart may be a safer choice if there are many categories. See:

There are some arguments against pie (or donut) charts (we'll try to cover that topic later). However, they may work better, when there are few categories (Datawrapper limits them to 5 by default, but you can change it), especially for showing values like 1/4, 1/3 or a half, which are easily recognized as section of the circle.

So if you decide to focus on one particular category, like people parking near home, a pie (or donut) chart may work well:

This example was meant to show that there is no golden rule for choosing a visualisation type, even for simple cases – it always depends on the perspective you want to adopt for showing the topic.

4.3 Making a good use of a chart: the next steps

As you may noticed, examples above not only differ in terms of chart type. There are also other tweaks applied to them. Even for a basic chart there is quite a few details to concern.

Let's finish our chart in Datawrapper. While staying in "Step 3: Visualize" we're going to focus mostly on the tabs [ Refine ] and [ Annotate ].

Create the title. You can do it by double-clicking the title field or in tab called [ Annotate ]. The title often turns out to be helpful in making other design decisions – defining it in the very beginning is a good practice.
Think of any additional information that may be obvious for you, but not for the reader. For example: What's the unit we use for the measurement? What time and place are we talking about? How was the data collected? You can explain it in the [ Annotate ] tab by filling out "Description" or "Notes" fields.
Stacked column chart: [ Refine ] tab -> 🔘 Stack percantages
Pie chart: [ Refine ] tab -> "Labels" section -> 🔘 Convert values to percentages
Are the categories or labels easy to understand? In order to change them, navigate back to "Step 2: Check & Describe" and manually edit categories or column titles by duble-clicking them. Obviously, for a more complicated chart it's better to do it when preparing the data.
Think of the use of colour. The colour you use for bars or sections may be neutral, but it may also convey some meaningful information. We will talk more about it tomorrow. For now, let's consider one case: you may want to group some elements by making them visually similar. For example categories "70 m" and "Krížna surroundings" both mean "quite close". Coloring them both in red will emphasize that similarity.
Bar chart: [ Refine ] tab -> "Appearance" section -> 🔘 customize colors -> select a category and choose a color (square drop-down to the right)
Stacked column chart: [ Refine ] tab -> "Appearance" section -> [ customize colors... ] -> select a category and choose a color (square drop-down to the right)
Pie chart: [ Refine ] tab -> "Pie slices"/"Donut slices" section -> [ customize colors... ] -> select a category and choose a color (square drop-down to the right)
Sorting (arranging) your data in some specific order makes a huge impact on the perception of the chart. You may decide to sort categories basing on the number of cars, or arrange them according to the distance (starting from "70 m", ending with "Abroad"). Please notice how this decision affects the way we perceive the chart.
A. Automatically sort by values:
Bar chart: [ Refine ] tab -> "Sorting & Grouping" section -> 🔘 Sort bars
Pie chart: [ Refine ] tab -> "Pie slices"/"Donut slices" section -> "Sort by" -> 🔘 Largest first
Stacked column chart: You can't automatically sort data on vertical axis, but you can...
B. Manually sort according to custom order:
Navigate back to "Step 2: Check & Describe" and manually edit categories or column titles by duble-clicking them. Obviously, for a more complicated chart it's better to it do when preparing the data.
There is a lot of small visual improvements you may want to consider, like showing values and labels directly on the chart or using color key (legend), positioning labels, formatting numbers, changing thickness of bars or separating them by lines. In general, try to keep things simple.
For example: if you can place labels directly on the chart, don't use color key (legend). It requires more effort from your reader to decode and remember colors.
Give yourself time for experiment and try different results. We will also discuss some of this topics later.
Changing "Output locale" (either in [ Design ] tab or in "Step 2: Check & Describe") will affect all country-related features (like date and number format or the language of automatically generated categories like "Other").

In the end, you should always provide information on the source of your data. It makes your chart more reliable, but also allows your reader to conclude on facts that are not related to the visualisation itself. You can provide information on data source and chart authorship in the [ Annotate ] tab.

4.4 Publishing your results

When you're done with all the improvements, it's time to make the chart visible to others. Using the free version of Datawrapper, you have 3 possibilities:

Download your chart as .png file
Publish your chart as a standalone webpage and share the link
Embed your chart (as an iframe) on your webpage

Let's go to "Step 4: Publish & Embed"
Click [ Publish chart ].

If you're logged in, your chart is saved from the beginning, so you can access it later in your Dashboard. You can also duplicate the chart and modify the copy.

If you don't have an account, you will need to enter your email address in order to receive a link to the chart. Please copy the chart embed code and keep it in a safe place.

If you want to export the chart and edit it on your own, Datawrapper paid plan makes it also possible to download vector .pdf or .svg files, which can be edited or modified using vector editing software, like Adobe Illustrator. There are many other free dataviz tools that may suite your needs, like RAW Graphs, Flourish or D3.js.

4.5 Thinking outside the chart

While this online workshop has some limitations and we want to show you classic and well-working solutions, it's very important to remember that charts are nothing more but proportional visual representation of your data. You can represent your data, or the key information, in any other visual (or even non-visual) way, until only it's proportional.

When thinking on data-driven project outcomes, don't limit yourself to charts and infographics. Depending on your goal and recipient, you may consider

pure-text solution, like a poster summarizing the problem in one perfect sentence ("One third of these cars stay here unused for 7 hours a day")
a photographic comparison of parking situation during different times of day and days of week
a well-calculated information for a local inhabitant ("Paying extra parking fee will give you 15 minutes less on finding parking spot. On average it will be 80 meters closer to your place")

5. Transforming data the easy way

5.1 Exercise: data cleaning and transformation

Before data scientists conduct any study, they usually spend 80% of their time cleaning and transforming data – otherwise, they may be useless, resulting in misleading visualisations.

In this exercise you will explore OpenRefine, a tool used by researchers and data journalists, first created by Google, currently developed as an open source project. It will help you automatically clean your data, correct language inaccuracies, and perform table transformation from wide to long format.

-> OpenRefine will work in your web browser (after download and installation)

1. Import data and create new project

You can import data to OpenRefine directly from any online source (url) like Google Sheets or upload files in different formats (e.g. CSV, JSON, xlsx). Let's look at the original data set that has not yet been preprocessed for the analysis.

Select [ Browse… ] and upload the day1_openrefine_raw_dataset.csv file
Click on [ Next ] » Enter the project name and check out available options » Click on [ Create Project ]

2. Handle missing data

Select [ ▾ latitude ] » Facet » Customized facets » Facet by blank (null or empty string)
Click on [ include ] true values

Select [ ▾ All ] » Edit rows » Remove all matching rows

Tip: Use [ ▾ All ] to perform table-wise operations like removing or shifting rows/columns

3. Make common data cleaning tasks

A common job of data analyst is to remove trailing spaces or split values from one to multiple columns. Many of those predefined actions can be performed in OpenRefine much faster than by typing formulas into a spreadsheet.

Select [ ▾ dr_obec ] » Edit cells » Common transforms » Trim leading and trailing whitespace
Select [ ▾ datum ] » Edit column » Split into several columns (using separator)
Select [ ▾ ecv ] » Edit column » Split into several columns (using field length)
Select [ ▾ dr_okres ] » Edit cells » Transform… » value.replace("Okres", "").trim()
Select [ ▾ dr_okres ] » Facet » Text facet – edit manually Bratislava-staré mes

Explore other available options as well.

4. Clean up and standardise text data

If you work a lot with survey data, you may have a hard time cleaning up the records – e.g. street names or education background – entered manually by survey participants. Losing hours to make the names written in dozens of ways consistent? (e.g. Bratislava, BRA, Bratslava, Bratislava, Bratislava, Bratislava). This is where OpenRefine really shines – using algorithms that analyse natural language to help you tackle this otherwise tedious task.

Select [ ▾ dr_ulica ] » Facet » Text facet
Edit a few clustered values manually
Click on [ Cluster ] and test out available algorithms – check out: Method [ ▾ Key collision ] and Keying Function [ ▾ fingerprint]
Select [ ▾ poznamka ] » Facet » Text facet
Rename manually blank to SLOVENSKO

5. Transpose table from wide to long

If you've ever dealt with a spreadsheet, you're probably used to storing data in a wide table with multiple columns. However, many other tools, e.g. for data visualisation, require data in a long form with only one value in a single row. While completing this transformation using spreadsheets is not a trivial task, with OpenRefine it will only take a couple of minutes.

Import new dataset (this time it's a file from the Medialab's research)
Select [ ▾ label ] » Transpose » Transpose cells across columns into rows…
Choose columns to transpose and set new column names. Make sure to tick ✓ for “Fill down in other columns”

TIP: Read more about tidy data or see this video.

Untidy data (example)

0 items

Tidy data (example)

0 items

6. Transpose/pivot table in Google Sheets

We have just learned how to tidy up (stack) our data. While this can be done in OpenRefine, we will move for a moment to Google Sheets to discover a very popular and extremely useful feature of creating pivot tables.

Let's assume that we are interested in how long and at what times of a day the vehicles from different locations were parked at Krížna Street.

Inside a spreadsheet select [ ▾ File ] » Import – to upload the CSV file
Select [ ▾ Data ] » Pivot table
Select the appropriate values In the Pivot table editor:
In rows [ Add ] time_of_day and distance_cat columns
In values – select id, although it could be any other column with no missing values

7. Apply the same cleaning patterns to another file

Okay, let's go back to OpenRefine. Suppose you performed a number of operations on the data, then your data set has been updated with new values. Fortunately, you don't have to do all the work from scratch. You can save a list of executed steps and apply them again to another similar file.

Select [ Undo / Redo ] and than [ Extract… ]
Copy the contents of the right-hand window manually (use Ctrl + a)
Save it to a text file using your favourite editor
Create a fresh project in OpenRefine
Select [ Undo / Redo ] and than [ Apply… ]
Paste the formula from your text file and click on [ Perform Operations ]

8. More advanced processing (optional)

Select [ ▾ latitude ] » Edit cells » Common transforms » To number
Select [ ▾ latitude ] » Facet » Numeric facet
Check [ Non-numeric ] only
Select [ ▾ latitude ] » Edit cells » Transform...
Try out several options, e.g.:
value.replace(' ','').splitByLengths(2,100).join(".")
value.replace(' ','').replace(/(\d{2})(.+)/, "$1.$2")

9. Parsing data from large files (optional)

If you work with larger files containing over a million rows, your spreadsheet may not be ready for such challenges. However, OpenRefine should be able to handle them, and unlike a spreadsheet, they can come from xml or JSON as well (please note that you will have to allocate more RAM memory to OpenRefine). This makes OpenRefine a powerful companion to popular commercial data processing applications.

Search online for a large CSV file of more than 1 mln rows or use the Urban Dictionary corpus (2.6 mln rows) available at Kaggle (you need to sign up first).
Allocate more RAM to OpenRefine (e.g. 3072M)
Load the file using OpenRefine. Start with 1 200 000 rows to make sure your machine can handle it (to do this, tick Load at most 1200000 rows of data)
See which functions are possible to perform. You will most likely be able to work with numbers or split text cells but more complicated operations – e.g. text facets or clustering – can be difficult to perform

See OpenRefine documentation.

[ short break! ]

6. Where do charts come from?

Popular types of charts, like a line chart, a bar chart, a scatter plot or a pie chart are widely known, understandable and available in many datavis tools, from Excel and Datawrapper to Tableau. We've already seen that switching between them may help us look at the data from different perspectives.

Apart from classics, there are also other chart types, a bit less popular, but still very easy to read – like Isotype charts (and dot-matrix, a close relative), a flow chart or a heatmap.

What makes these examples of information design so universal and effective? A quick analysis of visual principles that are the very base of any data visualisation will give us more freedom and more understanding for crafting better messages.

6.1 The visual grammar

The Visual grammar presentation will be the final part of the first workshop day. You can download the pdf to follow it live during the presentation. Of course, you can also keep it after the workshop.

visual_grammar_presenattion_for_mib_v03.pdf

To summarize, let's go back to illustrations above and try to list all visual means used in the examples. If you'd like to look for more ways of visualizing numbers, see also this article.

That's it for today! Thanks! Tomorrow we'll start from here.

Unless otherwise indicated, all materials are published under Creative Commons Attribution 4.0 International license CC BY 4.0.