Waldek Węgrzyn / May 18 2020

with KKarolow

Remix of Dataviz 101: introduction to data wrangling and visualisation by

Waldek Węgrzyn

Dataviz 101: day two

Medialab Katowice | Karol Piekarski & Waldek Węgrzyn | workshop for Metropolitný inštitút Bratislavy

This is the second part of the workshop, see the first section here.

1. Documenting your work

After yesterday, you are already able to clear and prepare data for analysis, and you know several visualisation methods. Great! There are more exercises ahead of us, but let's create a draft online article, in which you will present the results of your research. To do this, you will need a Nextjournal account.

On this occasion, you will learn about the idea of reproducible research and why it is worth publishing detailed research data online.

2. Communicating with visual language

Let's try to examine the workshop topic a little further. Not only we know where do cars on Krížna come from, but we have also some information on the time of day they were spotted. It might be a good idea to combine these two variables.

You already know how to make such calculations in Google Sheets using the pivot table (if you need a refresher, see the exercise 5.6 Transpose/pivot table from long to wide from day one). Yesterday you prepared a dataset with two variable: time_of_day and distance_cat. The result looks exactly like the table below.

0 items

As you can see, we have counted how many cars from different distance categories were spotted during evening, midday, or both examinations. It's stored in a long-shaped table, thus we have only three columns, but quite a lot of rows. Let's download the file:

day2_time_of_day_distance_cat.csv

2.1 Exercise: Deconstructing a chart

Let's try a different tool, called RAW Graphs. It was created as a supplementary tool for people working with vector graphics (e.g. Adobe Illustrator) – therefore final results are simply raw and need further work before publishing. However, it's perfect if you want to have a quick look under the hood of a chart generator. We'll see how some aspects of data can be remapped to different visual means.

Go to https://rawgraphs.io/
Click [ Use it now! ]
Select "Upload the file" on the left and drag the file from your folder (or click to find it)

Note: After half a second, data is imported. RAW Graphs accepts long data format (which we already have). In case you had a wide format, there is a small link just below the data table on the right, which lets you transform your table.

Scroll down to see the list of available charts

Note: This tool offers less typical charts – perfect for experimenting, but not as easy to achieve reasonable effect as Datawrapper

Select [ bar chart ]
Scroll down to see columns from your data available on the left as green rectangles
Drag the first part of your data (let it be [ distance category ] onto one of the free slots available on the right
As you can see, only the first slot ("X Axis") gives us some result, which is visible below
After you have mapped [ distance categories ] to "X Axis", lets drag [ number of cars ] to the "Height" field

Below, as a result, you should see a bar chart very similar to what we did yesterday with Datawrapper.

Now lets drag [ time of day ] onto "Groups". Finally we see some new information!
On the left side of the chart, check 🔘 "Use Same Scale", so that we could compare height of bars also between groups

What we see is a variant of a grouped column (or bar) chart. It is somehow similar to a stacked column chart, but the segments of columns (or rows) are aligned on a common bottom axis – which allows us to compare bars within each time of day very precisely. That wouldn't be possible for the middle row ("evening"), if we created stacked column chart. It's a highly detailed comparison.

Ok, but we could have done the same with Datawrapper, and probably it would look better! Let's go further and hack the tool just a bit – we'll try to create a heatmap, still using a bar chart option. You can do it in just two moves:

Take [ number of cars ] from "Height" and drag it onto "Colors"
On the left side of the newly generated chart, find "Color Scale" and change it from [ Ordinal (categories) ] to [ Linear (numeric) ]

Now, instead of bars, we have a heatmap with the color brightness proportional to the number of cars in each category. To make it more compact, you can change the height of the chart to 300 on the left.

Note, how the perception of the data changed. Instead of detailed comparison (which is no longer possible with colors), we have a general overview. Now we focus on extremes (a lot of cars parking all day close to the apartment) and patterns (similar number of cars from other districts of Bratislava for all three categories).

It's very convenient if you have a lot of data to show, but detailed differences are not that important – like in GitHub shown yesterday:

What you just did using this basic example, is an essential activity for every data visualisation project – it's about mapping different features of data onto visual means in a way that will be the most effective for your message. This way of looking at charts and infographics is the best way to assess their effectiveness and to choose a proper solution.

If you want to see how different people propose different ways to visualize the same topic, have a look at Makeover Monday.

Now you can save your work and publish it on Nextjournal. It's not possible to directly publish charts from RAW Graphs, but you can download the image:

Navigate to the very bottom of the page
On the left, under "Download", select [ Image (png) ]
Type a file name
Click [ Download ]

You can also download scalable vector graphics for further edition (or simply copy SVG code and paste it into Adobe Illustrator). For now, a png will be enough – you can embed the image in your Nextjournal article.

2.2 Keeping things simple

It's very easy to be overwhelmed by possibilities of creating fancy and attractive chart. Instead of creating rich infographics just to make the topic attractive, it's better to enhance the informative value of your project. In most cases, that can be achieved with simple solutions.

Comparing the number of visual features with the amount of information conveyed is usually the best way to asses the quality of the infographics. This proportion of data to visualization is commonly described as the data-ink-ratio (a term proposed by Edward Tufte).

2.3 Exercise: Crafting the message

All the charts we created so far were focused on a summary of some quantitative data. However, a map exploration also seemed promising. Having the experience in creating basic visualizations, it's time for us to create a map out of the full dataset:

full_dataset.csv

We're going to use the same tool we already tried at the exploration stage.

Go to http://carto.com/
Log in
Click [ New Map ]
We don't have any dataset to work on – go to "Upload"
Drag&drop or browse your file
Click [ Upload dataset ] at the bottom
You should see the map with all the data-points (car locations). Zoom in!
On the left, you should see one data layer (our dataset) and one basemap. Let's click on the basemap ("Voyager" selected by default)
Choosing a basemap is the first decision. It should provide only the information that's relevant for your project, and not disturb seeing your data. Select one!
After you choose a background, click "<- Back" at the top to return to layer list
Now click on the data layer.

For now, let's stay inside "Style" tab:

The first available option is "Aggregation". Default setting to show data "by points" is probably the best for us, but you can try other options.
Below, in the "Style" section, you can change the points by setting size, color (and shape), stroke, blending and by adding a label.
Both size and color may depend on your data. To calculate the size, you need quantitative (numerical) data, but to change the color categorical data (nominal or ordinal) will be also good.

Now, let's map some values from our data onto visual features of our points. Any ideas? From yesterday exploration we can remember columns like:

slovak district (with or without Bratislava districts)
is weekday (true or false)
time of day ("evening", "midday" or "all day")
spotted count – how many times have we spotted the car
days count – during how many days (1, 2 or 3) have we spotted the car
distance – from the parked car to the address
distance category

You can try mapping values from this columns onto size and color of our points by choosing "🔘 by value".

Let's define the point size "🔘 by value" and select "spotted_count" in the drop-down below
The more often the car has been spotted, the bigger the circle. You can change the minimal and maximal size of the circle.
As you may noticed, some circles overlap each other. We may try to use "blending" at the bottom (change it to "multiply") to see overlapping points.
After trying that, lets undo the changes: fix the size of the dot and set "blending" back to "none" – just to make our next steps easier. You can come back to this point later.

Ok, let's talk about colors.

Start by setting "Point color" to "🔘 by value" and select "distance_cat" column in the drop-down below
We can use one of default color schemes to differentiate different distance categories
We can also change the color individually for each category by clicking the small color sample next to the value

After playing with color samples for a while you may realize that it's not that easy to use many colors without making the map cluttered and chaotic. Let's stop for a moment and talk about it.

[ short break! ]

2.4 The use of color: good practices

In general, it's hard to discuss about colors. A lot of factors affect the way we perceive color – certainly even the background of this website seems different for each of us, not only (but also) because of different screens we're using. There are, however, some rules to follow when choosing a color for datavis:

Legibility first
- make sure that the contrast between important elements and the background is high enough – especially when text is involved. For more information about text on colorful backgrounds have a look here.
- think about older readers and people with color blindness
- there are online tools like Colorbrewer or I want hue that can help create a legible palette of colors
The role of color in your project
- try to think what kind of differences (or similarities) you want to show. To use color for distinction is often the first choice, but it's not always the best – consider using other means like position, area, shape, visual grouping or sorting
- consistent use of the same color for the same thing or category may be a good idea, not only for one datavis, but also for the whole article or publication
- as you can remember from the visual grammar presentation, color (especially of different hue) is not accurate for presenting quantitative data
- we're also not very good at memorizing colors – even with categorical data it's hard to encode many values with color. A rule of thumb is that we can memorize (and then distinguish) between 5 and 7 colored categories
Meaning of colors
- remember that colors you want to use are probably already associated with some concepts (the simplest example is blue for "cold" and red for "hot", or red for "danger")
- the meaning of the color is always based on the context. Depending on the topic of the article (or other colors used), green may be associated with "free way", "vegetarian" or "polluted water". Meaning of a color is also different in different cultures

Consider this example of a map showing a voter turnout for regions of Poland (in some public elections few years ago) and try to answer the question: Was the turnout in western Poland higher than in the eastern part?

Hint: not only red is used for "less", also the color key (the legend) is turned upside-down. Both color and position are used in a misleading way, you need to read small numbers to understand the information.

Let's go back to the map we're working on. What kind of conclusions can we have, basing on rules mentioned above?

There is are simply too many colorful points and we're not able to identify any patterns. We need to set up priorities by asking a well known question: What do I want to show? and then highlighting one thing at a time.

The first approach may to reduce number of categories by grouping similar ones. We have a separate categories for "about 70 m" and "Krížna surroundings" – is that difference really important? For example, we could come up with two main groups:

cars from the neighborhood = ~70 m, Krížna surroundings and Stare Mesto
cars from outside = all the other categories
and a "no data" (null) group

To create a color set for these two groups, we may start with the Colorbrewer tool.

Go to colorbrewer2.org
Choose [ 7 ] data classes at the top (as we have 7 categories)
Select 🔘 divering palette, which will give us 2 opposite groups
Check 🔘 colorblind safe
Pick one of the color schemes

As a result, you should have a list of 7 colors available as #hex codes, which you can now copy into your Carto color scheme (BTW, there are also more advanced export options available in Colorbrewer). I decided to use:

dark blue for ~70 m
middle blue for Krížna surroundings
light blue for Bratislava-Staré Mesto
gray for null
light red for Bratislava-other
middle red for Slovakia
dark red for Abroad

Then, we need to

copy each of 7 color #hex codes from Colorbrewer to Carto color scheme (one per each category). Note that you can quickly switch between categories by clicking small color swatches:

You may decide that some differences between these categories are less important, and some are more interesting – feel free to adjust your colors in Carto to make it clear. You can even choose just two colors and ignore nuances.

Here's an example of the map I created – we start to see some differences between red-dominated and blue-dominated streets or localities.

However, a careful map reader may notice, that some points simply overlap others, which may be a bit misleading. That leads us to the second approach: do not show everything at once. Instead you may consider:

adding interactivity, which will let you reader filter data and focus on one thing at a time. The downside is that your reader needs to be curious enough to use it, otherwise she or he simply won't click it
using more than one map (or chart). There is nothing bad about creating more of them to present one topic. Using many less-detailed charts instead of one is a popular trick in data visualisation, and it's called small multiples.

We have already used interactivity features in Carto yesterday, during the first exploration stage. Let's quickly recreate it:

Click "<- Back" to exit data layer
Select "Widgets" tab
Click [ + Add new widget ]
From the list of available data columns, select the one we're investigating (distance_cat)
Click [ Continue ]

Now we're back with our map, with an additional panel that let's us select categories:

Click "~70m" and "Krížna surroundings" to show only cars from these two categories. After you publish the map, your user will be able to do the same.
Click both categories again, or click "All" in top-right corner of the panel, to reset the view.

Unfortunately, this kind of widget can only show 6 categories (even though we have only 7 of them). You can look for others using "search in 6 categories". The solution would be to prepare data with maximum 6 categories.

You can decide to leave the widget for the reader to explore, or to copy your map and publish two versions (minimal small multiples approach), with pre-selected categories.

Note: For the second case, category filtering can be also done in non-interactive way by navigating to your data layer -> "Analysis" tab -> [ + Add new analysis ] -> "Transform" tab -> "Filter by column value". However, if you want to create multiple non-interactive maps, the best solution would be to prepare separate datasets.

The last two steps before we publish our map should be to explain to the reader what's actually presented on the map, similarly to what we did with Datawrapper charts.

While staying inside data layer, let's

Go to "Pop-up" tab
Make sure the first section ("click") is underlined (not "hover")
Choose a pop-up style
Select categories to show in the pop-up (e.g. distance_cat, time_of_day and slovak_district)
Change the labels of the categories you selected to make them more informative
You can change the order of the labels by dragging them

Now, all of the information you selected is available after clicking any of the points. You can select some of the information to appear also while hovering the mouse cursor over the point. To do this

Go to "hover" section
Repeat steps you made in the "click" section

Finally, consider adding a legend – it's recommended, however some maps can be understood without the legend (or with pop-ups only)

Go to "Legend" tab
Select "Custom legend" style
Add the title
Edit category labels
Drag the categories to arrange them in a meaningful order
Consider changing the name of you data layer, since it's displayed as the header of your legend

Decide about the additional options:

Click the "Map options" button on the left (below the pen)
Decide, what to include. If the screen space is limited, you can simply remove everything besides "zoom controls"

Finally, we're ready to publish out map and include it in the article.

Click "<- Back" to exit data layer
Click [ Publish ] at the bottom
Click [ Publish ] once again (you need to click "Update" here in case you make any changes to your map)
(if nothing happens, refresh the page)
Copy the embed code and place it in your article

2.5 Adding context

Let's make a little summary of what we have already done with our data:

Explored it in WTFCSV and Carto
Created basic charts showing distance categories (bar chart, stacked bar or pie chart) in Datawrapper
Transposed table from long to wide to create time_of_date vs distance_cat summary in Google Sheets
Used the newly created summary to create a heatmap in RAW Graphs, presenting how distance categories vary for different time of day
Crafted an interactive map in Carto showing where cars from selected categories actually park

We made it all using our primary dataset. Can we go even further and use external resources to support our article?

3. Data workflow in a nutshell

3.1 Exercise: from data acquisition to visualization

Now that you have explored the various formats and methods of data processing, are you ready to implement a data-based micro project? In this exercise, you will perform several key data analysts' activities to complete step by step the full workflow – from data acquisition to visualization. This time you will preprocess the data in Google Sheets.

We obtained the registration numbers of vehicles parking on Križna Street. Our task is to assign two-letter codes (e.g. BA, BB, KE) to the regions and then count the cars from different parts of Slovakia. Where would you expect to find such data? It turns out that the registration codes can be easily scraped from Wikipedia.

Find Slovak registration plates in English Wikipedia
Use this formula to automatically scrape the data =IMPORTHTML("wikipedia_url_here", "table", 0). 0 stands for the first table found on the website
Copy-paste the obtained table to a new sheet (name it lookup_table) by selecting [ ▾ Edit ] » Paste special » Paste values only
Split multiple codes (e.g. BB, BC, BK) from single cell to the following rows, adding more records
Import data from full_dataset.csv file
Use =VLOOKUP function to match the codes from ecv_code to the lookup_table created earlier, store the results in the slovak_districts column
You may want to use the =ARRAYFORMULA function to apply the lookup automatically to multiple rows: =ARRAYFORMULA(VLOOKUP(col_name2:col_name,lookup_table!A:B,2,false))
Select [ ▾ Data ] » Slicer – to filter out rows from poznamka column (we are only interested in Slovak registration numbers)
Select [ ▾ Data ] » Pivot table – to count the values from slovak_district column
Rename column names to slovak_district,count
Select [ ▾ File ] » Download » Comma-separated values

Now you can use this CSV file to make visualizations! This time we will go back to Datawrapper to see how it can handle a map.

Go to www.datawrapper.de (log in if you have an account)
Click [ Dashboard ] at top right
Find "New Map" in the top menu and click it
You can create 3 different types of map in Datawrapper: Choropleth (colored regions), Symbol map (similar to what we did in Carto) and Locator map (static maps for storytelling). For this exercise let's choose the Choropleth.

The next step is to select a map. It's a bit different than a basemap in Carto –what we need now is a set of geographical areas that can be later manipulated (recolored). There is a library of regions available in Datawrapper – let's try it and

type: "Slovakia" in the search box
You should see a list of 5 results – click the names to see how the maps look. The most interesting choice for us is 🔘 Slovakia » Districts

After closer examination you may see, that this map includes multiple, detailed districts for Bratislava and Košice. We could give it a try, but finally we would find out that regions for these two cities are too small. We need a modified version of this map.

One of the standard formats for storing geographical features, such as districts of Slovakia, is a geojson file. You should be able to obtain (or maybe produce on your own?) such a file, with the regions you want, for the purpose of the visualization. It just so happens that we have such a file:

slovak_districts_epsg_4326_dissolved.geojson

BTW, if you've got the json file and you're not sure what's inside, you can always examine the file using a free online tool called geojson.io

Download the geojson file and preview it using geojson.io

As you can see, Bratislava and Košice are merged into one region in the new file. We can go back to Datawrapper and import the file:

Click [ or Upload Map ] and choose geojson file
After you see the map imported, click [ Next ]
What you can see now is the same list of regions presented as a map and as a table. Now, we need to merge this table with the dataset we created in Google Sheets. Click [ Import your dataset ] at the bottom of the table.
Datawrapper wants to do the job for us. Ok, click [ Start import ] and go through the import process step-by-step
It might occur, that Datawrapper will have a problem recognizing the name of the region that should be matched – a table with red cells appear. But the tool is smart enough to show you a list of rows that are still available for merging. You can try to solve it by double clicking the red cell and editing the name manually, to fit the names used in the geojson map.
After the data is added, you should see a map with some colors. We definitely want to customize them in the section "3. Visualize"

Although the color palette proposed by Datawrapper is a perfect linear gradient between min and max values (number of cars), it does not show us a lot – basically because of the amount of cars from Bratislava.

We need to adjust the palette to show differences between regions, and to keep Bratislava visible as a much higher number.

In the "Refine" tab, "Color palette" section we're able to customize colors and how they change along with the number of cars. You can add color stops by clicking on the bottom of the color palette
Colors stops can be easily dragged and moved left-right
The same way you can add data stops by clicking on the top of the color palette
You can remap data by clicking on the small triangle below data stop and dragging it

Having a satisfying effect in terms of colors, we're ready to finish the work by appyling all the small improvements you're already familiar with:

In te tooltips section you can customize the tooltip text, using the variables from your data table. It's very similar to Carto pop-ups.
Remember to add all necessary information in the "Annotate" tab
Use the "Publish & Embed" section to publish the chart and place it in your article

That's it! We spend quite a lot of time crafting visualisations! ;) Now you can wrap them up adding any kind of narrative that can explain all the subtleties, important questions or struggles of the process.

We hope you can use the same approach to create an article supporting your work – not only Krížna topic, but also other reports, studies or campaigns.

4. Taking your teamwork to the next level

You can already do a lot of things by yourself – from cleaning data to visualizing it on charts and maps. It's time to think about how you can improve the data workflow and collaboration with your teammates.

In this short presentation you will find a solution (Git) to several common problems while working with datasets:

you get lost in versions of the same file on your hard drive (full_dataset.csv, full_dataset_final.csv, full_dataset_geocoded.csv, etc.)
you want to keep track of changes in your file, while avoiding keeping several versions of it (e.g. full_dataset_1.csv, full_dataset_2.csv)
you need to share datasets with your colleagues and edit them together with your team
you would like to have a backup copy of your files to make sure nothing get lost by accident

5. Exercise: writing up your article

Now that you can explore the data and publish it in the form of maps and visualisations, use Next Journal to write an article about Krížna Street, in which you will use the previously designed visualisations.

Unless otherwise indicated, all materials are published under Creative Commons Attribution 4.0 International license CC BY 4.0.