Visualizations Showcase
Top Olympic Medal Earning Countries
Top Olympic Medal Earning Countries
Preamble
import numpy as np # for multi-dimensional containers
import pandas as pd # for DataFrames
import itertools
from plotapi import Chord
Introduction
In previous sections, we visualised co-occurrences of Pokémon type. Whilst it was interesting to look at, the dataset only contained Pokémon from the first six geerations. In this section, we're going to use the TidyTuesday Animal Crossing villagers dataset to visualise the relationship between Species and .
The Dataset
The dataset documentation states that we can expect 13 variables per each of the 1017 Pokémon of the first eight generations.
Let's download the mirrored dataset and have a look for ourselves.
data_url = 'https://datacrayon.com/datasets/athlete_events.csv'
raw_data = pd.read_csv(data_url)
raw_data.head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | A Dijiang | M | 24.0 | 180.0 | 80.0 | China | CHN | 1992 Summer | 1992 | Summer | Barcelona | Basketball | Basketball Men's Basketball | NaN |
1 | 2 | A Lamusi | M | 23.0 | 170.0 | 60.0 | China | CHN | 2012 Summer | 2012 | Summer | London | Judo | Judo Men's Extra-Lightweight | NaN |
2 | 3 | Gunnar Nielsen Aaby | M | 24.0 | NaN | NaN | Denmark | DEN | 1920 Summer | 1920 | Summer | Antwerpen | Football | Football Men's Football | NaN |
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
4 | 5 | Christine Jacoba Aaftink | F | 21.0 | 185.0 | 82.0 | Netherlands | NED | 1988 Winter | 1988 | Winter | Calgary | Speed Skating | Speed Skating Women's 500 metres | NaN |
data = raw_data[raw_data.Medal.notna()]
data.head()
ID | Name | Sex | Age | Height | Weight | Team | NOC | Games | Year | Season | City | Sport | Event | Medal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Edgar Lindenau Aabye | M | 34.0 | NaN | NaN | Denmark/Sweden | DEN | 1900 Summer | 1900 | Summer | Paris | Tug-Of-War | Tug-Of-War Men's Tug-Of-War | Gold |
37 | 15 | Arvo Ossian Aaltonen | M | 30.0 | NaN | NaN | Finland | FIN | 1920 Summer | 1920 | Summer | Antwerpen | Swimming | Swimming Men's 200 metres Breaststroke | Bronze |
38 | 15 | Arvo Ossian Aaltonen | M | 30.0 | NaN | NaN | Finland | FIN | 1920 Summer | 1920 | Summer | Antwerpen | Swimming | Swimming Men's 400 metres Breaststroke | Bronze |
40 | 16 | Juhamatti Tapio Aaltonen | M | 28.0 | 184.0 | 85.0 | Finland | FIN | 2014 Winter | 2014 | Winter | Sochi | Ice Hockey | Ice Hockey Men's Ice Hockey | Bronze |
41 | 17 | Paavo Johannes Aaltonen | M | 28.0 | 175.0 | 64.0 | Finland | FIN | 1948 Summer | 1948 | Summer | London | Gymnastics | Gymnastics Men's Individual All-Around | Bronze |
capitalise the name, personality, and species of each villager.
It looks good so far, but let's confirm the 13 variables against 1017 samples from the documentation.
data.shape
(39783, 15)
data = data[data['NOC'].isin(list(data['NOC'].value_counts()[:20].index))]
Perfect, that's exactly what we were expecting.
Data Wrangling
We need to do a bit of data wrangling before we can visualise our data. We can see from the columns names that the Pokémon types are split between the columns Type 1
and Type 2
.
So let's select just these two columns and work with a list containing only them as we move forward.
species_personality = pd.DataFrame(data[['NOC', 'Medal']].values).dropna().astype(str)
species_personality
0 | 1 | |
---|---|---|
0 | FIN | Bronze |
1 | FIN | Bronze |
2 | FIN | Bronze |
3 | FIN | Bronze |
4 | FIN | Gold |
... | ... | ... |
30152 | URS | Gold |
30153 | URS | Silver |
30154 | URS | Bronze |
30155 | RUS | Bronze |
30156 | RUS | Silver |
30157 rows × 2 columns
species_personality = species_personality.dropna()
Now for the names of our types.
#left = np.unique(pd.DataFrame(species_personality)[0]).tolist()
left = list(data['Medal'].value_counts().index)[::-1]
#left.sort()
left = list(["Gold","Silver","Bronze"])
pd.DataFrame(left)
0 | |
---|---|
0 | Gold |
1 | Silver |
2 | Bronze |
#right = np.unique(pd.DataFrame(species_personality)[1]).tolist()
right = list(data['NOC'].value_counts().index)
#right.sort()
pd.DataFrame(right)
0 | |
---|---|
0 | USA |
1 | URS |
2 | GER |
3 | GBR |
4 | FRA |
5 | ITA |
6 | SWE |
7 | CAN |
8 | AUS |
9 | RUS |
10 | HUN |
11 | NED |
12 | NOR |
13 | GDR |
14 | CHN |
15 | JPN |
16 | FIN |
17 | SUI |
18 | ROU |
19 | KOR |
Which we can now use to create the matrix.
features= left+right
d = pd.DataFrame(0, index=features, columns=features)
Our chord diagram will need two inputs: the co-occurrence matrix, and a list of names to label the segments.
We can build a co-occurrence matrix with the following approach. We'll start by creating a list with every type pairing in its original and reversed form.
species_personality.values
array([['FIN', 'Bronze'], ['FIN', 'Bronze'], ['FIN', 'Bronze'], ..., ['URS', 'Bronze'], ['RUS', 'Bronze'], ['RUS', 'Silver']], dtype=object)
for x in species_personality.values:
d.at[x[0], x[1]] += 1
d.at[x[1], x[0]] += 1
Chord Diagram
Time to visualise the co-occurrence of types using a chord diagram. We are going to use a list of custom colours that represent the types.
colors =["#FFD700","#C0C0C0","#A57164",
'#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080'
#'#e6194B', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#42d4f4', '#f032e6', '#bfef45', '#fabed4', '#469990', '#dcbeff', '#9A6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#a9a9a9', '#ffffff', '#000000'
]
names = left + right
len(names)
23
Finally, we can put it all together.
Chord(d.values.tolist(), names,credit=True, colors=colors, curved_labels=False,
margin=40, font_size_large=7,noun="medals", conjunction="awarded", verb="",
details_separator="", bipartite=True, bipartite_idx=len(left),bipartite_size=.2, reverse_gradients=True).show()
Chord(d.values.tolist(), names,credit=True, colors=colors, curved_labels=False,
margin=40, font_size_large=7,noun="medals", conjunction="awarded", verb="",
details_separator="", bipartite=True, bipartite_idx=len(left),bipartite_size=.2, reverse_gradients=False).show()
import json
data = {"matrix": d.values.tolist(),
"names": names,
"colors": colors,
"bipartite_idx": len(left)}
with open("olympic_medals.json", "w") as fp:
json.dump(data, fp)
Chord(
d.values.tolist(),
names,
colors=colors,
curved_labels=False,
margin=40,
font_size_large=7,
noun="medals",
conjunction="awarded",
verb="",
details_separator="",
bipartite=True,
bipartite_idx=len(left),
bipartite_size=0.2,
reverse_gradients=False,
).show()