After seeing many language wars style posts about vs and the sort of to comparisons being made, I realized that there aren’t many helpful side-by-sides that show you how to do x in y language (and vice versa), I thought about the kind of post I would like to see; one that leverages both tidyverse
, modern pandas method-chaining / pyjanitor
or polars
, and plotly
(in both R and Python).
I decided to try and see if I could contribute something to the discourse. I’m not really trying to reinvent an analysis wheel and just want to focus on the how something is accomplished from one language to the other so I’m pulling from a few sources to just have some code to translate using the same data for both languages.
Since polars
is new to me and I like learning new things, I’m using it for the examples, but if you’re familiar with pandas
already, I’d highly recommend pyjanitor
.
Data
Data was obtained from the dslab
R package and written to parquet via R’s arrow::write_parquet
for better interoporability between R and Python. Additionally, the size is low enough to pull the data as parquet from my GitHub repo.
packages
libraries
return top 10 rows in R
get quick info on the data with dplyr::glimpse()
Rows: 10,545
Columns: 9
$ country <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
$ life_expectancy <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
$ fertility <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
$ population <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
$ gdp <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
$ continent <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
$ region <fct> Southern Europe, Northern Africa, Middle Africa, Cari…
return top 10 rows in Python
get quick info on the data with pandas
’s info
DataFrame
method
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10545 entries, 0 to 10544
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 10545 non-null object
1 year 10545 non-null int32
2 infant_mortality 9092 non-null float64
3 life_expectancy 10545 non-null float64
4 fertility 10358 non-null float64
5 population 10360 non-null float64
6 gdp 7573 non-null float64
7 continent 10545 non-null object
8 region 10545 non-null object
dtypes: float64(5), int32(1), object(3)
memory usage: 700.4+ KB
This will come back later, but it’s very easy to move your polars
data into a pandas DataFrame
.
Hans Rosling’s quiz
Following along Hans Rosling’s New Insights on Poverty video, we’re going to answer the questions he poses in connection to child mortality rates in 2015. He asks, which pairs do you think are most similar?
- Sri Lanka or Turkey
- Poland or South Korea
- Malaysia or Russia
- Pakistan or Vietnam
- Thailand or South Africa
Sri Lanka or Turkey
simple dplyr::filter
and dplyr::select
gapminder %>%
filter(year == "2015", country %in% c("Sri Lanka", "Turkey")) %>%
select(country, infant_mortality)
simple filter
and select
method chain
This is where you can start to see how powerful polars
can be in terms of the way it handles lazy evaluation. One of the reasons dplyr
is so expressive and intuitive (at least in my view) is due in large part to the way it handles lazy evaluation. For people that are tired of constantly needing to refer to the data and column in pandas
will likely rejoice at polars.col
!
Let’s just compare them all at once
same strategy; more countries
gapminder %>%
filter(
year == "2015",
country %in% c(
"Sri Lanka", "Turkey", "Poland", "South Korea",
"Malaysia", "Russia", "Pakistan", "Vietnam",
"Thailand", "South Africa")) %>%
select(country, infant_mortality) %>%
arrange(desc(infant_mortality))
same as above
(gapminder
.filter(
(pl.col("year") == 2015) &
(pl.col("country").is_in([
"Sri Lanka", "Turkey", "Poland", "South Korea",
"Poland", "South Korea","Malaysia", "Russia",
"Pakistan", "Vietnam", "Thailand", "South Africa"])))
.select(["country", "infant_mortality"])
.sort("infant_mortality", reverse = True)
)
Aggregates
With conditionals?
let’s do something slightly more complicated
gapminder %<>%
mutate(group = case_when(
region %in% c(
"Western Europe", "Northern Europe","Southern Europe",
"Northern America",
"Australia and New Zealand") ~ "West",
region %in% c(
"Eastern Asia", "South-Eastern Asia") ~ "East Asia",
region %in% c(
"Caribbean", "Central America",
"South America") ~ "Latin America",
continent == "Africa" &
region != "Northern Africa" ~ "Sub-Saharan",
TRUE ~ "Others"))
gapminder %>% count(group)
rather than use a case_when
style function you can continue to chain .when
and .then
gapminder = (gapminder.with_columns(
pl.when(
pl.col("region").is_in([
"Western Europe", "Northern Europe","Southern Europe",
"Northern America", "Australia and New Zealand"]))
.then("West")
.when(
pl.col("region").is_in([
"Eastern Asia", "South-Eastern Asia"]))
.then("East Asia")
.when(
pl.col("region").is_in([
"Caribbean", "Central America",
"South America"]))
.then("Latin America")
.when(
(pl.col("continent") == "Africa") &
(pl.col("region") != "Northern Africa"))
.then("Sub-Saharan")
.otherwise("Other")
.alias("group")
))
(gapminder
.groupby("group")
.agg([ pl.count() ])
.sort("group")
)
I think this is probably a good enough intro to how you’d generally do things. Filtering, aggregating, and doing case_when
style workflows are probably the most foundational and this could already get you started in another language without as much headache
Scatterplots
I’m trying to strike a balance between dead basic plotly
plots and some things you might want to do to make them look a little more the way you want. The great thing about customizing is that you can write functions to do specific things. I don’t want to overload you with defensive programming for custom function writing using Colin Fay’s attempt
package, so I’m simplifying a bit; or at least trying to strike a balance. in some instances you can create simple functions or just save a list of values you want to recycle throughout.
+ plotly
plotly_title <- function(title, subtitle, ...) {
return(
list(
text = str_glue(
"
<b>{title}</b>
<sup>{subtitle}</sup>
"),
...))
}
margin <- list(
t = 95,
r = 40,
b = 120,
l = 79)
gapminder %>%
filter(year == 1962) %>%
plot_ly(
x = ~fertility, y = ~life_expectancy,
color = ~continent, colors = "Set2",
type = "scatter", mode = "markers",
hoverinfo = "text",
text = ~str_glue(
"
<b>{country}</b><br>
Continent: <b>{continent}</b>
Fertility: <b>{fertility}</b>
Life Expectancy: <b>{life_expectancy}</b>
"),
marker = list(
size = 7
)) %>%
layout(
margin = margin,
title = plotly_title(
title = "Scatterplot",
subtitle = "Life expectancy by fertility",
x = 0,
xref = "paper")) %>%
config(displayModeBar = FALSE)
A quick note about having plotly
work inside of the RStudio IDE–as of the time of this writing it isn’t very straightforward, i.e., not officially supported yet. The plot will open in a browser window and it’s fairly snappy. The good think is that on the reticulate
side, knitting works! So this side was able to put all this together via rmarkdown
when I started this post and Quarto now that I’m finishing this post (remember any chunk will default to the knitr
engine), so that’s pretty cool. We’re even using both renv
and mamba
for both environments in the same file
+ plotly
def plotly_title(title, subtitle):
return(f"<b>{title}</b><br><sup>{subtitle}</sup>")
margin = dict(
t = 95,
r = 40,
b = 120,
l = 79)
config = {"displayModeBar": False}
(px.scatter(
(gapminder.filter(pl.col("year") == 1962).to_pandas()),
x = "fertility", y = "life_expectancy", color = "continent",
hover_name = "country",
color_discrete_sequence = px.colors.qualitative.Set2,
title = plotly_title(
title = "Scatterplot",
subtitle = "Life expectancy by fertility"),
opacity = .8,
template = "plotly_white")
.update_traces(
marker = dict(
size = 7))
.update_layout(
margin = margin)
).show(config = config)
plotly
expects a pandas DataFrame
so we’re just using .to_pandas()
to give it what it wants, but that doesn’t have to stop you from adding any filtering, summarizing, or aggregating before chaining the data into your viz.
Conclusion
Hopefully this is helpful. If people like posts like this I can try to do more blogging, I just get busy and foregetful sometimes! Feel free to reach out with any feedback or questions.