A tidyverse R and polars Python side-by-side

R
Python
tidyverse
polars
plotly
Author

Robert Mitchell

Published

July 19, 2022

After seeing many language wars style posts about vs and the sort of to comparisons being made, I realized that there aren’t many helpful side-by-sides that show you how to do x in y language (and vice versa), I thought about the kind of post I would like to see; one that leverages both tidyverse, modern pandas method-chaining / pyjanitor or polars, and plotly (in both R and Python).

I decided to try and see if I could contribute something to the discourse. I’m not really trying to reinvent an analysis wheel and just want to focus on the how something is accomplished from one language to the other so I’m pulling from a few sources to just have some code to translate using the same data for both languages.

Since polars is new to me and I like learning new things, I’m using it for the examples, but if you’re familiar with pandas already, I’d highly recommend pyjanitor.

Data

Data was obtained from the dslab R package and written to parquet via R’s arrow::write_parquet for better interoporability between R and Python. Additionally, the size is low enough to pull the data as parquet from my GitHub repo.

packages

library(tidyverse)
library(plotly)
library(arrow, include.only = "read_parquet")
library(magrittr, include.only = "%<>%")

gapminder <- read_parquet("gapminder.parquet")

libraries

import polars as pl
import plotly.express as px

gapminder = pl.read_parquet("gapminder.parquet")

gapminder = (gapminder
  .with_columns([
    pl.col("country").cast(pl.Utf8),
    pl.col("continent").cast(pl.Utf8),
    pl.col("region").cast(pl.Utf8)  
  ])
)

return top 10 rows in R

gapminder %>% head(10)

get quick info on the data with dplyr::glimpse()

gapminder %>% glimpse()
Rows: 10,545
Columns: 9
$ country          <fct> "Albania", "Algeria", "Angola", "Antigua and Barbuda"…
$ year             <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, 37.…
$ life_expectancy  <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 70.8…
$ fertility        <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2.70,…
$ population       <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 1867396,…
$ gdp              <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966778…
$ continent        <fct> Europe, Africa, Africa, Americas, Americas, Asia, Ame…
$ region           <fct> Southern Europe, Northern Africa, Middle Africa, Cari…

return top 10 rows in Python

gapminder.head(10)

get quick info on the data with pandas’s info DataFrame method

gapminder.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10545 entries, 0 to 10544
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   country           10545 non-null  object 
 1   year              10545 non-null  int32  
 2   infant_mortality  9092 non-null   float64
 3   life_expectancy   10545 non-null  float64
 4   fertility         10358 non-null  float64
 5   population        10360 non-null  float64
 6   gdp               7573 non-null   float64
 7   continent         10545 non-null  object 
 8   region            10545 non-null  object 
dtypes: float64(5), int32(1), object(3)
memory usage: 700.4+ KB

This will come back later, but it’s very easy to move your polars data into a pandas DataFrame.

Hans Rosling’s quiz

Following along Hans Rosling’s New Insights on Poverty video, we’re going to answer the questions he poses in connection to child mortality rates in 2015. He asks, which pairs do you think are most similar?

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

Sri Lanka or Turkey

simple dplyr::filter and dplyr::select

gapminder %>%
  filter(year == "2015", country %in% c("Sri Lanka", "Turkey")) %>%
  select(country, infant_mortality)

simple filter and select method chain

(gapminder
  .filter(
    (pl.col("year") == 2015) & 
    (pl.col("country").is_in(["Sri Lanka", "Turkey"]))) 
  .select(["country", "infant_mortality"])
) 

This is where you can start to see how powerful polars can be in terms of the way it handles lazy evaluation. One of the reasons dplyr is so expressive and intuitive (at least in my view) is due in large part to the way it handles lazy evaluation. For people that are tired of constantly needing to refer to the data and column in pandas will likely rejoice at polars.col!

Let’s just compare them all at once

same strategy; more countries

gapminder %>%
  filter(
    year == "2015", 
    country %in% c(
      "Sri Lanka", "Turkey", "Poland", "South Korea",
      "Malaysia", "Russia", "Pakistan", "Vietnam",
      "Thailand", "South Africa")) %>%
  select(country, infant_mortality) %>%
  arrange(desc(infant_mortality))

same as above

(gapminder
  .filter(
    (pl.col("year") == 2015) & 
    (pl.col("country").is_in([
      "Sri Lanka", "Turkey", "Poland", "South Korea", 
      "Poland", "South Korea","Malaysia", "Russia", 
      "Pakistan", "Vietnam", "Thailand", "South Africa"]))) 
  .select(["country", "infant_mortality"])
  .sort("infant_mortality", reverse = True)
)

Aggregates

grouping and taking an average

gapminder %>%
  group_by(continent) %>%
  summarise(mean_life_expectancy = mean(life_expectancy) %>%
              round(2), .groups = "keep")

now with polars

(gapminder
  .groupby("continent")
  .agg([
    (pl.col("life_expectancy")
        .mean().
        round(2).
        alias("mean_life_expectancy"))
    ])
  .sort("continent")
) 

With conditionals?

let’s do something slightly more complicated

gapminder %<>% 
  mutate(group = case_when(
    region %in% c(
      "Western Europe", "Northern Europe","Southern Europe", 
      "Northern America", 
      "Australia and New Zealand") ~ "West",
    region %in% c(
      "Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c(
      "Caribbean", "Central America", 
      "South America") ~ "Latin America",
    continent == "Africa" & 
      region != "Northern Africa" ~ "Sub-Saharan",
    TRUE ~ "Others"))

gapminder %>% count(group)

rather than use a case_when style function you can continue to chain .when and .then

gapminder = (gapminder.with_columns(
  pl.when(
    pl.col("region").is_in([
      "Western Europe", "Northern Europe","Southern Europe", 
      "Northern America", "Australia and New Zealand"]))
    .then("West")
    .when(
      pl.col("region").is_in([
        "Eastern Asia", "South-Eastern Asia"]))
    .then("East Asia")
    .when(
      pl.col("region").is_in([
        "Caribbean", "Central America", 
        "South America"]))
    .then("Latin America")
    .when(
      (pl.col("continent") == "Africa") & 
      (pl.col("region") != "Northern Africa"))
    .then("Sub-Saharan")
    .otherwise("Other")
    .alias("group")
))

(gapminder
  .groupby("group")
  .agg([ pl.count() ])
  .sort("group")
)

I think this is probably a good enough intro to how you’d generally do things. Filtering, aggregating, and doing case_when style workflows are probably the most foundational and this could already get you started in another language without as much headache

Scatterplots

I’m trying to strike a balance between dead basic plotly plots and some things you might want to do to make them look a little more the way you want. The great thing about customizing is that you can write functions to do specific things. I don’t want to overload you with defensive programming for custom function writing using Colin Fay’s attempt package, so I’m simplifying a bit; or at least trying to strike a balance. in some instances you can create simple functions or just save a list of values you want to recycle throughout.

+ plotly

plotly_title <- function(title, subtitle, ...) {
  return(
    list(
      text = str_glue(
        "
        <b>{title}</b>
        <sup>{subtitle}</sup>
        "),
      ...))
}

margin <- list(
  t = 95,
  r = 40,
  b = 120,
  l = 79)

gapminder %>%
  filter(year == 1962) %>%
  plot_ly(
    x = ~fertility, y = ~life_expectancy, 
    color = ~continent, colors = "Set2", 
    type = "scatter", mode = "markers",
    hoverinfo = "text",
    text = ~str_glue(
      "
      <b>{country}</b><br>
      Continent: <b>{continent}</b>
      Fertility: <b>{fertility}</b>
      Life Expectancy: <b>{life_expectancy}</b>
      "),
    marker = list(
      size = 7
    )) %>%
  layout(
    margin = margin,
    title = plotly_title(
      title = "Scatterplot",
      subtitle = "Life expectancy by fertility",
      x = 0,
      xref = "paper")) %>%
  config(displayModeBar = FALSE)


Python Plotly rendering

A quick note about having plotly work inside of the RStudio IDE–as of the time of this writing it isn’t very straightforward, i.e., not officially supported yet. The plot will open in a browser window and it’s fairly snappy. The good think is that on the reticulate side, knitting works! So this side was able to put all this together via rmarkdown when I started this post and Quarto now that I’m finishing this post (remember any chunk will default to the knitr engine), so that’s pretty cool. We’re even using both renv and mamba for both environments in the same file

+ plotly

def plotly_title(title, subtitle):
  return(f"<b>{title}</b><br><sup>{subtitle}</sup>")

margin = dict(
  t = 95,
  r = 40,
  b = 120,
  l = 79)
  
config = {"displayModeBar": False}

(px.scatter(
  (gapminder.filter(pl.col("year") == 1962).to_pandas()),
  x = "fertility", y = "life_expectancy", color = "continent",
  hover_name = "country",
  color_discrete_sequence = px.colors.qualitative.Set2,
  title = plotly_title(
    title = "Scatterplot", 
    subtitle = "Life expectancy by fertility"),
  opacity = .8, 
  template = "plotly_white") 
  .update_traces(
    marker = dict(
      size = 7))
  .update_layout(
    margin = margin)
).show(config = config) 


plotly expects a pandas DataFrame so we’re just using .to_pandas() to give it what it wants, but that doesn’t have to stop you from adding any filtering, summarizing, or aggregating before chaining the data into your viz.

Conclusion

Hopefully this is helpful. If people like posts like this I can try to do more blogging, I just get busy and foregetful sometimes! Feel free to reach out with any feedback or questions.