Mapping South America with R: A Deep Dive into Geo-Visualization | by Fernando Barbalho | Aug, 2023

Navigating datasets, geopolitical nuances, and coding challenges to paint a comprehensive picture of the continent

So you are that kind of data scientist and amateur Medium writer who has enjoyed maps and geography since childhood. You are searching for a good theme for your subsequent work with graphs and, most certainly, maps when you realize that the official statistics agency of your country, Brazil, released the most recent census data. Why not? Why not take a picture of Brazil compared to its neighbors in South America? It might be a simple task using R and all its good packages. Let’s do it.

The minute after this decision comes the realization that the simple task is indeed a hero’s journey with elements such as discovering the most suitable dataset with shapefiles, lack of information, shapefiles interoperability, latitude and longitude mathematics, cultural differences in Geography concepts, and even geopolitical issues, like understanding how to put French overseas territories’ map and data correctly in South America.

The next paragraphs explain one of some possible paths to paint demographic information in a delimited portion of a world map. The step-by-step described below might be useful for all those interested in international comparison with a Geo-visualization approach, even if one’s purpose is to compare access to water among African countries or obesity rates in North America.

Let’s start with the whole picture: an R version of mapa-mundi. See the image and code below.

`library(readxl)library(geobr)library(tidyverse)library(sf)library(spData)library(ggrepel)data("world")#mapa mundiworld %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +scale_fill_continuous_sequential(palette= "Heat 2" )+theme_void() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões de habitantes", 10))`

I use the package {spData} as a reference for a dataframe with geometry information for territories shapefiles all around the planet. The aes function uses the population information to fill the shapes. As we know, China and India are the most populated countries in the world, with over 1 billion people each. The heat colors show the contrast with all other countries. Most of the sequential colors are weak. We can barely understand the gradient of colors in the picture. The logarithm is the best alternative if you want a glimpse of a better color distribution. See below.

`world %>%ggplot() +geom_sf(aes(fill=pop)) +scale_fill_continuous_sequential(palette= "Heat 2", trans= "log2" )+theme_void() +theme(panel.background = element_rect(fill="#0077be"),legend.position = "none")`

In the code, you can see the logarithm transformation in the scale_fill_continuous_sequential function.

In the world dataframe structure, there is a Continent column. So, filtering the data using this column to get a South American map is obvious. See the code and, right after, the map.

`world %>%filter(continent == "South America") %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +scale_fill_continuous_sequential(palette= "Heat 2" )+theme_void() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões de habitantes", 10))`

As you can see, the dplyr filter function worked fine; this is just the map we wanted to see. But is it really correct?

Climate change is a huge issue, but the sea levels haven’t risen yet with such a volume to drown a pronounced area that used to appear in North of South America. What happened here? Let’s draw another map now with the help of coordinates and naming the polygons.

`southamerica<-world %>%filter(continent=="South America") southamerica\$lon<- sf::st_coordinates(sf::st_centroid(southamerica\$geom))[,1]   southamerica\$lat<- sf::st_coordinates(sf::st_centroid(southamerica\$geom))[,2]southamerica %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +scale_fill_continuous_sequential(palette= "Heat 2" )+theme_light() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões de habitantes", 10))+geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)), color = "black", fontface = "bold", size = 2.8)`

The theme_light instead of theme_void was sufficient to display the coordinates. The polygon naming took more work. We had to calculate the centroid of each polygon and then use this information as x and y coordinates in a geom_text_repel function.

With this new map version and some previous knowledge, we discovered the missing territory was French Guyana, between 0º and 10º north latitude and 53º and 55º west longitude. Our next quest is understanding how to get pieces of information on French Guyana: polygon, population, and some coordinates to fill our map.

I had to isolate France from the rest of the world to understand how the {spData} package dealt with this country map’s data. See the result below.

`france<-world %>%filter(iso_a2 == "FR")france %>%ggplot() +geom_sf(aes(fill=pop)) +scale_fill_continuous_sequential(palette= "Heat 2", trans= "log2" )+theme_light() +theme(#panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População", 30))`

France has many so-called overseas territories. The approach of the {spData} package was to represent only the main territory, plus Corsica, an island in the Mediterranean Sea, and French Guyana, located precisely in the coordinate range that characterizes the gap in our last map of South America.

My next try was to add the dataframe with France geometry data to my South America filter, but I knew I would need more. See below

`southamerica %>%bind_rows(france) %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +scale_fill_continuous_sequential(palette= "Heat 2" )+theme_light() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões de habitantes", 10))+geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)), color = "black", fontface = "bold", size = 2.8)`

As you can see in the code, I used bind_row to combine South American territories with France shapefile. So we had now the French Guyana well positioned. On the other hand, there is no population information on the map, and France is a part of South America on the reverse of colonialism’s history.

In other words, I wanted this map.

`data_guiana<-insee::get_idbank_list('TCRED-ESTIMATIONS-POPULATION') %>%filter(str_detect(REF_AREA_label_fr,"Guyane")) %>%filter(AGE == "00-") %>% #all agesfilter(SEXE == 0) %>% #men and womenpull(idbank) %>%insee::get_insee_idbank() %>%filter(TIME_PERIOD == "2023") %>% select(TITLE_EN,OBS_VALUE) %>%mutate(iso_a2 = "FR")data_guiana <- janitor::clean_names(data_guiana)southamerica %>%bind_rows(france) %>%left_join(data_guiana) %>%mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +scale_fill_continuous_sequential(palette= "Heat 2" )+geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)), color = "black", fontface = "bold", size = 2.8)+coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+theme_light() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões de habitantes", 10))`

As you can read, I used an R package produced by France’s official statistics office to obtain the population of Guyana. In addition, I limited the map to the appropriate coordinates to see South America.

Now that the map hero finally resolved the South American issues and played the pipes of peace with France, it’s time to return to Brazilian data and maps. Remember, I want to compare some Brazilian census details with other countries and territories south of Panama.

The data census is available in an R package or at an API address. I did the more challenging option using the API. Using the other option another time might be a good idea. See the code and the map below, where I show the population of the Brazilian states in contrast to the other South American territories.

`central_america<-world %>%filter(subregion == "Central America")brasil<- geobr::read_country()estados<- geobr::read_state()#dados de populaçãoibge2022<-get_municipalies_data()estados<-estados %>%inner_join(ibge2022 %>%rename(abbrev_state = uf) %>%summarise(.by=abbrev_state,pop = sum(populacao_residente)))southamerica %>%filter(iso_a2!="BR") %>%bind_rows(france) %>%left_join(data_guiana) %>%mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +geom_sf(data=estados, aes(fill=pop/10^6)) +geom_sf(data=brasil,fill=NA, color="#00A859", lwd=1.2)+geom_sf(data= central_america,fill= "#808080")+scale_fill_continuous_sequential(palette= "Heat 2" )+geom_text_repel(aes(x=lon, y=lat, label= str_wrap(name_long,20)), color = "black", fontface = "bold", size = 2.8)+coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+theme_void() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões", 10))`

I wrote the function get_municipalites_data using the API cited above. The code is available in my gist. Note also two functions that provide the shapefiles used to draw the Brazilian and its sub-region limits: read_country and read_states. These functions are present at the {geobr} package.

I used another filter from the world dataframe. In this case, the purpose is to show the beginning of the Central American subcontinent and paint its map with a shade of gray. Here, we faced a cultural divergence as we learned in Brazil that the Americas have three sub-continents: North America, Central America, and South America. For the authors of the dataset, Central America is a sub-region of North America.

Now it’s time to finish my work. I want to show the names of the eight most populous territories on the map. Even in this final sprint, there were a few code tricks.

`estados\$lon<- sf::st_coordinates(sf::st_centroid(estados\$geom))[,1]   estados\$lat<- sf::st_coordinates(sf::st_centroid(estados\$geom))[,2]most_populated<-southamerica %>%filter(iso_a2 !="BR") %>%rename(name= name_long) %>%as_tibble() %>%select(name, pop, lat, lon) %>%bind_rows(estados %>%rename(name= name_state) %>%as_tibble() %>%select(name, pop, lat, lon)) %>%slice_max(order_by = pop, n=8)southamerica %>%filter(iso_a2!="BR") %>%bind_rows(france) %>%left_join(data_guiana) %>%mutate(pop=ifelse(iso_a2=="FR",obs_value,pop))%>%mutate(lon= ifelse(iso_a2=="FR", france[[11]][[1]][[1]][[1]][1,1], lon),lat= ifelse(iso_a2=="FR",france[[11]][[1]][[1]][[1]][1,2], lat)) %>%ggplot() +geom_sf(aes(fill=pop/10^6)) +geom_sf(data=estados, aes(fill=pop/10^6)) +geom_sf(data=brasil,fill=NA, color="#00A859", lwd=1.2)+geom_sf(data= central_america,fill= "#808080")+scale_fill_continuous_sequential(palette= "Heat 2" )+geom_text_repel(data= most_populated,aes(x=lon, y=lat, label= str_c(str_wrap(name,10),": ",round(pop/10^6,1))), color = "black", fontface = "bold", size = 2.9)+coord_sf(xlim = c(-82,-35), ylim=c(-60,15))+theme_void() +theme(panel.background = element_rect(fill="#0077be")) +labs(fill= str_wrap("População em milhões", 10))`

Three Brazilian states are among the eight most populated territories in South America. In fact, São Paulo is the second most inhabited space on the map, surpassing all the countries except Colombia.

Now, focusing on the code, you can see that I created a new dataframe to build this ranking by combining two different sf objects. I selected a sub-set of columns and changed the type from sf to tibble to enable the row binding.

So this is it. The hero has completed one possible path and left the footprints for the next journey. Now it’s your turn. Remember all of your projects that could have a significant improvement using a map representation. Using the walk-through above and gathering all the data available about population, socioeconomic issues, and so on, it’s just a matter of choosing the variable to fill the polygons.