Updates to package countries


Version 0.2 of package countries is out. This article is a short description of some of the new functions.

Francesco S. Bellelli

The version 0.2 of the package introduces several functions that identify special columns in country data. These functions can be used as building blocks to automate basic tasks and detect data formats. Here is an overview of a the new functionalities.

First, let’s install the latest version of the package

# Install and load devtools if needed

# Install and load countries package
devtools::install_github("fbellelli/countries", build_vignettes = TRUE)
#load the package

Then let’s load an example dataset

In this article we will use a dataset of country policies as an example for the functions in the package. This dataset contains 41 columns and has a rather complex strutcture. Every row is identified by a combination of Country name (column 21), Year (Column 9), Policy measure number (column 1), and HS code (column 39) — which is a code identifying traded commodities. This dataset is an extension based on the WTO environmental databse (EDB). More information on the dataset can be found here.

#download data
temp <- tempfile()
download.file("https://fbellelli.com/data%20files/Extended%20EDB%20by%20measure-country-HS%20chapter.zip", temp, mode="wb")

#unzip and load it in R
example <- read.csv(
  unzip(temp,"Extended EDB by measure-country-HS chapter/Extended EDB (by measure-country-HS chapter).csv"),

Finding columns containing country and time information

The functions find_countrycol() and find_timecol() can be used to return respectively all columns containing country names and date/year information.

In the example below, the function find_countrycol() is used to examine our example dataset. The argument min_share is used to indicate whether to look for an entire column of Country names or any column that contains Country names. This is a numeric value indicating the minimum share of entries in the column that are country names.

#Columns of country names
#Return any column containing a country name
find_countrycol(example, min_share=0) 
[1] "Notifying.Member" "COUNTRIES"       

Similarly, we can look for time columns in the following way:

# Date and year columns
[1] "Year"

By default, the function will return the name of the column. However, the argument return_index can be used to get the indices of the column in the table.

find_countrycol(example, return_index = TRUE)
[1] 21
find_timecol(example, return_index = TRUE)  
[1] 9

Computations may take long on large tables with many columns. To speed up operations, these functions operate by evaluating the table on a random sample of rows. The sample size can be adjusted with the argument sample_size. It also possible to request to use the entire table by passing the value NA. Finally, it is also possible to request not to return columns containing NA values by specifying: allow_NA = FALSE.

These two functions are based on is_country() and is_date(), which can be used to test if a string is a Country name and a date. An overview of the first function is provided in the vignette Dealing with country names, here we have a look at `is_date()´.

Testing dates

is_date() takes a string vector as argument and outputs a boolean vector indicating whether the strings are dates. The argument formats can be used to specify the formats to be checked. Standard R notation can be used for date elements (see the table below).

test <- c("December 2022", "20/01/1970", "Banana", "12.13.2000") 

Symbol Definition Example
%d Day number 13
%a Abbreviated day name Mon
%A Full day name Monday
%m Month number 2
%b Abbreviated month name Feb
%B Full month name February
%y Year 2-digits 22
%Y Year 4-digits 2022

Finding table keys

find_keycol() is a function that can be used to automate the search of key columns in datasets. A table key is a column or set of columns that uniquely identifies the entries in the table. This function was specifically designed for country data, so it will prioritise the search of country and time columns, which are often keys in country data.

As shown below, the function correctly identifies the keys for the complex dataset introduced at the beginning of this article.

find_keycol(example, allow_NA = TRUE)
                      country                          time 
                  "COUNTRIES"                        "Year" 
                        other                         other 
                         "Nr" "Tentative.HS.chapters.match" 

The function outputs a vector of column names that uniquely identifies the entries in the table. In addition, it also indicates whether the column contains country, time or other information. Just like the other find function in this page, it is possible to request the column index instead of the name by passing return_index = TRUE.

Due to the potentially very high number of column combinations in a table, the function only tests the more likely options. The function will first check country and time columns. Next, left-most columns are prioritised. Also, the function will only look for the most common country data formats (e.g. cross-sectional, time-series, panel data, dyadic, etc.) and only searches for up to two additional key columns of type other. If no key is found at the end of the search, the function returns a NULL value.

Computation can be slow on large datasets. Therfore, the function relies on a sample of row. The sample size can be controlled with sample_size. Moreover, it is also possible to restrict the search to a subset of columns by providing a vector of column names or indices in the argument search_only.

#This will only check if a key is found among the first three columns
find_keycol(example, allow_NA = TRUE, search_only = 1:3)

Finally, the function also permits the user to specify whether columns containing NA values are allowed to be considered as table keys. The default is allow_NA = FALSE because keys should typically not contain missing values. However, in our example dataset the column Tentative.HS.chapters.match contains blank entries which are read in R as NA, therefore the default would return NULL. The argument allow_NA is present to deal with these edge cases.

find_keycol(example, allow_NA = FALSE)

If the user wants to test a specific set of columns, the package also provides the function is_keycol() to check whether the provided columns uniquely identify the entries in the dataset.

is_keycol(example, c("COUNTRIES","Year"), allow_NA = TRUE)
is_keycol(example, c("COUNTRIES","Year", "Nr","Tentative.HS.chapters.match"), allow_NA = TRUE)
[1] TRUE

Tweaked which.min and which.max functions and statistical mode

Version 0.2 also introduces the which_min, which_max and which_mode. These functions return the position (index) respectively of the minimum, maximum, and mode values of the vector. Unlike base R functions which.min and which.max, these functions return the position of ALL minima/maxima instead of only the first one. If desired, one can request only the first value with the argument first_only. Here is an example:

# Return position of all minimum values. Example: Return position of measures with the lowest score?  
[1] 82520 82521
# Compare this with which.min, which will return only the first value:
[1] 82520
# If desired, the same behaviour can be obtained with the argument "first_only"
which_min(example$MEASURE_SCORE, first_only = TRUE)
[1] 82520

The package also provide a function to get the mode of a vector. That is to say, the observation that appear most frequently in a vector. It supports both numeric, factor and character variables. In case of

#return the mode of a vector. Example: which country has the most measures in the dataset? United States
[1] "United States"
#if multiple modes are present, all of them will be returned, unless only the first is requested:
[1] 1 3
countries::mode(c(1,1,2,3,3,4), first_only=TRUE)
[1] 1
#we can also get the position of all US entries
 [1] 1093 1094 1098 1099 1100 1112 1115 1116 1120 1121

What will come next?

This version of the package introduces many building blocks that will be used to design more advanced functionalities.

The next package version will aim to bring a merging function that deals with differences in country names and with time columns. I will also work on a wrapper function to allow easy download of country data from public sources (e.g. World Bank, IMF, FRED, etc…). Ideally, we want to be able to download country data from multiple sources and obtain a single table with merged data ready to use, all in one simple command.


For attribution, please cite this work as

Bellelli (2022, Aug. 7). F.S.Bellelli: Updates to package countries. Retrieved from https://fbellelli.com/posts/2022-08-07-updates-to-package-countries/

BibTeX citation

  author = {Bellelli, Francesco S.},
  title = {F.S.Bellelli: Updates to package countries},
  url = {https://fbellelli.com/posts/2022-08-07-updates-to-package-countries/},
  year = {2022}