The first development version of the package countries is now available on Github. This post looks at how the package can be used to work with country names.
countries
is an R package designed to quickly wrangle, merge and explore country data. This package will contain functions to easily identify and convert country names, pull country info and datasets, merge country data from different sources, and make quick world maps.
I recently released the first development version of the package, which is now available on my Github page. The package also has a website containing information on the package’s usage.
In this article, we will have a look at how countries
can be used to work with country name. In particular, we will look in detail at the function country_name()
, which can be used to convert country names to different naming conventions or to translate them to different languages. country_name()
can identify countries even when they are provided in mixed formats or in different languages. It is robust to small misspellings and recognises many alternative country names and disused official names.
Since the package is not yet on CRAN, the development version needs to be downloaded directly from the Github repository. This can be done with the devtools
package.
# Install and load devtools
install.packages("devtools")
library(devtools)
# Install countries
devtools::install_github("fbellelli/countries", build_vignettes = TRUE)
The package can then be loaded normally
The function country_name()
can be used to convert country names to different naming conventions or to translate them to different languages.
example <- c("United States","DR Congo", "Morocco")
# Getting 3-letters ISO code
country_name(x= example, to="ISO3")
[1] "USA" "COD" "MAR"
# Translating to spanish
country_name(x= example, to="name_es")
[1] "Estados Unidos"
[2] "República Democrática del Congo"
[3] "Marruecos"
If multiple arguments are passed to the argument to
, the function will output a data.frame
object, with one column corresponding to every naming convention.
# Requesting translation to French and 2-letter and 3-letter ISO codes
country_name(x= example, to=c("name_fr","ISO2","ISO3"))
name_fr ISO2 ISO3
1 États-Unis US USA
2 République démocratique du Congo CD COD
3 Maroc MA MAR
The to
argument supports all the following naming conventions:
CODE | DESCRIPTION |
---|---|
simple | This is a simple english version of the name containing only ASCII characters. This nomenclature is available for all countries. |
ISO3 | 3-letter country codes as defined in ISO standard 3166-1 alpha-3 . This nomenclature is available only for the territories in the standard (currently 249 territories). |
ISO2 | 2-letter country codes as defined in ISO standard 3166-1 alpha-2 . This nomenclature is available only for the territories in the standard (currently 249 territories). |
ISO_code | Numeric country codes as defined in ISO standard 3166-1 numeric . This country code is the same as the UN’s country number (M49 standard). This nomenclature is available for the territories in the ISO standard (currently 249 countries). |
UN_xx | Official UN name in 6 official UN languages. Arabic (UN_ar ), Chinese (UN_zh ), English (UN_en ), French (UN_fr ), Spanish (UN_es ), Russian (UN_ru ). This nomenclature is only available for countries in the M49 standard (currently 249 territories). |
WTO_xx | Official WTO name in 3 official WTO languages: English (WTO_en ), French (WTO_fr ), Spanish (WTO_es ). This nomenclature is only available for WTO members and observers (currently 189 entities). |
name_xx | Translation of ISO country names in 28 different languages: Arabic (name_ar ), Bulgarian (name_bg ), Czech (name_cs ), Danish (name_da ), German (name_de ), Greek (name_el ), English (name_en ), Spanish (name_es ), Estonian (name_et ), Basque (name_eu ), Finnish (name_fi ), French (name_fr ), Hungarian (name_hu ), Italian (name_it ), Japponease (name_ja ), Korean (name_ko ), Lithuanian (name_lt ), Dutch (name_nl ), Norwegian (name_no ), Polish (name_po ), Portuguese (name_pt ), Romenian (name_ro ), Russian (name_ru ), Slovak (name_sk ), Swedish (name_sv ), Thai (name_th ), Ukranian (name_uk ), Chinese simplified (name_zh ), Chinese traditional (name_zh-tw ) |
GTAP | GTAP country and region codes. |
all | Converts to all the nomenclatures and languages in this table |
country_name()
can identify countries even when they are provided in mixed formats or in different languages. It is robust to small misspellings and recognises many alternative country names and old nomenclatures.
fuzzy_example <- c("US","C@ète d^Ivoire","Zaire","FYROM","Estados Unidos","ITA")
country_name(x= fuzzy_example, to=c("UN_en"))
Multiple country IDs have been matched to the same country name
Set - verbose - to TRUE for more details
[1] "United States of America"
[2] "Côte d’Ivoire"
[3] "Democratic Republic of the Congo"
[4] "North Macedonia"
[5] "United States of America"
[6] "Italy"
More information on the country matching process can be obtained by setting verbose=TRUE
. The function will print information on:
"C@ète d^Ivoire"
is the only name recognised with fuzzy matching. The function’s reference table can be accessed with the command data(country_reference_list)
."C@ète d^Ivoire"
) to the closest reference ("Côte d'Ivoire"
). Lower DISTANCE statistics indicate more reliable fuzzy matching.country_name(x= fuzzy_example, to=c("UN_en"), verbose=TRUE)
In total 6 unique country identifiers have been found
5/6 have been matched with EXACT matching
1/6 have been matched with FUZZY matching
Fuzzy matching DISTANCE summary:
| Average: 3
| Min: 3
| Q1: 3
| Median: 3
| Q3: 3
| Max: 3
Multiple arguments have been matched to the same country name:
- Estados Unidos : United States of America
- US : United States of America
[1] "United States of America"
[2] "Côte d’Ivoire"
[3] "Democratic Republic of the Congo"
[4] "North Macedonia"
[5] "United States of America"
[6] "Italy"
In addition, setting verbose=TRUE
will also print additional informations relating to specific warnings that are normally given by the function:
Multiple country IDs have been matched to the same country name
: This warning is issued if multiple strings have been matched to the same country. In verbose mode, the strings and corresponding countries will be listed. In the example above, both "US"
and "Estados Unidos"
are matched to the same country. If the vector of country names is a unique identifier, this could indicate that some country name was not recognised correctly. The user might consider using custom tables (refer to the next section).Some country IDs have no match in one or more country naming conventions
: indicates that it is impossible to find an exact match for one or more country names with fuzzy_match=TRUE
. The user might consider using fuzzy_match=TRUE
or custom tables (refer to the next section).There is low confidence on the matching of some country names
: This warning indicates that some strings have been matched poorly. Thus indicating that the country might have been mididentified. In verbose mode the function will provide a list of problematic strings (see the example below). The user might consider using custom tables to solve this type of issues (refer to the next section).Some country IDs have no match in one or more country naming conventions
: Conversion is requested to a nomenclature for which there is no information on the country. For instance, in the example below “Taiwan” has no correspondence in the UN M49 standard. In verbose mode, the function will print all the country names affected by this problem. The user might consider using custom tables to solve this type of issues (refer to the next section).country_name(x= c("Taiwan","lsajdèd"), to=c("UN_en"), verbose=FALSE)
Some country IDs have no match in one or more country naming conventions
There is low confidence on the matching of some country names
Set - verbose - to TRUE for more details
[1] NA
[2] "Lao People's Democratic Republic"
All the information from verbose mode can be accessed by setting ´simplify=FALSE´. This will return a list object containing:
In some cases, the user might be unhappy with the naming conversion or no valid conversion might exist for the provided territory. In these cases, it might be useful to tweak the conversion table. The package contains a utility function called ´match_table()´, which can be used to generate conversion tables for small adjustments.
example_custom <- c("Siam","Burma","H#@°)Koe2")
#suppose we are unhappy with how "H#@°)Koe2" is interpreted by the function
country_name(x = example_custom, to = "name_en")
There is low confidence on the matching of some country names
Set - verbose - to TRUE for more details
[1] "Thailand" "Myanmar" "Korea, Republic of"
#match_table can be used to generate a table for small adjustments
tab <- match_table(x = example_custom, to = "name_en")
There is low confidence on the matching of some country names
tab$name_en[2] <- "Hong Kong"
#which can then be used for conversion
country_name(x = example_custom, to = "name_en", custom_table = tab)
[1] "Thailand" "Myanmar" "Hong Kong"
I am still working on the package. In the near future, the following items will be added to the package:
For attribution, please cite this work as
Bellelli (2022, Feb. 27). F.S.Bellelli: First version of countries. Retrieved from https://fbellelli.com/posts/2022-02-27-first-version-of-countries/
BibTeX citation
@misc{bellelli2022first, author = {Bellelli, Francesco S.}, title = {F.S.Bellelli: First version of countries}, url = {https://fbellelli.com/posts/2022-02-27-first-version-of-countries/}, year = {2022} }