Skip to content

The country converter (coco) - a Python package for converting country names between different classifications and between different naming versions.

License

Notifications You must be signed in to change notification settings

cynepiaadmin/country_converter

 
 

Repository files navigation

country converter

The country converter (coco) is a Python package to convert country names between different classifications and between different naming versions. Internally it uses regular expressions to match country names.

Installation

Just download the package and add the path to your python path:

import sys
_fd = r'S:\coco'
if not _fd in sys.path:
    sys.path.append(_fd)
del _fd
import country_converter as coco

The package depends on pandas; for testing py.test is required.

Usage

Basic usage

Convert various country names to some standard names:

import country_converter as coco
cc = coco.CountryConverter()

some_names = ['United Rep. of Tanzania', 'Cape Verde', 'Burma', 'Iran (Islamic Republic of)', 'Korea, Republic of', "Dem. People's Rep. of Korea"]

standard_names = cc.convert(names = some_names, src = 'regex', to = 'name_short')
print(standard_names)

Which results in ['Tanzania', 'Cabo Verde', 'Myanmar', 'Iran', 'South Korea', 'North Korea'].

Convert between classification schemes:

iso3_codes = ['USA', 'VUT', 'TKL', 'AUT' ]
iso2_codes = cc.convert(names = iso3_codes, src = 'ISO3', to = 'ISO2')
print(iso2_codes)

Which results in ['US', 'VU', 'TK', 'AT']

Internally the data is stored in a pandas dataframe, which can be accessed directly. For example, this can be used to filter countries for membership organisations (per year).

some_countries = ['Australia', 'Belgium', 'Brazil', 'Bulgaria', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'India', 'Indonesia', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Romania', 'Russia',  'Turkey', 'United Kingdom', 'United States']

oecd_since_1995 = cc.data[(cc.data.OECD >= 1995) & cc.data.name_short.isin(some_countries)].name_short
eu_until_1980 = cc.data[(cc.data.EU <= 1980) & cc.data.name_short.isin(some_countries)].name_short
print(oecd_since_1995)
print(eu_until_1980)

Some properties provide direct access to affiliations:

cc.EU28
cc.OECD

cc.EU27in('ISO3')

and the classification schemes available:

cc.valid_class

The regular expressions can also be used to match any list of countries to any other. For example:

match_these = ['norway', 'united_states', 'china', 'taiwan']
master_list = ['USA', 'The Swedish Kingdom', 'Norway is a Kingdom too', 'Peoples Republic of China', 'Republic of China' ]

matching_dict = coco.match(match_these, master_list)

See the IPython Notebook (country_converter_examples.ipynb) for more information.

Refining and Extending

The underlying raw data is a tab-separated file which is read into a pandas dataframe (available as attribute .data in the main class). Any column added to this dataframe can be used for all conversions. The tab-separated datafile is utf-8 encoded.

The included regular expressions were tested against names commonly found in various databases. In case, the expression need to updated I recommend to rerun all tests (using the _py.test package).

These tests check

  1. Do the short names uniquely match the regular expression?
  2. Do the official name uniquely match the regular expression?
  3. Do the alternative names tested so far still uniquely match the standard names?

To specify a new test set just add a tab-separated file with headers "name_short" and "name_test" and provide name (corresponding to the short name in the main classification file) and the alternative name which should be tested (one pair per row in the file). If the file name starts with "test_regex_" it will be automatically recognised by the test functions.

Classification schemes

Currently the following classification schemes are available:

  1. ISO2 (ISO 3166-1 alpha-2)
  2. ISO3 (ISO 3166-1 alpha-3)
  3. ISO - numeric (ISO 3166-1 numeric)
  4. UN numeric code (which follows to a large extend ISO - numeric)
  5. A standard or short name
  6. The "official" name
  7. Continent
  8. UN region
  9. EXIOBASE 1 classification
  10. EXIOBASE 2 classification
  11. EXIOBASE 2 classification
  12. WIOD classification
  13. OECD membership (per year)
  14. UN membership (per year)
  15. EU membership (per year)

Data sources and further reading

Most of the underlying data can be found in Wikipedia. https://en.wikipedia.org/wiki/ISO_3166-1 is a good starting point. UN regions/codes are given on the United Nation Statistical Division (unstats) web-page. EXIOBASE and WIOD classification were extracted from the respective databases. The membership of OECD, UN and EU can be found at the membership organisations webpages.

Acknowledgements

This package was inspired by (and the regular expression are mostly based on) the R-package countrycode by Julian Hinz and its port to Python (pycountrycode) by Vincent Arel-Bundock.

About

The country converter (coco) - a Python package for converting country names between different classifications and between different naming versions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 66.8%
  • Python 33.2%