Discovery

Functions used to discover and explore the data exposed by ISTAT webservice.

This module implements functions to discover the data exposed by ISTAT. To do so, istatapi make metadata requests to the API endpoints. The Discovery module provides useful methods to parse and analyze API metadata responses. It makes use of the library pandas and returns data in the DataFrame format, making it convenient for interactive and exploratory analysis in Jupyter Notebooks.

The main class implemented in the Discovery module is DataSet.


source

parse_dataflows

 parse_dataflows (response)

parse the response containing all the available datasets and return a list of dataflows.

The simplest way to get a full list of the dataflows provided by ISTAT is to call the method all_available() which returns a list of all the explorable dataflows, together with their IDs and descriptions.


source

all_available

 all_available (dataframe=True)

Return all available dataflows

available_datasets = all_available()
available_datasets.head()
df_id version df_description df_structure_id
0 101_1015 1.3 Crops DCSP_COLTIVAZIONI
1 101_1030 1.0 PDO, PGI and TSG quality products DCSP_DOPIGP
2 101_1033 1.0 slaughtering DCSP_MACELLAZIONI
3 101_1039 1.2 Agritourism - municipalities DCSP_AGRITURISMO_COM
4 101_1077 1.0 PDO, PGI and TSG products: operators - munici... DCSP_DOPIGP_COM
print(f'number of available datasets: {len(available_datasets)}')
number of available datasets: 509
test_eq(available_datasets.columns, ['df_id', 'version', 'df_description', 'df_structure_id'])

source

search_dataset

 search_dataset (keyword)

Search available dataflows that contain keyword. Return these dataflows in a DataFrame

This method looks for keyword inside all datasets descriptions. By default, the keyword needs to be an english word.

df = search_dataset(keyword="Tax")
df.head()
df_id version df_description df_structure_id
168 168_261 1.1 Hicp - at constant tax rates annual data(base ... DCSP_IPCATC2
169 168_306 1.2 Hicp - at constant tax rates monthly data (bas... DCSP_IPCATC1
172 168_756 1.4 Hicp - at constant tax rates monthly data (bas... DCSP_IPCATC1B2015
173 168_757 1.1 Hicp- at constant tax rates annual data (base ... DCSP_IPCATC2B2015
267 30_1008 1.1 Irpef taxable incomes (Ipef) - municipalities MEF_REDDITIIRPEF_COM
test_fail(lambda: search_dataset(keyword="disoccupazione"))

Data Structures and Information about available Datasets


source

DataSet

 DataSet (dataflow_identifier:str, resource:str='datastructure')

Class that implements methods to retrieve informations (metadata) about a Dataset

The class takes df_id, df_structure_id or df_description as inputs. These 3 values can be found by using the all_available() function.

ds = DataSet(dataflow_identifier="151_914")
test_eq(ds.identifiers['df_id'], '151_914')
test_eq(ds.identifiers['df_description'], 'Unemployment  rate')
test_eq(ds.identifiers['df_structure_id'], 'DCCV_TAXDISOCCU1')
ds2 = DataSet(dataflow_identifier="22_289")
test_eq(ds2.identifiers['df_id'], '22_289')
test_eq(ds2.identifiers['df_description'], 'Resident population  on 1st January')
test_eq(ds2.identifiers['df_structure_id'], 'DCIS_POPRES1')
# test Dataset 729_1050 (https://github.com/Attol8/istatapi/issues/24)
assert len(available_datasets.query('df_id == "729_1050"')) == 1
# test that it raises ValueError if no dataset is found
test_fail(lambda: DataSet(dataflow_identifier="729_1050"), contains="No available data found for the requested query")
ds2.dimensions_info()
dimension dimension_ID description
0 FREQ CL_FREQ Frequency
1 ETA CL_ETA1 Age class
2 ITTER107 CL_ITTER107 Territory
3 SESSO CL_SEXISTAT1 Gender
4 STACIVX CL_STATCIV2 Marital status
5 TIPO_INDDEM CL_TIPO_DATO15 Data type 15

we can look at the dimensions of a dataflow by simply accessing its attribute dimensions. However, we won’t have dimensions’ descriptions here.


source

DataSet.dimensions_info

 DataSet.dimensions_info (dataframe=True, description=True)

Return the dimensions of a specific dataflow and their descriptions.

To have a look at the dimensions together with their description, we can use the dimension_info function. It will return an easy to read pandas DataFrame.

dimensions_df = ds.dimensions_info()
test_eq(dimensions_df.columns, ['dimension', 'dimension_ID', 'description'])
dimensions_df
dimension dimension_ID description
0 FREQ CL_FREQ Frequency
1 CITTADINANZA CL_CITTADINANZA Citizenship
2 DURATA_DISOCCUPAZ CL_DURATA Duration
3 CLASSE_ETA CL_ETA1 Age class
4 ITTER107 CL_ITTER107 Territory
5 SESSO CL_SEXISTAT1 Gender
6 TIPO_DATO CL_TIPO_DATO_FOL Data type FOL
7 TITOLO_STUDIO CL_TITOLO_STUDIO Level of education

The values that the different dimensions can take can also be explored. The available_values attribute contains a dictionary with the dimensions of the dataset as keys. The values of the dictionary are themselves dictionaries which can be accessed through the values_ids and values_description keys. The former key returns an ID of the dimension’s values, the latter a description of these values.

values_dict = ds.available_values
test_eq(isinstance(values_dict, dict), True)
test_eq(list(values_dict.keys()).sort(), ds.dimensions.sort())
test_eq(values_dict['DURATA_DISOCCUPAZ']['values_ids'], ['TOTAL', 'M_GE12'])
test_eq(values_dict['DURATA_DISOCCUPAZ']['values_description'], ['total', '12 months and over'])

source

DataSet.get_dimension_values

 DataSet.get_dimension_values (dimension, dataframe=True)

Return the available values of a single dimension in the dataset

ds.get_dimension_values('DURATA_DISOCCUPAZ')
values_ids values_description
0 TOTAL total
1 M_GE12 12 months and over

source

DataSet.set_filters

 DataSet.set_filters (**kwargs)

set filters for the dimensions of the dataset by passing dimension_name=value

# test dataset from https://github.com/Attol8/istatapi/issues/25
ds = DataSet(dataflow_identifier = "155_358")
assert 'WAGE_E_2021' not in ds.available_values['TIP_AGGR1']['values_ids']

With DataSet.set_filters() we can filter the dimensions of the dataset by passing the values that we want to filter for. The dataset will then only return data containing our filters. A dictionary with the selected filters is contained in the attribute DataSet.filters.

Note that the arguments of DataSet.set_filters are lower case letters, but in DataSet.filters they are converted to upper case to be consistent with dimension names on ISTAT API.

dz = DataSet(dataflow_identifier="139_176")
dz.set_filters(freq="M", tipo_dato=["ISAV", "ESAV"], paese_partner="WORLD")

test_eq(dz.filters['FREQ'], 'M')
test_eq(dz.filters['TIPO_DATO'], ["ISAV", "ESAV"])
test_fail(lambda: dz.filters['freq']) #the filter is not saved in lower case