Discovery

Functions used to discover and explore the data exposed by ISTAT webservice.

This module implements functions to discover the data exposed by ISTAT. To do so, istatapi make metadata requests to the API endpoints. The Discovery module provides useful methods to parse and analyze API metadata responses. It makes use of the library pandas and returns data in the DataFrame format, making it convenient for interactive and exploratory analysis in Jupyter Notebooks.

The main class implemented in the Discovery module is DataSet.

source

parse_dataflows

 parse_dataflows (response)

parse the response containing all the available datasets and return a list of dataflows.

The simplest way to get a full list of the dataflows provided by ISTAT is to call the method all_available() which returns a list of all the explorable dataflows, together with their IDs and descriptions.

source

all_available

 all_available (dataframe=True)

Return all available dataflows

available_datasets = all_available()
available_datasets.head()

	df_id	version	df_description	df_structure_id
0	101_1015	1.3	Crops	DCSP_COLTIVAZIONI
1	101_1030	1.0	PDO, PGI and TSG quality products	DCSP_DOPIGP
2	101_1033	1.0	slaughtering	DCSP_MACELLAZIONI
3	101_1039	1.2	Agritourism - municipalities	DCSP_AGRITURISMO_COM
4	101_1077	1.0	PDO, PGI and TSG products: operators - munici...	DCSP_DOPIGP_COM

print(f'number of available datasets: {len(available_datasets)}')

number of available datasets: 509

test_eq(available_datasets.columns, ['df_id', 'version', 'df_description', 'df_structure_id'])

source

search_dataset

 search_dataset (keyword)

Search available dataflows that contain keyword. Return these dataflows in a DataFrame

This method looks for keyword inside all datasets descriptions. By default, the keyword needs to be an english word.

df = search_dataset(keyword="Tax")
df.head()

	df_id	version	df_description	df_structure_id
168	168_261	1.1	Hicp - at constant tax rates annual data(base ...	DCSP_IPCATC2
169	168_306	1.2	Hicp - at constant tax rates monthly data (bas...	DCSP_IPCATC1
172	168_756	1.4	Hicp - at constant tax rates monthly data (bas...	DCSP_IPCATC1B2015
173	168_757	1.1	Hicp- at constant tax rates annual data (base ...	DCSP_IPCATC2B2015
267	30_1008	1.1	Irpef taxable incomes (Ipef) - municipalities	MEF_REDDITIIRPEF_COM

test_fail(lambda: search_dataset(keyword="disoccupazione"))

Data Structures and Information about available Datasets

source

DataSet

 DataSet (dataflow_identifier:str, resource:str='datastructure')

Class that implements methods to retrieve informations (metadata) about a Dataset

The class takes df_id, df_structure_id or df_description as inputs. These 3 values can be found by using the all_available() function.

ds = DataSet(dataflow_identifier="151_914")
test_eq(ds.identifiers['df_id'], '151_914')
test_eq(ds.identifiers['df_description'], 'Unemployment  rate')
test_eq(ds.identifiers['df_structure_id'], 'DCCV_TAXDISOCCU1')

ds2 = DataSet(dataflow_identifier="22_289")
test_eq(ds2.identifiers['df_id'], '22_289')
test_eq(ds2.identifiers['df_description'], 'Resident population  on 1st January')
test_eq(ds2.identifiers['df_structure_id'], 'DCIS_POPRES1')

# test Dataset 729_1050 (https://github.com/Attol8/istatapi/issues/24)
assert len(available_datasets.query('df_id == "729_1050"')) == 1
# test that it raises ValueError if no dataset is found
test_fail(lambda: DataSet(dataflow_identifier="729_1050"), contains="No available data found for the requested query")

ds2.dimensions_info()

	dimension	dimension_ID	description
0	FREQ	CL_FREQ	Frequency
1	ETA	CL_ETA1	Age class
2	ITTER107	CL_ITTER107	Territory
3	SESSO	CL_SEXISTAT1	Gender
4	STACIVX	CL_STATCIV2	Marital status
5	TIPO_INDDEM	CL_TIPO_DATO15	Data type 15

we can look at the dimensions of a dataflow by simply accessing its attribute dimensions. However, we won’t have dimensions’ descriptions here.

source

DataSet.dimensions_info

 DataSet.dimensions_info (dataframe=True, description=True)

Return the dimensions of a specific dataflow and their descriptions.

To have a look at the dimensions together with their description, we can use the dimension_info function. It will return an easy to read pandas DataFrame.

dimensions_df = ds.dimensions_info()
test_eq(dimensions_df.columns, ['dimension', 'dimension_ID', 'description'])
dimensions_df

	dimension	dimension_ID	description
0	FREQ	CL_FREQ	Frequency
1	CITTADINANZA	CL_CITTADINANZA	Citizenship
2	DURATA_DISOCCUPAZ	CL_DURATA	Duration
3	CLASSE_ETA	CL_ETA1	Age class
4	ITTER107	CL_ITTER107	Territory
5	SESSO	CL_SEXISTAT1	Gender
6	TIPO_DATO	CL_TIPO_DATO_FOL	Data type FOL
7	TITOLO_STUDIO	CL_TITOLO_STUDIO	Level of education

The values that the different dimensions can take can also be explored. The available_values attribute contains a dictionary with the dimensions of the dataset as keys. The values of the dictionary are themselves dictionaries which can be accessed through the values_ids and values_description keys. The former key returns an ID of the dimension’s values, the latter a description of these values.

values_dict = ds.available_values
test_eq(isinstance(values_dict, dict), True)
test_eq(list(values_dict.keys()).sort(), ds.dimensions.sort())
test_eq(values_dict['DURATA_DISOCCUPAZ']['values_ids'], ['TOTAL', 'M_GE12'])
test_eq(values_dict['DURATA_DISOCCUPAZ']['values_description'], ['total', '12 months and over'])

source

DataSet.get_dimension_values

 DataSet.get_dimension_values (dimension, dataframe=True)

Return the available values of a single dimension in the dataset

ds.get_dimension_values('DURATA_DISOCCUPAZ')

	values_ids	values_description
0	TOTAL	total
1	M_GE12	12 months and over

source

DataSet.set_filters

 DataSet.set_filters (**kwargs)

set filters for the dimensions of the dataset by passing dimension_name=value

# test dataset from https://github.com/Attol8/istatapi/issues/25
ds = DataSet(dataflow_identifier = "155_358")
assert 'WAGE_E_2021' not in ds.available_values['TIP_AGGR1']['values_ids']

With DataSet.set_filters() we can filter the dimensions of the dataset by passing the values that we want to filter for. The dataset will then only return data containing our filters. A dictionary with the selected filters is contained in the attribute DataSet.filters.

Note that the arguments of DataSet.set_filters are lower case letters, but in DataSet.filters they are converted to upper case to be consistent with dimension names on ISTAT API.

dz = DataSet(dataflow_identifier="139_176")
dz.set_filters(freq="M", tipo_dato=["ISAV", "ESAV"], paese_partner="WORLD")

test_eq(dz.filters['FREQ'], 'M')
test_eq(dz.filters['TIPO_DATO'], ["ISAV", "ESAV"])
test_fail(lambda: dz.filters['freq']) #the filter is not saved in lower case