Download whole dataset facility

(usually called Bulk Download)

 

 

Addresses of the site:

§    for all registered users: http://europa.eu/estatref/download/everybody/

§    for the Commission: http://europa.eu/estatref/download/commission/

 

 

On this site you will find:

-   this read_me file

-   the table_of_contents that includes the list of the datasets available on the Eurostat web site (in two versions: html and txt)

-   the directory (or folder) data that includes the zipped datasets

-   the directory dic that includes the "dictionaries" of all the coding systems used in the datasets

-   some tables of correspondences between old and new codes (of nodes and tables)

-   the directory comext, not treated here.

 

The information stored here is updated twice a day, at 11:00 and 23:00.

 

 

 

File table_of_contents

The file table_of_contents gives the following information on the datasets available on the Eurostat web site (which are stored in "tsv" format in the directory data):

 

title

the title (in English) of the items as it appears on the Eurostat web site (the titles in French and German are available on Eurostat web site).

code

the code of the item.  If the item is a table (see "type" below) this code corresponds to the filename (without the extension) of the dataset stored in the directory data.

N.B. some tables of this list are duplicated, i.e. appear in different "chapters".

type

= dataset (i.e. an "open dataset" available in the directory data)
= table (i.e. a "pre-defined table" available in the directory data)
= folder (i.e. a "section" or "chapter").

last update

last time the dataset/table has been modified.

data start

date of the oldest value included in the dataset (if available).

data end

date of most recent value included in the dataset (if available).

values

number of actual values included in the dataset.

rows

number of rows/records of the file (more details in the chapter Format of the datasets).

cols

number of "tab delimited" columns of the file (more details in the chapter Format of the datasets).

access_typ

type of access:  unfilled = free access  –  "restricted" = restricted access

 


Directory data

This directory includes approximately 4000 datasets in zipped (gz) format.  These datasets are stored in different sub-directories on the basis of the first characters of their file names (e.g. the files that begins with the character "a" are in the sub-directory a; the files that begins with the character "b" are in the sub-directory b; …).

The gz files can be uncompressed with WinZip.

You can sort the files by name, date or size by clicking on the header of the corresponding column.

 

The datasets are available in two formats: "tsv" (explained below) and "dft".  The latter is not treated here because it will be abandoned in 2007 and replaced by the "SDMX-ML" format.  Information on SDMX-ML format can be found at the following address: http://www.sdmx.org.

 

 

 

Format of the datasets

§    The "tsv" (= tab separated values) files are flat files that instead of containing one value per line/record contains a "tab delimited" sequence of values in each line.

Note for Excel users: these files can be straightforwardly opened in Excel (see chapter Hints for Excel users).

§    In most files the sequences of values are time series.

§    In the datasets that do not have the dimension (or attribute) time (e.g. "area of the regions") or that cover only one period of time, the sequences of values are not time series, but another dimension, e.g. geographical series.

§    The columns (or fields or cells) of the records are "tab delimited".

§    The time series are sorted in descending order (see the chapter Hints for Excel users for an explanation).

 

Example of dataset with time series (with made-up values):

unit,s_adj,partner,flow,indic,geo\time

2004m05

2004m04

2004m03

2004m02

mio-eur,nsa,ext_eurozone,net,bp-100,eurozone

11148

10660

13398

9437

mio-eur,nsa,ext_eurozone,net,bp-200,eurozone

3386 e

539

-185

-432

mio-eur,nsa,ext_eurozone,net,bp-300,eurozone

-5626 e

-6696 i

1902

919

mio-eur,nsa,ext_eurozone,net,bp-379,eurozone

-5758 e

-4165

-3970

-4703

mio-eur,nsa,ext_eurozone,net,bp-993,eurozone

3151.5 e

338.7 i

11146.1

5221.0

mio-eur,nsa,ext_eurozone,net,bp-994,eurozone

2314

669

543

2113

mio-eur,nsa,ext_eurozone,net,bp-010,eurozone

5465.1

1006.5

11689.0

7334.3

 

§    First line = the header.

§    Other lines = the records, with the sequence of values.

§    First column of first line = a sequence of codes separated by a comma "," followed by a code separated by a back slash "\"

-   the codes separated by a comma "," are the "names" of the dimensions used for identifying each (time) series

-   for each of these codes there is a file (with the same name plus the extension "dic") in the directory dic

-   the code separated by a back slash "\" is the "name" of the dimension of the sequence of values (e.g. time in case of time series, geo in case of geographical series).


§    First column of all the lines after the first one = sequence of codes separated by a comma "," that represent the "names" of the items (or instances or positions) of the dimensions.  The label/title of these codes can be found in the "dic" file that has the same name of the corresponding dimension.

§    Other columns of the first line = sequence of codes that represent the "names" of the items of the dimension of the sequence of values (each code is followed by a blank – see the chapter Hints for Excel users for an explanation).

§    Other columns of all lines after the first one = sequence of values. If a value has one or more flags, the value is separated from the string of flags by a blank. If there are non flags, the value is followed by a blank (see the chapter Hints for Excel users for an explanation).

§    The decimal symbol used in the files is the dot ".".

 

 

 

Dictionaries ("dic") files

The "dic" files contain the meaning of the codes used in the first columns of the "tsv files.  They are available in German, English and French, and are stored respectively in the sub-directories "de", "en", and "fr".

 

There is a "dic" file for each dimension used in the "tsv" files (i.e. for each code that appears in the first column of first line of the files, excluding "time" – see below).

 

Each "dic" file includes the items of the corresponding dimension, followed the corresponding "labels" (separated by a "tab").

 

There is no "dic" file for the dimension time.  The syntax of the codes used for this dimension is the following:

-  Yearly data: YYYYyMM  –  where YYYY is the year and MM is the number of the month (01 to 12) or "00" ("zero zero") if the value refers to the "whole year" (e.g. is total or average of the year).

-  Semi-annual data: YYYYsSS  –  where YYYY is the year and SS is the number of the semester (01 or 02).

-  Quarterly data: YYYYqQQ  –  where YYYY is the year and QQ is the number of the quarter (01 to 04).

-  Monthly data: YYYYmMM  –  where YYYY is the year and MM is the number of the month (01 to 12).

-  Daily data: YYYYmMMdDD  –  where YYYY is the year, MM is the number of the month (01 to 12) and DD is the number of the day (01 to 28 or 29 or 30 or 31).

 

 

 


Hints for Excel users

§    When you open a "tsv" file, Excel shows you the Text Import Wizard.  For a "basic" import, you can just click on the button Finish and get the table in Excel  (but do not forget that the decimal symbol used in the files is the dot ".", not the comma ",").

§    For a more "sophisticated" import, you can click on Next and, in the second step of the wizard, specify the character(s) used for delimiting the columns:

-   if you choose the Tab as delimiter (= the choice by default), you will get as many columns as specified in the file table_of_contents, i.e. all the "comma delimited" codes will be in the first column and the flags will be in the same cells of the values:

In the Data preview panel you can see how the table will look like.

-   if you chose the Tab and the Space as delimiters and you uncheck the option Treat consecutive delimiters as one, you will get values and flags in separated columns.  The total number of columns will be the double (minus one) of the number of columns given in file table_of_contents:

In the Data preview panel you can see that values and flags are in different columns.

-   if you chose the Tab and the Comma as delimiters, you will get the dimensions and their items in distinct columns (with the exception of the code that is separated by a back slash "\" in the first line);

In the Data preview panel you can see that the codes at the beginning of each line are in separated columns.

-   you can also chose the Tab and the Space and the Comma as delimiters (and uncheck the option Treat consecutive delimiters as one)

In the Data preview panel you can see that codes, values and flags are in different columns.

§    In the third and last step of the Text Import Wizard (see the picture below) you will be able to:

-   chose the Column data format.  The default option General works well for importing "tsv" files, unless you have decided to split the first column (using the Comma as delimiters).  In that case, we advise you to choose the Text format for the columns that contain the codes of the dimensions and their items, otherwise Excel may interpret them incorrectly (e.g. the code for the region DEC1 may become 1/12/2000!).

-   chose to skip columns;

-   define the decimal separator, by clicking on the button Advanced (the decimal symbol used in the "tsv" files is the dot ".").

§    The "notional" limits of an Excel sheet are 256 columns and 65536 lines (the actual limits depend on the memory available on your PC).  If you try to open a file that goes beyond these limits, Excel will "truncate" the exciding columns and/or lines and give a message of this kind::

-   in case of time series with more than 255 periods, you will keep the most recent ones because in the "tsv" files the time series are sorted in descending order;

-   in case of too many lines, you can open the file with Word and split it;

-   huge files can only be used/exploited with more powerful software such as Access, Dbase, Oracle, SAS, SyBase, etc.

§    If you import in Excel the values and flags in different columns, the total number of columns will be the double (minus one) of the number of columns given in file table_of_contents.  If the file is then "truncated" by Excel, the last imported column of values may loose its associated column of flags. In that case last column should be deleted.

§    If you import in Excel values and flags in the same columns, check for special combinations of values and flags such as "10 p", because Excel may automatically convert them into "10:00 PM". To solve this problem you can format the cell(s) as “text”.

§    How to sort the time series (i.e. the columns) in ascending order:

a)  If you have imported in Excel values and flags in the same columns

-   select all the columns that contain values (= typically all the columns but the first one) including the first row

-   open the menu Data and choose the item Sort  –  you will get the following menu:

-   check that the option Header row is selected (as shown in the picture above)

-   click on Options and select the option Sort left to right as shown in the following picture:

-   Click on OK and you will go back to the Sort menu:

-   Click on OK again.

b)  If you want values and flags in different columns, then you can sort the columns in ascending order either writing an ad hoc macro or with a double import:

1)  import the file in Excel with flags and values in the same columns

2)  sort the columns as explained in the previous paragraph (a)

3)  save the file as Text (Tab delimited)

4)  import the file in Excel with flags and values in separated columns (as explained in previous pages).