Format for data files

Back to home page

 

I have a collection of data sets for use during workshops or just to play with when trying out new statistical techniques or computer code.

A big question is what format to use, and I've changed my mind on this several times already!

After looking at this blog post by John Mount I've decided to try using tab-separated files with a .tsv extension.

The advantages of this scheme are:

1. I can have spaces, commas, quotes, hyphens, etc. in my fields, anything except tabs, new lines or carriage returns. So "Elephant, Asian" and "Long-tailed bat" are okay. In a locale where the decimal point is a comma, I can ctrl-h and replace "." with "," and everything still works.

2. I can put comments preceded by a "#" at the top with metadata: explanations of the variables, how the data were collected, citations for the data source and published analyses; metadata in the same file as the data won't get lost. The comments can contain commas and quotation marks without making a mess if I open or save the file in a spreadsheet program.

3. I can open the file in a text editor and see what I've got; the default in Windows appears to be Notepad. With native spreadsheet formats, such as .xls or .ods, what appears in a spreadsheet may not match what's in the file, eg. dates are stored as number of days since a base date (often 31 Dec 1899) but displayed as 6/12/12 or whatever. What happens when that file is read into R?

4. It works for most software packages, including spreadsheets, R, QGIS. In some cases you may need to "start import at line x" to skip the comments. In R, use  read.delim with comment = "#" (or read.table with sep = "\t" and header = TRUE).

5. The .tsv file extension indicates that it's a data file, avoiding confusion with .txt files containing text.

Here's an example data file opened in Notepad and in LibreOffice Calc:

 

Updated 5 June 2012 by Mike Meredith