A first look at the FUSE spreadsheet corpus

Following on the first paper by Titus Barik et al
static.barik.net/barik/publications/msr2015/PID3640389.pdf
and some work by Mark Townsend analysing the last row and column used in each file
http://markstownsend.com/what-are-all-the-rows-for/index.html

I downloaded the 7GB of 249,376 files and did some summary analysis of them and the VBA.

The top domain is .org (29.5%),followed by .gov (27.7%)
That’s because almost half the files are from one web site – triathlon.org. They look like files that were filled in for reporting purposes, and so contain no formulas.
Files which are simple web report downloads or automatically generated (eg quickfacts.census.gov) then they were not user-created spreadsheets at all, and so of no interest to me.
5,600 have “SpreadsheetGear” as a write access user, all from worldbank.org.

So most of the FUSE spreadsheets are of no interest to me in formula error research.
There are no .xlsm files although a simple google finds 106,000.

Of the 5037 web hosts, http://www.triathlon.org accounts for 106328 files, or 43% of the total.
http://www.triathlon.org 106328
quickfacts.census.gov 47025
theahl.com 10350
The top 3 account for 66% of the files, the top 50 (1% of the hosts) have 87% of the files.
So it’s pretty skewed towards a few domains.

The POI analysis can not handle Biff5 files, but they can be processed in Excel if you relax the File Block settings.
The top 80% of files have no formulas or very few, again because they are really data files.
12854 (5.15%) have formulas.

Only 737 had VBA code, and 472 of them had unique VBA content as determined by a MD5 hash.
They have a range of typically 10 to 2000 lines of code.
102 have “Macro recorded by…” and no Dim statements
Only 78 of 472 have Option Explicit

I have prepared a slide deck of the findings, available at:
http://www.sysmod.com/vbainfusecorpus-pobeirne.pdf (185K PDF)

Does this interest anyone?

Advertisements

About Patrick O'Beirne, spreadsheet auditor

Patrick provides consultancy and training in spreadsheet development, auditing / testing and model review; and the Excel addin XLtest
This entry was posted in Excel/VBA, Research and tagged , , . Bookmark the permalink.

One Response to A first look at the FUSE spreadsheet corpus

  1. profgarrett says:

    Very interesting. The top #n of formulas used is particularly helpful in having some ideas of what formulas should be commonly taught to students.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s