Extend Excel: London 18 Oct

Charles Williams is organising  a conference in London on October 18th, 2018. I’ll be there! This is a rare chance to meet leading developers from all over the world and hear directly from MS on Excel Extensibility. Expect to be seriously stretched. It’s a community event, not for profit, so the price for corporate-funded attendees and vendors is only £120 including VAT.

More information & booking: https://developexcel.wordpress.com

  • Charles Williams (UK) – Excel MVP:  Overview of extensions, APIs and Tools
  • Juan Balmori (USA) – Senior Program Manager, Microsoft Excel Extensibility:  Javascript APIs, UDFs and UI elements
  • Govert van Drimmelen (SA): Excel-DNA: .Net for Excel
  • Tony Roberts (UK): PyXLL: Python for Excel
  • Bruce Belson (AUS): Planatech XLL Plus: C++ for Excel
  • Gašper Kamenšek (SL): Power Query + M in VBA
  • Simon Murphy (CH): VSTO, Addin Express and other Dev Tools: real world experience

Network with, socialise and meet other Excel developers of addins and solutions

 

Posted in Excel/VBA, ExcelDna, Open Source, Productivity, Research, Software Quality, Software Testing, Uncategorized | Tagged , | Leave a comment

Excel and PowerPoint 2016 for Mac gripes

I’ve been trying to maintain compatibility between Mac and Windows versions of Excel/VBA apps for a while now, and it’s a pain.
These notes refer to versions 16.14 of both Excel and Powerpoint.

Here are some of my recent threads on answers.microsoft.com

Passing a user-defined object to a sub As Object passes an object with all Null properties. Passing As MyClass works. Methods work, just not properties.
Interoperability between Excel 2016 for Mac and Powerpoint 2016 for Mac is very frustrating.
To get VBA to copy a picture of a chart to a slide, I’ve tried four different ways.
Only in the latest Insider Build 16.14 does the obvious Copy / Paste method work but even that requires copying and pasting as an image in Excel first and then copying amd pasting the static picture.

1) Manual copy and paste works, as it does in Windows. I want to automate this for what could be 100 slides.

2) The CopyPicture method fails for all objects. Range gives error 1004 CopyPicture method of Range class failed. ChartArea gives error 438 Object doesn’t support this property or method. Chart.Copypicture the result when pasting is a shape with text “The picture can’t be displayed”. (Update 8-May-18: Excel 16.14 (180507) Chart.Copypicture now works, includes a TIFF format in the copied data. And to save a picture of a range as a file, we can now do range.copypicture followed by adding a temporary chart by chartobjects.add and then chartobject.chart.paste of the picture, then chart.export as PNG)

See my post on answers.microsoft.com.
To work around this, create the picture in Excel:

 oChart.ChartArea.Copy
 ' ws is the target sheet for the picture
 ws.Cells(30, 9).Select 'below the chart; to anchor the Paste
 ws.PasteSpecial 'default Format Link:=False,DisplayAsIcon:=False
 'Format: 0=PNG,1=JPEG,2=GIF,3=Enhanced metafile,4=Bitmap,5=DrawingObject
 ' but UI offers only PDF/TIFF/Drawingobject
 Set oShp = ws.Shapes(ws.Shapes.Count)
 oShp.Copy

 

3) There is no Slide.PasteSpecial method in Mac Powerpoint VBA. Slide.Paste takes NO data type argument.

4) ApplescriptTask can be used to execute the Paste in Powerpoint. Thanks to Ron de Bruin for help here.

This code is in the VBA:

    myChartObject.Copy
    AppleScriptTask "CopyPasteScript.applescript", "PastePowerpoint", ""

Save this code in a file eg “CopyPasteScript.applescript” in the special folder
/Users/myusername/Library/Application Scripts/com.microsoft.Excel

    on PastePowerpoint()
    -- Paste into Powerpoint
    tell application "Microsoft PowerPoint" to activate
    delay 0.1
    tell application "System Events"
    tell process "microsoft powerpoint" to keystroke "v" using command down
    end tell
    end PastePowerpoint

5) Another method is to export the Chart as a picture and then import it in PowerPoint

    Dim oChart As Chart, sFile As String
    Set oChart = ThisWorkbook.Worksheets(1).ChartObjects(1).Chart
    oChart.Parent.Activate ' Chartobject
    sFile = ThisWorkbook.Path & Application.PathSeparator & "test.png"
    oChart.Export sFile, "png" 
    ' Only works on Excel for Mac 16.14 or later; 16.9 to 16.13 produces zero bytes file
    oSlide.Shapes.AddPicture(sFile, False, True, 185, 110, -1, -1)

 

6) In practice, the above method is agonisingly slow, so what I really do is write a VBA script with lines that create a presentation, add slides, set the title text, and call AddPicture using the file names created.
Then I create a temporary presentation in PowerPoint, import that VBA code, execute it, and close the temporary presentation without saving it.

7) Before Chart.Export started to work in Excel 16.14 or later, I did try the ugliest workaround, and it works but it’s really a kludge.

7.1 Create a temporary workbook with a sheet for each slide and copy/paste the title and chart image into it. Add needed properties to the .AlternativeText of the pictures.

7.2 Save the workbook as html.

7.3 Open the tabstrip.htm file, read the text, split it at “<a href=” and extract the tab names.

7.4 for each tab file name, open its htm file, read the text, split it at “<v:shape id=” and parse out the picture file names and properties.

7.5 Now I have the picture file name, add a line to the PowerPoint VBA file I create in (6) above to add a slide using the right title and picture filename.
Horrible, isn’t it?

 

Posted in Excel/VBA, Mac, powerpoint, Uncategorized | 6 Comments

Dublin Tech Summit 18-19 Apr 2018

#DTS18 @dubtechsummit

Dublin Tech Summit is a two day (Wed 18 and Thu 19 April 2018) meetup of startups and business people intended to connect investors to tech companies, and to highlight emerging tech trends.
I attended as a guest of Failte Ireland, and brought two UK attendees to it, to help promote Ireland as a conference destination.20180418_141020,400x300

Several side tracks dealt with start-ups, mentoring, and Eir offered a prize of €5,000 for the best startup pitch. This looks like a good event for any tech company at launch or early development. If you can’t afford it, maybe (in Ireland) talk to your LEO (Local Enterprise Office) for support, or the equivalent in other countries. For example,  La French Tech, an initiative of Business France, featuring Kuzzle, Axione, Qare, and Keymetrics. I saw a number of other stands from French companies; one that caught my eye was Myoeno.com, a wine scanner device to measure oenological characteristics.

My opinions (and bear in mind I’m probably not its target market):
It is not a training event, there were no real techie or “how to” presentations, but an educational / product promotional event in a general sense loaded with “fantastic / incredible / amazing” buzzwords and hype. I understand that slides will be available later.
The schedule only gave speaker names and very few descriptions of what the talks were about. This did not improve as the event grew closer, so it became clear that speakers were chosen for being able to give engaging and fluent presentations, and not for promising any useful content. The general sense is of watching an animated magazine article. The panel sessions featured experts, including local ones such as Brian Honan, but the limited time meant that topics were necessarily high-level discussions rather than informative.

Quantum Computing, Bo Ewald, D-Wave systems: because many terms were used without definitions or examples, like “annealing”, this would have made more sense to me if I had known more about the topic, but the 2017 story of China communicating with a satellite using “entangled” photons was interesting.

Patricia Scanlon of Soapbox Labs described the particular challenges of recognising children’s voices.

The Publishing Revolution session spoke of developments in eBooks and interactive learning and self-publishing.

Fred Laluyaux of Aera spoke of the “self-driving” enterprise. Their solution to integration and interoperability of 140 separate IT systems is to introduce a 141st and link every other one to it to create a “digital twin” to capture corporate memory. Maintenance, anyone?

The MusicXTech events were in large nearly empty rooms, where other tracks were packed out.

20180418_140931,400x300
One MusicXTech session interested me, on “The Visualisation of Music” but the schedule had no information on the speaker and no abstract. I tweeted the organisers asking for more info but got no reply. I went anyway but it was nothing to do with the visualisation either of musical or music notation, but rather a look back of how the early augmented reality scenes in the Gorillaz ‘Humanz’ app were created by B-Reel and how it’s much easier now. A bit of pop history, maybe.

The Casey Neistat session was entitled “How to be heard” but he gave no “how to” information at all – in fact, disclaimed it. It was simply the story of his YouTuber career and probably interested people in media.

Jason Zander of Microsoft described the Azure eco system – he began by saying that at a dev conference the first thing he’d do is launch into Visual Studio so that sounds more like my kind of event.

At the panel on Bitcoin, an audience show of hands showed that more than 50% claimed to own Bitcoin. They expect the performance problems to be addressed someday – currently BTC handles 7 transactions per second but Visa handle 24,000 tps.

A session entitled “The Future is Dominated by Customer Experience​” sounded promising but again came across as a rehash of fairly obvious “customer first” assertions. It might have been of more interest to discuss why this does not actually happen in practice, what the motivations are that result in CX failure. Like the app for this event.

Jordan Bitterman of IBM described current applications of AI. He gave examples like Shazam for song recognition, “Do not pay” to challenge parking tickets, and voice recognition identifying which adolescents showed mental health risks. He also reported on using location data to infer possible forced migration of refugees before they are washed up on shore.

The Grip App for the event was my Gripe App because it was useless for planning my attendance. It had networking features but nobody I clicked as “interested” responded, perhaps because of the thousands of attendees, the few they chose to show me were not particularly relevant. They trumpet the “AI” aspect without saying what that means – probably just dumb text matching between stated interests. There is no search option in the Grip app. Well, there is but when I typed in the name of a speaker I was interested in hearing I got one result dated 1st April 2016 and no further info.
The only way to view the schedule is to select one track, then one day, then scroll down, tap the down arrow beside each session title in turn to see who the speakers are and any description, read it to see if it’s what you want, and if not then scroll down more and repeat. And repeat all that for the next day. And repeat all that for each of the four tracks. What a crock.
If anyone designing the app had much experience of attending a conference they would know the top task is to select which sessions to attend in which tracks.
I need an overview of the whole schedule with detail of all the talks. If I tap on a speaker, I want to see what sessions they are in and when and where.
There is a grid overview offered but most of the time it does not download. (I subsequently learned that was due to demand throttling.) And when it did, it was view-only in the reader, it could not be saved on my device so I would not have to download again. The smart solution is to print the schedule yourself before coming.
Or as some do, take a picture of the session displays on the screens outside the rooms. And do it while it’s on show – part of the time it wipes away and come back, for no discernable reason except to frustrate the reader.
A simple backup system for the organisers would have been to give everyone an A4 sheet with the parallel sesssions listed, to enable planning.
On the second day I took the time to scroll the schedule early morning and write down my choices on paper.

 

Posted in Uncategorized | Leave a comment

Free Add-In to profile VBA speed and coverage

I’m making available a beta version of my VBA performance profiler for people to try it out and let me know what they think. I’m interested in seeing what kinds of VBA code trip it up and produce syntax errors.

It adds profiling code to your VBA project which collects procedure timing and code coverage statistics. Here is what the Help form shows:

INSTRUCTIONS

Open the file XLTEST_VBA_PROFILER.xlam and enable macros, or install it as an add-in.

Open the workbook you want to profile. Save it under a different name to avoid overwriting the original.

On the Developer tab, in the VBA Profiler group, click the button “Profiling”.

This gives a form in which you can specify whether you wish to:

1. Add a call to every line of code to trace its execution for coverage analysis.
A tab-delimited text file is output of all the code after trace lines have been added
so that the Sub VBA_Profile_Reports can show what lines of code were executed.
The text file is named from your workbook name suffixed with _VBA_COVERAGE.TXT

2. Collect Profile timing data (needed to produce profile reports)
If unchecked, it sends the output to Outputdebugstring.
You can watch that output using TraceSpy (https://tracespy.codeplex.com/)
To be sure that TraceSpy can keep up with the output, do not use Auto-Scroll.

3. Export all the VBA to a text file before making the changes.
This is just for your record, it is not used in this process.
The tab-delimited text file is named from your workbook name suffixed with _VBA_BEFORE.TXT

When you click OK, it adds a line to every procedure in the project, unless its declaration line contains a comment ‘@notrace

After it is finished, go into the VBE and click Compile to check that no syntax errors have been introduced. If you get a Compile error: Variable not defined VBA_Profiler then you need to add the Tools Reference to VBA_Profiler although that should have been automatically added by the profiler.

If you checked Profile timing above, this line looks like this:
Dim P_ As Object: Set P_ = VBA_Profiler.New_Profiler(): P_.Start 12345, “modulename”, “procname”
Otherwise it looks like
Dim P_ As Object: Set P_ = VBA_Profiler.New_Debugger(): P_.Start 12345, “modulename”, “procname”

The way that works is that when the procedure exits and the variable P_ goes out of scope, the profiler counts up the number of calls and the length of time the procedure took. The resolution of the timer is about a microsecond.

If you checked code coverage above, it inserts a call at the beginning of every exceutable line like this:
P_.O 12345&:

In the examples above 12345 is a unique line number in the project, corresponding to lines in the coverage file.

Finally, insert a test sub which initialises the profiler, calls your main line procedure, and then produces reports.

Sub testMyProc
 VBA_Profile_Initialize ThisWorkbook
 MyProc
 VBA_Profile_Reports 
End Sub

VBA_Profile_Reports outputs a tab-delimited profile report named from the workbook name suffixed by _VBA_PROFILE.TXT;
and if you do the coverage analysis, a workbook is created with one sheet containing the contents of the _VBA_COVERAGE.TXT file with an extra column added to show the number of times a line of code was executed.
The workbook is unsaved, you can save it as a .xlsx file if you wish to keep it.

Let me know what you think!

Posted in Excel/VBA, Software Testing | 8 Comments

Tracing & Profiling VBA speed with TraceSpy

I’m interested in measuring VBA performance. I add tracing calls to the routines so I get the time on entrance and exit. I also optionally add a call to trace every line executed, which I use when I get unexpected crashes (“Excel has stopped working”), so that I can see the line it crashed on. Of course, Debug.Print is no use in that so I use the Windows API OutputDebugString. I used to monitor that by DebugView but was disappointed to see when I examined the log that dbgview.exe missed some output. It looks like if one call – eg an entry into a fast sub – is followed too quickly by another, DebugView only reports the second output. That makes it impossible for me to calculate the elapsed time for each sub. So I looked around and found TraceSpy. It kept up very well. The only thing I had to do was to turn off its Auto Scroll. With auto-scroll it looked like its queue filled up and it would take a half hour to finally work through all the debug messages, and if VBA was too fast, it would miss the last half of them. So I also put a Sleep 0 after OutputDebugString to get Excel to give other processes a chance. It doesn’t need DoEvents, I’m not looking to process events in Excel itself, just to give TraceSpy some time. TraceSpy does not log to file but it is easy enough to click on the first log line, Shift+End to select to the end, Copy entire line, and paste into Notepad++ to be able to save it. My longest log files so far are 600,000 lines, so I have not yet hit the Notepad++ file size limit.

When I saw the results, I could see which functions were slow and needed optimising so that the overall execution speed would be improved.

Posted in Software Testing, Uncategorized | Leave a comment

Rubberduck VBA addin

The Rubberduck VBA (@rubberduckvba) open source project has been going for a couple of years, and when I was reminded of it recently I decided to give it a try. It’s definitely a help in VBA development and I’ll go on using it with Excel.

It’s an addin for the VBE add-ins manager. I am currently using version 2.0.11.2453, downloaded from Github. I look forward to some enhancements they discuss on Github.

I focused on Code Inspections, where Rubberduck warns you about potential quality issues by inspecting the code in a VBProject, either Excel or Access. It also has Refactorings and Unit Tests, which I shall look at later.

The list of inspections is here. Some of them I don’t enable (eg “Encapsulate Public Field”, “Multiple Declarations”) as they are not issues for me, although the style police might think otherwise. Others I ignore with a comment flag in one specific place but leave the inspection active in case the poor style arises by mistake elsewhere.

I found one problem in one of my projects using this, so it’s already worthwhile.

The name, in case you’re curious, derives from the saying that if you don’t have a peer to review your code with, explaining it to a rubber duck helps you see things you missed first time.

And finally, if you have a special interest in Excel VBA code quality, please consider sharing your knowledge with others by presenting at the annual conference of the European Spreadsheet Risk Interest Group (@eusprig) . July 6, Imperial College London. http://www.eusprig.org/call-for-papers.htm

Posted in Excel/VBA, Open Source, Software Testing, Uncategorized | Leave a comment

VBA code for Microsoft Text Translator API

I have now updated the translator VBA source code to V3.

https://docs.microsoft.com/en-us/azure/cognitive-services/translator/migrate-to-v3

V2 was deprecated on April 30, 2018 . The Microsoft Translator Hub was to be retired on April 30, 2019. View important migration information and dates.

On Apr 30 Microsoft extended availability of version 2 of the Translator Text API:
“After receiving strong customer feedback about the scheduled retirement of version 2, we’ve decided to continue availability of this version along with version 3. Your applications that use version 2 will remain functional. Although version 2 will continue to be available, please migrate your applications to version 3 as your development cycle allows. By moving to version 3, you get access to exclusive functions, such as neural models, which are trained through Custom Translator.”

The V2 content below is now obsolete. See the V3 link above for new content.

Microsoft retired the Datamarket Translation API on April 30, 2017.
You need to sign up to the Azure Portal and use the Text Translator API in Cognitive Services.

https://translatorbusiness.uservoice.com/knowledgebase/articles/1078534-microsoft-translator-on-azure
Action Required before April 30, 2017: Microsoft Translator Moves to Azure
How to sign up for Microsoft Translator on Azure

https://www.microsoft.com/cognitive-services/en-us/translator-api
The Free plan allows 2M characters per month.

Documentation and interactive trials :
http://docs.microsofttranslator.com/text-translate.html

You define an Account Name and get two keys. Either work, I don’t know yet what the difference is.

Before using “Try out a Method” on that page, you first use your Key 1 to get an access token that lasts 10 minutes and then you can try the methods.
In the methods, leave the Authorization blank, and for appid specify Bearer followed by a space and the access token.

To use your key in VBA, I now provide a VBA module that I use in Excel for translation.
You may use this free with attribution. I am available for custom Excel/VBA development work.

Download here: http://www.sysmod.com/MSTranslator4.bas

A test with 9,900 phrases in 23 European languages totalling 241K took under 30 minutes to complete.
I did not use Sleep to throttle the performance, and no translations were missed.
Microsoft do not indicate a rate limit, only the Azure Subscription Plan total: 2M characters/month for the free plan.
To learn more about the technology, see
https://hubtest.microsofttranslator-int.com/Help/Download/Microsoft%20Translator%20Hub%20User%20Guide.pdf
“This guide will benefit any person who is interested in building a custom translation system using the Hub. A deeper background in machine translation is not essential to use the Hub. “

My old post from 2012 on the Datamarket translator is at
https://sysmod.wordpress.com/2012/01/09/microsoft-translator-vba-code-sample/

Example:

en Translate
pt Traduzir
fr Traduire
de Übersetzen
fi Kääntää
nl Vertalen
es Traducir
it Traduci
no Oversette
ja 翻訳
zh-CHS 翻译
ar ترجمة

Posted in Excel/VBA, Open Source, Productivity | Tagged , | 19 Comments

Conditional compilation VBA Excel Mac/Win/32/64…

To make my addin handle all the combinations of Excel 2007 onwards on Mac and Windows, 32-bit and 64-bit, I ended up with this conditional compilation block. The example I use is GetTickCount() but the principle applies to other declarations.

Microsoft say “We expressly DO NOT support external use of Win32 APIs on the Mac, even if we have an implementation of them in the Office frameworks.” If you ever find documentation on these, let me know.

'https://msdn.microsoft.com/en-us/library/office/ee691831(v=office.14).aspx
'Compatibility Between the 32-bit and 64-bit Versions of Office 2010
'Apple has deprecated HFS on the Mac, so Mac Office 2016 no longer uses or supports HFS paths.
'You must use POSIX paths (with "/" as the directory separator) in Mac Office 2016.
' In Macscript, when getting folders or files in Mac Office 2016,
' insert "return posix path of " before "(choose folder..."

#If Mac Then
 #If MAC_OFFICE_VERSION >= 15 Then
  #If VBA7 Then ' 64-bit Excel 2016 for Mac
   Declare PtrSafe Function GetTickCount Lib _
"/Applications/Microsoft Excel.app/Contents/Frameworks/MicrosoftOffice.framework/MicrosoftOffice" () As Long
  #Else ' 32-bit Excel 2016 for Mac
   Declare Function GetTickCount Lib _
"/Applications/Microsoft Excel.app/Contents/Frameworks/MicrosoftOffice.framework/MicrosoftOffice" () As Long
  #End If
 #Else
  #If VBA7 Then ' does not exist, but why take a chance
   Declare PtrSafe Function GetTickCount Lib _
"Applications:Microsoft Office 2011:Office:MicrosoftOffice.framework:MicrosoftOffice" () As Long
  #Else ' 32-bit Excel 2011 for Mac
   Declare Function GetTickCount Lib _
"Applications:Microsoft Office 2011:Office:MicrosoftOffice.framework:MicrosoftOffice" () As Long
  #End If
 #End If
#Else
 #If VBA7 Then ' Excel 2010 or later for Windows
  Declare PtrSafe Function GetTickCount Lib "kernel32" () As Long
 #Else ' pre Excel 2010 for Windows
  Declare Function GetTickCount Lib "kernel32" () As Long
 #End If
#End If

 

See http://www.rondebruin.nl/mac/ for lots more on Excel for Mac, in particular Macscript code for file and folder dialogs.

For Windows API declarations with the new LongPtr type, see

http://www.jkp-ads.com/articles/apideclarations.asp

My previous posts on Excel for Mac are here:

https://sysmod.wordpress.com/2011/10/21/list-of-mac-excel-2011-differences-from-excel-2011/

https://sysmod.wordpress.com/2011/10/24/more-differences-mainly-vba/

https://sysmod.wordpress.com/2011/10/28/mac-excel-vs-pc-excel/

 

UPDATE 14-March-2019

Excel VBA is now throwing an error 53 file not found for

“/Applications/Microsoft Excel.app/Contents/Frameworks/MicrosoftOffice.framework/MicrosoftOffice”

So the GetTickCount() function is no longer available, unless any reader has an idea.

 

Posted in Excel/VBA, Uncategorized | Tagged , , | 2 Comments

Excel & Genes: mutation and curation

On 23 Aug a paper by three Australian authors gained much publicity about how Excel interprets data entry and how the users don’t check their data.

http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

Gene name errors are widespread in the scientific literature. By Mark Ziemann, Yotam Eren and Assam El-Osta. Genome Biology 2016 17:177 DOI: 10.1186/s13059-016-1044-7 ©  The Author(s). 2016. Published: 23 August 2016

They wrote scripts to trawl through published papers on genetics which had data files attached, and checked those files for data errors. They screened 35,175 supplementary Excel files and confirmed gene name errors in 987 supplementary files from 704 published articles in 18 journals. Linear-regression estimates show gene name errors in supplementary files have increased at an annual rate of 15 % over the past five years, outpacing the increase in published papers (3.8 % per year). “In conclusion, we show that inadvertent gene name conversion errors persist in the scientific literature, but these should be easy to avoid if researchers, reviewers, editorial staff and database curators remain vigilant.”

The root problem is that the researchers who uploaded those files NEVER checked them. Once you have seen this once, you’re sensitised to it and can take steps to avoid it in future.

The reason they never check is because Excel is seen as a simple application, not programming or software development or data management. People don’t learn how to use it any more than they would take courses in Word or Powerpoint. So this point-and-click approach gives rise to attitudes like that expressed in a comment to a Slate article that Excel should just take the input and not change it. The problem with running with such naïve unvalidated expectations about any software package is that you then run the risk of publishing career-limiting mistakes.

There’s a skills certification called “Spreadsheet Safe” and one of their points is “validate CSV file imports.”

http://www.spreadsheetsafe.com/us_main_page_section/services/

I contacted the authors and Mark Ziemann kindly sent me three example files. The solution is so easy … when importing that CSV/TSV/TXT file from the data capture instrumentation, use the “Get External Data button”, set the data type for the gene columns and other IDs to text, and that’s it. A few seconds’ work.

Mark Ziemann says “Most, but not all of these data files are imported as text or csv formats from instrumentation such as DNA sequencers, gene microarrays or proteomics screens. Many of these files appear to be heavily modified from their original format and contain colour coding, modified column headers and additional columns. Some files are comparisons of 2 or more datasets or more in the same worksheet. A smaller number of files was simply a filtered list of gene names that could be a group of candidate genes for future analysis. As most error-containing files I screened also had accession numbers or other identifying information, the risks to altering the conclusions of the study by gene name errors are minimal, but embarrassingly common. Statistical analyses of large datasets is done in R, Matlab, Python, etc and its common to save the data in XLS file so that other researchers in the study can open and inspect the data. ”

Here’s how to work it:

First open the file as normal to eyeball it. I picked the CSV file first:

example1-1

Firstly, D4:E6 look different from the rest of columns D&E. What’s happening there is that the numeric values are outside Excel’s range of approximately 2E-308 to 1E308 so they are stored as text. Let’s take a quick look at the other columns. On the Data ribbon tab, click Filter.

Click the dropdown for “geneid” first:

example1-geneid

Well, that shows something is amiss – what are dates doing there? Pick the first one and apply that filter:

example1-sep11

So, one gene name has been changed to a date and displayed as Sep-11; that’s 1-Sep-2011. You can check some more, but you should now realise that this column (and maybe others, although in this specific case they all look like they should be numbers) needs to be imported as Text. On the Data tab, Get External Data, click From Text. In fact, if you were opening a .txt or .tsv file, Excel would offer you this Text Import Wizard right away rather than just opening it as it does with .csv files.

example-tiw1

Yes, it’s Delimited by commas, so click Next and check the Comma box :

example-tiw2

Click Next again:

example-tiw3

This is it. See where Excel says it converts date values to dates? Anything that looks like a date value in your locale, like MARCH1 or SEPT11, will become a date. So override that by checking the Text option:

example-tiw3text

Click Finish and specify where you want it to go:

example-tiw4

I usually like to click Properties and uncheck “Save query definition” as otherwise a data link is created in this workbook so that you can refresh it later from the same source file.

After using the Filter button and checking the “geneid” column, that now looks clean. There is still the text data in D4:E6 but Excel can’t handle that as numbers so you’ll have to know what to do with it.

example1-after

To further check on the data, I would add a sheet to calculate for each column the count of all values, of numbers, of blanks, and the minimum, maximum, average, and sum of the numeric data and the lengths of the text data. That would tell me of other oddities that need investigation such as missing or out-of-range data.  A check for duplicates reveals that eight of the gene IDs are duplicated in this sample.

If I had a reference list of gene names that should be in this sample, I could also do a lookup both ways to report gene IDs in the sample not in the reference, and those in the reference not in the sample. In any domain, there are always specific tests that can be performed. In addition, data can be formatted or highlighted in various ways to convey some significance.

What all that means is up to the expert in that domain, but it illustrates the tests that are easy to perform in Excel, and where there is really no excuse to skip the task of curating the data for presentation.

Example files

The paper’s lead author Mark Ziemann has kindly made available three files if you’d like to try it yourself. CSV, TSV and TXT compressed zip files.
example1.csv.zip 547K

example2.txt.zip 801K

example3.tsv.zip 480K

He adds “They were generated after statistical analysis of gene expression data from Illumina DNA sequencing instruments or microarray instruments. It is common to sort these sheets by p-value to help end-users see what the most significant datapoints are. Duplicated gene names can be an issue for downstream work and it is common practise to keep only the most significant one.”

Other references

This sample media comment illustrates the contrasting views:
http://www.slate.com/blogs/future_tense/2016/08/24/excel_created_major_typos_in_20_percent_of_scientific_papers_on_genes.html
Excel Created Major Typos in 20 Percent of Scientific Papers on Genes

One of the comments reads “Even if I were to pretend this was actually Excel’s fault, I still can’t understand them being so stupid as to not double check their work before publishing. No excuse for that.”

Another says “I read the article and give not two whits for what the last sentence says. I shouldn’t have to “wrestle” with the software to convince it to just let me enter my data already and stop pretending like it knows what I want better than I do.”

http://www.thetimes.co.uk/article/microsoft-excels-at-giving-genes-a-bad-name-db9xjlzpr

http://chandoo.org/wp/2016/08/26/psa-dont-let-auto-correct-spoil-your-party/

http://www.bbc.com/news/technology-37176926

The 2004 paper:

Zeeberg BR, Riss J, Kane DW, Bussey KJ, Uchio E, Linehan WM, et al. Mistaken identifiers: gene name errors can be introduced inadvertently when using Excel in bioinformatics. BMC Bioinformatics. 2004;5:80. http://dx.doi.org/10.1186/1471-2105-5-80

In 2012 this blog post says it’s still happening:

https://nsaunders.wordpress.com/2012/10/22/gene-name-errors-and-excel-lessons-not-learned/

I guess there are always people yet to learn for the first time. Maybe spend $40 of their research budget on a good general Excel book by Walkenbach, or one of the specific Excel For Data Analysis books?

Even textbooks don’t always cover good practice. Amazon shows a number of books for “Excel Data Analysis”. Those would be more specialised than the general books critically reviewed here:

Journal of Education for Business 90(4):169-174 · March 2015 by Nathan Garrett

https://www.researchgate.net/publication/276166463_Textbooks_for_Responsible_Data_Analysis_in_Excel

Garrett concludes:   “Teaching Excel as a point-and-click tool, or only examining individual features in isolation, results in amateur excel programmers who understand features, but do not know how to tie these features together.  The universally high error rates found in the field show the need for improving the state of instruction. Curriculum needs to reflect a professional approach. Without this shift, amateur work will continue to be the norm.”

https://support.office.com/en-ie/article/Excel-specifications-and-limits-ca36e2dc-1f09-4620-b726-67c00b05040f
Excel specifications and limits

 

 

Posted in Excel/VBA, Research, Risk, Software Testing | 4 Comments

Eusprig 2016 July 7, London

The European Spreadsheet Risks Interest Group 17th Annual Conference
(EuSpRIG 2016) will be held at the Institute of Chartered Accountants in
England and Wales (ICAEW), Moorgate, London, on the 7th July 2016.

http://www.eusprig.org/annual-conference.htm

The programme is now available. Register now to take advantage of the early bird discount! Here are the titles of the papers:

     Teaching methods are erroneous: approaches which lead to erroneous 
end-user computing.  Mária Csernoch, Piroska Biró

     Characteristics of Spreadsheets Developed with the SSMI 
Methodology.    Paul Mireault

     A Programmatic Approach to the development of Solutions in Excel. 
Peter Bartholomew

     A Conceptual Model for Measuring the Complexity of Spreadsheets. 
     Thomas Reschenhofer, Bernhard Waltl, Klym Shumaiev, Florian Matthes

     A Pilot Study Exploring Spreadsheet Risk in Scientific Research     
Ghada AlTarawneh, Simon Thorne

     The use of the Power Query / Get & Transform tools in Excel     
Simon Hurst

Registration and other information: http://www.eusprig.org

————————————————————————-
In other news:

At 18:00 the previous evening, Wed July 6th, the ICAEW launch their
Spreadsheet Competency Framework, a new guide for classifying
spreadsheeting ability in finance professionals. The event features
speeches from sector leaders and experts, and a Q&A session where
attendees can put their questions to a panel of our speakers and the
framework’s authors. There will be a drinks reception afterwards.

ICAEW Spreadsheet Capability Framework Launch

Posted in Excel/VBA, Research, Risk, Software Testing | Tagged , , | Leave a comment