Use of a workflow that makes use of R/Markdown and the R programming language can greatly increase the efficiency of producing reports and documents, especially if the same report is needed on a regular basis.
What are R and R/Markdown?
R is a programming language designed specifically to do statistical data analysis. It is an open-source variant of the S language created by Bell Labs in the late 1990s. It was created by Ross Ihaka and Robert Gentleman of the University of Auckland, New Zealand. As it is designed by and for statisticians it has a lot of statistical, mathematical and graphics functionality built-in, so there is less need to import statistical libraries. Nonetheless, there are more than 15 000 user-written packages that provides an enoormous range of extra functionality.
R/Markdown is a markup language designed to work with R. When using a markup language a user creates a text file then submits it to a program that them renders it in a chosen output form.
Markdown can produce output in the form of HTML, Word, PowerPoint or PDF files, and several other formats as well. The output format can be switched by simply changing one line of the code in the Markdown file.
Why would I want to use a markup language?
Writing in a markup language immediately makes your work reproducible and auditable. You can show exactly how you created a particular chart or a calculation result. If you produce a report that you thought was a one-ff but the customer comes back and asks you to update it with newer data, you can easily amend it and reproduce your analysis with the new data.
Spreadsheets users don’t do this, as a rule, although in principle they could record their keystrokes in a macro. Also, recording macros comes with a number of limitations. A macro cannot include code that creates a new function or implement a do-while loop. Most users don’t bother using named ranges so debugging is difficult. If two users calculate the same quantity and come to different answers it’s hard to check one spreadsheet against the other as the ‘code’ (formulas) are hidden.
Spreadsheets are very useful at the level of the individual employee, because they appear to fill in the gaps between the big corporate IT systems in a flexible way. BUT, at the enterprise level they can lead to disaster if ‘copy-paste’ becomes a regular part of a crucial business process.
They don’t support a ‘one truth’ culture and accurate data transfer between different users becomes brittle because it requires everyone in the chain to stick to exactly the same format. If someone breaks that protocol, perhaps by adding a comment, it fails.
Large spreadsheets with many cells containing formulas sometimes have very long run times or may fail entirely. Business processes built with spreadsheets simply don’t scale up to big data and can fail without warning.
Even at the individual person level, it’s very easy to make mistakes in formulas. Big mistakes, like the Excel copy-paste error that contributed to the $6 billion JP Morgan ‘London Whale’ and Reinhart and Rogoff’s mistaken economic advice that was used as the basis of the International Monetary Fund’s lending policy for 10 years, with enormous consequences for several developing countries.
R/Markdown also adds another important feature to Markdown. It can incorporate executable code blocks that are run when the document is rendered. This means that reports can be created on the fly, using the latest data.
Example
Suppose you are responsible for reporting on a number of Key Performance Indicators (KPIs) for your business. Every Monday morning new figures are posted to a SharePoint for you to look at. You pull these latest figures off the SharePoint and into a spreadsheet. YOu analyse them and plot them on a chart to see if they are trending up, or trending down or not trending at all. You write a report in Word incorporating some of the charts pasted from the spreadsheet and email it to the senior managers who need to know. This process is time-consuming, error-prone and very common.
Now imagine that you have an R/Markdown script that runs automatically at 7 am every Monday morning. It pulls in the data automatically and checks the data for errors. It plots the data and uses statistical tools to determine whether the trend is up, down or neither. It pulls in relevant ‘boiler-plate’ text appropriate to each case (up, down or neither) and generates a customised report in Word for each KPI, and emails it to you, ready for your arrival at 9 am. You can review it before sending it out to the senior managers. That’s a lot of regular work saved and with far fewer opportunities for error.
• R is a programming language designed specifically for data analysis and data visualisation.
- Spreadsheets were designed originally to help accountants do their sums. This would typically include tables of detailed information, with simple summaries by department, product, salesperson or month. These can be easily displayed as charts over time. Sales Ledgers, Balance Sheets, Profit & Loss statements and Cashflow reports can be produced automatically. For new projects, standard financial checks such as Payback Time, Net Present Value and Internal Rate of Return can be computed using built-in financial functions. For business development managers scenario management is supported, such as what the effect of different product price points would be, or goal-seeking: what price point would yield a specific revenue level? All of this is a boon to financial managers.
The problems start when spreadsheet manipulation becomes a critical component of an information pipeline that feeds decisions critical to the business.
Many large organisations now record and track large numbers of items. Phone companies produce a Call Data Record (CDR) for every single call made on their network. Delivery companies record the journey of every parcel. These data are stored in large corporate databases and they produce regular reports to management. However, development of IT systems is a slow and expensive business and there is always a need for some information that isn’t yet provided automatically in a convenient form.
For example, users often find that they get one piece of information from one of their corporate IT systems and another piece from a different IT system. They need both, so that they can construct a table that has both pieces of information. Often the customer ID code, or a part number is common to both and can used to link the two tables.
In Excel tables can be linked via a common column using the VLOOKUP command. If you have a few thousand customers and your table is a few thousand rows long this may work well. If you are a telephone company and you have 22 million customers and your table has millions of rows VLOOKUP will simply freeze up. Spreadsheets just aren’t built for ‘big data’ and VLOOKUP doesn’t scale up either. (It can take a long time even on relatively small tables.) Eventually, in a few years time the IT department may produce a corporate database solution that solves the problem for you, but that is no help to the manager who needs the answer today. The manager who uses a spreadsheet for this work has the wrong tool, but R will do this kind of thing almost immediately.
Wrangling. Anyone who analyses data on a regular basis knows that you will often spend a lot of your time (maybe 80%) simply getting your input data into the right format for your analysis. This is known as ‘data wrangling’. R has many tools to make this work a lot easier—and most spreadsheets don’t.
• Spreadsheets aren’t a great place to store data. They can alter your input data to what they think you meant, without telling you. If your data looks like a date it may be converted to a date. (Geneticists had to rename the MARCH2 and SEPT1 genes because of exactly this problem.) If it looks like a number with one or more leading zeroes it may remove them silently, thus corrupting telephone numbers and Standard Industry Codes. Programming languages and databases don’t do that.
Summary
The workflow created by using R and R/Markdown together has a number of big advantages
- Automatic production of regular reports, including calculations and charts
- Many kinds of error that are common in spreadsheets cannot occur or are much rarer in this workflow
- The processes scale to large numbers of customers, employees or stock numbers.