‹ Jack's Brain

Tech Debt in Research & Academia: 3 Keys for Getting on the Right Track

Mar 16, 2017

Tech debt: a concept in programming that reflects the extra development work that arises when code that is easy to implement in the short run is used instead of applying the best overall solution.source

Research institutions often suffer from the effects of technical debt that slow processing, lead to inflexible procedures and pipelines, and ultimately limit the potential for forward progress in high speed academic environments. This article covers three key ways that researchers can begin to tackle the issue of tech debt.

Tech Debt in Software Engineering

Minimizing technical debt is one of the most important concepts in keeping a software system up and running. It can creep up on development teams, crippling them as the scales tip. Soon they end up spending more time working around hastily designed systems than maximizing impact and making their milestones.

Many don’t consider tech debt in and of itself problematic — proofs of concept and mockups can be thought of as short term loans to be paid off quickly once approved and developed into a solid portion of the codebase (much like a car loan). However, whether good or bad tech debt,  it should be paid off as soon as possible lest it become a burden to the team and future goals.

This can be addressed in a number of ways. One firm I worked for held Tech Debt Tuesdays where the entire team would take a break from delay-able work and spend time improving documentation, code base cleanliness, and excising bad hacks and crufty code from the product. Another company had a Dead Code Society that awarded t-shirts to employees who cut > 1K lines of dead code in a commit. Regardless of how a company handles it, tech debt needs intentional attention lest work grind to a halt (or new hires take one look at the codebase and run the other way).

“Can we just make it work?”

While debt maintenance is well known in technical circles, it’s a relatively unknown concept in academia. For example, a colleague of mine is a research assistant to a professor who works primarily in MATLAB. Her project has begun collecting a new type of data and is currently running into storage issues due to fixed field length limitations of the historical data storage format they’ve been using. The new metadata is being crammed into 3 character blocks that provide no intuitive understanding of the data they describe. As the research grows, so does the tech debt, and the day to day working of the lab is impacted by poor systems design.

Sadly, this story is not uncommon in research contexts. Unlike many software developers whose job depends on them understanding the cutting edge, researchers are often content to use tools that they know work (in the case of MATLAB, for over 30 years) and shim fixes into them to keep them working rather than holistically reevaluating their workflow when they meet with obstacles — and understandably so! Their job is to perform their research; any time spent on technical infrastructure and paying off tech debt is time not spent doing the work they get paid for. However, this thinking is the same attitude that many software engineers have — until they reach an inflection point and they’re forced to reevaluate, which is often unplanned and more costly than taking the time to fix things as they go.

Technical debt is just as dangerous in academia as it is in the software world:

  • It increases the onboarding time of new researchers and assistants as they learn convoluted processes that have been patched and hacked together.
  • It limits flexibility in trying new workflows and code/data reuse in different contexts.
  • In the worst cases, it can lead to terribly fragmented workflows for processes that should be unified when a tech debt wall is hit and researchers are forced to begin with a new process to accommodate new or newly formatted data.

How do we fix this?

1. Recognize your debt exposure.

Tech debt is often present even in processes that seem to be running smoothly. Even if you haven’t stumbled on an inflection point where poorly designed processes halt or cause issues during the course of research, minimizing debt is important. It’s like going to the gym weekly to ensure you don’t need to get bypass surgery at some future date — it will hurt a bit to get started, but it’s worth it in the long run. The first step to fixing a problem is admitting you have a problem, and if you use any kind of software toolchain, chances are you have tech debt and cruft rattling around in your codebase no matter who you are.

As a corollary, making headway on your debt takes time. It’s never easy to refresh workflows, so understanding the potential cost of ignoring issues and choosing hacks over holistic design is vital for any change to occur.

2. “Refinance” your debt using resources from the academic community.

Many academic researchers have access to a fantastic toolbox in the form of students at the institution. Leverage students outside your primary research context from areas of computer science/engineering and even business students from a BI or data processing perspective. Ask the faculty about promising students that would provide good input for systems design in your field — while they may not grasp the intricacies of your research, the format of the data and the general processing needs can help them to choose the right pipeline for you, whether that includes commonly consumable formats such CSV, relational databases, or analytics applications like KNIME or Tableau that can readily export datasets in flexible formats for further processing.

Reworking data flow doesn’t mean giving up tools that you’ve grown to love and know; it just means setting up the storage and delivery systems to those endpoints as modularly and flexibly as possible — lots of intermediate/platform-agnostic formats mean more tools and manpower that can work on your data and provide new insights, which brings us to…

3. Keep your bus factor high by using intuitive processing.

Obviously, most academic research data is not consumable to a layman. However, that’s not to say that not to say that the data pipelines need to be equally convoluted. Set up data processing systems that are logical and extensible — **plan to pivot. **Keep your systems flexible and adjustable to allow for sudden changes in the form of your data. Backwards compatible designs let fresh ideas or points of view be analyzed quickly with minimal effort dedicated to data reformatting (or, in the case of my colleague, extracting mountains of static data from MATLAB matrices saved in files scattered around the research computer).

Building workflows that allow for collaboration is a prerequisite for truly open research. Open source tools and standardized formats for data open the door for others, even those without research backgrounds, to work on and aid with discovery.

One of my favorite transplants of a traditionally tech-world oriented idea into the academic world is a physics hackathon: a professor invited 10-20 graduate students from many majors to a Friday evening seminar that covered current topics of research by the faculty and staff. On Saturday, the students joined the research team that most interested them and were taught the basic metrics of the data they collected. Then the students went to work on learning about and looking at the data in new ways. Physics students collaborated with computer engineers to build new data visualizations or update old models that weren’t compatible with new data. Biologists worked with students who had business intelligence experience to extrapolate dosimetric data from work schedules and instrument readings in the radioactivity lab to learn more about the researchers’ radiation exposure.

While there weren’t always great leaps forward in understanding, by establishing systems such as version controlled models, engine-agnostic data formats, and well labeled and documented workflows, researchers could collaborate readily and open their data up for further work by other people.

 

Tech debt is dangerous in the long run. The sooner it’s addressed, especially in research contexts, the sooner your data can be flexible and open to collaboration from others, and the sooner you can stop worrying about when your next data format change or movement in a new direction will break your workflow or slow down science.


Title image courtesy of Alan O’Rourke on Flickr under CC BY 2.0.