Data science code review manifesto

Data science code review manifesto

Data scientists typically come from physics, mathematics, statistics, biostatistics, finance, and engineering backgrounds. As such, the level of formal software engineering training can vary greatly within a team. It’s important to standardize coding processes so that teams can work together efficiently and cohesively. The benefits can be felt not just in increased productivity, but also in reducing bugs. This is a consequence of both peer-reviewing code and maintaining a cleaner codebase in which bugs can be more easily spotted.

Much of a data scientist’s time is spent creating scripts to explore their data. Frequently these scripts have been coded in a rush with the aim of getting an end-result as quickly as possible; oftentimes (especially for data scientists who do not have a background in software engineering) code readability gets sacrificed for speed. This approach has its downsides; poorly written code is more difficult to work with: more difficult to debug, more difficult to extend, and more difficult to reuse. These difficulties only compound when a script has several contributors.

Some common offenses are: functions/methods unwieldy in length, inconsistent naming conventions, code blocks that are commented out, code that is inconsistently formatted, lines of code that are too wide, substandard documentation etc…

We recommend the following processes for improving data science coding workflows:

Standardize software versions across your team

Oftentimes when sharing code or deploying software, bugs can emerge due to version conflicts. For larger teams where software installation is managed by a tech-ops department, this probably won’t be an issue. But for smaller teams where software is managed in a more ad-hoc manner, this can be a major pain-point. Similarly, if you are deploying code to run in the cloud, ensure that the software versions match in the deployment and development systems.

Enforce a company-wide programming style guide

Using a style forces an individual and a team to write code to a consistent standard. It’s easier to read and edit other people’s code when it is all written to the same standard. When deciding on a company-wide style guide there is no need to reinvent the wheel — there are plenty of style-guides available online. The tidyverse style guide is a popular style guide in R, and the PEP 8 style guide is popular for python. Pick a style guide for each language that you work in and use code reviews (and plugins) to enforce the standards.

Implement a code review practice

Code reviews should be done to verify that a data scientist’s results are mathematically sound, to reinforce company coding style-guides, and to foster a culture of peer mentorship and of strong communication within the team. There are different forms of code reviews: over-the shoulder reviews, pair programming, pull-requests, email/message board threads, or reviews conducted using a dedicated software. Find a system that works for your team and enforce it.

Version Control

Version Control reaps benefits not just in a formal structure for code reviews but in the ability to manage releases, quickly revert code changes, and efficiently fix bugs while limiting disruption to other team-members/users.

Stay up to date with the latest best practices

In the data science world, new research techniques, new software releases, and new libraries or packages are continuously emerging. It’s important for you and your team to stay on top of the most up-to-date practices. For this, we recommend a three-pronged approach: complete online training courses, attend conferences related to the technology suite that you work in and read the release notes in any software upgrade that you use.


To discuss your requirements

Contact us or Call us on 1 425-230-7396