1 Introduction

To face the growing volumes of data generated by high-throughput analytical techniques involved in OMICS approaches, the current trend is to process these data as much as possible automatically by putting know-how and expertise into toolboxes within Virtual Research Environment (VRE), thus allowing non-experts to handle their data themselves (Candela et al. 2013). The metabolomics approach is therefore much concerned (http://phenomenal-h2020.eu/home/). Regarding metabolomics based on 1D NMR spectroscopy and although it has become a common approach, multiple challenges in spectra processing remain to be solved. As mentioned in Larive et al. (2015), continued improvement and development of new techniques for NMR data processing is essential for improving the high-throughput NMR metabolomics. As discussed in Vu and Laukens (2013) and Alonso et al. (2015), many of the spectra processing steps especially the spectra alignment require a certain degree of user expertise. Indeed, NMR analysis is not exempt of difficulties, and in particular subtle differences in pH, ionic strength, temperature, protein content, etc., between samples may cause differences in the NMR-detected peak position and line widths of a given metabolite (Cloarec et al. 2005; Cruz et al. 2014; Tredwell et al. 2016). In addition, in complex spectra there is a high degree of overlap in certain regions of the NMR spectrum, which hampers analysis. Therefore, the diversity of issues regarding the NMR spectra processing step is not lacking, namely those encountered during the various stages of processing (baseline correction, chemical shift calibration, removal of resonance signals of solvents and other contaminants, re-alignment of resonances having uncontrolled variations in chemical shifts between spectra, ...) and depending on the biological context (humans, plants, micro-organisms), the type of sample source (tissue, tissue extract or biofluid ...), the analytical protocol (choice of NMR sequence, use of additives for calibration and/or quantification, use of buffer solution and pH adjustment to control the sample pH, etc.). Given the nature of the 1D NMR spectra, and the issues cited above, the expert eye is often required and even crucial to disentangle the intertwined peaks and to unravel the metabolite composition of a complex mixture sample. Apart for very well-mastered and very reproducible use cases, the implementation of 1D NMR spectra processing workflows into a toolbox and operating automatically in batch mode (regarded as a black-box) in order to be widely used by non-expert users or new-comers has not yet reached full maturity. Therefore so far the best way is still to proceed interactively with an 1D NMR spectra viewer. To fulfill this need, we have been developing NMRProcFlow, an open source software that greatly helps spectra processing (Fig. 1). It was built by involving NMR spectroscopists eager to have a quick and easy tool to use and even for non-expert users. Although an open source software such as R (R Development Core Team 2005), can be leveraged to develop a comprehensive 1D NMR spectra processing, its command-line interface is a genuine barrier for metabolomics researchers with little or no programming background. Knowing that the processing of a set of NMR spectra may be long and tedious, the purpose of NMRProcFlow is to assist users in processing 1D NMR spectra by leveraging upon an easy and intuitive Graphical Unit Interface (GUI). Thus, it offers users significant time saving, while allowing them to apply their expertise without the skill barrier in programming.

Fig. 1
figure 1

Workflow of NMRProcFlow open source software for 1D NMR spectra processing within an interactive interface based on spectra visualization

2 Implementation

NMRProcFlow has been developed as two separate applications (NMRspec and NMRviewer), each of them embedded in Docker images, sharing a data volume and communicating together through AJAX web services by making intensively use JavaScript functionalities (Fig. 2). The Docker (https://www.docker.com/) technology was chosen to facilitate the setting up of the software.

Fig. 2
figure 2

Schema of the NMRProcFlow implementation into two applications

NMRspec, the NMR spectra processing application has been implemented using mainly the open source software language R (R Core Team 2014, http://www.R-project.org/), and some processing algorithms have been implemented in C + + using the Rcpp package (Eddelbuettel and Francois 2011) within an internal module called Rnmr1D. The NMRspec GUI has been implemented using the ‘R shiny’ (http://shiny.rstudio.com/) framework based on event-driven programming. NMRviewer is based on a client–server architecture. To obtain fast transfers from server-side to client-side, first spectral data are not stored as text format but only as binary format, because text to binary conversion is time consuming. Then, the open-source software Gnuplot (http://gnuplot.info) was chosen to generate images in PNG format directly from the binary data. Thus, data reading/writing and data transfers are minimized. The NMRProcFlow tool is available online as a web tool to provide advanced processing tools for 1D NMR spectral data for anyone who needs to process NMR spectra sets, expert or not, in order to save users the trouble of installing the software. In addition, virtual appliances of NMRProcFlow compliant with the two major virtualization platforms (VMware http://www.vmware.com/ and Oracle VirtualBox https://www.virtualbox.org/) are available for a local setup thus ensuring security of confidential data such as clinical data.

3 Results and discussion

One of the strongest point of NMRProcFlow is the possibility of visualizing the experimental factor levels within the NMR spectra set through the spectral viewer making the tool valuable to create links between the experimental design and subsequent statistical analyses, and thus facilitate interactions between biologists and NMR spectroscopists. The NMR spectra viewer is the central tool of NMRProcFlow and the core of the application. It allows users to visually explore the spectra overlaid or stacked, to zoom in on intensity scale, to group sets of spectra by coloring them based on their factor levels. Compared with the existing software based on a GUI dedicated to 1D NMR spectra processing (https://omictools.com/data-processing-category, Alonso et al. 2015), besides that some are commercial software (or implying to buy a software’s license), neither of them allows specifying the factor levels within the spectra viewer, nor capturing areas of ppm to be treated either on a full set of spectra, or on a subset belonging to the same factorial group as readily as NMRProcFlow can do. The current version of NMRProcFlow accepts the Free Induction Decay (FID) coming from two major vendors namely Bruker GmbH & Agilent Technologies (Varian) and in addition the Bruker folder structure with 1r files are also supported (see Supplementary Information S1). Besides, we already support the nmrML format (http://nmrml.org/, Rocca-Serra et al. 2016). In a next version the final goal will to achieve a complete export in this new standard format, thus allowing to describe based on an ontology the whole processing steps. However, NMRProcFlow implies that all NMR spectra were acquired and pre-proceeded with the same number of points. Nevertheless, it is possible for example to mix spectra acquired with different pulse sequences in order to compare them. Regarding spectra processing, automatic approaches when applicable are very suitable on well defined use-cases. This is the case e.g. for the approach implemented in BAYESIL web tool (Ravanbakhsh et al. 2015) provided that its analytical protocols defined for limited and well defined use-cases (only three matrices cerebrospinal fluid, serum, ultra-filtered plasma, collected and prepared with a standardized protocol and measured by NMR with a standardized acquisition) is appropriate to your use-case. Moreover, tools that integrate such approaches exist (Workflow4Metabolomics, Giacomoni et al. 2015). Because our approach is deliberately to the antipodes of a totally automatic approach, our criterion for the choice of methods for each spectra processing stages (baseline, alignment, binning, ...) implemented in our application were those that were evaluated as effective according to our tests and that can be executed in just a few seconds, allowing a fluid use of the application (See “Description of the implemented processing methods” in Supplementary Information S1). An additional key point is the ability to interactively choose and proceed, with the help of the NMR spectra viewer, the processing method (Fig. 1) that seems most relevant depending on the ppm region. Indeed, for each spectra processing stage, at least two methods can be applied (see Supplementary Information S1). For instance, to align a spectral region with intertwined peaks, a Parametric Time Warping method (Bloemberg et al. 2010) will produce a better result provided that all spectra have the same amount of peaks within this region. Otherwise, a Least-Square algorithm will be more adapted. Similarly to the visualization, the peak alignment within a spectral region can be proceeded on either a full set of spectra or on a subset belonging to the same factor level, thus expanding the range of processing options.

Beyond its capabilities of spectra viewing and processing, NMRProcFlow is especially dedicated to metabolomics. The two major metabolomics approaches, namely metabolic fingerprinting (MF) and targeted metabolomics (TM) are taken into account. The workflow covers all steps from the spectral data up to the output data matrix (Fig. 1).

Regarding the TM approach, the identity of the metabolites of interest is established before statistical data analysis, and this involves to be able to: (i) identify the spectral regions for which the quantification will be performed based on both knowledge and well-established metabolomic profiles, (ii) ensure that each of these regions is not polluted by peak contribution of neighbor regions. To fulfill these two points, it is necessary to locally correct the baseline in order to (i) eliminate the residual effects due to the presence of macromolecules in extracts, (ii) but also reduce the prevalence of an intense peak on the less intense neighbor ones. The best way in the TM approach for obtaining buckets is to choose yourself the ppm ranges to integrate. Here only a few dozen of peaks corresponding to targeted and selected compounds. Their size is depending on the signal pattern. After the processing stages and bucketing step, NMRProcFlow allows users to export all the data needed for the quantification into a spreadsheet workbook. The ‘qHNMR’ template aggregates information within five separate tabs like the sample table, the bucket table, the signal-to-noise ratio (SNR) matrix and the data matrix i.e. the values of integration for each bucket (columns) and for each spectrum (rows), and also includes another tab with the pre-calculated quantifications from data provided in the others tabs. Some information are set by default in both ‘samples’ and ‘buckets’ tabs. Just adjust them with the appropriate values and the quantifications within the eponymous tab will be automatically updated. This approach based on a local baseline correction applied on a specific pattern (singlet, doublet and triplet) works well when patterns are not overlapped (Moing et al. 2004). To address the issue in crowded regions with heavy peak overlap, a quantification method based on a local signal deconvolution seems more appropriate (Zheng et al. 2011). The implementation of such a method is planned in a future version of NMRProcFlow to cover this type of more complex needs.

Regarding the MF approach, the identity of the metabolites of interest is established after statistical data analysis of metabolic fingerprints, and this involves to be able to: (i) highlight that spectral regions having a difference between the groups are statistically significant, (ii) ensure that each of these regions involves only a single metabolite peak or pattern, i.e. there is unique correspondence between a bucket and a resonance pattern (spectral signature of a metabolite). The standard approach in NMR-based metabolomics used imply the division of spectra into equally sized bins, thereby simplifying subsequent data analysis. Yet, disadvantages are the loss of information and the occurrence of artifacts caused by peak shifts (up to 0.05 ppm, Cloarec et al. 2005). Therefore, we implemented the Adaptive Intelligent Binning (De Meyer et al. 2008) algorithm which largely circumvents these problems by recursively identifying bin edges in existing bins. The later algorithm requires only minimal user input, and avoids the use of arbitrary parameters or reference spectra. It is well adapted to meet the second point mentioned above. Moreover, whatever the bucketing approach used, the SNR is a good quality indicator. Thus, in NMRProcFlow it is possible to filter buckets based on this ratio either during the computing process, or at the output step. This possibility was carried on in a previous collaborative work where NMRProcFlow has been used with success (Bornet et al. 2016). Because statistical analysis is an important step within the MF approach, file manipulations have to be minimized to avoid laborious and time consuming conversions. Thus, NMRProcFlow can export data matrices fully compatible with online statistical analysis tools such as BioStatFlow (http://biostatflow.org) or MetaboAnalyst 3.0 (Xia et al. 2015).

In addition, NMRProcFlow allows experts to build their own spectra processing workflow, in order to become re-applicable to similar NMR spectra sets, i.e. stated as use-cases. By extension, the implementation of NMR spectra processing workflows executed in batch mode can be considered as relevant provided that we want to process in this way very well-mastered and very reproducible use cases, i.e. by applying the same Standard Operating Procedures (SOP). A subset of NMR spectra is firstly processed in interactive mode in order to build a well-suited workflow. This mode can be considered as the ‘expert mode’. Then, other subsets that are regarded as either similar or being included in the same case study, can be processed in batch mode, operating directly a Command Line Tool (CLI) or embedded in a workflow management system such as Workflow4Metabolomics (Giacomoni et al. 2015) or PhenoMeNal VRE App Library (http://portal.phenomenal-h2020.eu/app-library).