Reshaping (stacking and splitting) data

Data on your experiments can sometimes wind up in either of two forms: "key-attribute-value" or "tabular". When doing your analyses, you might want one or the other, or both. So how do you transform between the two forms? We provide some tools for that.

First let's explain in a little bit more detail what we're talking about. (Or skip to the AddIn section below if you already understand these concepts.)

In the "key-attribute-value" form, your "key" column describes the time point, a batch ID, or some other index on a property being measured. The "attribute" column describes the property being measured. The "value" column contains the value for that property. In this form of data there are usually many many rows, but only a three columns.  This form of data can sometimes occur in formulations experiments (e.g., Approach 1 in this article).  It is also commonly seen in data historians or chromatography output files. Here's an example:

In the "tabular" form of data, you have one column for each property (attribute) and the values corresponding to that property are listed in the rows below the property. Each row represents a single "key" of data (i.e. a single time point or single batch or other index). This is the data we are most accustomed to working with in Excel or other places, and is usually the most convenient for analysis. It is generally the way that Riffyn SDE produces data in the data tables. Here is an example of the above key-attribute-value data transformed into a table.

So the challenge is how to convert data from one form to another. This can be done by "pivoting" data or "splitting" data.  They are almost the same thing, but "pivoting" adds a data summarization step that "splitting" does not have.

Riffyn JMP AddIn for stacking and splitting data

The Riffyn JMP AddIn provides functionsthat allow you to stack data (put it into key-attribute-value form) or split it (put it into tabular form).  These functions are available on the AddIn menu like so:

And they are optimized for Riffyn SDE data tables because they will transform all properties associated with a resource automatically when you stack or split resources. They also nicely name your columns in an intuitive way.

JMP also provides out of the box functionality for stacking, splitting and pivoting data. Those functions are the menu items "Stack", "Split" and "Summary" menu items on the on the Tables menu, and the "Tabulate" item on the Analyze menu. (Summary does pivots if you use the subgroup option, and Tabulate always pivots.) Although this is quite handy, they don't account for the structure of Riffyn data tables and so don't automatically work with resource column groups the way that the Riffyn SDE JMP AddIn does.  

The Riffyn SDE JMP AddIn's functions work as follows:

Stack resource columns: This takes tabular data into key-attribute-value form. You select a set of resources you want to stack into a key-attribute-value form, and then the function will combine those resources into a single column, and will combine all properties of those resources into single columns (one column for reach type of property).  All other data in the table will be replicated appropriately and associated with the stacked data so that nothing is lost. This would, for example, allow you to combine multiple columns of carbon source data into a single carbon source data column set.

Split resource columns: This takes key-value-attribute data into tabular form. You select a particular resource, and a set of attributes (properties) to split the data by.  Those attributes for the resource will end up as column headings in a tabular form, and the run and event data in the table will be used as keys (row labels).  All other data in the table will be split according to the  the keys the chosen splitBy columns — i.e., the data will be assigned to the appropriate rows based on the key data (runs and events) and to the appropriate columns based on the splitBy data.  Columns are named automatically by renaming the resources with the values of the SplitBy columns.

Stack / split columns (sorting your resources in columns of the same type): Suppose you have three generic input resources on your process step, e.g. three "carbon source" resources as suggested in Approach 1 in this article. Resources of a given type (e.g., "glucose") might get assigned in any of those columns. But when you analyze the data, you may wish to sort the glucose resources all into one column, and other carbon sources into their own columns.

Fortunately there is an easy way to accomplish this: (1) stack the three resource columns together, then (2) split the stacked resource. This will have the desired effect of creating a specific column of data for each resource type that you used in the three carbon source resource inputs in the experiments.  I.e., this is a handy way to sort your data into columns of similar materials.

It's a little difficult to understand exactly what's going on from the explanations above, so it's best just to give these tools a try! (And we'll post a video here eventually.)

Other notes:

Overlapping values report. When you are splitting data, you may run into a warning that says you have multiple data values, and then get and extra "Overlapping values report".

This happens when your your split column and your splitBy column have repeated sets of values in them within a single run/event. The splitting function will then be faced with having to shove the multiple values of the split column (corresponding to the repeated sets) into a single cell in the resulting table. 

If you were doing a pivot operation (i.e. "Tabulate" in JMP), those values would be summarized (mean, median, max, min, count or something).  Currently the AddIn doesn't have a summarization, instead it warns you and takes the last value encountered among the multiples. The last column of the report ("N Rows") tells you how many overlapping values were found for each splitBy value.  Clicking on that row in the report will also highlight the corresponding repeated value sets in your original data table.  See below for an example of such a report. Note that the first row selected in the report (top table) highlights the overlapping values in the original table (bottom table). There you can see the two overlapping values (different concentrations and units).


Have more questions? Submit a request


Article is closed for comments.