Fast correlations with turbocor

October 19, 2021

We often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!

However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor.

Installing turbocor

Of course, we want to install it using conda but there are a couple of simple gotchas that are easy to overcome. Here are some step by step instructions

Specifically, hdf5-sys requires hdf5=1.10.1 at the moment, and balks at hdf5=1.12.1 (the current version in anaconda).

Step 1: Create the conda environment and install hdf5

mamba create -n turbocor rust hdf5=1.10.1

Step 2: Activate the environment

conda activate turbocor

Step 3: Set your environment variables

You will need to set this to the base of your conda environment. Change the path as appropriate for your conda installation. See the hdf5 conda installation for more details.

export HDF5_DIR=$HOME/miniconda3/envs/turbocor
export RUSTFLAGS="-C link-args=-Wl,-rpath,$HOME/miniconda3/envs/turbocor/lib"

Step 4: Build the release

cargo build --release

and now the executable will be in the target/release directory, so you can either add this to your path (e.g. PATH=$PATH:$PWD/target/release) or remember the location (e.g. TURBOCOR=$PWD//target/release).

Using turbocor

You need your data conceptually in a matrix. Turbocor will read an hdf5 format file with a dataset tag that points to a two-dimensional matrix.

Step 1: Convert your data to hdf5 format

If you have your data in a tab-separated (or even comma-separated) text file, you can use matrix_to_h5.py to convert that data into the hdf5 format file that you need.

For example, here is how to convert the matrix to an hdf5 file. In this example, matrix.tsv is a tab-separated matrix file that has a header row. It outputs the data into matrix.h5 with a dataset tag mydata and outputs a separate file with the column names (taken from the first column) into indices.tsv.

python3 /home/edwa0468/GitHubs/EdwardsLab/h5py/matrix_to_h5.py --file matrix.tsv --output matrix.h5 --header --dataset mydata --indexfile indices.tsv

Step 2: Run turbocor

Next, we run turbocor on the matrix.h5 file. Here we use the $TURBOCOR path we set earlier, and output the correlations to the matrix.cor file.

$TURBOCOR/turbor compute --dataset mydata matrix.h5 matrix.cor

We convert that to a comma-separated list of row indices and their correlation coefficients:

Step 3: Extract the coefficients

$TURBOCOR/turbor topk 1000 matrix.cor

You can use the indices.tsv file we created in step 1 to identify the row names of things that correlate with each other.

Enjoy!

Rob Edwards