We often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!
However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using
Of course, we want to install it using conda but there are a couple of simple gotchas that are easy to overcome. Here are some step by step instructions
hdf5=1.10.1 at the moment, and balks at
hdf5=1.12.1 (the current version in anaconda).
Step 1: Create the conda environment and install hdf5
mamba create -n turbocor rust hdf5=1.10.1
Step 2: Activate the environment
conda activate turbocor
Step 3: Set your environment variables
You will need to set this to the base of your conda environment. Change the path as appropriate for your conda installation. See the hdf5 conda installation for more details.
export HDF5_DIR=$HOME/miniconda3/envs/turbocor export RUSTFLAGS="-C link-args=-Wl,-rpath,$HOME/miniconda3/envs/turbocor/lib"
Step 4: Build the release
cargo build --release
and now the executable will be in the
target/release directory, so you can either add this to your path (e.g.
PATH=$PATH:$PWD/target/release) or remember the location (e.g.
You need your data conceptually in a matrix.
Turbocor will read an hdf5 format file with a
dataset tag that points to a two-dimensional matrix.
Step 1: Convert your data to hdf5 format
If you have your data in a tab-separated (or even comma-separated) text file, you can use matrix_to_h5.py to convert that data into the
hdf5 format file that you need.
For example, here is how to convert the matrix to an hdf5 file. In this example,
matrix.tsv is a tab-separated matrix file that has a header row. It outputs the data into
matrix.h5 with a dataset tag
mydata and outputs a separate file with the column names (taken from the first column) into
python3 /home/edwa0468/GitHubs/EdwardsLab/h5py/matrix_to_h5.py --file matrix.tsv --output matrix.h5 --header --dataset mydata --indexfile indices.tsv
Step 2: Run turbocor
Next, we run turbocor on the
matrix.h5 file. Here we use the
$TURBOCOR path we set earlier, and output the correlations to the
$TURBOCOR/turbor compute --dataset mydata matrix.h5 matrix.cor
We convert that to a comma-separated list of row indices and their correlation coefficients:
Step 3: Extract the coefficients
$TURBOCOR/turbor topk 1000 matrix.cor
You can use the
indices.tsv file we created in step 1 to identify the row names of things that correlate with each other.