Extracting some data from Google Cloud
We are going to start by running a big query in Google Cloud Console
Note: Google Cloud may charge you for access to services or data, and you are responsible for those charges
Start by choosing Run a query in BigQuery
In this query, we have a list of CRC64 checksums, and we would like to get a list of PDB files that they correspond to.
We will start by uploading our data into Google Cloud. From the three dots by your project name choose Create data set
:
On the form that opens out, we need to provide some sensible answers. I usually let the data set expire so that I don’t accumulate old data!
Once you have created a new data set, you need to create a table to load your data. Choose the Create Table
option:
Now, we need to upload our file, and create a new dataset. The key things here are:
- Upload a file
- File format is CSV
- Give your table a useful name
- Auto-detect the
Schema
so that it will figure out the column types
Now that we have uploaded the data, we can start a query. Click on the dataset to expand it. Then click on the field name to initiate a query &em; in my case that is called string_field_0
.
My query looks like:
SELECT string_field_0 FROM `alphafold-359400.Susie_CRC64.CRC64s` LIMIT 1000
The first thing I do is reduce the number of results returned to 10. I can think about 10 results, but not 1,000!
So lets run this query:
SELECT string_field_0 FROM `alphafold-359400.Susie_CRC64.CRC64s` LIMIT 10
And the results look like this:
Now we can combine these results to get the paths to the PDB files:
SELECT sequenceChecksum, CONCAT('gs://public-datasets-deepmind-alphafold/', entryID, '-model_v', latestVersion, '.cif') as URLS from bigquery-public-data.deepmind_alphafold.metadata where sequenceChecksum in (SELECT string_field_0 FROM `alphafold-359400.Susie_CRC64.CRC64s` limit 10);
Which gives the results:
Or just the PDB files:
SELECT CONCAT(entryID, '-model_v', latestVersion, '.pdb') as PBDs, sequenceChecksum from bigquery-public-data.deepmind_alphafold.metadata where sequenceChecksum in (SELECT distinct string_field_0 FROM `alphafold-359400.Susie_CRC64.CRC64s` limit 10);
Which gives the results: