NCBI datasets and genome assembly data
Recently, NCBI released their new datasets API that might replace NCBI E-utils. At the moment, datasets is focused on genomes, genes, and viruses, but no doubt it will expand over time. (Note: I think the name is terrible, and they should use ncbi_datasets
(see this tweet)
Here is a rough guide to extracting some data about genomes using datasets
.
First, we have a list of all bacterial genome assemblies. There are currently just over a million genome assemblies, and you can download the latest list:
DATE=`date +%Y%m%d`
curl -Lo assembly_summary_$DATE.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
Now we want to just get the report data about these genomes so we can figure out which ones are worth interrogating further. In particular we are concerned with the number of contigs and the overall assembly length, but you might want other data. Here is how to get all the data for a lot of organisms.
Before you begin, you’ll need to install ncbi_datasets
, and you can easily do that with conda
:
mamba create -n ncbi_datasets -c conda-forge ncbi-datasets-cli
conda activate ncbi_datasets
First, we are going to get about 10 accessions to see if what happens, and then we’ll build up to get all the accessions.
We create a variable called $ACC
with the accession numbers separated by spaces.
ACC=$(head ../assembly_summary_20220130.txt | grep -v \# | cut -f 1 | tr \\n \ )
Now, we use datasets
to get the genome assembly report:
datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq --filename ncbi_data.zip
This will download three files:
- README.md: a generic readme file
- ncbi_dataset/data/assembly_data_report.jsonl: all the genome data in JSON format
- ncbi_dataset/data/dataset_catalog.json: a JSON summary of what was downloaded
However, we don’t really want to extract the archive, we can just access it directly using another ncbi datasets tool, dataformat. Let’s extract that data into a tsv file:
dataformat tsv genome --package ncbi_data.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi_data.tsv
Note that we use awk
to only print out one line per accession (otherwise we get lots of lines that appear mostly redundant per accession).
This table will have the following columns:
- Annotation Info BUSCO Complete
- Annotation Info BUSCO Duplicated
- Annotation Info BUSCO Fragmented
- Annotation Info BUSCO Lineage
- Annotation Info BUSCO Missing
- Annotation Info BUSCO Single Copy
- Annotation Info BUSCO Total Count
- Annotation Info BUSCO Version
- Annotation Info Count Gene Non-coding
- Annotation Info Count Gene Other
- Annotation Info Count Gene Protein-coding
- Annotation Info Count Gene Pseudogene
- Annotation Info Count Gene Total
- Annotation Info Name
- Annotation Info Release Date
- Annotation Info Report URL
- Annotation Info Source
- Assembly Accession
- Assembly BioProject Lineage Accession
- Assembly BioProject Lineage Parent Accessions
- Assembly BioProject Lineage Title
- Assembly BioSample Accession
- Assembly BioSample Attribute Name
- Assembly BioSample Attribute Value
- Assembly BioSample BioProject Accession
- Assembly BioSample BioProject Parent Accessions
- Assembly BioSample BioProject Title
- Assembly BioSample Description Comment
- Assembly BioSample Description Organism Common Name
- Assembly BioSample Description Organism Organism Name
- Assembly BioSample Description Organism Pangolin Classification
- Assembly BioSample Description Organism Strain
- Assembly BioSample Description Organism Taxonomic ID
- Assembly BioSample Description Title
- Assembly BioSample Sample Identifiers Database
- Assembly BioSample Sample Identifiers Label
- Assembly BioSample Sample Identifiers Value
- Assembly BioSample Last updated
- Assembly BioSample Models
- Assembly BioSample Owner Contact Lab
- Assembly BioSample Owner Name
- Assembly BioSample Package
- Assembly BioSample Publication date
- Assembly BioSample Status Status
- Assembly BioSample Status When
- Assembly BioSample Submission date
- Assembly Blast URL
- Assembly Description
- Assembly GenBank Accession
- Assembly Level
- Assembly Linked Assembly
- Assembly Name
- Assembly Paired Accession
- Assembly RefSeq Accession
- Assembly Refseq Dategory
- Assembly Sequencing Tech
- Assembly Submission Date
- Assembly Submitter
- Assembly Type
- Assembly UCSC Assembly Name
- Assembly Stats Contig L50
- Assembly Stats Contig N50
- Assembly Stats Gaps Between Scaffolds Count
- Assembly Stats GC Count
- Assembly Stats Number of Component Sequences
- Assembly Stats Number of Contigs
- Assembly Stats Number of Scaffolds
- Assembly Stats Scaffold L50
- Assembly Stats Scaffold N50
- Assembly Stats Total Number of Chromosomes
- Assembly Stats Total Sequence Length
- Assembly Stats Total Ungapped Length
- Breed
- Common name
- Cultivar
- Ecotype
- Isolate
- Organelle Assembly Name
- Organelle BioProject Accessions
- Organelle Description
- Organelle Infraspecific Name
- Organelle Submitter
- Organelle Total Seq Length
- Organism name
- Sex
- Strain
- Taxonomic ID
- WGS contigs URL
- WGS project accession
- WGS URL
Running on all the accessions
Now we can put that together and run this on all the accessions at NCBI.
Note: Before you start this it is imperative you have an NCBI API Key set up.
We can make a simple script to process a lot of accessions at once. From trial and error, it appears that the limit is ~500 accessions, so we set that as our limit.
Lets have a little slurm script that sets this up as an array job:
#SBATCH --time=2-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
# How much memory. I usually request 2000M (2 GB) if I am not sure
#SBATCH --mem=4G
# bash strict mode
set -euo pipefail
# have we run already?
if [ -e ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv ]; then exit 0; fi
# sleep for upto two minutes to delay concurrent jobs
sleep $((RANDOM%120))
# who many accessions to get at a time?
NUM=500
END=$(((SLURM_ARRAY_TASK_ID*NUM)+1))
# generate the list of accessions
ACC=$(head -n $END ../assembly_summary_20220130.txt | tail -n $NUM | grep -v \# | cut -f 1 | tr \\n \ )
# download the data
datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq --filename ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip
# extract it to a tsv file
dataformat tsv genome --package ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv
Now you can submit this as an array job, say running max 10 at once so NCBI doesn’t get too upset:
sbatch --array=1-1000%10 extract_all_info.sh