NCBI datasets and genome assembly data

February 11, 2022

Recently, NCBI released their new datasets API that might replace NCBI E-utils. At the moment, datasets is focused on genomes, genes, and viruses, but no doubt it will expand over time. (Note: I think the name is terrible, and they should use ncbi_datasets (see this tweet)

Here is a rough guide to extracting some data about genomes using datasets.

First, we have a list of all bacterial genome assemblies. There are currently just over a million genome assemblies, and you can download the latest list:

DATE=`date +%Y%m%d`
curl -Lo assembly_summary_$DATE.txt ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt

Now we want to just get the report data about these genomes so we can figure out which ones are worth interrogating further. In particular we are concerned with the number of contigs and the overall assembly length, but you might want other data. Here is how to get all the data for a lot of organisms.

Before you begin, you’ll need to install ncbi_datasets, and you can easily do that with conda:

mamba create -n ncbi_datasets -c conda-forge ncbi-datasets-cli
conda activate ncbi_datasets

First, we are going to get about 10 accessions to see if what happens, and then we’ll build up to get all the accessions.

We create a variable called $ACC with the accession numbers separated by spaces.

ACC=$(head ../assembly_summary_20220130.txt | grep -v \# | cut -f 1 | tr \\n \ )

Now, we use datasets to get the genome assembly report:

datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq  --filename ncbi_data.zip

This will download three files:

README.md: a generic readme file
ncbi_dataset/data/assembly_data_report.jsonl: all the genome data in JSON format
ncbi_dataset/data/dataset_catalog.json: a JSON summary of what was downloaded

However, we don’t really want to extract the archive, we can just access it directly using another ncbi datasets tool, dataformat. Let’s extract that data into a tsv file:

dataformat tsv genome --package  ncbi_data.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi_data.tsv

Note that we use awk to only print out one line per accession (otherwise we get lots of lines that appear mostly redundant per accession).

This table will have the following columns:

Annotation Info BUSCO Complete
Annotation Info BUSCO Duplicated
Annotation Info BUSCO Fragmented
Annotation Info BUSCO Lineage
Annotation Info BUSCO Missing
Annotation Info BUSCO Single Copy
Annotation Info BUSCO Total Count
Annotation Info BUSCO Version
Annotation Info Count Gene Non-coding
Annotation Info Count Gene Other
Annotation Info Count Gene Protein-coding
Annotation Info Count Gene Pseudogene
Annotation Info Count Gene Total
Annotation Info Name
Annotation Info Release Date
Annotation Info Report URL
Annotation Info Source
Assembly Accession
Assembly BioProject Lineage Accession
Assembly BioProject Lineage Parent Accessions
Assembly BioProject Lineage Title
Assembly BioSample Accession
Assembly BioSample Attribute Name
Assembly BioSample Attribute Value
Assembly BioSample BioProject Accession
Assembly BioSample BioProject Parent Accessions
Assembly BioSample BioProject Title
Assembly BioSample Description Comment
Assembly BioSample Description Organism Common Name
Assembly BioSample Description Organism Organism Name
Assembly BioSample Description Organism Pangolin Classification
Assembly BioSample Description Organism Strain
Assembly BioSample Description Organism Taxonomic ID
Assembly BioSample Description Title
Assembly BioSample Sample Identifiers Database
Assembly BioSample Sample Identifiers Label
Assembly BioSample Sample Identifiers Value
Assembly BioSample Last updated
Assembly BioSample Models
Assembly BioSample Owner Contact Lab
Assembly BioSample Owner Name
Assembly BioSample Package
Assembly BioSample Publication date
Assembly BioSample Status Status
Assembly BioSample Status When
Assembly BioSample Submission date
Assembly Blast URL
Assembly Description
Assembly GenBank Accession
Assembly Level
Assembly Linked Assembly
Assembly Name
Assembly Paired Accession
Assembly RefSeq Accession
Assembly Refseq Dategory
Assembly Sequencing Tech
Assembly Submission Date
Assembly Submitter
Assembly Type
Assembly UCSC Assembly Name
Assembly Stats Contig L50
Assembly Stats Contig N50
Assembly Stats Gaps Between Scaffolds Count
Assembly Stats GC Count
Assembly Stats Number of Component Sequences
Assembly Stats Number of Contigs
Assembly Stats Number of Scaffolds
Assembly Stats Scaffold L50
Assembly Stats Scaffold N50
Assembly Stats Total Number of Chromosomes
Assembly Stats Total Sequence Length
Assembly Stats Total Ungapped Length
Breed
Common name
Cultivar
Ecotype
Isolate
Organelle Assembly Name
Organelle BioProject Accessions
Organelle Description
Organelle Infraspecific Name
Organelle Submitter
Organelle Total Seq Length
Organism name
Sex
Strain
Taxonomic ID
WGS contigs URL
WGS project accession
WGS URL

Running on all the accessions

Now we can put that together and run this on all the accessions at NCBI.

Note: Before you start this it is imperative you have an NCBI API Key set up.

We can make a simple script to process a lot of accessions at once. From trial and error, it appears that the limit is ~500 accessions, so we set that as our limit.

Lets have a little slurm script that sets this up as an array job:

#SBATCH --time=2-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# How much memory. I usually request 2000M (2 GB) if I am not sure
#SBATCH --mem=4G

# bash strict mode
set -euo pipefail

# have we run already?
if [ -e ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv ]; then exit 0; fi

# sleep for upto two minutes to delay concurrent jobs
sleep $((RANDOM%120))

# who many accessions to get at a time?
NUM=500
END=$(((SLURM_ARRAY_TASK_ID*NUM)+1))

# generate the list of accessions
ACC=$(head -n $END ../assembly_summary_20220130.txt | tail -n $NUM | grep -v \# | cut -f 1 | tr \\n \ )

# download the data
datasets download genome accession $ACC --exclude-genomic-cds --exclude-gff3 --exclude-protein --exclude-rna --exclude-seq  --filename ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip

# extract it to a tsv file
dataformat tsv genome --package  ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.zip | awk -F"\t" '!s[$18]++ {print}' > ncbi/ncbi_${SLURM_ARRAY_TASK_ID}.tsv

Now you can submit this as an array job, say running max 10 at once so NCBI doesn’t get too upset:

sbatch --array=1-1000%10 extract_all_info.sh

Rob Edwards

Running on all the accessions