Analyzing 3024 rice genomes characterized by DeepVariant

SentinelOne’s Product Journey – A Year in Review
March 18, 2019
How Google Cloud helped Multiplay power a record-breaking Apex Legends Launch
March 18, 2019

Rice is an ideal candidate for study in genomics, not only because it’s one of the world’s most important food crops, but also because centuries of agricultural cross-breeding have created unique, geographically-induced differences. With the potential for global population growth and climate change to impact crop yields, the study of this genome has important social considerations.

This post explores how to identify and analyze different rice genome mutations with a tool called DeepVariant. To do this, we performed a re-analysis of the Rice 3K dataset and have made the data publicly available as part of the Google Cloud Public Dataset Program pre-publication and under the terms of the Toronto Statement.

We aim to show how AI can improve food security by accelerating genetic enhancement to increase rice crop yield. According to the Food and Agriculture Organization of the United Nations, crop improvements will reduce the negative impact of climate change and loss of arable land on rice yields, as well as support an estimated 25% increase in rice demand by 2030.

Why catalog genetic variation for rice on Google Cloud?

In March 2018, Google AI showed that deep convolutional neural networks can identify genetic variation in aligned DNA sequence data. This approach, called DeepVariant, outperforms existing methods on human data, and we showed that the approach to call variants on a human can be used to call variants on other animal species. This blog post demonstrates that DeepVariant is also effective at calling variants on a plant, thus demonstrating the effectiveness of deep neural network transfer learning in genomics.

In April 2018, three research institutions–the Chinese Academy of Agricultural Sciences (CAAS), the Beijing Genomics Institute (BGI) Shenzhen, and the International Rice Research Institute (IRRI)published the results of a collaboration to sequence and characterize the genomic variation of the Rice 3K dataset, which consists of genomes from 3,024 varieties of rice from 89 countries. Variant calls used in this publication were identified against a Nipponbare reference genome using best practices and are available from the SNP-Seek database (Mansueto et al, 2017).

We recharacterized the genomic variation of the Rice 3K dataset with DeepVariant. Preliminary results indicate a larger number of variants discovered at a similar or lower error rate than those detected by conventional best practice, i.e. GATK.

In total the Rice3K DeepVariant dataset contains ~12 billion variants at ~74 million genomic locations (SNPs and Indels). These are available in a 1.5 terabyte (TB) table that uses the BigQuery Variants Schema.

Even at this size, you can still run interactive analyses, thanks to the scalable design of BigQuery. The queries we present below run on the order of a few seconds to a few minutes. Speed matters, because genomic data are often being interlinked with data generated by other precision agriculture technologies.

Illustrative queries and analyses

Below, we present some example queries and visualizations of how to query and analyze the Rice 3K dataset. Our analyses focus on two topics:

  • The distribution of genome variant positions, across 3024 rice varieties.
  • The distribution of allele frequencies across the rice genome.

For a step-by-step tutorial on how to work with variant data in BigQuery using the Rice 3K data or another variant dataset of your choosing, consider trying out the Analyzing variants with BigQuery codelab.

Analysis 1: Genetic variants are not uniformly distributed

Genomic locations with very high or very low levels of variation can indicate regions of the genome that are under unusually high or low selective pressure.

In the case of these rice varieties, high selective pressure (which corresponds to low genetic variation) indicates regions of the genome under high artificial selective pressure (i.e. domestication). Moreover, these regions contain genes responsible for traits that regulate important cultivational or nutritional properties of the plant.

We can measure the magnitude of the regional pressure by calculating at each position the Z statistic of each individual variety vs. all varieties. Here’s the query we used to produce the heatmap below, which shows the distribution of genetic variation across all 1Mbase-sized regions across all 12 chromosomes as columns (labeled by the top colored row), vs. all 3024 rice varieties as rows. Red indicates very low variant density relative to other samples within a particular genomic region, while pale yellow indicates very high variant density within a particular genomic region. The dendrogram below shows the similarity among samples (branch length) and groups similar rice varieties together:

Leave a Reply

Your email address will not be published. Required fields are marked *