Posted by

Sequence to Sequence Prediction making waves in multiple fields because of recent advances in Deep Learning. In this post, let me introduce Graph Theory based seq2seq technology created by NaturalText to change the way bio-sequences such as Gene, Protein, Nucleotide analyzed in Computational Biology.

Why seq2seq for Computational Biology?

Bio-sequences are very hard to analyze. DNA has only four letters arranged to Billion length sequences. In this very densely compacted search space, analyzing where a single letter missing is very hard.

A technology developed for Natural Language Understanding can solve, some of the hardest problems in other fields.

How hard?

The four letters, ‘ACGT’ can be arranged in various ways.

Two letters sequences can be arranged in 16 ways,AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT

Three letters sequences can be arranged in 64 ways,AAA, AAC.. ACA.. AGA.. CAA.. CGA.. CCC.. GCA... TTT

Four letter sequences can be arranged in 256 ways, AAAA, AAAC.. ACAA.. AGAA.. CAAC.. CGCA.. CCCC.. GCAC... TTTT

This can be written as 4, 42, 43, 44, 45.... 4n

Number of possible combinations a 100 length sequence can occur? 4100 = 1.6x1060

For comparison, observable universe has 1082 atoms. 200 length sequence has 10120 possible combinations, equivalent of possible games in Chess. Imagine, the combinations for 300 Million length sequence.

This shows how hard to find a match, if we try to use brute-force method. However, we can assume that, there is particular structure hidden in DNA that makes up the entire living world.

How it is solved today?

Popular tools such as BLAST, Clustal Omega solve this by using statistical methods and dynamic programming. For Gene comparison and genome assembly, methods such as de Bruijn graph and Burrows–Wheeler aligner are available.

How to apply de Bruijn graphs to genome assembly
Fast and accurate short read alignment with Burrows–Wheeler transform

These tools still require High Performance Computing because of Computational complexity.

Computational complexity

Alignment of multiple sequences is NP-Hard or NP-complete depending upon the task performed. It is discussed in these papers.

Computational complexity of multiple sequence alignment with SP-score.
On the Complexity of Multiple Sequence Alignment.

Gist of the complexity can be found in Dynamic programming and computational complexity

How NaturalText’s seq2seq can solve this?

NaturalText’s seq2seq algorithm solve this problem by extracting patterns using graphical reasoning. Exact method of solving is proprietary technology. Generally, it is based on two things, Reducing the search space and Fail fast.

Search Space Reduction

Comparing 100 length string in 1000 length string, would be faster than comparing 100 length string in 300 Million length string. Even by the brute-force method, this will result in 300,000 times increase in performance.

Narrowing down the search to only in regions that differ, would give the same performance increase in comparing large sequences or multiple large sequences. When comparing large sequences such as genomes, this method can do magic. If only small regions of genes differ, then that regions alone can be compared and analyzed.

Fail Fast, Fail Faster, Finish Fastest

Reducing the search space is nice strategy but what about worst case performance? Well, that is where fail fast method comes in. If there is no likelihood of finding motifs, the process will be stopped immediately. This ensures that if and when regions are compared, then it must have 100% chance of finding a match.

The 100% match advantage

Current tools cant match to 100% owing to the NP-Hard problem. This leads to problems such as gaps in whole genome sequencing data. NaturalText’s seq2seq can match 100% to the detail of single letter. This advantage comes from not running to NP-Hard problem, ie seq2seq doesn’t face NP-Hard problem. Idea is to bypass NP-Hard problem instead of solving it.

This has potential to change genomics research by reducing the cost and increasing the accuracy.

Seq2seq Patterns no Statistics

Speed and accuracy is possible because of using patterns instead of statistical calculations. Statistics is a great tool, however, if thrown into the densely compacted information of DNA or Proteins, the performance takes a hit along with accuracy. Using Patterns instead of Statistical methods gives the advantage of finding 100% match.

Graphical Reasoning, no Deep Learning

Even though Deep Learning methods do the sequence to sequence prediction, Natural Text’s methods are not Deep Learning. It is based on Graph Theory. The decision making, discriminator in Deep Learning , is done by reasoning using Graph Theory. Hence, it is Artificial Reasoning instead of Artificial Intelligence.

Where is Variants, Indels etc?

This is a mathematical solution by treating bio-sequences as strings. Variants and other concepts can be applied using Artificial Reasoning.

Comparing Millions of short reads in bulk for Genome assembly

If applied to genome assembly, this can reduce the time by 10 times. If three 150 short reads are 99% similar, there is 100% chance that all three short reads occur in a same region.

Instead of finding that region for three times, that particular motif area can be found once and three reads can be matched at once. For a million reads, reads can be grouped and assembled. Million reads can be categorized into 5000 categories, thus 200 times performance improvement from single comparison.

If NGS results are in the order of millions, instead of comparing short reads one by one, it can be compared in bulk thus saving time and costs.

Instead of following the familiar path of having a reference genome to match, short reads can be compared, eliminating another round of verification and validation.

Storing in Graph Database

Storing the bio-sequences in graph database or any database is hard because the distribution is highly dense. Just 4 nodes and millions of edges. However, Storing extracted motifs is easy. A gene can be reduced to 30 million motifs which can stored in a graph database and analyzed.

However, if 30 million patterns for one gene, with 23 genes, 690 million patterns for all the genes, this just for one set. If there is small differences then this count can go up for millions, if we consider sequencing millions of people. Would any database can handle this load? Given billion patterns, how to find and compare patterns faster?

High Throughput Database

A high throughput database handling half billion nodes and eight billion edges with response time in the range of milliseconds can handle even trillions of nodes in distributed environment.

This database is developed by NaturalText to handle big data.

Performance Details.

Gene assembly

Gene taken for comparison :
gi|568815597|ref|NC_000001.11| Homo sapiens chromosome 1, GRCh38.p7 Primary Assembly
Number of Nucleotide short reads : 1 Million

System :

Processor: AMD FX(tm)-6300 Six-Core Processor
Memory : 24GB (DDR2 8GB *3)
Time to process ref Gene : 20 Minutes
Time to compare Million Nucleotide : 32 Minutes
Max Memory : 10 GB
Threads : Single Thread

This time is in low cost, low end machine. Time reduces into half if processed in i7 processor.

Processor :  Intel(R) Core(TM) i7-4770K CPU
Memory : 16 GB (DDR3 8GB * 2)
Time to process ref Gene : 9 Minutes
Time to compare Million Nucleotide : 14 Minutes
Max Memory : 10 GB
Threads : Single Thread

Check the NaturalText Genome Alignment and Search Demo

If processing ref gene is omitted, it can be stored in database and loaded in less than min, total time taken to process million Nucleotide is 4x faster than current methods which is an hour in high end machine with 256 GB Memory. Only Single Thread is used to make comparisons easy. Open-Source tools such as Burrows Wheeler Aligner takes one hour in a single thread.

For high end machines, running cost is also high apart from the initial investments. This makes not only treatments such as precision medicine costly and not accessible to masses because of high cost.

Nucleotide Comparison

Processor:	AMD FX(tm)-6300 Six-Core Processor
Memory : 24GB (DDR2 8GB *3)
Time to process million Nucleotide : 2 Minutes
Time to compare Million Nucleotide : 10 Minutes
Max Memory : 10 GB
Threads : Single Thread

Check the NaturalText Nucleotide Alignment and Search Demo

API Access and OnDemand Analysis

Another biggest advantage of this approach is, it can accessed via API and on demand analysis is possible. But another question arises, what if, there is new sequence or need to find patterns in another set of data? Well, this built on general solution of Sequence to Sequence Prediction applied to multiple fields. So, it can scale on both vertically (more data, more analysis) and horizontally (various comparisons, different datasets).

How that is possible? Because this sequence alignment technology created as a general solution for sequence prediction in various areas such as Natural Languages.

Cloud Hosting Details

Processor :  2 Core
Memory : 2 GB

Yes millisecond performance in the demo runs on 2GB Memory.

Related Posts:

Protein Alignment and Search by Graphical Models

Generating new sentences in Natural Language as a Graph Clique Problem : Graphical Reasoning based solution

Natural Language Grammar Correction using Graphical Logical Reasoning

Follow @naturaltext