For a researcher working with thousands of the research papers, finding relevant papers is a daunting task. Usually, searching by authors name or using referenced/cited papers, one could find relevant papers. There are tools to annotate the pdfs, collecting reference papers to make life easier for scientists and researchers.

Even with all those things, a researcher still would need to read 100s of papers or skim thorough 1000s of papers to find needed information. This would consume most of the time spent to do the research.

Another issue is redundancy. If a researcher/scientist starts a work, he needs to ensure that no one in the world already researched on that specific topic. Lot of scientists, after they finished the research found that some one else already did what they just finished.

Even just to check all the published papers is beyond human capacity. It is impossible even to comprehend the size of the information available. With advent of machine learning techniques, these issues can be solved.

One solution is finding similar lines between papers thus identifying related papers. Our **NaturalText algorithm** uses grammar to find similar lines.

2200 papers downloaded from NIPS(https://papers.nips.cc/) used for this analysis. The algorithm found more than **100 similar lines in 82 papers**. Those lines were taken for around 95% accuracy and verified manually.

Those similar lines include lines in a algorithm

```
2: while convergence criteria is not met do 3: Sample a subset S of node pairs.
2: while convergence criteria is not met do 3: Sample a mini-batch S of nodes.
```

Abbreviation expanded in two lines

```
The label complexity of standard DBAL is measured in terms of the dis- agreement coef\ufb01cient.
The label complexity of disagreement-based active learning is characterized in terms of the disagreement coef\ufb01cient.
```

Similar descriptions of formula

```
A function f : Z \u2192 R is DR-submodular if and only if f is lattice submodular and coordinate-wise concave.
A function f : (k + 1) V \u2192 R is k-submodular if and only if f is orthant submodular and pairwise monotone.
```

Words mixed between two places

```
An EM-type algorithm is proposed to fit the model for high dimensional data.
An on-line learning algorithm is devised to make the model feasible for large-scale learning problems.
```

With total of a million lines in those files, it is not possible to compare each line with another line to find similar lines. That would be **million times of million, ie a trillion comparisons**.

Doing a trillion comparisons not only feasible but it would grow exponentially with addition of each line.

So the algorithm uses sort of hashing to find similar lines in those million lines. As this the first attempt to compare and check for the results, high accuracy results are considered and checked manually.

Full results are available in both HTML and PDF formats.

Click Here for viewing Full Analysis in HTML format

Click Here to Download Full Analysis in PDF format

Check the demo of this method using US Patents

As this can be used for any kind of text, texts of Historical, legal areas would be analyzed next.

Keep watching this blog.

Email rajasankar@naturaltext.com, to analyze your text data

Follow @naturaltext Tweet to @naturaltext