Published:
Posted by

Introduction

Contex2Vec is a text mining algorithm which can be compared to Word2Vec. As in Word2Vec, NaturalText's Contex2Vec groups words based on contextual similarity. Contex2Vec uses Machine Learning techniques to generate the word groups.

As with every Machine Learning algorithms, speed is a important factor to judge the performance of the algorithm. This post explains that with statistics.

Performance

Data set used for this analysis is PubMed Open Access articles listed in NCBI PubMed site

Details of system used for this analysis

ProcessorAMD FX6300 Six-Core Processor
Memory16 GB DDR3
GPUNone
MotherBoardM5A78L-M/USB3
HDDSeagate ST4000VX000 4TB
OSUbuntu 14.04.4 LTS

Total size of the text data is 46 GB. Preprocessing,Processing,Generating Vectors and comparing the words to group, all done in sequence of steps. Multiprocessing module of Python used to efficiently use all the 6 cores of the processor. Below is the table listing time taken for each work

WorkSizeTime Taken
Preprocessing46 GB45 mins
Generating Vectors27 GB78 mins
Comparisons4 billion220 mins
Total Time taken345 mins

Algorithm

Our code uses pure python to do all the work including heavy lifting. Nothing else than Python, no NoSQL database or Hadoop/Spark etc. This merits some explanation.

One might wonder, when Open Source algorithms are available, what is the need for reinventing the wheel? There are multiple reasons for this.

First is license. As a company, I want to have a complete and full open source solution to avoid any issues related to IPR or license in the future. So any Java based solution is out of consideration.

Second, each dependency would add a exponential cost in terms of resources and performance. Less dependencies leads to more performance. This can be compared with having less moving parts in engines would lead to less issues.

Third, there is no Database or framework that can do wonders as I expected in meagre resources. Instead of tweaking an existing framework, it would be easy to write and maintain a new framework.

Fourth, Python has quite a lot frameworks for Statistical Analysis, Scientific Computing, Machine Learning, Data Mining and Natural Language Processing. So it would be easy to get and use those frameworks if and when needed. Maintaining Python code is much easier than any other language.

Fifth, using multicore processor efficiently needs a intelligent framework. For more data, multiple machines will be used. So framework should have intelligent capabilities to discover, manage, deploy and monitor the execution of the code in multiple machines.

So creating a custom in-house developed framework arrived after considering above said reasons and future needs.

Conclusion

Search the grouped words in Contex2Vec PubMed data demo. See the same in the USPTO data demo. Some examples are listed in separate blog posts too.

Contex2Vec worked well in different datasets. It is comparatively fast in single machine. As this groups contextual words this can be used to enhance search, find various formatting issues etc.

Links for Demo and related blog posts.

NaturalText Contex2Vec Demo Using USPTO data

NaturalText Contex2Vec Demo Using PubMed OA data

Mining PubMed data using Context2Vec

Introducing Contex2Vec, Mining words based on vectors generated by context

Email rajasankar@naturaltext.com, to analyze your text data

Follow @naturaltext