Contex2Vec is a text mining algorithm which can be compared to Word2Vec. As in Word2Vec, NaturalText's Contex2Vec groups words based on contextual similarity. Contex2Vec uses Machine Learning techniques to generate the word groups.
As with every Machine Learning algorithms, speed is a important factor to judge the performance of the algorithm. This post explains that with statistics.
Data set used for this analysis is PubMed Open Access articles listed in NCBI PubMed site
Details of system used for this analysis
|Processor||AMD FX6300 Six-Core Processor|
|Memory||16 GB DDR3|
|HDD||Seagate ST4000VX000 4TB|
|OS||Ubuntu 14.04.4 LTS|
Total size of the text data is 46 GB. Preprocessing,Processing,Generating Vectors and comparing the words to group, all done in sequence of steps. Multiprocessing module of Python used to efficiently use all the 6 cores of the processor. Below is the table listing time taken for each work
|Preprocessing||46 GB||45 mins|
|Generating Vectors||27 GB||78 mins|
|Comparisons||4 billion||220 mins|
|Total Time taken||345 mins|
Our code uses pure python to do all the work including heavy lifting. Nothing else than Python, no NoSQL database or Hadoop/Spark etc. This merits some explanation.
One might wonder, when Open Source algorithms are available, what is the need for reinventing the wheel? There are multiple reasons for this.
First is license. As a company, I want to have a complete and full open source solution to avoid any issues related to IPR or license in the future. So any Java based solution is out of consideration.
Second, each dependency would add a exponential cost in terms of resources and performance. Less dependencies leads to more performance. This can be compared with having less moving parts in engines would lead to less issues.
Third, there is no Database or framework that can do wonders as I expected in meagre resources. Instead of tweaking an existing framework, it would be easy to write and maintain a new framework.
Fourth, Python has quite a lot frameworks for Statistical Analysis, Scientific Computing, Machine Learning, Data Mining and Natural Language Processing. So it would be easy to get and use those frameworks if and when needed. Maintaining Python code is much easier than any other language.
Fifth, using multicore processor efficiently needs a intelligent framework. For more data, multiple machines will be used. So framework should have intelligent capabilities to discover, manage, deploy and monitor the execution of the code in multiple machines.
So creating a custom in-house developed framework arrived after considering above said reasons and future needs.
Search the grouped words in Contex2Vec PubMed data demo. See the same in the USPTO data demo. Some examples are listed in separate blog posts too.
Contex2Vec worked well in different datasets. It is comparatively fast in single machine. As this groups contextual words this can be used to enhance search, find various formatting issues etc.
Links for Demo and related blog posts.
NaturalText Contex2Vec Demo Using USPTO data
NaturalText Contex2Vec Demo Using PubMed OA data
Mining PubMed data using Context2Vec
Introducing Contex2Vec, Mining words based on vectors generated by context
Email email@example.com, to analyze your text dataFollow @naturaltext Tweet to @naturaltext