Posted by

Contex2Vec mines/groups words based on the context used. Like the famous Word2vec by Google, NaturalText's Contex2Vec finds the similarity of words based on context and uses machine learning techniques to generate that context vector.

Converting words into numbers by a "magic method" is the dream of those working in Natural Language processing , because numbers can be processed more easily then words. The "magic method" can be vectorization or machine learning, natural language processing or combination of any of these. Vectorization and machine learning, is used in Contex2Vec.

Ever since, Word2Vec introduced , that dream looks like reality. Asking the question of, if woman is to man , what is for king?, Word2Vec would be return the answer of queen.

Contex2Vec by NaturalText, to extract words based on the context. For example data base and information flow can be similar in context. and, 1/3 can be equal to one-third.

Past 5 months data of US Patents released by USPTO is used for this analysis. Using over 200k words, Contex2Vec extracted around 9k words based on similarity.

This dataset processed in AMD 6 core processor with 16 GB memory for 15 hours, including preprocessing of data, extracting words, learning vectors and comparing each word with other. This is used without any GPU processing or deep learning. Original size of text data is 5 GB in which words occurring more than 50 contexts were used for this analysis. What is considered as a context is generated by the Contex2Vec algorithm.

Contex2Vec algorithm can mine any text data including legal, medicine etc to generate contextually similar words to be used in semantic search and other Natural Language processing tasks.

Take a look at some example groupings by Contex2Vec

      example described

      embodiment disclosed
      example shown
      invention described
      process described
      action taken
      conditions described
      elements described
      embodiment described
      embodiment disclosed
      embodiment shown
      embodiments described
      example provided
      example shown
      examples provided
      features disclosed
      functionality described
      functions described
      illustrated embodiments
      illustrative embodiment
      information described
      invention described
      invention disclosed
      materials described
      method described
      methods described
      n-type semiconductor layer
      organic light emitting element
      procedures described
      process described
      processing described
      references cited
      steps described
      structures described
      terms used

As you can see our Contex2Vec algorithm groups all the words that may give the meaning of example. This can be used in search, rewriting etc.

      10 μm

      100 nm
      20 mm
      50 nm

      12 hours

      3 hours
      30 minutes
      30 mm
      48 hours
      5 hours
      170° C.

      1000° C.
      110° C.
      130° C.
      140° C.
      160° C.
      3 times

      1.5 times
      10 times
      2 times
      4 times
      five times
      four times
      located outside
      made aware
      data associated

      information associated
      data collected

      images captured
      information obtained
      information provided
      heat generated

      light emitted
      light generated


      above described embodiments
      above-described embodiments
      embodiments described above
      foregoing embodiments

As with any other Machine Learning technique, Contex2Vec also generates incorrect pairs. Take a look at it.


      heat pipe 


      5 times

      1.5 times
      collectively referred
      formed along
      located outside
      made using

This algorithm can be applied to various Natural Processing tasks.

Check the words alphabetically listed in Our Demo

Email, to analyze your text data

Follow @naturaltext