Published:
Posted by

In Natural Language Processing, the standard method is to have a tagged corpus to train the algorithm, if the algorithm works, it can be used to generate models which in turn can be used to process new information.

How about trying to extract names, places, unique expressions etc without using a tagged corpus?

Our NaturalText algorithm discussed here is doing exactly that.

The algorithm works by processing the full text, trying to find the patterns. Once, patterns were generated, then the patterns were generalized to get the unique words.

Data set used for this is EU parliament corpus

Corpus contains 330k unique words, after the algorithm processed the text, unique words is 530k. Size of Corpus is 54 million words.

How many unique words extracted by the algorithm? 277K words or 52% of unique words.

Download the total extracted words.

Algorithm took 8 hours to extract words in AMD processor with 16 GB memory running python on Ubuntu.

What about the accuracy of the extracted words?

As this algorithm used raw text, I haven't ran those metrics. Next step would be that. Using pre-tagged data would be a easier to do this. UMBC WebBase corpus or OANC_GrAF corpus could be used for this. I am also planning to use some scientific texts too.

However I would like to list out some of the lines extracted from the text

Other language lines extracted without language identification

"Arbitrio sobre las Importaciones y Entregas de Mercancías en las islas Canarias",
"Commission Interprofessionnelle et Interministériel de l'Alimentation Animale'
"L'Europe ne se fera pas d'un coup ni dans une construction d'ensemble;
"Les affaires du Parlement européen ne peuvent pas faire l'objet d'un recours juridictionnel." [No
"Lorsque la vacance résulte de l'application des dispositions nationales en vigueur dans un État membre,
"Nous nʼavons que le choix entre les changements dans lesquels nous serons entrainés et ceux que nous aurons su vouloir et accomplir'
"Quousque tandem abutere Catilina patientia nostra [How
"c'est l'argent qui fait la guerre
'Estime important d'évaluer l'efficacité de la politique de l'aide humanitaire.' Qu'est-ce qu'on doit évaluer?
'Frankfurter Allgemeine Zeitung
'Office International du Coin de Terre Jardins Familiaux'
'Qualitätswein garantierten Ursprungs' and 'Qualitätsschaumwein garantierten Ursprungs'
'Verzoekt de lidstaten ernstig in overweging te nemen om in dat geval de structuur van hun economieën waar mogelijk te flexibiliseren en diversificeren,
'anche garantendo l'indipendenza e la separazione delle istituzioni e del potere politico dalle autorità religiose' - 'also
'die besten nachrichtendienstlichen Instrumente '.
'eist dat deze maatregelen de uitbreiding omvatten van de lijst van buiten de wet geplaatste individuen...?
'entreprises pharmaceutiques et organisations scientifiques'
'erede della cultura umanística e della cultura cristiana
'exhorte le Conseil à ne ménager aucun effort pour créer les conditions propices à une brève interruption des bombardements '.
Cuirim fáilte freisin roimh shocrú rialtas Iosrael an deireadh seachtaine seo caite 250 príosúnach a scaoileadh saor.
Cuirimse céad mile fáilte roimh Taoiseach na h'Eireann agus Aire Gnothai na h'Eorpa ata anseo chun deire oifiguil a chuir le Uachtaranacht Eireann don Comhphobail.
ar an céad dul síos ba mhaith liom fáilte a chur roimh Aire Airgeadais na hEireann agus Uachtarán in Oifig,
ar an céad turas dó go dtí Parlaimint na hEorpa.
ar an drochuair bhí meancóg cló ag 183 nuair a chuireamar isteach é agus ní fhacamar é go dtí an bomaite deireanach.
ar an ocáid specialta seo,
athaontú na hEorpa a leanfaidh sé go dtabharfar Eoraip scoilte le chéile d'fhonn comhluachanna agus rath eacnamaíoch a bheith ag a muintir i bpáirt le chéile.
ba mhaith liom a rá go cruinn agus go sóiléir,
ba mhaith liom comhghairdeas a dhéanamh le Rialtas na hÉireann,
ba mhaith liomsa i dtosac báire céad mile fáilte a chur roimh Taoiseach na hEireann,
ba mhaith liomsa i dtosac céad míle fáilte a chuir roimh Aire Gnóthai Eachtracha na hEireann chuig Parlaimint na hEorpa i maidin.

Extracted number descriptions

0.75 l bottles,
0.75 litre bottle.
0.75 litre bottles
1.4 trillion,
1.4% of Gabon's
1.4% per year:
1.4% to 1.5%
1.43 escudos

Dates extracted

13 November was thwarted
13 October 2006 EU-India
13 October and 25 November 2003,

Extracted lines of Names, Positions, etc

Archbishop of Braga,
Archbishop of Bulawayo
Archbishop of Bulawayo,
Archbishop of Canterbury
Bessarabian Orthodox Church in Moldavia
Best Available Techniques
Best Available Technology
Best Emerging European Rural Destinations of Excellence.
Chişinau as persona non-grata
Chişinău Municipal
Chişinău airport,
Chişinău and Tiraspol
Chişinău specialist consultancy
Chişinău to scrupulously adhere
Comprehensive Monitoring Report
Comprehensive Monitoring Report,
Comprehensive Monitoring Reports,
Comprehensive Nuclear Test Ban
Comprehensive Nuclear Test-Ban
Comprehensive Nuclear Test-Ban Treaty;
Comprehensive Nuclear Weapons Non-Proliferation
Deepwater Horizon
Deepwater Horizon explosion
Deepwater Horizon oil platform
Deepwater Horizon oil platform sank
Deputy Defence
Deputy Director General
Deputy Director General for Enterprise and Industry,
Deputy Director-General
Deputy Director-General,
Deputy Executive Director
Deputy Executive Director,
Deputy Foreign Minister
Deputy Prime Minister
Deputy Prime Minister Anwar Ibrahim
Deputy Prime Minister Anwar Ibrahim,
Deputy Prime Minister Baramidze
Deputy Prime Minister Baramidze.
Deputy Prime Minister Bozidar Djelic,
Deputy Prime Minister El Gamal,
Deputy Prime Minister Fini
Deputy Prime Minister John Prescott
Deputy Prime Minister John Prescott secured,
Deputy Prime Minister Khiem
Deputy Prime Minister Khloponin,
Deputy Prime Minister Medvedev
Deputy Prime Minister Michael Heseltine
Deputy Prime Minister Michel
Director-General for Fisheries.
Director-General for Justice,
Director-General for Personnel
Director-General for Personnel and Administration
Director-General for Personnel and Administration,
Director-General for Personnel.
Díaz de Mera García Consuegra
Díaz de Mera García Consuegra emphasised,
Díaz de Mera García Consuegra has amply demonstrated,
Díaz de Mera García Consuegra on EUROPOL
Díaz de Mera García Consuegra on EUROPOL,
Díaz de Mera García Consuegra's
Díaz de Mera García Consuegra,
Orthodox Ecumenical Patriarch
Orthodox Ecumenical Patriarch Bartholomew,
Orthodox Ecumenical Patriarch;
Orthodox Ecumenical Patriarchate,
Orthodox Halki Seminary
Orthodox Jewish fundamentalists,
Orthodox Monastery at Halki
Orthodox Patriarch Petros VII
Orthodox Patriarch,
Orthodox Patriarchate in Alexandria.
Orthodox Patriarchate,
Orthodox Patriarchates and Autocephalous Churches

Algorithm also extracted some descriptions,

communitarisation of "border management”
communitarisation of debts,
communitarisation of policies'
communitarising the ‘third pillar’,
communitarization being desirable,
dyed their hair blond
dyed-in-the-wool European federalists
dyed-in-the-wool Europhile
dyed-in-the-wool anti-Americans,
dyed-in-the-wool anti-communism.
dyed-in-the-wool conservative,
dyed-in-the-wool dogmatists
dyed-in-the-wool liberal
dyed-in-the-wool liberal.
voltage electrical
voltage gas appliances
voltage of 220 kW
voltage of 225 kV was chosen,
two-stroke and four-stroke engines.
two-stroke engines are dirtier
two-stroke engines emitting hydrocarbons and particulates
two-stroke engines for generators and snow-blowers.

What is the unique about this algorithm compared to all the Natural Language Processing algorithms out there?

This algorithm can be applied to any text to extract the information. This can be applied any language without requiring tagged Corpus. This can be applied to text of any domain/field.

I think algorithm may be used to find relationship between words or extracting facts from the text. Even the synonym/antonym or sentiments.

I will keep posting the results.

Email rajasankar@naturaltext.com, to analyze your text data