Information Extraction, Named-Entity Recognition, and Part-of-Speech Tagging

What's on this page?

Defining Information Extraction

Information extraction (IE) systems retrieve, understand, and extract relevant pieces of text, such as phrases and named entities, from documents. They generate structured representations in formats like JSON, XML, or database tables, enabling further analysis and querying.

IE systems answer questions like: Who did what to whom? When? Where? They transform unstructured text into structured data for querying and further use.

IE System Applications

[Image Placeholder: IE System Applications]

Named-Entity Recognition (NER)

NER identifies and classifies proper names in text, such as people, locations, organisations, dates, times, and quantities. It is foundational for many language processing applications, including sentiment analysis.

NER Example

Excerpt:
"Funding for poor countries to cope with the impacts of the climate crisis will be a key focus at Cop26. The UN secretary general, António Guterres, warned last year in an interview with the Guardian that the longstanding pledge by rich countries to provide $100bn (£70bn) a year to developing countries from 2020 was unlikely to be met… Along with the US, China, Russia and France, the UK is one of the five permanent members of the security council but has not chaired a session since John Major did so in 1992."

Why Recognise Named Entities?

NER Ambiguities

Type Tag Examples
People PER Turing, John Major
Organisation ORG IPCC, Cop26
Location LOC Mt. Sanitas, Chicago
Geo-political entity GPE Palo Alto, UK
Facility FAC Golden Gate Bridge
Vehicle VEH Ford Falcon

How Does NER Work?

NER Training and Testing Process

  1. Collect representative training texts
  2. Label each token for its entity class
  3. Design feature extractors
  4. Train a sequence classifier

Testing: Apply the trained model to new texts and output recognised entities.

Part-of-Speech (POS) Tagging

POS tagging determines the grammatical function of each token in a text, such as noun, verb, adjective, etc. The same word can have different tags depending on context.

Penn Treebank POS Tags (Selected)

Tag Description Example
NNNoun, singularllama
NNSNoun, pluralllamas
VBVerb, base formeat
VBDVerb, past tenseate
JJAdjectiveyellow
RBAdverbquickly
INPrepositionin, of
CCCoordinating conjunctionand, but

POS Tagging as Sequence Labelling

[Image Placeholder: POS Tagging Example]

Sentiment Analysis

Sentiment analysis is a text classification task that extracts opinions and determines the positive or negative orientation of a writer toward a subject, such as a product or political event.

Sentiment Levels Example

Applications of Sentiment Analysis

Bag of Words Model

The bag-of-words model represents text by the frequency of words, ignoring their order.... It is widely used for document classification and language modelling.

Sentiment Analysis Techniques

Common Sentiment Lexicons

ML-Based Sentiment Classifier Architecture

  1. Data collection (e.g., tweets via API)
  2. Text cleaning and preprocessing
  3. Feature extraction (Bag of Words, TF-IDF)
  4. Data split: training (80%), testing (20%)
  5. Model training (Naïve Bayes, Logistic Regression, Random Forest, SVM)
  6. Model evaluation: precision, recall, F1-score
  7. Deployment for prediction

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF evaluates how relevant a word is to a document in a collection. It increases with the word's frequency in the document but decreases with its frequency across all documents, highlighting unique terms.

Semi-Supervised Learning in Sentiment Analysis

Evaluating Text Classifiers

Precision and recall are preferred over accuracy when dealing with rare or unbalanced classes.

f1, precision score, recall diagram

Twitter/X Post Sentiment Analysis

An example of some Twitter Posts

sentiment id date query user text
0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus @Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF my whole body feels itchy and like its on fire
0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli @nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.

Checking Sentiments: