Information Extraction, Named-Entity Recognition, and Part-of-Speech Tagging

What's on this page?

Information Extraction (IE)
Named-entity recognition (NER)
Part-of-speech tagging (POS)
Sentiment analysis: goals, applications, methods, and evaluation
Lexicon-based approaches for sentiment analysis
Learn how to evaluate sentiment classification - using Twitter/X Posts

Defining Information Extraction

Information extraction (IE) systems retrieve, understand, and extract relevant pieces of text, such as phrases and named entities, from documents. They generate structured representations in formats like JSON, XML, or database tables, enabling further analysis and querying.

IE systems answer questions like: Who did what to whom? When? Where? They transform unstructured text into structured data for querying and further use.

IE System Applications

Creating calendar events from emails
Business intelligence: extracting insights from reports
Bioinformatics: learning drug-gene interactions from research
Retail: supporting marketing and inventory decisions

[Image Placeholder: IE System Applications]

Named-Entity Recognition (NER)

NER identifies and classifies proper names in text, such as people, locations, organisations, dates, times, and quantities. It is foundational for many language processing applications, including sentiment analysis.

NER Example

Excerpt:
"Funding for poor countries to cope with the impacts of the climate crisis will be a key focus at Cop26. The UN secretary general, António Guterres, warned last year in an interview with the Guardian that the longstanding pledge by rich countries to provide $100bn (£70bn) a year to developing countries from 2020 was unlikely to be met… Along with the US, China, Russia and France, the UK is one of the five permanent members of the security council but has not chaired a session since John Major did so in 1992."

People: António Guterres, John Major
Dates: 2020, 1992
Locations: US, China, Russia, France, UK
Organisations: Cop26, UN, Guardian
Quantities: $100bn (£70bn)

Why Recognise Named Entities?

Tagging and indexing for search and linking
Key for question-answering systems
Supports sentiment analysis by linking opinions to entities
Enables structured knowledge extraction for databases and knowledge graphs

NER Ambiguities

Boundary ambiguity: "First Bank of Chicago" vs. "Bank of Chicago"
Type ambiguity: "Washington" can refer to a person, city, organisation, or vehicle

Type	Tag	Examples
People	PER	Turing, John Major
Organisation	ORG	IPCC, Cop26
Location	LOC	Mt. Sanitas, Chicago
Geo-political entity	GPE	Palo Alto, UK
Facility	FAC	Golden Gate Bridge
Vehicle	VEH	Ford Falcon

How Does NER Work?

NER uses sequence labelling models trained on annotated datasets
Common encoding schemes: IO (Inside-Outside), IOB (Inside-Outside-Beginning)
Features: current word, context words, part-of-speech tags, previous/next labels

NER Training and Testing Process

Collect representative training texts
Label each token for its entity class
Design feature extractors
Train a sequence classifier

Testing: Apply the trained model to new texts and output recognised entities.

Part-of-Speech (POS) Tagging

POS tagging determines the grammatical function of each token in a text, such as noun, verb, adjective, etc. The same word can have different tags depending on context.

Example: "back" can be a noun ("on my back"), adverb ("win the voters back"), adjective ("the back door"), or verb ("promised to back the bill").

Penn Treebank POS Tags (Selected)

Tag	Description	Example
NN	Noun, singular	llama
NNS	Noun, plural	llamas
VB	Verb, base form	eat
VBD	Verb, past tense	ate
JJ	Adjective	yellow
RB	Adverb	quickly
IN	Preposition	in, of
CC	Coordinating conjunction	and, but

POS Tagging as Sequence Labelling

Taggers use information about neighboring tokens for accurate tagging
Important for parsing, speech recognition, and word sense disambiguation

[Image Placeholder: POS Tagging Example]

Sentiment Analysis

Sentiment analysis is a text classification task that extracts opinions and determines the positive or negative orientation of a writer toward a subject, such as a product or political event.

Used in marketing, politics, recommendation systems, trust/reputation systems, and research
Can be binary (positive/negative) or multi-level (e.g., very positive, positive, neutral, negative, very negative)

Sentiment Levels Example

[-5, -3]: very negative
[-3, -1]: negative
[-1, 1]: needs improvement
[1, 3]: positive
[3, 5]: very positive

Applications of Sentiment Analysis

Marketers and businesses for customer feedback
Political analysts for campaign tracking
Recommendation engines
Trust and reputation systems in e-commerce
Researchers in psychology, finance, and social science

Bag of Words Model

The bag-of-words model represents text by the frequency of words, ignoring their order.... It is widely used for document classification and language modelling.

Sentiment Analysis Techniques

Machine Learning Approaches: Supervised, semi-supervised, unsupervised
Lexicon-Based Approaches: Use sentiment lexicons (dictionaries or corpus-based)
Hybrid Approaches: Combine machine learning and lexicon-based methods

Common Sentiment Lexicons

Harvard General Inquirer
LIWC (Linguistic Inquiry and Word Count)
MPQA Subjectivity Lexicon
SentiWordNet
VADER (Valence Aware Dictionary and sEntiment Reasoner)

ML-Based Sentiment Classifier Architecture

Data collection (e.g., tweets via API)
Text cleaning and preprocessing
Feature extraction (Bag of Words, TF-IDF)
Data split: training (80%), testing (20%)
Model training (Naïve Bayes, Logistic Regression, Random Forest, SVM)
Model evaluation: precision, recall, F1-score
Deployment for prediction

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF evaluates how relevant a word is to a document in a collection. It increases with the word's frequency in the document but decreases with its frequency across all documents, highlighting unique terms.

Semi-Supervised Learning in Sentiment Analysis

Combines labelled and unlabelled data for training
Wrapper-based and self-training methods iteratively label and retrain on new data
Topic-based SSL (Semi-Supervised Learning) clusters tweets by topic for specialised sentiment models

Evaluating Text Classifiers

Accuracy: Percentage of correct predictions (not reliable with unbalanced classes)
Precision: Proportion of predicted positives that are true positives
Recall: Proportion of actual positives correctly identified
F1-score: Harmonic mean of precision and recall

Precision and recall are preferred over accuracy when dealing with rare or unbalanced classes.

Twitter/X Post Sentiment Analysis

An example of some Twitter Posts

id	date	query	user	text
1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.

Checking Sentiments: