Information Extraction, Named-Entity Recognition, and Part-of-Speech Tagging
What's on this page?
- Information Extraction (IE)
- Named-entity recognition (NER)
- Part-of-speech tagging (POS)
- Sentiment analysis: goals, applications, methods, and evaluation
- Lexicon-based approaches for sentiment analysis
- Learn how to evaluate sentiment classification - using Twitter/X Posts
Defining Information Extraction
Information extraction (IE) systems retrieve, understand, and extract relevant pieces of text, such as phrases and named entities, from documents. They generate structured representations in formats like JSON, XML, or database tables, enabling further analysis and querying.
IE systems answer questions like: Who did what to whom? When? Where? They transform unstructured text into structured data for querying and further use.
IE System Applications
- Creating calendar events from emails
- Business intelligence: extracting insights from reports
- Bioinformatics: learning drug-gene interactions from research
- Retail: supporting marketing and inventory decisions
[Image Placeholder: IE System Applications]
Named-Entity Recognition (NER)
NER identifies and classifies proper names in text, such as people, locations, organisations, dates, times, and quantities. It is foundational for many language processing applications, including sentiment analysis.
NER Example
Excerpt:
"Funding for poor countries to cope with the impacts of the climate crisis will be a key focus at Cop26.
The UN secretary general, António Guterres, warned last year in an interview with the Guardian that the longstanding pledge by rich countries to provide $100bn (£70bn) a year to developing countries from 2020 was unlikely to be met… Along with the US, China, Russia and France, the UK is one of the five permanent members of the security council but has not chaired a session since John Major did so in 1992."
- People: António Guterres, John Major
- Dates: 2020, 1992
- Locations: US, China, Russia, France, UK
- Organisations: Cop26, UN, Guardian
- Quantities: $100bn (£70bn)
Why Recognise Named Entities?
- Tagging and indexing for search and linking
- Key for question-answering systems
- Supports sentiment analysis by linking opinions to entities
- Enables structured knowledge extraction for databases and knowledge graphs
NER Ambiguities
- Boundary ambiguity: "First Bank of Chicago" vs. "Bank of Chicago"
- Type ambiguity: "Washington" can refer to a person, city, organisation, or vehicle
Type |
Tag |
Examples |
People |
PER |
Turing, John Major |
Organisation |
ORG |
IPCC, Cop26 |
Location |
LOC |
Mt. Sanitas, Chicago |
Geo-political entity |
GPE |
Palo Alto, UK |
Facility |
FAC |
Golden Gate Bridge |
Vehicle |
VEH |
Ford Falcon |
How Does NER Work?
- NER uses sequence labelling models trained on annotated datasets
- Common encoding schemes: IO (Inside-Outside), IOB (Inside-Outside-Beginning)
- Features: current word, context words, part-of-speech tags, previous/next labels
NER Training and Testing Process
- Collect representative training texts
- Label each token for its entity class
- Design feature extractors
- Train a sequence classifier
Testing: Apply the trained model to new texts and output recognised entities.
Part-of-Speech (POS) Tagging
POS tagging determines the grammatical function of each token in a text, such as noun, verb, adjective, etc. The same word can have different tags depending on context.
-
Example: "back" can be a noun ("on my back"), adverb ("win the voters back"), adjective ("the back door"), or verb ("promised to back the bill").
Penn Treebank POS Tags (Selected)
Tag |
Description |
Example |
NN | Noun, singular | llama |
NNS | Noun, plural | llamas |
VB | Verb, base form | eat |
VBD | Verb, past tense | ate |
JJ | Adjective | yellow |
RB | Adverb | quickly |
IN | Preposition | in, of |
CC | Coordinating conjunction | and, but |
POS Tagging as Sequence Labelling
- Taggers use information about neighboring tokens for accurate tagging
- Important for parsing, speech recognition, and word sense disambiguation
[Image Placeholder: POS Tagging Example]
Sentiment Analysis
Sentiment analysis is a text classification task that extracts opinions and determines the positive or negative orientation of a writer toward a subject, such as a product or political event.
- Used in marketing, politics, recommendation systems, trust/reputation systems, and research
- Can be binary (positive/negative) or multi-level (e.g., very positive, positive, neutral, negative, very negative)
Sentiment Levels Example
- [-5, -3]: very negative
- [-3, -1]: negative
- [-1, 1]: needs improvement
- [1, 3]: positive
- [3, 5]: very positive
Applications of Sentiment Analysis
- Marketers and businesses for customer feedback
- Political analysts for campaign tracking
- Recommendation engines
- Trust and reputation systems in e-commerce
- Researchers in psychology, finance, and social science
Bag of Words Model
The bag-of-words model represents text by the frequency of words, ignoring their order.... It is widely used for document classification and language modelling.
Sentiment Analysis Techniques
- Machine Learning Approaches: Supervised, semi-supervised, unsupervised
- Lexicon-Based Approaches: Use sentiment lexicons (dictionaries or corpus-based)
- Hybrid Approaches: Combine machine learning and lexicon-based methods
Common Sentiment Lexicons
- Harvard General Inquirer
- LIWC (Linguistic Inquiry and Word Count)
- MPQA Subjectivity Lexicon
- SentiWordNet
- VADER (Valence Aware Dictionary and sEntiment Reasoner)
ML-Based Sentiment Classifier Architecture
- Data collection (e.g., tweets via API)
- Text cleaning and preprocessing
- Feature extraction (Bag of Words, TF-IDF)
- Data split: training (80%), testing (20%)
- Model training (Naïve Bayes, Logistic Regression, Random Forest, SVM)
- Model evaluation: precision, recall, F1-score
- Deployment for prediction
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF evaluates how relevant a word is to a document in a collection. It increases with the word's frequency in the document but decreases with its frequency across all documents, highlighting unique terms.
Semi-Supervised Learning in Sentiment Analysis
- Combines labelled and unlabelled data for training
- Wrapper-based and self-training methods iteratively label and retrain on new data
- Topic-based SSL (Semi-Supervised Learning) clusters tweets by topic for specialised sentiment models
Evaluating Text Classifiers
- Accuracy: Percentage of correct predictions (not reliable with unbalanced classes)
- Precision: Proportion of predicted positives that are true positives
- Recall: Proportion of actual positives correctly identified
- F1-score: Harmonic mean of precision and recall
Precision and recall are preferred over accuracy when dealing with rare or unbalanced classes.
Twitter/X Post Sentiment Analysis
An example of some Twitter Posts
sentiment |
id |
date |
query |
user |
text |
0 |
1467810369 |
Mon Apr 06 22:19:45 PDT 2009 |
NO_QUERY |
_TheSpecialOne_ |
@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D |
0 |
1467810672 |
Mon Apr 06 22:19:49 PDT 2009 |
NO_QUERY |
scotthamilton |
is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! |
0 |
1467810917 |
Mon Apr 06 22:19:53 PDT 2009 |
NO_QUERY |
mattycus |
@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds |
0 |
1467811184 |
Mon Apr 06 22:19:57 PDT 2009 |
NO_QUERY |
ElleCTF |
my whole body feels itchy and like its on fire |
0 |
1467811193 |
Mon Apr 06 22:19:57 PDT 2009 |
NO_QUERY |
Karoli |
@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. |
Checking Sentiments: