Generating Network Insights Using Named Entity Recognition

Generating Network Insights Using Named Entity Recognition

Thursday, May 23
2:00 PM - 2:40 PM
Magnolia

The speed with which information is posted online makes it difficult for analysts to identify and distill important trends. This is true for companies managing potentially harmful stories and for government analysts monitoring emergent regional developments. To help analysts on the Novetta Mission Analytics (NMA) team address this challenge, we conducted a novel analysis of open source and cloud-based Named Entity Recognition (NER) tools. NMA collects and enriches open source content and provides the analyst a customizable UI used to analyze trends across topics, regions, and news sources. To support their users, we generated a network visualization tool that allowed analysts to interpret entities and relationships in a corpus of news articles or other text datasets.

In Phase 1, we conducted an analysis of five open source NER libraries and Amazon Comprehend (an AWS service). Tests were conducted using two corpora - the WikiGold Corpus and an operational NMA dataset. These two corpora totalled to approximately 50,000 tokens, sufficient for generation of baseline evaluation metrics. To handle the libraries’ varied inputs and outputs, we configured a python pipeline to streamline evaluation processes. To address dependencies we created a Dockerfile that would allow any other data scientist to replicate this analysis.

We assessed the models performance on their ability to correctly identify people, locations, and organizations. The most effective libraries, as measured by the F1 score, correctly identified these entity types without falsely identifying non-entity tokens. Based upon this evaluation we determined that Amazon Comprehend performed robustly against test corpora, while the top open source solutions for the NER task were AllenNLP ELMo, Stanford NER Tagger, and NeuroNER.

In Phase 2 we developed a web-based Entity Network Visualization Tool capable of (1) processing raw text or a news article URL and (2) displaying a dynamic network graph visualization of the relationships between entities in the text. Front-end development was based on Flask, a micro web framework written in python. We used Cytoscape (a Javascript library) to render the graph displays.

Each entity detected in the text is represented by a node colored according to its categorization - person, location, or organization. The size of each node was proportional to its number of degrees, or connections to other nodes. The weight of an edge was dictated by the frequency with which two entities are seen within a set distance of each other in the text.

In some cases NER tools extract the same identity in slightly different forms, such as Donald Trump, President Trump, and Donald J. Trump. This causes a suboptimal analyst experience. We implemented a fuzzy matching library, dedupe, that resolved and combined such instances.

The Entity Network Visualization Tool has proved to be a useful, functional part of the NMA analyst toolkit. From an analytical standpoint it facilitates insights through visual representation of text. Key entities stand out based on their importance and prominence in the article. Important relationships come into focus based on node size and edge thickness. Another dimension reflected in our visualization is the nature of communities that entities form. This speaks to the subgroups and interrelations within a larger network.

Presentation Video

講演者

Akshitha Ramachandran
Software Engineer
Novetta Solutions
Akshitha Ramachandran is a junior at Harvard University pursuing a joint degree in both Computer Science and Statistics. She was a founding member and Lead Engineer at Harvard Student Agencies - DEV, a start-up focused on developing mobile and web applications for third party clients. She is both a senior developer and board member at ProMazo, a campus organization partnering top students from leading universities with projects at leading companies ranging from Unilever to Whirlpool. She is the Director of the Harvard College Consulting Group, and is responsible for the acquisition, organization and execution of eighteen client projects along with managing organization operations. She directly manages 13 board members who collectively oversee more than 120 members of the group. Additionally, Akshitha attends Hackathons and has been on the board of Harvard’s Women Engineers Code (WECode) conference.This past summer she spent time at Novetta expanding their Machine Learning practices, specifically in the Named Entity Resolution space. She has contributed to the company’s internal pipeline, designed a demo for them, and published some of her work (https://www.novetta.com/2018/09/named-entity-recognition-and-graph-visualization/ and https://www.novetta.com/2018/08/evaluating-solutions-for-named-entity-recognition/).