The speed with which information is posted online makes it difficult for analysts to identify and distill important trends. This is true for companies managing potentially harmful stories and for government analysts monitoring emergent regional developments. To help analysts on the Novetta Mission Analytics (NMA) team address this challenge, we conducted a novel analysis of open source and cloud-based Named Entity Recognition (NER) tools. NMA collects and enriches open source content and provides the analyst a customizable UI used to analyze trends across topics, regions, and news sources. To support their users, we generated a network visualization tool that allowed analysts to interpret entities and relationships in a corpus of news articles or other text datasets.
In Phase 1, we conducted an analysis of five open source NER libraries and Amazon Comprehend (an AWS service). Tests were conducted using two corpora - the WikiGold Corpus and an operational NMA dataset. These two corpora totalled to approximately 50,000 tokens, sufficient for generation of baseline evaluation metrics. To handle the libraries’ varied inputs and outputs, we configured a python pipeline to streamline evaluation processes. To address dependencies we created a Dockerfile that would allow any other data scientist to replicate this analysis.
We assessed the models performance on their ability to correctly identify people, locations, and organizations. The most effective libraries, as measured by the F1 score, correctly identified these entity types without falsely identifying non-entity tokens. Based upon this evaluation we determined that Amazon Comprehend performed robustly against test corpora, while the top open source solutions for the NER task were AllenNLP ELMo, Stanford NER Tagger, and NeuroNER.
Each entity detected in the text is represented by a node colored according to its categorization - person, location, or organization. The size of each node was proportional to its number of degrees, or connections to other nodes. The weight of an edge was dictated by the frequency with which two entities are seen within a set distance of each other in the text.
In some cases NER tools extract the same identity in slightly different forms, such as Donald Trump, President Trump, and Donald J. Trump. This causes a suboptimal analyst experience. We implemented a fuzzy matching library, dedupe, that resolved and combined such instances.
The Entity Network Visualization Tool has proved to be a useful, functional part of the NMA analyst toolkit. From an analytical standpoint it facilitates insights through visual representation of text. Key entities stand out based on their importance and prominence in the article. Important relationships come into focus based on node size and edge thickness. Another dimension reflected in our visualization is the nature of communities that entities form. This speaks to the subgroups and interrelations within a larger network.