Ready to learn Text Analytics? Browse courses developed by industry thought leaders and Experfy in Harvard Innovation Lab.
Much of the current writing on text analytics emphasizes the importance of it, the benefits, the dangers of not doing it, on a quite general level. Often, when I teach such search applications as SOLR, ElasticSearch and Lucene, I tell the students the following: databases return all information for your query, but search systems are in the class of IR (Information Retrieval), and they return relevant information first. This is still selecting from your own information, just doing it better.
Then I mention that text analytics returns nonobvious information, enhancing your data in some way. At this point, I may notice that the students feel that "this is too much", and that the concept may be beyond their reach.
In this blog post, I want to simplify and explain practical text analytics. I want to show a few examples, and in each one I want to show the practical rationale, the technology used, and the benefits resulting from text analytics. At the end I will summarize this in a few short rules.
Some use cases are my own work, while other come from my work in the industry, or from research and interviews with the people who are doing this work.
First I want to mention that there are two approaches to text analytics: grammar-based natural language processing (NLP), and machine learning based methods. Both approaches have their pros and cons, and both can lead to very useful results. Let us start with the simpler ones, and continue to more modern and more “machine learning” flavors.
Law Enforcement: Crime fighting with? privacy protection for Royal Canadian Mounted Police
Business use case
Memex is a DARPA program to arm global law enforcement with Big Data technology to fight labor trafficking, sex trafficking, and other crimes.
The Royal Canadian Mounted Police (RCMP) has asked Memex researchers to come up with the optimal recommendations on the deployment of additional law enforcement resources, without violating the privacy of its citizens. We should note that the US companies are only coming to grips with the ideas of protecting privacy of citizens, while in Canada these areas are fully developed.
For our RCMP use case this means the following: the information on traffic violations and crime statistics, needed by the Memex scientists, cannot be directly given to them, because of privacy regulations. The Memex researchers came up with a solution: they will develop a software that will find and remove the personally identifiable information (PII). The resulting software will be sent to the RCMP to be run inside the RCMP office by their personnel, and the anonymized results will be delivered to Memex.
However, in order preserve the information for future reference, the software will produce two data sets:
- RCMP citations with private information removed
- The KEY data set, mapping the person/citation combination to a unique key.
The KEY data set stays under the RCMP control, assuring both privacy and the ability to go back to the original record should the need arise.
Technology use case
RCMP has originally proposed the following steps:
(a) Memex software extracts the citations information from the PDF-based citations, and (b) the RCMP clerks anonymize the information and assign unique keys. However, it was quickly realized that the step (b) will take an inordinate amount of human effort. Therefore, the Memex researchers suggested taking care of both steps using NLP. Here we come to our first example of text analytics: classical NLP based on text extraction rules (called regular expressions) and grammar (using such tool as GATE). This approach works best in the situations characterized by:
- Relatively small amounts of data
- Small document size
- Documents are well structured, mostly standardized
These assumptions are all true for the RCMP citations. To give the reader an idea of why the technology works so well, imagine a regular computer ticket for traffic violation. At the top left or top right corner you will find the ticket numbers. Next to it will be the date. Then the person’s name, the description of the incident and so on. (Show ticket)
If you have software that can identify text elements using simple patterns, like “first entry after the date”, then you can extract the basic text fields. This is a good start. Going further, NLP software can find entities, such as officer name, dates, court name, person name, place name, etc. The software of this type is based on the language of “go through all entries that looks like a place name and find matches a court from the given list”. This software is often based on the vocabularies, called “gazetteers.”
The Memex program is organized is by open sourcing most of the contributions on GitHub. The example of RCMP text extraction code can be found in the RCMP Memex project (https://github.com/TeamHG-Memex/RCMPmemex). We used Aspose software to extract text from the PDF documents, because it was the only software that could extract correctly formatted text, and this was very important for field identification. I mention it because it was an exception to our usual use of open source software, but sometimes there is no alternative. We got a discount though, because we wrote about Aspose in our open source blog post. An example of GATEbased text analytics is found in the MemexGate project (https://github.com/memex-explorer/memex-gate).
Both the rules and the gazetteers have to be actively maintained by the people responsible for the software. However, once they are tuned, they work very accurately. The project is actively deployed.
Patent law analytics
Business use case
Lex Machina (https://lexmachina.com/) made its strong business use case by being acquired by Lexis Nexis in November 2015. Although the exact terms of the acquisition were undisclosed, estimates put it (https://bol.bna.com/access-to-justice-problem-fueled-lexisnexis-lex-machina-deal/) at around $30 - $35 million, with annual revenues of between $5 and $8 million.
So what was in the shopping cart when Lexisnexis clicked on “Buy”?
Imagine you can read all of the patents and patent litigation cases. Then you can classify each person/entity as participants, plaintiffs, defendants, wins, losses, etc. You would get a good picture of which patents bring more money, which lawyers are more successful with certain case types regarding patent litigation, and so on.
Obviously, it is very useful. That is exactly what Lex Machina does for you. You get dozens of useful displays like the screenshot below, slicing and dicing this data in every way possible. Source: (https://lexmachina.com/lex-machina-adds-bigger-data-custom-insights-raising-analytics-bar-ip-litigators/)
There are many practical business considerations when planning this analytics thrust. All court records are publicly available on a website called PACER (https://www.pacer.gov/). The caveat is that it costs 10 cents per page of every document. So all court proceedings since year 2000 will cost you about $10 million, and there are no discounts.
Furthermore, although the analytics data is nice, it comes at a price. Prices for various services and custom reports are in the thousands of dollars (https://lexmachina.com/services/). But we will leave this to LexisNexis to figure out the business side. What is important is that here is a classical business use case for text analytics: knowledge is power. Knowing which lawyer has the best chance of winning on your behalf in your specific situation may be well worth orders of magnitude of what this report costs (https://lexmachina.com/services/).
In the book, "The Zigzag Kid" by David Grossman, the policeman father teaches his son that "Knowledge is powerrrr!!", pounding with his fist for emphasis. The son continues, "I could never be sure which was more important to him, knowledge or power."
Having put this into proper perspective, let us now look at the technology side.
Technology use case
So how was this feat achieved? What Search analytics technology is inside of Lex Machina? The reader can recognize the same questions that the Lex Innova engineers had to ask themselves, as in the Royal Canadian Mounted Police use case above. What are the possible fields? How can we extract them?
The target patent applications are fairly well structured, and are relatively few in number (a few million) Lex Machina must have used this kind of fieldbased analytics first. Then they have created a number of visually appealing analytical charts and display.
It is essential to point out that for the search analytics to be useful, it has to be combined with various displays and textual disambiguation* allowing to see actionable results. For effective adoption, you likely need to create customized visualization and analysis actions, as Lex Machina had to do.
There is more than what meets the eye in the Lex Machina displays. A patent search reveals that they have two unissued, but published, patents. The first one (https://www.google.com/patents/US20140180934) is based on document clustering, which is a part of machine learning. It permits to calculate the patent similarity by the use of words in the patent description and by other factors. This is really extracting hidden, nonobvious information, from all the patents at once, a task at which computers excel.
The second patent (https://www.google.com/patents/US20140279583) goes beyond that. Its purpose is to identify the Patent Monetizing Entities (PME), or patent trolls in simple terms. It does this using features extracted from various sources, such as the entities' litigation behavior, the patents they asserted, and their presence on the web. Lex Machina claims that the classifier can correctly separate PMEs from operating companies with a reasonable degree of accuracy.
Big Data and Text Analytics for Oil and Gas Operations
Business use case
Oil and Gas industry has been dealing with large data sizes for a long time. For example, a seismic data set can be over 10 TB. This may explain why people from the oil patch were not impressed with the new Big Data tools and their data sizes of terabytes and petabytes.
Until now, that is. Kathy Ball, general manager for Devon Energy’s advanced analytics team, saw it coming early. According to her, “The amount of data generated by oil and gas operations is starting to explode as realtime information from sensors is being collected at a rate of 4 milliseconds. The speed of this data gathering is pushing the size of oil and gas Big Data into the exabyte range and larger as companies handle more data sets.” (http://www.rigzone.com/news/oil_gas/a/140631/Big_Data_Internet_of_Things_Transfor ming_Oil_and_Gas_Operations)
So Devon may be one of the first oil and gas companies using Big Data to a great advantage. But oil people are practical, so what will be the first thing they do that will have impact on profitability? Here it is: reduce non productive time (NPT).
Using Hadoop and SAS, as well text and numerical data, Devon developed an algorithm to determine the main factors for NPT. As it was found, NPT was occurring due to generator issues. By taking all the comments and finding “bad words,” such as “Motor failure,” combined with sensor data, Devon pinpointed the issue, and concluded that having the electrical skill set could reduce NPT by 30 percent.
This and similar use cases justified the popular, if not cliche adage of “Data is the new oil.”
Technology use case
How can such feat be achieved? Devon used the following components:
Hadoop to store the massive volumes of data, SAS for machine learning, based on word occurrence statistics. SAS is well known to professional data scientists? open source alternatives would be Spark and its MLLib for machine learning, which contains a number of textanalytics tools.
Devon is expanding its Big Data and analytics use. Ball’s team has found that drilling, completion and production are all asking the same questions. “Getting people to talk – and allowing people access to data they didn’t have before – has been amazing. Instead of different versions of the truth, there’s one version and one system that everyone goes by.”
How to start using text analytics
1. Get familiar with the basic tools for text analytics: Python NLTK, GATE, Spark MLLib.
2. Start using these tools to solve practical problems, get used to thinking in text analytics categories.
3. Combine multiple starter applications into a realworld system.
4. Watch the developments in the area, be in touch with the researchers, open source committers, starts an open source project yourself.