{"id":1790,"date":"2019-07-01T02:19:08","date_gmt":"2019-07-01T02:19:08","guid":{"rendered":"http:\/\/kusuaks7\/?p=1395"},"modified":"2023-07-28T07:50:07","modified_gmt":"2023-07-28T07:50:07","slug":"spark-nlp-getting-started-with-the-worlds-most-widely-used-nlp-library-in-the-enterprise","status":"publish","type":"post","link":"https:\/\/www.experfy.com\/blog\/ai-ml\/spark-nlp-getting-started-with-the-worlds-most-widely-used-nlp-library-in-the-enterprise\/","title":{"rendered":"Spark NLP: Getting Started With The World\u2019s Most Widely Used NLP Library In The Enterprise"},"content":{"rendered":"<h3 style=\"text-align: center;\"><\/h3>\n<h3><strong>AI Adoption in the Enterprise<\/strong><\/h3>\n<p>The annual O\u2019Reilly report on\u00a0<a href=\"https:\/\/www.oreilly.com\/data\/free\/ai-adoption-in-the-enterprise.csp\" target=\"_blank\" rel=\"noopener noreferrer\">AI Adoption in the Enterprise<\/a>\u00a0was released in February 2019. It is a survey of 1,300 practitioners in multiple industry verticals, which asked respondents about revenue-bearing AI projects their organizations have in production. It\u2019s a fantastic analysis of how AI is really used by companies today \u2013 and how that use is quickly expanding into deep learning, human-in-the-loop, knowledge graphs, and reinforcement learning.<\/p>\n<p>The survey asks respondents to list all the ML or AI frameworks and tools that they use. This is the summary of the answers:<\/p>\n<p style=\"text-align: center;\"><img fetchpriority=\"high\" decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig1-survey.jpg\" sizes=\"(max-width: 600px) 100vw, 600px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig1-survey.jpg 600w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig1-survey-300x191.jpg 300w\" alt=\"\" width=\"600\" height=\"381\" \/><\/p>\n<p>The 18-month-old Spark NLP library is the 7<sup>th<\/sup>\u00a0most popular across all AI frameworks and tools (note the \u201cother open source tools\u201d and \u201cother cloud services\u201d buckets). It is also by far the most widely used NLP library \u2013 twice as common as spaCy. In fact, it is the most popular AI library in this survey following scikit-learn, TensorFlow, Keras, and PyTorch.<\/p>\n<h3><strong>State-of-the-art Accuracy, Speed, and Scalability<\/strong><\/h3>\n<p>This survey is in line with the uptick in adoption we\u2019ve experienced in the past year, and the public case studies on using Spark NLP successfully in\u00a0<a href=\"https:\/\/www.oreilly.com\/library\/view\/strata-data-conference\/9781492025955\/video318957.html\" target=\"_blank\" rel=\"noopener noreferrer\">healthcare<\/a>,\u00a0<a href=\"https:\/\/conferences.oreilly.com\/strata\/strata-eu-2018\/public\/schedule\/detail\/68625\" target=\"_blank\" rel=\"noopener noreferrer\">finance<\/a>,\u00a0<a href=\"https:\/\/conferences.oreilly.com\/strata\/strata-ca-2019\/public\/schedule\/detail\/72568\" target=\"_blank\" rel=\"noopener noreferrer\">life science<\/a>, and\u00a0<a href=\"https:\/\/conferences.oreilly.com\/strata\/strata-eu-2019\/public\/schedule\/detail\/74111\" target=\"_blank\" rel=\"noopener noreferrer\">recruiting<\/a>. The root causes for this rapid adoption lie in the major shift in state-of-the-art NLP that happened in recent years.<\/p>\n<p><strong>ACCURACY<\/strong><\/p>\n<p>The rise of deep learning for natural language processing in the past 3-5 years meant that the algorithms implemented in popular libraries, like\u00a0<a href=\"https:\/\/spacy.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">spaCy<\/a>,\u00a0<a href=\"https:\/\/stanfordnlp.github.io\/CoreNLP\/\" target=\"_blank\" rel=\"noopener noreferrer\">Stanford CoreNLP<\/a>,\u00a0<a href=\"https:\/\/www.nltk.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">nltk<\/a>, and\u00a0<a href=\"https:\/\/opennlp.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">OpenNLP<\/a>, are less accurate than what the latest scientific papers made possible.<\/p>\n<p>Claiming to deliver state-of-the-art accuracy and speed has us constantly on the hunt to productize the latest scientific advances (yes, it is as fun as it sounds!). Here\u2019s how we\u2019re doing so far (on the en_core_web_lg benchmark, micro-averaged F1 score):<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-benchmark.jpg\" sizes=\"(max-width: 522px) 100vw, 522px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-benchmark.jpg 522w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-benchmark-300x259.jpg 300w\" alt=\"\" width=\"522\" height=\"451\" \/><\/p>\n<p><strong>SPEED<\/strong><\/p>\n<p>Optimizations done to get\u00a0Apache Spark\u2019s performance closer to bare metal, on both a single machine and cluster, meant that common NLP pipelines could run orders of magnitude faster\u00a0than what the inherent design limitations of legacy libraries allowed.<\/p>\n<p>The most comprehensive benchmark to date,\u00a0<a href=\"https:\/\/www.oreilly.com\/ideas\/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability\" target=\"_blank\" rel=\"noopener noreferrer\">Comparing production-grade NLP libraries<\/a>, was published a year ago on O\u2019Reilly Radar. On the left is the comparison of runtimes for training a simple pipeline (sentence boundary detection, tokenization, and part of speech tagging) on a single Intel i5, 4-core, 16 GB memory machine:<\/p>\n<p style=\"text-align: center;\"><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-speed.jpg\" sizes=\"(max-width: 600px) 100vw, 600px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-speed.jpg 600w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig2-speed-300x198.jpg 300w\" alt=\"\" width=\"600\" height=\"396\" \/><\/p>\n<p>Being able to leverage GPU\u2019s for training and inference has become table stakes. Using TensorFlow under the hood for deep learning enables Spark NLP to make the most of modern computer platforms \u2013 from\u00a0<a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-1\/\" target=\"_blank\" rel=\"noopener noreferrer\">nVidia\u2019s DGX-1<\/a>\u00a0to\u00a0<a href=\"https:\/\/fuse.wikichip.org\/news\/1773\/intel-cascade-lake-brings-hardware-mitigations-ai-acceleration-scm-support\/2\/\" target=\"_blank\" rel=\"noopener noreferrer\">Intel\u2019s Cascade Lake<\/a>\u00a0processors. Older libraries, whether or not they use some deep learning techniques, will require a rewrite to take advantage of these new hardware innovations that can add improvements to the speed and scale of your NLP pipelines by another order of magnitude.<\/p>\n<p><strong>SCALABILITY<\/strong><\/p>\n<p>Being able to scale model training, inference, and full AI pipelines from a local machine to a cluster with little or no code changes has also become table stakes. Being natively built on Apache Spark ML enables Spark NLP to scale on any Spark cluster, on-premise or in any cloud provider. Speedups are optimized thanks to Spark\u2019s distributed execution planning and caching, which has been tested on just about any current storage and compute platform.<\/p>\n<h3><strong>Other Drivers of Enterprise Adoption<\/strong><\/h3>\n<p><strong>\u00a0PRODUCTION-GRADE CODEBASE<\/strong><\/p>\n<p>We make our living delivering working software to enterprises. This was our primary goal in contrast to research-oriented libraries like\u00a0<a href=\"https:\/\/allennlp.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">AllenNLP<\/a>\u00a0and\u00a0<a href=\"http:\/\/nlp_architect.nervanasys.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">NLP Architect<\/a>.<\/p>\n<p><strong>PERMISSIVE OPEN SOURCE LICENSE<\/strong><\/p>\n<p>Sticking with an Apache 2.0 license was selected so that the library can be used freely, including in a commercial setting. This is in contrast to Stanford CoreNLP, which requires a paid license for commercial use, or the problematic ShareAlike CC licenses used for some spaCy models.<\/p>\n<p><strong>FULL PYTHON, JAVA, AND SCALA API\u2019S<\/strong><\/p>\n<p>Supporting multiple programming languages does not just increase the audience for a library. It also enables you to take advantage of the implemented models without having to move data back and forth between runtime environments. For example, using spaCy, which is Python-only, requires moving data from JVM processes to Python processes in order to call it \u2013 resulting in architectures that are more complex and often much slower than necessary.<\/p>\n<p><strong>FREQUENT RELEASES<\/strong><\/p>\n<p>Spark NLP is under active development by a full core team in addition to community contributions. We release about twice a month \u2013 there were\u00a0<a href=\"https:\/\/github.com\/JohnSnowLabs\/spark-nlp\/releases\" target=\"_blank\" rel=\"noopener noreferrer\">26 new releases in 2018<\/a>. We welcome contributions of code, documentation, models or issues \u2013 please start by looking at the existing\u00a0<a href=\"https:\/\/github.com\/JohnSnowLabs\/spark-nlp\/issues\" target=\"_blank\" rel=\"noopener noreferrer\">issues<\/a>\u00a0on GitHub.<\/p>\n<h3><strong>Getting Started<\/strong><\/h3>\n<p><strong>\u00a0PYTHON<\/strong><\/p>\n<p>A major design goal of Spark NLP 2.0 was to enable people to get the benefits of Spark and TensorFlow\u00a0<em>without knowing anything about them<\/em>. You shouldn\u2019t have to know what a Spark ML estimator or transformer is, or what a TensorFlow graph or session is. These are all still available if you\u2019re looking to build your own custom models or graphs, but they are now fronted by facades that get stuff done with minimal time and learning curve. We\u2019ve also added 15 pre-trained pipelines and models that cover the most common use cases.<\/p>\n<p>Installing Spark NLP for Python requires a one-line\u00a0<em>pip install<\/em>\u00a0or\u00a0<em>conda install<\/em>. This\u00a0<a href=\"https:\/\/nlp.johnsnowlabs.com\/docs\/en\/install\" target=\"_blank\" rel=\"noopener noreferrer\">installation page<\/a>contains more detail about Jupyter and Databricks configurations. Live projects routines use the library also on Zeppelin, SageMaker, Azure, GCP, Cloudera, and vanilla Spark \u2013 inside and outside Kubernetes.<\/p>\n<p>Once installed, here is what it takes to run sentiment analysis:<\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig4-code-sentiment.jpg\" sizes=\"(max-width: 600px) 100vw, 600px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig4-code-sentiment.jpg 600w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig4-code-sentiment-300x155.jpg 300w\" alt=\"\" width=\"600\" height=\"309\" \/><\/p>\n<p>Here is all it takes to run named entity recognition with BERT embeddings:<\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig5-code-recognition.jpg\" sizes=\"(max-width: 600px) 100vw, 600px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig5-code-recognition.jpg 600w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig5-code-recognition-300x106.jpg 300w\" alt=\"\" width=\"600\" height=\"212\" \/><\/p>\n<p>The pipeline object in these examples has two key methods \u2013\u00a0<em>annotate()<\/em>, which takes a string, and\u00a0<em>transform()<\/em>, which takes a Spark data frame. These enable you to scale this code to process a large body of text on any Spark cluster.<\/p>\n<p><strong>SCALA<\/strong><\/p>\n<p>Spark NLP is written in Scala, required for it to operate directly on Spark data frames, with zero copying of data and while taking full advantage of the Spark execution planner and other optimizations. As a result, the library is a breeze to use for Scala and Java developers.<\/p>\n<p>The library is published on Maven Center, so just adding the dependency in your Maven or SBT file is all it takes to install it. There\u2019s a second dependency to add if you also want to install Spark NLP\u2019s OCR (object character recognition) capabilities.<\/p>\n<p>Once installed, here is what it takes to spell check a sentence:<\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig6-code-spell-check.jpg\" sizes=\"(max-width: 600px) 100vw, 600px\" srcset=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig6-code-spell-check.jpg 600w, https:\/\/www.kdnuggets.com\/wp-content\/uploads\/talby-fig6-code-spell-check-300x105.jpg 300w\" alt=\"\" width=\"600\" height=\"209\" \/><\/p>\n<p>The Scala and Python API\u2019s are kept similar and 100% complete in every release.<\/p>\n<p><strong>UNDER THE HOOD<\/strong><\/p>\n<p>There\u2019s a lot that goes behind the scenes in the few lines of codes shared above \u2013 and a lot that you can customize for your other applications. Spark NLP is heavily optimized towards training domain-specific NLP models \u2013 see for example the\u00a0<a href=\"https:\/\/www.johnsnowlabs.com\/spark-nlp-health\/\" target=\"_blank\" rel=\"noopener noreferrer\">Spark NLP for Healthcare<\/a>\u00a0commercial extension \u2013 so all the tools for defining your pre-trained models, pipelines, and resources are in the public and documented APIs.<\/p>\n<p>Here are the main steps taken by the named entity recognition with BERT Python code from the previous section:<\/p>\n<ol>\n<li><em>sparknlp.start()<\/em>\u00a0starts a new Spark session if there isn\u2019t one, and returns it.<\/li>\n<li><em>PretrainedPipeline()<\/em>\u00a0loads the English language version of the explain_document_dl pipeline, the pre-trained models, and the embeddings it depends on.<\/li>\n<li>These are stored and cached locally.<\/li>\n<li>TensorFlow is initialized, within the same JVM process that runs Spark. The pre-trained embeddings and deep-learning models (like NER) are loaded. Models are automatically distributed and shared if running on a cluster.<\/li>\n<li>The\u00a0<em>annotate()<\/em>\u00a0call runs an NLP inference pipeline which activates each stage\u2019s algorithm (tokenization, POS,\u00a0<em>etc<\/em>.).<\/li>\n<li>The NER stage is run on TensorFlow \u2013 applying a neural network with bi-LSTM layers for tokens and a CNN for characters.<\/li>\n<li>Embeddings are used to convert\u00a0<em>contextual\u00a0<\/em>tokens into vectors during the NER inference process.<\/li>\n<li>The\u00a0<em>result<\/em>\u00a0object is a plain old local Python dictionary.<\/li>\n<\/ol>\n<p><strong>GIVE IT A GO!<\/strong><\/p>\n<p>The Spark NLP homepage has examples, documentation, and an installation guide.<\/p>\n<p>If you\u2019re looking to explore sample notebooks on your own, the\u00a0<a href=\"https:\/\/github.com\/JohnSnowLabs\/spark-nlp-workshop\" target=\"_blank\" rel=\"noopener noreferrer\">Spark NLP Workshop<\/a>\u00a0has a pre-built Docker container that enables you to run a complete environment on your local machine by typing 3 one-liners.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Spark NLP library is the 7th&nbsp;most popular across all AI frameworks and tools. It is also by far the most widely used NLP library &ndash; twice as common as spaCy. Optimizations are done to get&nbsp;Apache Spark&rsquo;s performance closer to bare metal, on both a single machine and cluster, meant that common NLP pipelines could run orders of magnitude faster&nbsp;than what the inherent design limitations of legacy libraries allowed. Being natively built on Apache Spark ML enables Spark NLP to scale on any Spark cluster, on-premise or in any cloud provider.&nbsp;<\/p>\n","protected":false},"author":586,"featured_media":3153,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"content-type":"","footnotes":""},"categories":[183],"tags":[97],"ppma_author":[3282],"class_list":["post-1790","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml","tag-artificial-intelligence"],"authors":[{"term_id":3282,"user_id":586,"is_guest":0,"slug":"david-talby","display_name":"David Talby","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","user_url":"","last_name":"Talby","first_name":"David","job_title":"","description":"David Talby, Ph.D. in computer science, is Chief Technology Officer at Pacific AI that builds artificial intelligence software systems for other companies. A regular and seasoned speaker at data science events, he specialises in applying machine learning, deep learning, and natural language understanding in healthcare.&nbsp;He started and runs the Linkedin group for data science in healthcare, and is a member of the Forbes Technology Council, an invitation-only community for CIOs, CTOs and technology executives."}],"_links":{"self":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1790","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/users\/586"}],"replies":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/comments?post=1790"}],"version-history":[{"count":4,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1790\/revisions"}],"predecessor-version":[{"id":29708,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/posts\/1790\/revisions\/29708"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media\/3153"}],"wp:attachment":[{"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/media?parent=1790"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/categories?post=1790"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/tags?post=1790"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.experfy.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=1790"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}