facebook-pixel

Social Signal Detection Using NLP & Text Analytics

Industry Media and Advertising

Specialization Or Business Function Customer Analytics, Media and Advertising, Strategic Business Planning, Market Research (Product Development, Social Media Research)

Technical Function Data Warehousing (Data Integration, Scheduling & Monitoring), Analytics (Trend Analysis, Real-time Analytics, Machine Learning, Time Series Analysis, Natural Language Processing, Text Analytics)

Technology & Tools

COMPLETED

Project Description

A SUMMARY OF OUR BUSINESS:

Big Spaceship is a creative agency focused on leveraging cultural intelligence to solve key business problems for our partners. We recently won OMMA’s Agency of the Year and work with industry leaders - including JetBlue, Starbucks, Google, and Hasbro. We have 115 employees, with one centralized office in Brooklyn.

 

THE PROBLEM WE’RE TRYING TO SOLVE:

Big Spaceship is working with various brands to better detect trends before they enter mainstream internet culture. In order to get at the forefront of these trends, we have set up different “tribes” or communities to monitor. These tribes are made up of a defined set of Twitter users who we’ve manually categorized based on target audiences relevant to our clients (e.g. “Millennial Parents”). We are using Crimson Hexagon - a social listening platform with direct access to the Twitter API - to track and monitor conversations generated from these tribes in real-time.

 

To reach “trends” we must first identify significant terms (words or phrases). We’re defining significant as anomalous based on historical data from within the tribe’s user set and anomalous in comparison to the general population. Therefore each tribe’s data must be compared to itself and a general population tribe to determine what is significant to that tribe alone.

 

Our challenge is that we have no automated way to detect trends within these tribes in real-time. We believe there are two potential approaches, but welcome other solutions:

 

Potential Approach 1: Word/Phrase Indexing

  • Analyze term usage at a user level (i.e. proportion of users that posting tweets containing the given word out of the total set of users, e.g. 35% of users used the word “candle”)

  • Slice data into set intervals (e.g. daily, every 3 days, weekly, etc.)

  • Establish a rolling baseline of term usage based on previous data (e.g. 30 days, 90 days, etc.)

  • Index term usage against this rolling baseline accounting for variance within the baseline range

  • Index term usage against the general population to subtract general trends in term usage

  • Identify anomalous words/phrases and raise to user.

Potential Approach 2: Streaming Topic Model

    • Based around Liang, Yilmaz, Kanoulas’ paper “Dynamic Clustering of Streaming Short Documents”

    • Implement a Dynamic Clustering Topic Model (DCT) their proposed variation of Latent Dirichlet Allocation with one topic per tweet and a dynamic topic model based on time on our data for each tribe

    • Slice data into set intervals (e.g. daily, every 3 days, weekly, etc.)

    • Establish topic distributions within each tribe at each time interval

    • Establish a rolling baseline of topic distributions based on previous data (e.g. 30 days, 90 days, etc.)

    • Index topic distribution against this rolling baseline accounting for variance within the baseline

    • Index topic distribution against the general population to subtract general trends in term usage

    • Identify anomalous topics and raise to user. Users can easily label topics based on the context of the terms within them

 

From this analysis we would likely need daily exports of these top terms or topics in the form of CSVs relevant to each tribe.

 

THE KIND OF EXPERTISE REQUIRED:

Natural Language Processing

  • Topic Modeling

  • Tokenization

  • etc.

Anomaly detection

Unsupervised Learning

Neural Networks (optional)

Data Storage/Management

 

DATA SOURCES & FORMATS:

We expect to have 10-15 tribes with 500-2000 tweets per day. Each tribe will have a monitor in Crimson Hexagon, tweets links are pulled from Crimson and then tweet content is pulled from Twitter API. We will collect and store this data daily for analysis.

 

CURRENT TECH STACK:

Python 3.0 (Required)

  • Pandas

  • NumPy

  • SciPy

  • Scikit Learn

  • Gensim

  • Tensor Flow

  • Peewee

PostgreSQL - Google CloudSQL (Flexible)

Spark (if necessary)

 

BID:

For our bidding process, we would like experts to submit an outline of their approach, a rationale explaining why that approach is the right solution, existing references they’ll use to support their approach (e.g. published white papers outlining an approach for a similar problem), and an estimate of hours. Our hourly rate will range between $100 - $200 for this project.

 

DELIVERABLE:

A replicable approach to detecting the emergence of trends within ongoing conversations, with thorough documentation describing the general methodology used.

 

LOCATION PREFERENCE:

We would like a collaborate working model by which the candidate would work either onsite in Brooklyn or within the Eastern Time Zone alongside our in-house data scientist and analysts iteratively.

 

SAMPLE DATASET:

Crimson Hexagon /posts endpoint with Twitter Output (JSON):

{

   "posts": [

       {

           "url": "http://twitter.com/mirl/status/882700164401692672",

           "title": "",

           "type": "Twitter",

           "location": "VA, USA",

           "geolocation": {

               "id": "USA.VA",

               "name": "Virginia",

               "country": "USA",

               "state": "VA"

           },

           "language": "en",

           "assignedCategoryId": 4763388608,

           "assignedEmotionId": 4763388602,

           "categoryScores": [

               {

                   "categoryId": 4763388606,

                   "categoryName": "Basic Negative",

                   "score": 0

               },

               {

                   "categoryId": 4763388610,

                   "categoryName": "Basic Positive",

                   "score": 0

               },

               {

                   "categoryId": 4763388608,

                   "categoryName": "Basic Neutral",

                   "score": 1

               }

           ],

           "emotionScores": [

               {

                   "emotionId": 4763388602,

                   "emotionName": "Neutral",

                   "score": 0.86

               },

               {

                   "emotionId": 4763388603,

                   "emotionName": "Sadness",

                   "score": 0.01

               },

               {

                   "emotionId": 4763388607,

                   "emotionName": "Surprise",

                   "score": 0

               },

               {

                   "emotionId": 4763388604,

                   "emotionName": "Fear",

                   "score": 0

               },

               {

                   "emotionId": 4763388605,

                   "emotionName": "Disgust",

                   "score": 0

               },

               {

                   "emotionId": 4763388611,

                   "emotionName": "Anger",

                   "score": 0

               },

               {

                   "emotionId": 4763388609,

                   "emotionName": "Joy",

                   "score": 0.12

               }

           ],

           "imageInfo": [

               {

                   "url": "http://pbs.twimg.com/media/DD_7HvpWsAEHq4E.jpg"

               }

           ]

       }

   ],

 “totalPostsAvailable”: 1,

 “status”: “success”

}

Example of the Twitter API /statuses/lookup endpoint (JSON):

[

 {

   "created_at": "Tue Mar 21 20:50:14 +0000 2006",

   "id": 20,

   "id_str": "20",

   "text": "just setting up my twttr",

   "source": "web",

   "truncated": false,

   "in_reply_to_status_id": null,

   "in_reply_to_status_id_str": null,

   "in_reply_to_user_id": null,

   "in_reply_to_user_id_str": null,

   "in_reply_to_screen_name": null,

   "user": {

     "id": 12,

     "id_str": "12",

     "name": "Jack Dorsey",

     "screen_name": "jack",

     "location": "California",

     "description": "",

     "url": null,

     "entities": {

       "description": {

         "urls": []

       }

     },

     "protected": false,

     "followers_count": 2577282,

     "friends_count": 1085,

     "listed_count": 23163,

     "created_at": "Tue Mar 21 20:50:14 +0000 2006",

     "favourites_count": 2449,

     "utc_offset": -25200,

     "time_zone": "Pacific Time (US & Canada)",

     "geo_enabled": true,

     "verified": true,

     "statuses_count": 14447,

     "lang": "en",

     "contributors_enabled": false,

     "is_translator": false,

     "is_translation_enabled": false,

     "profile_background_color": "EBEBEB",

     "profile_background_image_url": "http://abs.twimg.com/images/themes/theme7/bg.gif",

     "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme7/bg.gif",

     "profile_background_tile": false,

     "profile_image_url": "http://pbs.twimg.com/profile_images/448483168580947968/pL4ejHy4_normal.jpeg",

     "profile_image_url_https": "https://pbs.twimg.com/profile_images/448483168580947968/pL4ejHy4_normal.jpeg",

     "profile_banner_url": "https://pbs.twimg.com/profile_banners/12/1347981542",

     "profile_link_color": "990000",

     "profile_sidebar_border_color": "DFDFDF",

     "profile_sidebar_fill_color": "F3F3F3",

     "profile_text_color": "333333",

     "profile_use_background_image": true,

     "default_profile": false,

     "default_profile_image": false,

     "following": true,

     "follow_request_sent": false,

     "notifications": false

   },

   "geo": null,

   "coordinates": null,

   "place": null,

   "contributors": null,

   "retweet_count": 23936,

   "favorite_count": 21879,

   "entities": {

     "hashtags": [],

     "symbols": [],

     "urls": [],

     "user_mentions": []

   },

   "favorited": false,

   "retweeted": false,

   "lang": "en"

 },

 {

   "created_at": "Sun Feb 09 23:25:34 +0000 2014",

   "id": 432656548536401920,

   "id_str": "432656548536401920",

   "text": "POST statuses/update. Great way to start. https://t.co/9S8YO69xzf (disclaimer, this was not posted via the API).",

   "source": "web",

   "truncated": false,

   "in_reply_to_status_id": null,

   "in_reply_to_status_id_str": null,

   "in_reply_to_user_id": null,

   "in_reply_to_user_id_str": null,

   "in_reply_to_screen_name": null,

   "user": {

     "id": 2244994945,

     "id_str": "2244994945",

     "name": "TwitterDev",

     "screen_name": "TwitterDev",

     "location": "Internet",

     "description": "Developers and Platform Relations @Twitter. We are developers advocates. We can't answer all your questions, but we listen to all of them!",

     "url": "https://t.co/66w26cua1O",

     "entities": {

       "url": {

         "urls": [

           {

             "url": "https://t.co/66w26cua1O",

             "expanded_url": "/",

             "display_url": "dev.twitter.com",

             "indices": [

               0,

               23

             ]

           }

         ]

       },

       "description": {

         "urls": []

       }

     },

     "protected": false,

     "followers_count": 3147,

     "friends_count": 909,

     "listed_count": 53,

     "created_at": "Sat Dec 14 04:35:55 +0000 2013",

     "favourites_count": 61,

     "utc_offset": -25200,

     "time_zone": "Pacific Time (US & Canada)",

     "geo_enabled": false,

     "verified": true,

     "statuses_count": 217,

     "lang": "en",

     "contributors_enabled": false,

     "is_translator": false,

     "is_translation_enabled": false,

     "profile_background_color": "FFFFFF",

     "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",

     "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",

     "profile_background_tile": false,

     "profile_image_url": "http://pbs.twimg.com/profile_images/431949550836662272/A6Ck-0Gx_normal.png",

     "profile_image_url_https": "https://pbs.twimg.com/profile_images/431949550836662272/A6Ck-0Gx_normal.png",

     "profile_banner_url": "https://pbs.twimg.com/profile_banners/2244994945/1391977747",

     "profile_link_color": "0084B4",

     "profile_sidebar_border_color": "FFFFFF",

     "profile_sidebar_fill_color": "DDEEF6",

     "profile_text_color": "333333",

     "profile_use_background_image": false,

     "default_profile": false,

     "default_profile_image": false,

     "following": true,

     "follow_request_sent": false,

     "notifications": false

   },

   "geo": null,

   "coordinates": null,

   "place": null,

   "contributors": null,

   "retweet_count": 1,

   "favorite_count": 5,

   "entities": {

     "hashtags": [],

     "symbols": [],

     "urls": [

       {

         "url": "https://t.co/9S8YO69xzf",

         "expanded_url": "/docs/api/1.1/post/statuses/update",

         "display_url": "dev.twitter.com/docs/api/1.1/p…",

         "indices": [

           42,

           65

         ]

       }

     ],

     "user_mentions": []

   },

   "favorited": false,

   "retweeted": false,

   "possibly_sensitive": false,

   "lang": "en"

 }

]

Project Overview

  • Posted
    January 23, 2018
  • Planned Start
    February 05, 2018
  • Delivery Date
    March 16, 2018
  • Preferred Location
    United States

Client Overview


EXPERTISE REQUIRED

Matching Providers