facebook-pixel

Entity Extraction, NLP and Network Analysis for a Research and Advisory Firm

Industry Chemical, Oil and Gas, Energy and Utility

Specialization Or Business Function Strategic Business Planning (Competitive Intelligence)

Technical Function Analytics (Natural Language Processing, Text Analytics)

Technology & Tools Programming Languages and Frameworks (R, Python)

COMPLETED Jan 15, 2019

Project Description

We have two distinct projects with aggressive deadlines.  These projects collectively will serve as a proof of concept for a research platform that we would like to build.  We intend to hire two data scientists, one for each project.

Project 1: Academic Network Formation Detection

Objectives

  1. Discover interesting emerging technologies by clustering related academic publications and researchers, and be able to track their development over time. (a). We are particularly interested at discovering points in time where networks emerge (b). “Interesting” could be defined as when similar topics begin to see increases in publications at multiple institutions by many different researchers
  2. In the process of doing the above network analysis, build an underlying data asset (ideally in a relational database) of publications, people, and institutions, and time-stamped affiliations.

Project Description

The finished project will include scripts/software to scrape/pull the dataset from one or more sources (e.g. ResearchGate) and

  • Create a table of publications (title, date, abstract)
  • Create a table of authors (associated with the publication)
  • Create a table of institutions (associated with authors at a point in time based on a publication)
  • Classify the publications based on similarity of text in abstracts (or authorship or other

You will also deliver the actual database/files produced by the above scripts.

We are open to recommendations of techniques, approaches, and strategy based on the expertise and experience of the expert.

We have no particular requirement around tools, programming language, or methodology. We will ultimately need to refresh our dataset at least monthly; however our immediate need is to have historical data for a proof of concept.

We expect the output to eventually be imported into a relational database, and merged/deduped with people, institutions, and relationships from other data sources. While we currently don’t have a full working taxonomy, we also expect to combine the categorization scheme extracted from this data source with other data sources (e.g. company profiles/descriptions, patents, press releases).

Boundaries

To narrow the universe of publications and people for this dataset, we are only interested in publications, people, and institutions in any STEM discipline, within the date range of 2000-present.

Project 2: Entity extraction and classification from patent filing databases

Objectives

  1. Discover interesting emerging technologies by clustering related companies and people by classifying them by topic/technology based on their patent filings over time
  2. We are particularly interested in trends in new cluster formation and predicting growth in certain technology areas

Project Description

The finished project will include:

  • Scripts/software to build the classifications from Thomson Innovation or other patent sources and
  • A database or files containing the data on the classifications/clusters, and appropriate visualizations to show growth of clusters over time

We are open to recommendations of techniques, approaches, and strategy based on the expertise and experience of the expert.

We have no particular requirement around tools, programming language, or methodology.

We expect the output to eventually be imported into a relational database, and merged/deduped with people, institutions, and relationships from other data sources. While we currently don’t have a full working taxonomy, we also expect to combine the categorization scheme extracted from this data source with other data sources (e.g. company profiles/descriptions, patents, press releases).

Boundaries

To narrow the universe of patents for this dataset, we are only interested in publications, people, and institutions in any STEM discipline, within the date range of 2000-present.

Proposal

Please specify which of the two projects you are most interested in and how much time you can dedicate on a weekly basis.  An estimate of how long it would take to create a proof of concept would be very helpful.  We would also like to understand your specific methodology to tackle the challenges described above.

Project Overview

  • Posted
    November 17, 2015
  • Planned Start
    November 23, 2015
  • Preferred Location
    United States

Client Overview


EXPERTISE REQUIRED

Matching Providers