• Artificial Intelligence
  • Max Ved
  • JAN 03, 2020

Soft skills detection

The new world we live in gives us more help — and more doubts. Machines stand behind everything — and the scope of this everything is only growing. To what extent can we trust such machines? We are used to relying on them in market trends, traffic management, maybe even in healthcare. Machines are now analysts, medical assistants, secretaries and teachers. Are they reliable enough to work as HRs? Psychologists? What can they tell about us?

Let’s see how text analysis can analyze your soft skills and tell a potential employer whether you can join the team smoothly.

Project Description

In this project, we used text analysis techniques to analyze the soft skills of young men (aged 15–24) looking for career opportunities.

What we had in mind was to perform a number of tests, or to choose the most effective one, to determine ground truth values. The tests we were experimenting with included:

  • Mind Tools test — short and simple, but may be difficult for centennials. Test’s output is score (from 1 to 15) for each of the following soft skills categories: Personal Mastery, Time Management, Communication Skills, Problem Solving and Decision Making, Leadership and Management.
  • TAT test — the input is a story about a picture, and the system automatically rates it in the following categories: Need for Affiliation, Need for power, Self-references (I, me, my), Social words, Positive emotions, Negative emotions, Big words (> 6 letters). We thought about mapping certain traits: Communication — Need for Affiliation, Courtesy — Social words, Flexibility — Need for Affiliation, Integrity — Self-references, Interpersonal skills — Need for Affiliation, Positive attitude — Positive emotions, Professionalism — Need for Achievement, Responsibility — Big words (words with more than 6 letters), Teamwork — Social words, Work ethic — Social words. Together with MBTI test results these categories can be mapped on soft skills more accurately.
  • Myers-Briggs tests — self-report questionnaires indicating psychological preferences in how people perceive the world around them and make decisions. The received answers are mapped onto four categories: Extraversion-Introversion, Sensing-Intuition, Thinking-Feeling, Judging-Perceiving. There are three options for MBTI tests with about 60 questions in each: MBTI test 1, MBTI test 2, MBTI test 3.

The following table describes how to map MBTI and TAT test results on soft skills.

Table 1. Mapping of MBTI and TAT results on soft skills

Our Approach

To have the fullest understanding of the user’s personality, we combine insights coming from soft skills prediction analytics and the MBTI prediction. Moreover, we keep track of the user’s past inputs and result to update our findings and to monitor possible changes in the user’s attitudes.

1. Data Collection

Any text analysis system requires a minimal amount of the user’s text at the input, as well as a minimal number of user’s tweets or FB posts (in case of linking with social media accounts). As soon as there is no publicly available labeled dataset for soft skills detection, we had to collect our own.

To quickly develop a reliable dataset we decided to expand the users’ tweets and FB posts with produced by them essays and answers to the questions generated by our system.

At the registration, the user is encouraged to specify his or her social networks profile (Instagram, Twitter and Facebook profiles). The system then collects all messages authored by the user.

As the second step, the system prompts the user to write a brief essay, answering prearranged sets of questions, which should be answered in the specific order:

1. How are you doing today?

2. Please tell me about yourself

3. Tell me about a time when you demonstrated leadership.

4. Tell me about a time when you were working in a team and faced with a challenge. How did you solve this problem?

5. What is your weakness, and how do you plan to overcome it?

One more set of questions:

1. How are you doing today?

2. Please tell me about yourself

3. How do you like to spend your free time? Please, tell me about your hobbies.

4. Please tell me about some of the most memorable moments which happened to you during your study at school/college/university.

5. What is your weakness, and how do you plan to overcome it?

The questions are generated by the system based on previous data analysis.

Afterwards, the user is prompted to enter daily status, for example, as a short blog-post or essay on his past day.

To make it easy for the user, we offer the following questions as a plan:

  1. What new things did you learn today?
  2. What difficulties did you face?
  3. How did you overcome those difficulties?

Of course, essays are not always the most convenient way to sketch your feelings about the day and not everyone likes writing. For this, we can use a chatbot interface integrated with social media, such as Facebook Messenger.

2. Data preprocessing

All available texts are used as an input to the model which determines soft skills from the customers’ set.

To be fed to the model for training, both questionnaire answers and collected tweets undergo preprocessing, i.e. normalization and tokenizing. At first, with the help of regular expressions, the text is freed from emoji, URLS, numbers, stop words, acronyms, etc.; HTML entities are also replaced with characters. If necessary, we complement text normalization with tokenizing.

A single module is used as an input to the model which determines soft skills from the dataset. A model’s output will be an N-dimensional vector, where N is the number of soft skills used by the system. The n-th element of the set will be a probability of the user having n-th skill. As an optional second output, we include a confidence score (depending on the model choice).

3. MBTI and soft skills detection

For MBTI detection, we chose the approach described at Plank, Hovy, 2015. The authors collected a corpus of 1.5M English tweets labeled by gender and MBTI type, and created a model for automatic MBTI detection. Both the corpus and the model are available, and the reported accuracy is 55–72% depending on a category (on 2000 tweets). To improve the accuracy and decrease the amount of texts required for accurate detection, we additionally used the approach described in Arnoux et. al., 2017.

For soft skills detection, we chose a slightly different approach, namely, GloVE (Pennington et. al, 2014), and used the Gaussian process as a classifier.

4. MBTI and soft skills prediction

At the next stage we calculated the features for both MBTI and Soft Skills models. We tried to use the same features for both models in order to optimize calculations. Features were then passed to the MBTI and Soft Skills calculation models. Such features can be run in parallel if the Soft Skills model does not use the MBTI model’s output as an additional input.

At this stage, the system returns a 4-dimensional vector and an optional confidence score. Optionally, it can return 4-letter codes of Myers-Biggs personality type, e.g. “ISTJ”, “ENFP”.

It outputs an N-dimensional vector with a certain score for each soft skill (normalized to [0, 1]) and a confidence score as a second (optional) output. The system can also return an error code in case of error (e.g. too short text, input language other than English, etc.).

5. Results validation

This component analyzes error codes of Soft skills and MBTI prediction modules. Gaussian processes or another Bayesian model allow us to get a confidence score which is compared with a preselected threshold value. A low confidence score means that system is not confident enough, and it cannot estimate MBTI or soft skills. In this case, an additional user’s input may be required.

6. User’s results recording

Finally, the recent MBTI and soft skills assessment results are recorded in the database, including user’s text input, the system output, and the corresponding timestamp The database stores both the initial estimation, and daily updates.

Text analysis flowchart


To work effectively, any text analysis system has a minimal required amount of user’s text at the input, as well as a minimal number of user’s tweets or FB posts (in case we link it with social media accounts). Besides, for both soft skills and MBTI, the system can have a confidence score as an output — a floating point value from 0 to 1 (where 1 is the highest score). In case of low confidence score (e.g. < 0.5), the user should be asked to write more detailed answers. The system will generate questions based on previous data analysis.

The second consideration is that the set of questions and the rules for their generation should be discussed. As we are not professional psychologists, we have two options to create a set of questions and rules:

  • Let human experts in psychology define the set of questions and rules;
  • Find existing dataset of relevant interviews

Technology stack

  • Python 3.6 is used for Text Analysis back-end implementation. The set of packages includes (but is not limited to) the following packages: Numpy, Scipy, Scikit-learn, Gensim.
  • Database for storing all users’ data, including their input text and all analysis results history can be either relational (Postgresql) or no-SQL (Mongo, Redis).

Third-party components


We showed that Artificial Intelligence can provide deep insights into the human personality. Sentiment analysis and analysis of soft skills based on the text produced by the users are examples of tools that can be immediately used by the HRs, businesses that want to monitor their employees’ attitudes and hire new specialists, as well as by candidates who want to assess and improve their chances of getting a job.

The Harvard Innovation Lab

Made in Boston @

The Harvard Innovation Lab


Matching Providers

Matching providers 2
comments powered by Disqus.