Dimensionality Reduction with T-SNE

Niklas Donges Niklas Donges
June 21, 2019 AI & Machine Learning

Since we live in a 3-dimensional world, we can understand things in 1 dimension, 2 dimensions and 3 dimensions easily but Datasets can be very complex and hard to understand, especially if you don’t have the right tricks in your proposal. In machine learning, we sometimes need to make assumptions based on hundred or even thousand dimensions. Our brains just can’t do that, which is why we invented machine learning to help us to recognize and learn patterns within data, that humans can’t recognize. A good example is IBM’s Watson that consistently diagnoses cancer better than the worlds leading doctors because it is able to analyze millions of cancer research papers at once and match a patients genetic profile to what it has learned.

Table of contents:

  • The Data Set 
  • Data Preparation
  • Train Test Split
  • Dimensionality Reduction with T-SNE
  • Summary

Data Set

Today we will visualize a dataset, that contains data of peole who were tracked with a fitness-tracking device while doing some exercises.

Bildschirmfoto 2017-12-11 um 17.36.00

Above you can see a part of the dataset we will working with today. Each row represents a different person and each column represents a different physical measurement (the features). At the right you can see the “class” column which describes what a person was doing while being tracked. You can see that there are some fields that contain “NA” which means that their value is missing. Our first job is to remove these, so that our data is clean.

Let’s first import our librearies and tools to complete our task.

Bildschirmfoto 2017-12-13 um 08.45.33

Now we can use the pandas read_csv() function to download our dataset directly from the internet. We will also save the number of rows our dataset has using the shape() function that counts them.

Bildschirmfoto 2017-12-13 um 08.45.41

Data Preparation

We will now start cleaning our data. We use the isnull() function and the sum() function to get the number of empty columns (or null coumns) within our dataset. Then we will create a variable to count the number of non-empty columns, while using the previous variable as it’s parameter. Now we are able to remove the columns with missing values by using the non-empty columns. If you examine the dataset a little bit more, you can recognize that the first seven columns don’t contain information that we could use to differentiate within our classes. Thats why we will also remove them, using the ix function that will take the index of the column we want to delete as a parameter.

Bildschirmfoto 2017-12-13 um 08.45.51.png

Now it is time to transform our data into vectors, so that our machine learning model is able to take is as an input. If you don’t know what a vector is, you can take a look at my previous blogpost about it (https://machinelearning-blog.com/2017/11/04/calculus-derivatives/). We will create vectors to represent the features of each person in our dataset.

We start by storing all the features of our data in a variable. The we standardize the features using the standard scaler object of sklearn.

Bildschirmfoto 2017-12-13 um 08.45.58.png

Train and Test split

Now we will split our data into a training and testing subset, so that we can train and test out model.

Bildschirmfoto 2017-12-11 um 17.40.32.png

Dimensionality Reduction with T-SNE

Dimensionality Reduction is an entire subfield of machine learning. It let’s us represent high dimensional data in a 3D or 2D space. Note that even a normal picture can have up to 32 million dimensions if we consider each pixel to be a dimension. But a picture can also just have 2 dimension: the length and the width. The key is to find the intrinsic low dimensionality in our data that enables us to visualize it better for the human eye.

On of the most popular methods to do exactly this is T-SNE (Distributed Stochastic Neighbor Embedding). With T-SNE we can reduce the dimensionality of our data into the number of dimensions we think is ideal. This technique works by taking every of our 70 dimensional feature vectors and compares them with every other vector, to find similarity. It stores these similarities as values within a so called similarity-matrix. T-SNE will then create a similarity-matrix for the projected map-points, which contains our final representation of the dataset. The first similiraity-matrix shows us where we are and the second one shows us where we want to be in the end. It wants then to minimize the distance between these two similarity-matrixes by using gradient descent, but you could also use another technique if you prefer to. Gradient descent will slowly and iterative reduce the dimensionality of the first (and biggest) similarity-matrix by updating it’s values over time, to match it to our second (and desired) similarity matrix.

First we will initialize the T-SNE model with sklearn and set the number of components to 2. We then fit it with fit_transform() on our feature vectors and save the resulting 2-dimensional feature vectors.

Bildschirmfoto 2017-12-13 um 08.46.20

Now we can plot our points on a 2D graph. For that we need to create a legend for our class labels and plot each point using matplotlib. We then plot our result.

Bildschirmfoto 2017-12-13 um 08.46.26.png

 

Bildschirmfoto 2017-12-11 um 17.43.23.png

On the plot we can see that the points from the same class are likely to cluster together. Note that this was only possible through T-SNE, without knowing the classes of our feature vectors. It learned by himself how to represent the similarity between these classes and the two dimensional space.

Summary

  1. We can work with high dimensional data because of machine learning.
  2. With dimensionality reduction we are able visualize data that would be too complex without it.
  3. T-SNE is a dimensionality reduction technique
  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Niklas Donges

    Tags
    Machine Learning
    © 2021, Experfy Inc. All rights reserved.
    Leave a Comment
    Next Post
    What is transfer learning?

    What is transfer learning?

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: [email protected]

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2025, Experfy Inc. All rights reserved.