And getting the most out of your data
source: Isaac Smith @Unsplash — free stock images
In real-life data science, plotting does matter. In my day-to-day life, I spend more time plotting and analysing those charts, than doing anything else. Let me explain myself, I work at Ravelin Technology. Our business is data, and specifically, analyzing and predicting fraud for online merchants. The main product of the company uses a combination of machine learning, network analysis, rules and human insights for predicting if a transaction might or might not be fraud. We have an ad-hoc machine learning model for each one of our clients, but the building of that model is something that happens at the beginning of the relationship with them and then it mostly requires maintenance.
Maintenance how? Sometimes it’s for introducing new features or because of behavioural changes in customers. However, it can also be the case that something changes in the data we receive. Or perhaps there’s just something we were originally missing for not having enough data when we built the model for the first time. It can also happen that either the client or us spot some dodgy performance in our predictions in, for example, one specific country. Whatever is the case, there’s usually an extensive investigation to find out what’s the problem and/or what could we do better. And just to give a bit more of context, analyzing the performance of a model for us usually implies dataset with millions of rows and thousands of columns. This can only be addressed by plotting. It’s almost impossible to find patterns or insights just by looking at the data. Plotting allows us to compare features’ performance, see the evolution through time, distribution of values, differences in mean and median values, etc., etc., etc., etc.
As I said in my previous story, in our field we must equally weight the importance of explainability and interpretability. Real-life Data Science never finds you working alone on a project and your workmates and/or clients usually won’t know much about the data you’ll be using. Being able to explain your thinking process is a key part of any data-related job. That’s why copying and pasting are not enough and charts personalization becomes key.
Today we’ll go through 5 techniques to make better charts that I’ve found useful in the past. Some of them are day-to-day tools, while others you’ll use them every now and then. But having this story at hand, hopefully, will come in handy when the moment arrives. The libraries we’ll be using are:
With the following style and configurations:
1. Change range and steps in axis
The default configuration of matplotlib or seaborn for setting up the range and steps it’s usually good enough for visualizing out data, but sometimes we’ll want to see all the steps in our axis explicitly shown. Or perhaps, something I’ve found useful is drawing all the data but including the axis labels just for a specific range of the y or x-axis.
For example, let’s say we’re plotting the distribution of our model’s predictions and we want to concentrate ourself in the values in between 30 and 50, with a step every two units, and without losing sight of the rest of the values. Our original seabon’s ‘distplot’ would be like:
We have now two options for accomplishing the idea above:
In both cases, we need to specify the starting point, ending point and step. Mind how the ending point follows a ‘less than’ kind of logic instead of ‘equal to or less than’. The result would be the following chart:
Also, mind how I’m calling both options from the ‘ax’ object, given that’s the default. When we create any kind of chart the axis (‘ax’) and a figure (‘fig’) are automatically created. We can also do the same the following way:
2. Rotate ticks
This is an easy one but very very useful tip if, for example, we’re dealing with text labels instead of numbers. We can do this just by using the ‘rotation’ hyperparameter in ‘set_xticklabels’:
Note how I’m also passing ‘my_labels’ to the ‘labels’ hyperparameter since that’s mandatory when using ‘xticklabels’. However, if you’re drawing a ‘distplot’, you can simply pass the range of values to be shown, while for any other chart, you can pass exactly the same array you specified for the x-axis. Also, you can combine this with the first technique like this:
Obtaining the following result:
3. Change the space in between plots
More often than not, we’ll want to plot several charts at once to compare their results, visualize them all together, or perhaps just to save time and/or space. In any case, we can do that by using ‘subplots’ in a very simple way:
We specified two rows, and therefore we’ll be plotting two charts:
Now, sometimes, instead of having only two charts, we might have more. And perhaps we need to include titles for all of them. We may also have some charts with text labels that will need to rotate them for better readability. In cases like this, we could end up with some overlap between plots and increasing the space between charts could help us to visualize them better.
Mind how the hight of the figure remains the same (10 in this case), but the space in between charts increases. If you want to maintain your charts’ size, you’d have to increase your figure size through the ‘figsize’ hyperparameter.
Also, the hyperparameter ‘hspace’ follows the horizontal space. If you were drawing multiple columns instead of rows, you could accomplish the same by using the hyperparameter ‘vspace’.
By the way, if you want to found out how to set titles for your charts, you can find that tip and some others for better plotting in my previous story.
NOTE: just like I specified two rows to be drawn above, you could also specify a fixed number of columns. In that case, the indexing of the charts would follow a two indexes logic like ax=[0,1].
4. Customize your confusion matrix
Unfortunately, this is not the space for explaining in depth how the confusion matrix works or what it is useful for. Nonetheless, if you fancy learning more about it, I always recommend this story from M. Sunasra.
Now, if you’re already familiar with the concept, you might have encountered in the past that sometimes the default heatmap created by ‘plot_confusion_matrix’ from ‘sklearn.metrics’ library, comes out with the upper and lower squares cut off, like the following picture:
We can solve this by plotting our own confusion matrix from scratch using just a bunch of lines. For example:
What we’re doing here is:
- We create an empty figure. Wider than taller since we’ll have the annotations to the right of the heatmap
- We get the values of our confusion matrix through ‘skplt.metrics.confusion_matrix’
- We specified the ‘labels’ according to the number of categories we have
- We create a heatmap using the values from point 2 and specifying: i) ‘annotations’ equal True, ii) ‘annot_kws’ for specifying the font size of the annotations (12 in this case), iii) ‘fmt’ for the passing the string formatting code, iv) ‘cmap’ for the colour pattern, v) And finally we specify the labels for both axis (in a confusion matrix both of them are the same)
- We get the y-axis view limits and we set the again +- 0.5
- Last step: set the y and ‘xlabels’ to true and predicted label respectively
The result should be something like this:
5. Plot accumulative distributions
Surely I don’t need to size how useful can be plotting accumulative distributions, either for better understanding the percentage of elements up to certain value or for comparing two different groups within our data.
You can easily get this kind of charts through Seaborn’s ‘distplot’ chart itself just by setting the following:
sns.distplot(my_data, label=’my label’, color=’red’, hist_kws=dict(cumulative=True))
We can make the chart look better by setting the limits for the x-axis:
As I said at the beginning of the story, some of these tools or tips I use them all the time, while some others only every now and then. But hopefully, knowing these quick fixes and techniques will help you to make better plots and to better understand your data itself.