How to take the probability density of any set of numbers and make an analysis.
Let’s start at the beginning: A Probability Density Function. The probability density is used to describe where the average list of numbers like to be, for example:
numbers = [1, 1, 1, 1, 2]
It looks like most of the numbers that occupy this list is 4 1’s and 2’s
So it looks like most of the numbers are leaning towards 1, which is what we expected because we see more of them in the list. Well great, but why is this useful?
So let’s say you have a bunch of numbers maybe a million and you don’t know where to stand in the trend of the data.
Well, this actually is a raw starting point because you are only looking at where the average numbers are most likely and that helps you see if there are any numbers that can be separated or categorized. Often you could mark the data as features.
So let’s take a more complex list: Perfect Numbers
perfect_numbers = [191, 561, 942, 608, 236, 107, 294, 793, 378, 84, 303, 638, 130, 997, 321, 548, 169, 216]
Interesting plot? Looks like most perfect numbers are within the 200–300 range. This is a trend in the data that I could pick rapidly. We would need something else to validate it but at least it starts us off with the question.
I use plotly to pretty much all my plots and you can come up with some pretty crazy designs for large-scale data analysis.
Of course your imports and standard variables:
import plotly.figure_factory as ffbin_size = .5
bin_width = ''
colors = ["#A800FF"]
group_labels = ['perfect_numbers']
perfect_numbers = [191, 561, 942, 608, 236, 107, 294, 793, 378, 84, 303, 638, 130, 997, 321, 548, 169, 216]
Set up the plotly distplot:
fig = ff.create_distplot([perfect_numbers], group_labels, bin_size=bin_size, show_hist=False, show_rug=False, colors=colors)
For large datasets, the show_rug
argument might slow it down tenfold. I have seen this for sets over couple 100,000.
And the rest is all to make it pretty ish if you want it.
fig.update_layout(legend=dict(itemsizing='constant'))
fig.update_layout(legend=dict(
orientation="h",
yanchor="bottom",
y=1.02,
xanchor="right",
x=1.05,
font = dict(family = "Arial", size = 50),
bordercolor="LightSteelBlue",
borderwidth=2,
),
)fig.update_traces(line=dict(width=10))
fig.update_xaxes(
ticks="outside",
tickwidth=3,
tickcolor='black',
tickfont=dict(family='Arial', color='black', size=50),
title_font=dict(size=46, family='Arial'),
title_text='Numbers',
ticklen=15,
)fig.update_yaxes(
ticks="outside",
tickwidth=3,
tickcolor='black',
title_text='Probability Density',
tickfont=dict(family='Arial', color='black', size=50),
title_font=dict(size=46, family='Arial'),
ticklen=15,
)
fig.update_layout(
title_text="Probability Distribution",
title_font=dict(size=28, family='Arial'),
template='simple_white',
xaxis_tickformat = 'i',
bargap=0.2, # gap between bars of adjacent location coordinates,
height=600,
width=1350
)
fig.show()
Here’s a Google Colab Link to make it easier: https://colab.research.google.com/drive/1-LbwjiLcfLA0_rB8UjSV6hvqHTj_zHym?usp=sharing
Eventually with some design you can get to some pretty crazy number feature distributions :).