In this tutorial we will explore continuous and discrete uniform distribution in Python.
Table of contents
To continue following this tutorial we will need the following Python libraries: scipy, numpy, and matplotlib.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install scipy pip install numpy pip install matplotlib
There are two types of uniform distributions:
A continuous uniform probability distribution is a distribution with constant probability, meaning that the measures the same probability of being observed.
A continuous uniform distribution is also called a rectangular distribution. Why is that? Let’s explore!
This type of distribution is defined by two parameters:
and is written as: \(U(a, b)\).
The difference between \(b\) and \(a\) is the interval length: \(l=b-a\). Since this is a cumulative distribution, all intervals within the interval length are equally probable (given that those intervals are of the same length).
The PDF (probability density function) of a continuous uniform distribution is given by:
$$f(x) = \frac{1}{b-a} \textit{ for } A\leq x \leq B$$
and 0 otherwise.
And the CDF (cumulative distribution function) of a continuous uniform distribution is given by:
$$F(x) = \frac{x-a}{b-a} \textit{ for } A\leq x \leq B$$
with 0 for \(x < a\) and 1 for \(x>b\).
A discrete uniform probability distribution, is a distribution with constant probability, meaning that a finite number of values are equally likely to be observed.
This type of distribution is defined by two parameters:
and is written as: \(U(a, b)\).
The difference between \(b\) and \(a\) +1 is the number of observations: The difference between \(b\) and \(a\) is the interval length: \(n=b-a+1\). And all observations are equally probable.
For any \(x \in [a, b]\), the PMF (probability mass function) of a discrete uniform distribution is given by:
$$f(x) = \frac{1}{b-a+1} = \frac{1}{n}$$
And for any \(x \in [a, b]\), the CDF (cumulative distribution function) of a discrete uniform distribution is given by:
$$F(x) = P(X\leq x) = \frac{x-a+1}{b-a+1} = \frac{x-a+1}{n}$$
Let’s consider an example: you live in an apartment building that has 10 floors and just came home. You entered the lobby and about to press the elevator button. You know that it can take anywhere between 0 and 20 seconds for you to wait for the elevator, where it takes 0 seconds if the elevator is on the first floor (no wait), and it takes 20 seconds if the elevator is on the tenth floor (maximum wait). This would be an example of a continuous uniform distribution, since the wait time can take any value with the same probability and is continuous because the elevator can be anywhere in the building between first and tenth floor (for example, between fifth and sixth floor).
Here we have the minimum value \(a = 0\), and the maximum value \(b = 20\).
Knowing the values of \(a\) and \(b\), we can easily compute the continuous uniform distribution PDF:
$$f(x)=\frac{1}{20-0} = \frac{1}{20} = 0.05$$
Using the \(f(x)\) formula and given parameters we can create the following visualization of continuous uniform PDF:
So what does this really tell us in the context of a continuous uniform distribution? Let’s take two 1 second intervals anywhere on the interval [0, 20]. For example from 1 to 2 (\(i_1 = [1, 2]\)) and from 15 to 16 (\(i_2 = [15, 16]\)). Important to note that both of these intervals are of the same length equal to 1. Using the PMF result, we can say that these intervals are equally likely to occur with probability 0.05. In other words, it is as likely for the elevator to arrive between 1 and 2 seconds, as it is to arrive between 15 and 16 seconds (with probability 0.05).
Now let’s consider an addition to the example in this section. You are still in the apartment building waiting for the elevator, but now you want to find out what is the probability that it will take the elevator 6 seconds or less to arrive after you press the button.
Using continuous distribution CDF formula from this section we can solve for:
$$F(6) = P(X\leq 6) = \frac{6-0}{20} = \frac{6}{20} = 0.3$$
We observe that the probability that it will take the elevator 6 seconds or less (anywhere between 0 and 6) to arrive is 0.3.
Using \(F(x)\) formula and given parameters we can create the following visualization of continuous uniform CDF:
And we observe a linear relationship between cumulative probability and random variable \(X\), where the function is monotonically increasing at the rate \(f(x)\) (in our case \(f(x)=0.05\)).
In one of the previous sections we computed continuous uniform distribution probability density function by hand. In this section, we will reproduce the same results using Python.
We will begin with importing the required dependencies:
import numpy as np import matplotlib.pyplot as plt from scipy.stats import uniform
Next, we will create a continuous array of values between 0 and 20 (minimum and maximum wait times). Mathematically, there is an infinitely large number of values, so for purposes of this example we will create 4,000 values in range between 0 and 20. We will also print the first 3 of them just to take a look.
a=0 b=20 size=4000 x = np.linspace(a, b, size) print(x[:3])
And you should get:
[0. 0.00500125 0.0100025 ]
And now we will have to create a uniform continuous random variable using scipy.stats.uniform:
continuous_uniform_distribution = uniform(loc=a, scale=b)
In the following sections we will focus on calculating the PDF and CDF using Python.
In order to calculate the cumulative uniform distribution PDF using Python, we will use the .pdf() method of the scipy.stats.uniform generator:
continuous_uniform_pdf = continuous_uniform_distribution.pdf(x) print(continuous_uniform_pdf)
And you should get:
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
So now we found the probabilities for each value are the same and equal to 0.05, which is exactly the same as we calculated by hand.
Using matplotlib library, we can easily plot the continuous uniform distribution PDF using Python:
plt.plot(x, continuous_uniform_pdf) plt.xlabel('X') plt.ylabel('Probability') plt.show()
And you should get:
In order to calculate the continuous uniform distribution CDF using Python, we will use the .cdf() method of the scipy.stats.uniform generator:
continuous_uniform_cdf = continuous_uniform_distribution.cdf(x)
Since we will have 4,000 values, if we want to double check the correctness of the calculations that we did by hand, you will need to find the cumulative probability associated with the value equal to 6. It is indeed around 0.3.
Using matplotlib library, we can easily plot the continuous uniform distribution CDF using Python:
plt.plot(x, continuous_uniform_cdf) plt.xlabel('X') plt.ylabel('Cumulative Probability') plt.show()
And you should get:
Let’s consider an example (and this is the one most us did ourselves): rolling the dice. Basically, the possible outcomes of rolling a single 6-sided die follow the discrete uniform distribution.
Why is that? It’s because you can only have 1 outcome from 6 possible outcomes (you can get either: 1, 2, 3, 4, 5, or 6). The number of possible outcomes if finite and each outcome has an equal probability of being observed, which is \(\frac{1}{6}\).
Knowing the number of all possible outcomes \(n\), we can easily compute the discrete uniform distribution PMF:
$$f(x)=\frac{1}{6} = 0.16$$
Using the \(f(x)\) formula and given parameters we can create the following visualization of discrete uniform PMF:
In this example, each side of the die has an equal opportunity of being observed equal to 0.16.
Now let’s consider an addition to this example. You are rolling the same 6-sided die and now want to find out the probability of you observing outcome that is equal to or less than 2 (meaning either 1 or 2).
Knowing the number of all possible outcomes \(n\), we can easily compute the discrete uniform distribution CDF:
$$F(2)=\frac{2-1+1}{6-1+1} = \frac{2}{6} = 0.33$$
This tells us that if we roll a 6-sided die, the probability of observing a value less than or equal to 2 is 0.33.
Using the \(F(x)\) formula and given parameters we can create the following visualization of discrete uniform CDF:
And we observe a step-wise relationship since we have discrete values as possible outcomes.
In one of the previous sections we computed continuous uniform distribution cumulative distribution function by hand. In this section, we will reproduce the same results using Python.
We will begin with importing the required dependencies:
import numpy as np import matplotlib.pyplot as plt from scipy.stats import randint
Next, we will create an array of values between 1 and 6 (smallest and largest die values), and print them to take a look.
a=1 b=6 x = np.arange(a, b+1) print(x)
And you should get:
[1 2 3 4 5 6]
And now we will have to create a uniform continuous random variable using scipy.stats.randint:
discrete_uniform_distribution = randint(a, b+1)
In the following sections we will focus on calculating the PMF and CDF using Python.
In order to calculate the discrete uniform distribution PMF using Python, we will use the .pmf() method of the scipy.stats.randint generator:
discrete_uniform_pmf = discrete_uniform_distribution.pmf(x) print(discrete_uniform_pmf)
You should get:
[0.16666667 0.16666667 0.16666667 0.16666667 0.16666667 0.16666667]
Which is exactly the 0.16 value that we calculated by hand.
Using matplotlib library, we can easily plot the discrete uniform distribution PMF using Python:
plt.plot(x, discrete_uniform_pmf, 'bo', ms=8) plt.vlines(x, 0, discrete_uniform_pmf, colors='b', lw=5, alpha=0.5) plt.xlabel('X') plt.ylabel('Probability') plt.show()
And you should get:
In order to calculate the discrete uniform distribution PMF using Python, we will use the .cdf() method of the scipy.stats.randint generator:
discrete_uniform_cdf = discrete_uniform_distribution.cdf(x) print(discrete_uniform_cdf)
And you should get:
[0.16666667 0.33333333 0.5 0.66666667 0.83333333 1. ]
We see here that the second value in the array is 0.33 which is exactly the same as we calculated by hand.
Using matplotlib library, we can easily plot the discrete uniform distribution CDF using Python:
plt.plot(x, discrete_uniform_cdf, 'bo', ms=8) plt.xlabel('X') plt.ylabel('Cumulative Probability') plt.show()
And you should get:
In this article we explored cumulative uniform distribution and discrete uniform distribution, as well as how to create and plot them in Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Statistics articles.
The post Continuous and discrete uniform distribution in Python appeared first on PyShark.
I recently presented a ~45min session at the Enterprise DNA Analytics Summit on the topic of automating & augmenting Excel with Python. This is an in-demand topic and there’s likely to be an official Python <> Excel integration in the future.
Until then, there are lots of great opportunities to using the two in tandem already. In this presentation I cover
pandas
package for data analysis can and can’t do with ExcelHave a look and please give a like if you could:
If you would like to learn more about these topics, make sure to check out the following:
Thanks for tuning in and be sure to subscribe to be notified of upcoming speaking events and more. You can also drop me a line if your Excel-heavy team is looking for help to get up and running with Python.
Data Scientists use to work with Anaconda Environments and for installing packages they use to run the “conda” commands. However, apart from conda there is the “pip” package manager that is still the most popular. Although these two package managers are very similar, they are designed for different purposes and should be used accordingly. In this tutorial, we will show you some tips about pip that you are going to apply to your daily tasks.
According to Wikipedia, pip is a package-management system written in Python used to install and manage software packages. It connects to an online repository of public packages, called the Python Package Index. pip can also be configured to connect to other package repositories, provided that they comply to Python Enhancement Proposal 503.
I think that most of you know how to install packages using pip which is simply by running the command:
pip install some-package-name
If you would like to install a specific version you can run:
pip install 'some-package-name==1.2.2' --force-reinstall
Where 1.2.2 is the version of the package. We can add the flag –force-reinstall in case we want to re-install the package if it is already installed. Moreover, you can give a range of versions like:
pip install 'some-package-name>=1.3.0,<1.4.0' --force-reinstall
Finally, you can install packages for a specific python version. For example, if we want for python 3, we can run
pip3 install some-package-name
We can easily remove a package by running:
pip uninstall some-package-name
We have explained how to create the requirements.txt file. Let’s assume that the requirements.txt is the file below:
pandas==1.2.5 numpy==1.21.1
We can install these libraries by running:
pip install -r requirements.txt
Usually, we work with virtual environments and once we have installed the required libraries, we can easily generate the requirements.txt file using pip.
pip freeze > requirements.txt
Using pip, we can get a list of the installed packages in our environment by running:
pip list
You can search for a specific package using the list and the grep command. Let’s get my pandas version.
pip list | grep pandas
pandas 1.2.5
When we install packages, it is common to have compatibility issues with dependencies and so on. We can check if everything is OK by running:
pip check
If I run it at my base environment, I get the following:
streamlit 0.86.0 requires protobuf, which is not installed. spyder 4.2.5 requires pyqt5, which is not installed. spyder 4.2.5 requires pyqtwebengine, which is not installed. qdarkstyle 2.8.1 requires helpdev, which is not installed. conda-repo-cli 1.0.4 requires pathlib, which is not installed. anaconda-project 0.10.1 requires ruamel-yaml, which is not installed. awswrangler 2.9.0 has requirement numpy<1.21.0,>=1.18.0, but you have numpy 1.21.1. awswrangler 2.9.0 has requirement pyarrow<4.1.0,>=2.0.0, but you have pyarrow 5.0.0. awscli 1.20.12 has requirement botocore==1.21.12, but you have botocore 1.20.112. awscli 1.20.12 has requirement colorama<0.4.4,>=0.2.5, but you have colorama 0.4.4. awscli 1.20.12 has requirement docutils<0.16,>=0.10, but you have docutils 0.17.1. awscli 1.20.12 has requirement s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.4.2.
Apparently, I have some work to do!
We can get more information about an installed package by running
pip show some-package-name
For example, this is what I get for pandas.
Name: pandas Version: 1.2.5 Summary: Powerful data structures for data analysis, time series, and statistics Home-page: https://pandas.pydata.org Author: Author-email: License: BSD Location: c:\users\gpipis\anaconda3\lib\site-packages Requires: pytz, numpy, python-dateutil Required-by: streamlit, statsmodels, seaborn, mlxtend, awswrangler, altair
Data Scientists and/or Data Engineers work with Python on a daily basis and as a result, a basic knowledge of “pip” is a really useful tool for their work. That was an introduction to pip, I encourage you to dive into pip and unlock its power, and feel free to share your tips with our community.
In this article we will explore Poisson distribution and Poisson process in Python.
Table of contents
To continue following this tutorial we will need the following Python libraries: scipy, numpy, and matplotlib.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install scipy pip install numpy pip install matplotlib
A Poisson point process (or simply, Poisson process) is a collection of points randomly located in mathematical space. Due to its several properties, the Poisson process is often defined on a real line, where it can be considered a random (stochastic) process in one dimension. This further allows to build mathematical systems and study certain events that appear in a random manner.
One of its important properties is that each point of the process is stochastically independent from other points in the process.
As an example we can think of an example where such process can be observed in real life. Suppose you are studying the historical frequencies of hurricanes. This indeed is a random process, since the number of hurricanes this year is independent of the number of hurricanes las year and so on. However, over time you may be observing some trends, average frequency, and more.
Mathematically speaking, in this case, the point process depends on something that might be some constant, such as average rate (average number of customers calling, for example).
A Poisson process is defined by a Poisson distribution.
A Poisson distribution is a discrete probability distribution of a number of events occurring in a fixed interval of time given two conditions:
To put this in some context, consider our example of frequencies of hurricanes from the previous section. Assume that when we have data on observing hurricanes over a period of 20 years. We find that the average number of hurricanes per year is 7. Each year is independent of previous years, which means that if we observed 8 hurricanes this year, it doesn’t mean we will observe 8 next year.
The PMF (probability mass function) of a Poisson distribution is given by:
$$p(k, \lambda) = \frac{\lambda^{k}e^{-\lambda}}{k!}$$
where:
The \(Pr(X=k)\) can be read as: Poisson probability of k events in an interval.
And the CDF (cumulative distribution function) of a Poisson distribution is given by:
$$F(k, \lambda) = \sum^{k}_{i=0} \frac{\lambda^{i}e^{-\lambda}}{i!} $$
Now that we know some formulas to work with, let’s go through an example in detail.
Recall the hurricanes data we mentioned in the previous sections. We know that the historical frequency of hurricanes is 7 per year, and this forms our \(\lambda\) value:
$$\lambda = 7$$
The question we can have is what is the probability of observing exactly 5 hurricanes this year? And this forms our \(k\) value:
$$k = 5$$
Using the formula from the previous section, we can calculate the Poisson probability:
$$p(5, 7) = \frac{(7^{5})(e^{-7})}{5!} = 0.12772 \approx 12.77\%$$
Therefore, the probability of observing exactly 5 hurricanes next year is equal to 12.77%.
Naturally, we are curious about the probabilities of other frequencies.
Consider the table below which shows the Poisson probability of hurricane frequencies (0-15):
\(k\) | \(p(k, \lambda)\) | % |
0 | 0.00091 | 0.09% |
1 | 0.00638 | 0.64% |
2 | 0.02234 | 2.23% |
3 | 0.05213 | 5.21% |
4 | 0.09123 | 9.12% |
5 | 0.12772 | 12.77% |
6 | 0.14900 | 14.9% |
7 | 0.14900 | 14.9% |
8 | 0.13038 | 13.04% |
9 | 0.10140 | 10.14% |
10 | 0.07098 | 7.01% |
11 | 0.04517 | 4.52% |
12 | 0.02635 | 2.64% |
13 | 0.01419 | 1.42% |
14 | 0.00709 | 0.71% |
15 | 0.00331 | 0.33% |
16 | 0.00145 | 0.15% |
Using the above table we can create the following visualization of the Poisson probability mass function for this example:
Consider the table below which shows the Poisson cumulative probability of hurricane frequencies (0-15):
\(k\) | \(F(k, \lambda)\) | % |
0 | 0.00091 | 0.09% |
1 | 0.00730 | 0.73% |
2 | 0.02964 | 2.96% |
3 | 0.08177 | 8.18% |
4 | 0.17299 | 17.3% |
5 | 0.30071 | 30.07% |
6 | 0.44971 | 44.97% |
7 | 0.59871 | 59.87% |
8 | 0.72909 | 72.91% |
9 | 0.83050 | 83.05% |
10 | 0.90148 | 90.15% |
11 | 0.94665 | 94.67% |
12 | 0.97300 | 97.3% |
13 | 0.98719 | 98.72% |
14 | 0.99428 | 99.43% |
15 | 0.99759 | 99.76% |
16 | 0.99904 | 99.9% |
Using the above table we can create the following visualization of the Poisson cumulative distribution function for this example:
The table also allows us to answer some interesting questions.
For example, what if we wanted to find out the probability of seeing up to 5 hurricanes (mathematically: \(k\leq5\)), we can see that it’s \(0.30071\) or \(30.07\%\).
On the other hand, we can be interested in probability of observing more than 5 hurricanes (mathematically: \(k>5\)), which would be \(1-p(5,7) = 1-0.30071 = 0.69929\) or \(69.93\%\) .
In the previous section we computed probability mass function and cumulative distribution function by hand. In this section, we will reproduce the same results using Python.
We will begin with importing the required dependencies:
import numpy as np import matplotlib.pyplot as plt from scipy.stats import poisson
Next we will need an array of the \(k\) values for which we will compute the Poisson PMF. In the previous section, we calculated it for 16 values of \(k\) from 0 to 16, so let’s create an array with these values:
k = np.arange(0, 17) print(k)
You should get:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16]
In the following sections we will focus on calculating the PMF and CDF using Python.
In order to calculate the Poisson PMF using Python, we will use the .pmf() method of the scipy.poisson generator. It will need two parameters:
And now we can create an array with Poisson probability values:
pmf = poisson.pmf(k, mu=7) pmf = np.round(pmf, 5) print(pmf)
And you should get:
[0.00091 0.00638 0.02234 0.05213 0.09123 0.12772 0.149 0.149 0.13038 0.1014 0.07098 0.04517 0.02635 0.01419 0.00709 0.00331 0.00145]
Note:
If you want to print it in a nicer way with each \(k\) value and the corresponding probability:
for val, prob in zip(k,pmf): print(f"k-value {val} has probability = {prob}")
And you should get:
k-value 0 has probability = 0.00091 k-value 1 has probability = 0.00638 k-value 2 has probability = 0.02234 k-value 3 has probability = 0.05213 k-value 4 has probability = 0.09123 k-value 5 has probability = 0.12772 k-value 6 has probability = 0.149 k-value 7 has probability = 0.149 k-value 8 has probability = 0.13038 k-value 9 has probability = 0.1014 k-value 10 has probability = 0.07098 k-value 11 has probability = 0.04517 k-value 12 has probability = 0.02635 k-value 13 has probability = 0.01419 k-value 14 has probability = 0.00709 k-value 15 has probability = 0.00331 k-value 16 has probability = 0.00145
which is exactly the same as we saw in the example where we calculated probabilities by hand.
We will need the k values array that we created earlier as well as the pmf values array in this step.
Using matplotlib library, we can easily plot the Poisson PMF using Python:
plt.plot(k, pmf, marker='o') plt.xlabel('k') plt.ylabel('Probability') plt.show()
And you should get:
In order to calculate the Poisson CDF using Python, we will use the .cdf() method of the scipy.poisson generator. It will need two parameters:
And now we can create an array with Poisson cumulative probability values:
cdf = poisson.cdf(k, mu=7) cdf = np.round(cdf, 3) print(cdf)
And you should get:
[0.001 0.007 0.03 0.082 0.173 0.301 0.45 0.599 0.729 0.83 0.901 0.947 0.973 0.987 0.994 0.998 0.999]
Note:
If you want to print it in a nicer way with each \(k\) value and the corresponding cumulative probability:
for val, prob in zip(k,cdf): print(f"k-value {val} has probability = {prob}")
And you should get:
k-value 0 has probability = 0.001 k-value 1 has probability = 0.007 k-value 2 has probability = 0.03 k-value 3 has probability = 0.082 k-value 4 has probability = 0.173 k-value 5 has probability = 0.301 k-value 6 has probability = 0.45 k-value 7 has probability = 0.599 k-value 8 has probability = 0.729 k-value 9 has probability = 0.83 k-value 10 has probability = 0.901 k-value 11 has probability = 0.947 k-value 12 has probability = 0.973 k-value 13 has probability = 0.987 k-value 14 has probability = 0.994 k-value 15 has probability = 0.998 k-value 16 has probability = 0.999
which is exactly the same as we saw in the example where we calculated cumulative probabilities by hand.
We will need the k values array that we created earlier as well as the pmf values array in this step.
Using matplotlib library, we can easily plot the Poisson PMF using Python:
plt.plot(k, cdf, marker='o') plt.xlabel('k') plt.ylabel('Cumulative Probability') plt.show()
And you should get:
In this article we explored Poisson distribution and Poisson process, as well as how to create and plot Poisson distribution in Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Statistics articles.
The post Poisson Distribution and Poisson Process in Python appeared first on PyShark.
This free, online workshop is to introduce researchers in healthcare, agriculture, academia and so forth to the concepts in my book Advancing into Analytics: From Excel to Python and R.
By the end of this workshop, you will have a complete working environment with R and Python on your computer to complete the exercises in the book.
We’ll be summarizing and visualizing a clinical trials dataset using both applications.
Requirements:
Learn more about Advancing into Analytics here.
The recording of this event will be posted for two weeks, then taken down for good.
In this article we will focus on a complete walk through of a Python dictionary data structure.
Table of contents
A Python dictionary is a data structure for storing groups of objects. It consists of a mapping of key-value pairs, where each key is associated with a value. It can contain data with the same or different data types, is unordered, and is mutable.
To initialize an empty dictionary in Python we can simple run the code below and print its content:
empty_dict = {} print(empty_dict)
You should get:
{}
When we want to create a dictionary with some values that we want to populate, we add the values as a sequence of comma-separated key-value pairs. For example, let’s say we want to create a dictionary with countries as keys and their populations as values:
countries = { "China": 1439323776, "India": 1380004385, "USA": 331002651, "Indonesia": 273523615, "Pakistan": 220892340, "Brazil": 212559417 } print(countries)
You should get:
{'China': 1439323776, 'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417}
In order to access values stored in a Python dictionary, we should use they key associated with the value. For example, let’s say we ant to get the population of USA from the above countries dictionary. We know that they key of the population value is “USA”, and we use it to access the population value:
usa_population = countries["USA"] print(usa_population)
You should get:
331002651
Note: unlike Python list, you can’t access values from the dictionary using indices. The only way to access the values is by searching a key that is present in the dictionary.
In this section we continue working with the countries dictionary and discuss ways of adding elements to a Python dictionary.
Let’s say we want to add another country with its population to our countries dictionary. For example, we want to add Japan’s population of 126,476,461. We can easily do it by adding it as an additional key-value pair to the dictionary:
countries["Japan"] = 126476461 print(countries)
You should get:
{'China': 1439323776, 'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417, 'Japan': 126476461}
And you can see that we have successfully added a new element to the dictionary.
Now, what if we want to add more than one country? Let’s say we now want to add two more countries with their populations to our dictionary. For example, Russia and Mexico with populations of 145,934,462 and 128,932,753 respectively.
Will the same syntax work? Well, not really. So we will need to the .update() method of a Python dictionary data structure. What it allows to do is to add multiple comma-separated key-value pairs to the dictionary.
The logic is to create a new dictionary (new_countries) from the new key-value pairs and then merge it into the countries dictionary:
new_countries = { "Russia": 145934462, "Mexico": 128932753 } countries.update(new_countries) print(countries)
You should get:
{'China': 1439323776, 'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417, 'Japan': 126476461, 'Russia': 145934462, 'Mexico': 128932753}
And you can see that we have successfully added new elements to the dictionary.
In this section we continue working with the countries dictionary and discuss ways of removing elements from a Python dictionary.
Let’s say we need to make some changes and remove a key-value pair for China and its population from a dictionary. We can easily remove it using the .pop() method:
countries.pop("China") print(countries)
You should get:
{'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417, 'Japan': 126476461, 'Russia': 145934462, 'Mexico': 128932753}
And you can see that we have successfully removed an element from the dictionary.
The next step is to explore how to remove multiple elements from a Python dictionary. Let’s say we want to remove Japan and Mexico and their respective populations from the countries dictionary.
We know that .pop() method allows to remove a single element per function call, which gives us an idea that if we iterate over a list with keys we want to remove, we can successfully call .pop() for each entry:
to_remove = ["Japan", "Mexico"] for key in to_remove: countries.pop(key) print(countries)
You should get:
{'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212559417, 'Russia': 145934462}
And you can see that we have successfully removed the elements from the dictionary.
Another functionality to cover is changing elements in a Python dictionary. You will see in the sections below that the functionality of changing elements is identical to functionality of adding elements.
Why is that? It happens because when we try to add new elements to the dictionary, Python looks for that specific key we are trying to add and if the key exists in the dictionary, it overrides the data; but if the key doesn’t exist, it adds a new key-value pair to the dictionary.
Let’s say we want to update the value of Brazil’s population to 212560000 in the countries dictionary:
countries["Brazil"] = 212560000 print(countries)
You should get:
{'India': 1380004385, 'USA': 331002651, 'Indonesia': 273523615, 'Pakistan': 220892340, 'Brazil': 212560000, 'Russia': 145934462}
Now, let’s say we want to update the values of Indonesia and Pakistan populations to 273530000 and 220900000 in the countries dictionary respectively.
The logic is to create a new dictionary (update_countries) from the new key-value pairs and then update the existing key-value pairs in the countries dictionary:
update_countries = { "Indonesia": 273530000, "Pakistan": 220900000 } countries.update(update_countries) print(countries)
And you should get:
{'India': 1380004385, 'USA': 331002651, 'Indonesia': 273530000, 'Pakistan': 220900000, 'Brazil': 212560000, 'Russia': 145934462}
In this section we will focus on different ways of iterating over a Python dictionary.
Let’s say we want to iterate over the keys of the countries dictionary and print each key (in our case each country) on a separate line.
We will simply use a for loop together with .keys() dictionary method:
for country in countries.keys(): print(country)
And you should get:
India USA Indonesia Pakistan Brazil Russia
Another use case can be that we want to find the sum of all countries’ populations stored in the countries dictionary.
As you can imagine, we will need to use a for loop again, and also now we will use the .values() dictionary method:
sum_populations = 0 for population in countries.values(): sum_populations += population print(sum_populations)
And you should get:
2563931498
And item of a Python dictionary is its key-value pair. This allows us to do the iteration over keys and values together.
How can we use it? Let’s say you want to find the country with the largest population from the countries dictionary. Iterating over each item of the dictionary allows us to keep track of both keys and values together:
max_population = float('-inf') max_country = '' for country, population in countries.items(): if population > max_population: max_population = population max_country = country print(max_country, max_population)
And you should get:
India 1380004385
A nested dictionary is a dictionary that consists of other dictionaries.
You can create nested dictionaries in a similar way of creating dictionaries.
For example, let’s say we want to create a dictionary that will have information about each country’s capital as well as its population:
countries_info = { "China": {"capital": "Beijing", "population": 1439323776}, "India": {"capital": "New Delhi", "population": 1380004385}, "USA": {"capital": "Washington, D.C.", "population": 331002651}, "Indonesia": {"capital": "Jakarta", "population": 273523615}, "Pakistan": {"capital": "Islamabad", "population": 220892340}, "Brazil": {"capital": "Brasilia", "population": 212559417} } print(countries_info)
And you should get:
{'China': {'capital': 'Beijing', 'population': 1439323776}, 'India': {'capital': 'New Delhi', 'population': 1380004385}, 'USA': {'capital': 'Washington, D.C.', 'population': 331002651}, 'Indonesia': {'capital': 'Jakarta', 'population': 273523615}, 'Pakistan': {'capital': 'Islamabad', 'population': 220892340}, 'Brazil': {'capital': 'Brasilia', 'population': 212559417}}
This article is an introductory walkthrough in Python dictionary and its methods which are important to learn as they are used in many areas of programming and in machine learning.
Feel free to leave comments below if you have any questions or have suggestions for some edits.
The post Everything About Python Dictionary Data Structure: Beginner’s Guide appeared first on PyShark.
What makes a great analyst, and how do you interview to find one? These are topics close to my heart and mission, and I had a great time speaking with those who feel the same at Enterprise DNA, an analytics community specializing in Power BI.
Among other topics, we discussed how we go about sizing up a dataset, how to keep a data project on track, and where we are focusing on skills improvement in the coming year.
Watch the discussion on YouTube :
When asked about how to improve as a business thinker and communicator, I suggested analysts read The Wall Street Journal. Friends of mine have made an inside joke of my WSJ obsession, but reading it really helps with sizing up industry trends and forces and knowing what’s going on in the wider business community:
(By the way, you can get plenty more analytics-related memes by signing up for my newsletter below )
I’ve really enjoyed my new-ish partnership with Enterprise DNA; stay tuned for my forthcoming R for Power BI users course with them.
I’m excited to share that I’ll be presenting to the MS Excel Toronto meetup on Weds 12/8 at 5p Eastern. The topic is “What Excel Users Should Know about Python.” This is a free online event.
I’ve presented to this meetup once before (on learning statistics in Excel) and it’s one of the best meetups going, run by Excel and all-around superstar MVP Celia Alves.
Integration with Python is one of the most highly sought-after Excel features and, if rumors are to be believed, is coming soon. So, what do you as an Excel user need to know about Python, and how should you think about combining these two data power tools? In this presentation, you’ll learn the basics of Python programming, including an introduction to the packages for analysis and visualization which have made it so popular in the data world. Then, you’ll see how to augment and automate your Excel work with your new Python skills, and where to go next.
To make the most of this interactive presentation, please have the following installed on your computer:
A recording will be made available after the event.
All slides, data and files to be used are available at this GitHub repo.
This meetup will be loosely based on my white paper, “Five things Excel users should know about Python.” You can download a copy by signing up below.
If you want to further immerse yourself in Python as an Excel user, check out my book Advancing into Analytics: From Excel to Python and R. More information about the book including how to read for free is available here.
Kick-start your journey into data analytics; read the book later
If you’re an analyst or researcher looking to level up your data analysis skills, this is the workshop and book for you. After attending this class, you’ll be in great shape to explore advanced analytics in Excel, Python and R.
I’ll send you a signed copy of my book Advancing into Analytics just for signing up for this 75-minute online workshop.
We’ll be covering topics from Chapter 1 (Foundations of Exploratory Data Analysis in Excel) and a bit of Chapter 5 (The Data Analytics Stack):
There is a plethora of Automated Machine Learning
tools in the wild, implementing Machine Learning (ML) pipelines from data cleaning to model validation.
In this post, the input data set is already cleaned and pre-processed (diabetes dataset); the ML model is
already chosen too, mlsauce’s LSBoost
. We are going to focus on two important steps of a ML pipeline:
LSBoost
’s hyperparameter tuning with GPopt on diabetes
dataLSBoost
’s output using the-teller’s new version, 0.7.0. It’s worth mentioning that LSBoost
, which is nonlinear, is interpretable as a linear modelthe-teller
).Install packages from PyPI:
pip install mlsauce pip install GPopt pip install the-teller==0.7.0 pip install matplotlib==3.1.3
Python packages for the demo:
import GPopt as gp import mlsauce as ms import numpy as np import pandas as pd import seaborn as sns import teller as tr import matplotlib.pyplot as plt import matplotlib.style as style from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split, cross_val_score from time import time
# Number of boosting iterations (global variable, dangerous) n_estimators = 250
def lsboost_cv(X_train, y_train, learning_rate=0.1, n_hidden_features=5, reg_lambda=0.1, dropout=0, tolerance=1e-4, col_sample=1, seed=123): estimator = ms.LSBoostRegressor(n_estimators=n_estimators, learning_rate=learning_rate, n_hidden_features=np.int(n_hidden_features), reg_lambda=reg_lambda, dropout=dropout, tolerance=tolerance, col_sample=col_sample, seed=seed, verbose=0) return -cross_val_score(estimator, X_train, y_train, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1).mean()
def optimize_lsboost(X_train, y_train): def crossval_objective(x): return lsboost_cv( X_train=X_train, y_train=y_train, learning_rate=x[0], n_hidden_features=np.int(x[1]), reg_lambda=x[2], dropout=x[3], tolerance=x[4], col_sample=x[5]) gp_opt = gp.GPOpt(objective_func=crossval_objective, lower_bound = np.array([0.001, 5, 1e-2, 0.1, 1e-6, 0.5]), upper_bound = np.array([0.4, 250, 1e4, 0.8, 0.1, 0.999]), n_init=10, n_iter=190, seed=123) return {'parameters': gp_opt.optimize(verbose=2, abs_tol=1e-3), 'opt_object': gp_opt}
In the diabetes
dataset, the response is “a quantitative measure of disease progression one year after baseline”. The explanatory variables are:
age: age in years
sex
bmi: body mass index
bp: average blood pressure
s1: tc, total serum cholesterol
s2: ldl, low-density lipoproteins
s3: hdl, high-density lipoproteins
s4: tch, total cholesterol / HDL
s5: ltg, possibly log of serum triglycerides level
s6: glu, blood sugar level
# load dataset dataset = load_diabetes() X = dataset.data y = dataset.target # split data into training test and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13) # Bayesian optimization for hyperparameters tuning res = optimize_lsboost(X_train, y_train) res
{'opt_object': <GPopt.GPOpt.GPOpt.GPOpt at 0x7f550e5c5f50>, 'parameters': (array([1.53620422e-01, 6.20779419e+01, 8.39242559e+02, 1.74212646e-01, 5.48527464e-02, 7.15906433e-01]), 53.61909741088658)}
# _best_ hyperparameters parameters = res["parameters"][0] # Adjusting LSBoost to diabetes data (training set) estimator = ms.LSBoostRegressor(n_estimators=n_estimators, learning_rate=parameters[0], n_hidden_features=np.int(parameters[1]), reg_lambda=parameters[2], dropout=parameters[3], tolerance=parameters[4], col_sample=parameters[5], seed=123, verbose=1).fit(X_train, y_train) # predict on test set err = estimator.predict(X_test) - y_test print(f"\n\n Test set RMSE: {np.sqrt(np.mean(np.square(err)))}")
100%|██████████| 250/250 [00:01<00:00, 132.50it/s] Test set RMSE: 55.92500853500942
LSBoost
decisionsAs a reminder, the-teller
computes changes (effects) in the response (variable to be explained), consecutive to a small change in an explanatory variable.
# creating an Explainer object explainer = tr.Explainer(obj=estimator) # fitting the Explainer to unseen data explainer.fit(X_test, y_test, X_names=dataset.feature_names, method="avg")
Heterogeneity of marginal effects:
# heterogeneity because 45 patients in test set => a distribution of effects explainer.summary()
mean std median min max bmi 556.001858 198.440761 498.042418 295.134632 877.900389 s5 502.361989 56.518532 488.352521 423.339630 663.398877 bp 256.974826 121.099501 245.205494 83.019164 495.913721 s4 190.995503 69.881801 185.163689 49.870049 356.093240 s6 72.047634 100.701186 76.269634 -68.037669 229.263444 age 55.482125 185.000373 61.218433 -174.677003 329.485983 s2 -8.097623 49.166848 -10.127223 -78.075175 104.572880 s1 -141.735836 72.327037 -115.976202 -292.320955 -6.694544 s3 -146.470803 164.826337 -196.285307 -357.895526 132.102133 sex -234.702770 162.564859 -314.707386 -415.665287 24.017851
Visualizing the average effects (new in version 0.7.0):
explainer.plot(what="average_effects")
Visualizing the distribution (heterogeneity) of effects (new in version 0.7.0):
explainer.plot(what="hetero_effects")
If you’re interested in obtaining all the individual effects, for each patient, then type:
print(explainer.get_individual_effects())