Want to share your content on python-bloggers? click here.

**Feature Importance **is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction. It can help in feature selection and we can get very useful insights about our data. We will show you how you can get it in the most common models of machine learning.

We will use the famous Titanic Dataset from Kaggle.

import pandas as pd import numpy as np import statsmodels.formula.api as smf from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from sklearn.feature_extraction.text import CountVectorizer #we used only the train dataset from Titanic data=pd.read_csv('train.csv') data=data[['Sex','Age','Embarked','Pclass','SibSp','Parch','Survived']] data.dropna(inplace=True)

model=LogisticRegression(random_state=1) features=pd.get_dummies(data[['Sex','Embarked','Pclass','SibSp','Parch']],drop_first=True) features['Age']=data['Age'] model.fit(features,data['Survived']) feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[abs(i) for i in model.coef_[0]]}) feature_importance.sort_values('feature_importance',ascending=False) #if you don't want the absolute value #feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[i for i in model.coef_[0]]}) #feature_importance.sort_values('feature_importance',ascending=False)

feature feature_importance 3 Sex_male 2.501471 0 Pclass 1.213811 4 Embarked_Q 0.595491 5 Embarked_S 0.380094 1 SibSp 0.336785 6 Age 0.042501 2 Parch 0.029937

As you can see we took the absolute value of the coefficients because we want to get the Importance of the feature both with negative and positive effect. If you want to keep this information, you can remove the **absolute** function from the code. Keep in mind that you will not have this option when using Tree-Based models like Random Forest or XGBoost.

model=RandomForestClassifier() model.fit(features,data['Survived']) feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_}) feature_importances.sort_values('feature_importance',ascending=False)

features feature_importance 6 Age 0.416853 3 Sex_male 0.288845 0 Pclass 0.145641 1 SibSp 0.063167 2 Parch 0.052152 5 Embarked_S 0.025383 4 Embarked_Q 0.007959

model=smf.logit('Survived~Sex+Age+Embarked+Pclass+SibSp+Parch',data=data) result = model.fit() feature_importances=pd.DataFrame(result.conf_int()[1]).rename(columns={1:'Coefficients'}).eval("absolute_coefficients=abs(Coefficients)") feature_importances.sort_values('absolute_coefficients',ascending=False).drop('Intercept')[['absolute_coefficients']]

absolute_coefficients Sex[T.male] 2.204154 Pclass 0.959873 Embarked[T.Q] 0.329163 Parch 0.192208 SibSp 0.103804 Embarked[T.S] 0.084723 Age 0.027517

model=XGBClassifier() model.fit(features,data['Survived']) feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_}) print(feature_importances.sort_values('feature_importance',ascending=False))

features feature_importance 3 Sex_male 0.657089 0 Pclass 0.163064 1 SibSp 0.067181 6 Age 0.041643 5 Embarked_S 0.029463 2 Parch 0.027073 4 Embarked_Q 0.014488

In most of the cases, when we are dealing with text we are applying a Word Vectorizer like Count or TF-IDF. The features that we are feeding our model is a sparse matrix and not a structured data-frame with column names. However we can get the feature importances using the following technique.

We are using a dataset from Kaggle which is about **spam or ham** message classification. This will be interesting because words with high importance are representing words that if contained in a message, this message is more likely to be a spam.

df=pd.read_csv('SPAM text message 20170820 - Data.csv') df.head()

Category Message 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro...

v = CountVectorizer(ngram_range=(1,1)) x = v.fit_transform(df['Message']) model=LogisticRegression() model.fit(x,df['Category']) #we are not getting the absolute value feature_importance=pd.DataFrame({'feature':v.get_feature_names(),'feature_importance':[i for i in model.coef_[0]]}) feature_importance.sort_values('feature_importance',ascending=False).head(10)

feature feature_importance 2978 error 2.606383 7982 txt 2.178409 6521 ringtone 1.788390 7640 text 1.777959 8012 uk 1.717855 1824 call 1.709997 6438 reply 1.643512 1975 chat 1.528649 5354 new 1.441076 8519 won 1.436101

Here we can see how useful the feature Importance can be. From the example above we are getting that the word **error** is very important when classifying a message. In other words, because we didn’t get the absolute value, we can say that If this word is contained in a message, then the message is most likely to be a **spam**.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Python – Predictive Hacks**.

Want to share your content on python-bloggers? click here.

Want to share your content on python-bloggers? click here.

Following up our post about Logistic Regression on Aggregated Data in R, we will show you how to deal with grouped data when you want to perform a Logic regression in Python. Let us first create some dummy data.

import pandas as pd import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf df=pd.DataFrame( { 'Gender':np.random.choice(["m","f"],200,p=[0.6,0.4]), 'Age':np.random.choice(["[<30]","[30-65]", "[65+]"],200,p=[0.3,0.6,0.1]), "Response":np.random.binomial(1,size=200,p=0.2) } ) df.head()

Gender Age Response 0 f [30-65] 0 1 m [30-65] 0 2 m [<30] 0 3 f [30-65] 1 4 f [65+] 0

Firstly, we will run a Logistic Regression model on Non-Aggregate Data. We will use the library Stats Models because this is the library we will use for the aggregated data and it is easier to compare our models. Also, Stats Models can give us a model’s summary in a more classic statistical way like R.

Tip: If you don’t want to convert your categorical data into binary to perform a Logistic Regression, you can use the Stats Models formulas Instead of Sklearn.

model=smf.logit('Response~Gender+Age',data=df) result = model.fit() print(result.summary())

Logit Regression Results ============================================================================== Dep. Variable: Response No. Observations: 200 Model: Logit Df Residuals: 196 Method: MLE Df Model: 3 Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765 Time: 18:09:11 Log-Likelihood: -85.502 converged: True LL-Null: -87.934 Covariance Type: nonrobust LLR p-value: 0.1821 ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 ================================================================================

Below are 3 methods we used to deal with aggregated data.

In the following code, we grouped our data and we created columns for the responders(**Yes**) and Non-Responders(**No**).

grouped=df.groupby(['Gender','Age']).agg({'Response':[sum,'count']}).droplevel(0, axis=1).rename(columns={'sum':'Yes','count':'Impressions'}).eval('No=Impressions-Yes') grouped.reset_index(inplace=True) grouped

Gender Age Yes Impressions No 0 f [30-65] 9 38 29 1 f [65+] 2 7 5 2 f [<30] 8 25 17 3 m [30-65] 17 79 62 4 m [65+] 2 12 10 5 m [<30] 9 39 30

glm_binom = smf.glm('Yes + No ~ Age + Gender',grouped, family=sm.families.Binomial()) result_grouped=glm_binom.fit() print(result_grouped.summary())

Generalized Linear Model Regression Results ============================================================================== Dep. Variable: ['Yes', 'No'] No. Observations: 6 Model: GLM Df Residuals: 2 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -8.9211 Date: Mon, 22 Feb 2021 Deviance: 1.2641 Time: 18:15:15 Pearson chi2: 0.929 No. Iterations: 5 Covariance Type: nonrobust ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 ================================================================================

For this method, we need to create a new column with the **response rate** of every group.

grouped['RR']=grouped['Yes']/grouped['Impressions']

glm = smf.glm('RR ~ Age + Gender',data=grouped, family=sm.families.Binomial(), freq_weights=np.asarray(grouped['Impressions'])) result_grouped2=glm.fit() print(result_grouped2.summary())

Generalized Linear Model Regression Results ============================================================================== Dep. Variable: RR No. Observations: 6 Model: GLM Df Residuals: 196 Model Family: Binomial Df Model: 3 Link Function: logit Scale: 1.0000 Method: IRLS Log-Likelihood: -59.807 Date: Mon, 22 Feb 2021 Deviance: 1.2641 Time: 18:18:16 Pearson chi2: 0.929 No. Iterations: 5 Covariance Type: nonrobust ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 ================================================================================

lastly, we can “ungroup” our data and transform our dependent variable into binary so we can perform a Logistic Regression as usual.

grouped['No']=grouped['No'].apply(lambda x: [0]*x) grouped['Yes']=grouped['Yes'].apply(lambda x: [1]*x) grouped['Response']=grouped['Yes']+grouped['No'] expanded=grouped.explode("Response")[['Gender','Age','Response']] expanded['Response']=expanded['Response'].astype(int) expanded.head()

Gender Age Response 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1 0 f [30-65] 1

model=smf.logit('Response~ Gender + Age',data=expanded) result = model.fit() print(result.summary())

Logit Regression Results ============================================================================== Dep. Variable: Response No. Observations: 200 Model: Logit Df Residuals: 196 Method: MLE Df Model: 3 Date: Mon, 22 Feb 2021 Pseudo R-squ.: 0.02765 Time: 18:29:33 Log-Likelihood: -85.502 converged: True LL-Null: -87.934 Covariance Type: nonrobust LLR p-value: 0.1821 ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -2.1741 0.396 -5.494 0.000 -2.950 -1.399 Gender[T.m] 0.8042 0.439 1.831 0.067 -0.057 1.665 Age[T.[65+]] -0.7301 0.786 -0.929 0.353 -2.270 0.810 Age[T.[<30]] 0.1541 0.432 0.357 0.721 -0.693 1.001 ================================================================================

To **leave a comment** for the author, please follow the link and comment on their blog: ** Python – Predictive Hacks**.

Want to share your content on python-bloggers? click here.

The post VectorAssembler in PySpark appeared first on PyShark. The post VectorAssembler in PySpark first appeared on Python-bloggers.]]>

Want to share your content on python-bloggers? click here.

In this article we will explore how to perform feature engineering with VectorAssembler in PySpark.

**Table of contents:**

- Introduction
- Create a SparkSession with PySpark
- Create a Spark DataFrame with PySpark
- Create a single vector column using VectorAssembler in PySpark
- Conclusion

In Python, especially when working with sklearn, most of the models can take raw DataFrames as an input for training. In a distributed environment it can be a little more complicated, as we should be using Assemblers to prepare our training data.

VectorAssember from Spark ML library is a module that allows to convert numerical features into a single vector that is used by the machine learning models.

As an overview, what is does is it takes a list of columns (features) and combines it into a single vector column (feature vector). It is then used as an input into the machine learning models in Spark ML.

To continue following this tutorial we will need Spark installed on your machine and the following Python library: pyspark.

pip install pyspark

The first step and the main entry point to all Spark functionality is the SparkSession class:

from pyspark.sql import SparkSession spark = SparkSession.builder.appName('mysession').getOrCreate()

As the next step we will create a simple Spark DataFrame with three features (“Age”, “Experience”, “Education”) and a target variable (“Salary”):

df = spark.createDataFrame( [ (20, 1, 2, 22000), (25, 2, 3, 30000), (36, 12, 6, 70000), ], ["Age", "Experience", "Education", "Salary"] )

Let’s take a look:

df.show()

+---+----------+---------+------+ |Age|Experience|Education|Salary| +---+----------+---------+------+ | 20| 1| 2| 22000| | 25| 2| 3| 30000| | 36| 12| 6| 70000| +---+----------+---------+------+

For this example, the DataFrame is simple with all the data of numerical type. When working on projects with other datasets you should always correctly identify and convert the data types, check for null values, and do the required data transformations.

Our goal in this step is to combine the three numerical features (“Age”, “Experience”, “Education”) into a single vector column (let’s call it “features”).

VectorAssembler will have two parameters:

- inputCols – list of features to combine into a single vector column
- outputCol – the new column that will contain the transformed vector

Let’s create our assembler:

from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler( inputCols=["Age", "Experience", "Education"], outputCol="features")

Now using this assembler we can transform the original dataset and take a look as the result:

output = assembler.transform(df) output.show()

+---+----------+---------+------+---------------+ |Age|Experience|Education|Salary| features| +---+----------+---------+------+---------------+ | 20| 1| 2| 22000| [20.0,1.0,2.0]| | 25| 2| 3| 30000| [25.0,2.0,3.0]| | 36| 12| 6| 70000|[36.0,12.0,6.0]| +---+----------+---------+------+---------------+

Perfect! This DataFrame can now be used for training models available in Spark ML by passing “features” vector column as your input variable and “salary” as your target variable.

In this article we explored how to use VectorAssembler for feature engineering in PySpark.

I also encourage you to check out my other posts on Feature Engineering.

Feel free to leave comments below if you have any questions or have suggestions for some edits.

The post VectorAssembler in PySpark appeared first on PyShark.

To **leave a comment** for the author, please follow the link and comment on their blog: ** PyShark**.

Want to share your content on python-bloggers? click here.

The post Master Machine Learning: Simple Linear Regression From Scratch With Python appeared first on Better Data Science. The post Master Machine Learning: Simple Linear Regression From Scratch With Python first appeared on Python-bloggers.]]>

Want to share your content on python-bloggers? click here.

Linear regression is the simplest algorithm you’ll encounter while studying machine learning. If we’re talking about *simple linear regression*, you only need to find values for two parameters – slope and the intercept – but more on that in a bit.

Today you’ll get your hands dirty implementing *simple linear regression* algorithm from scratch. This is the first of many upcoming from scratch articles, so stay tuned to the blog if you want to learn more.

Today’s article is structured as follows:

- Introduction to Simple Linear Regression
- Math Behind Simple Linear Regression
- From-Scratch Implementation
- Comparison with Scikit-Learn
- Conclusion

You can download the corresponding notebook here.

As the name suggests, simple linear regression is simple. It’s an algorithm used by many in introductory machine learning, but it doesn’t require any “learning”. It’s as simple as plugging few values into a formula – more on that in the following section.

In general, linear regression is used to predict continuous variables – something such as stock price, weight, and similar.

Linear regression is a linear algorithm, meaning the linear relationship between input variables (what goes in) and the output variable (the prediction) is assumed. It’s not the end of the world if the relationships in your dataset aren’t linear, as there’s plenty of conversion methods.

Several types of linear regression models exist:

**Simple linear regression**– has a single input variable and a single output variable. For example, using height to predict the weight.**Multiple linear regression**– has multiple input variables and a single output variable. For example, using height, body fat, and BMI to predict weight.

Today we’ll deal with simple linear regression. The article on multiple linear regression is coming out next week, so stay tuned to the blog if you want to learn more.

Linear regression is rarely used as a go-to algorithm for solving complex machine learning problems. Instead, it’s used as a baseline model – a point which more sophisticated algorithms have to outperform.

The algorithm is also rather strict on the requirements. Let’s list and explain a few:

**Linear Assumption**— model assumes the relationship between variables is linear**No Noise**— model assumes that the input and output variables are not noisy — so remove outliers if possible**No Collinearity**— model will overfit when you have highly correlated input variables**Normal Distribution**— the model will make more reliable predictions if your input and output variables are normally distributed. If that’s not the case, try using some transforms on your variables to make them more normal-looking**Rescaled Inputs**— use scalers or normalizer to make more reliable predictions

You now know enough theory behind this simple algorithm. Let’s look at the math next before the implementation.

In essence, simple linear regression boils down to solving a couple of equations. You only need to solve the line equation, displayed in the following figure:

As you can see, we need to calculate the beta coefficients somehow. X represents input data, so that’s something you already have at your disposal.

The Beta 1 coefficient has to be calculated first. It represents the slope of the line and can be obtained with the following formula:

The *Xi* represents the current value of the input feature, and *X* with a bar on top represents the mean of the entire variable. The same goes with *Y*, but we’re looking at the target variable instead.

Next, we have the Beta 0 coefficient. You can calculate it with the following formula:

And that’s all there is to simple linear regression! Once the coefficient values are calculated, you can plug in the number for X and get the prediction. As simple as that.

Let’s take a look at the Python implementation next.

Let’s start with the library imports. You’ll only need Numpy and Matplotlib for now. The `rcParams`

modifications are optional, only to make the visuals look a bit better:

Now onto the algorithm implementation. Let’s declare a class called `SimpleLinearRegression`

with the following methods:

`__init__()`

– the constructor, contains the values for Beta 0 and Beta 1 coefficients. These are initially set to`None`

`fit(X, y)`

– calculates the Beta 0 and Beta 1 coefficients from the input`X`

and`y`

parameters. After the calculation is done, the results are stored in the constructor`predict(X)`

– makes the prediction using the line equation. It throws an error if the`fit()`

method wasn’t called beforehand.

If you understand the math behind this simple algorithm, implementation in Python is easy. Here’s the entire code snippet for the class:

Next, let’s create some **dummy data**. We’ll make a range of 300 data points as the input variable, and 300 normally distributed values as the target variable. The target variable is centered around the input variable, with a standard deviation of 20.

You can use the following code snippet to create and visualize the dataset:

The visualization of the dataset is shown in the following figure:

Next, let’s **split the dataset** into training and testing subsets. You can use the `train_test_split()`

function from Scikit-learn to do so:

Finally, let’s make an instance of the `SimpleLinearRegression`

class, fit the training data, and make predictions on the test set. The following code snippet does just that, and also prints the values of Beta 0 and Beta 1 coefficients:

The coefficient values are displayed below:

And that’s your line equation formula. Next, we need a way to **evaluate the model**. Before doing that, let’s quickly see how the `preds`

and `y_test`

variables look like.

Here’s what inside the `preds`

variable:

And here’s how the actual test data looks like:

Not identical, sure, but quite similar overall. For a more quantitative evaluation metric, we’ll use RMSE (Root Mean Squared Error). Here’s how to calculate its value with Python:

The average error is displayed below:

As you can see, our model is around 20 units wrong on average. That’s due to introduced variance when declaring the dataset, so there’s nothing we can do to improve the model further.

If you want to visualize the **best fit line**, you’d have to retrain the model on the entire dataset and plot the predictions. You can do so with the following code snippet:

Here’s how it looks like:

And that’s all there is to a simple linear regression model. Let’s compare it to a `LinearRegression`

class from Scikit-Learn and see if there are any severe differences.

We want to know if our model is any good, so let’s compare it with something we know works well – a `LinearRegression`

class from Scikit-Learn.

You can use the following snippet to import the class, train the model, make predictions, and print the values for Beta 0 and Beta 1 coefficients:

The coefficient values are displayed below:

As you can see, the coefficients are nearly identical! Next, let’s check the RMSE value:

Once again, nearly identical! Model quality – check.

Let’s wrap things up in the next section.

Today you’ve learned how to implement simple linear regression algorithm in Python entirely from scratch.

*Does that mean you should ditch the de facto standard machine learning libraries?* No, not at all. Let me elaborate.

Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other *fit and predict* data scientist.

Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.

- Are The New M1 Macbooks Any Good for Data Science? Let’s Find Out
- PyTorch + SHAP = Explainable Convolutional Neural Networks
- 3 Ways to Tune Hyperparameters of Machine Learning Models with Python
- Python Parallelism: Essential Guide to Speeding up Your Python Code in Minutes
- Concurrency in Python: How to Speed Up Your Code With Threads

- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn

The post Master Machine Learning: Simple Linear Regression From Scratch With Python appeared first on Better Data Science.

To **leave a comment** for the author, please follow the link and comment on their blog: ** python – Better Data Science**.

Want to share your content on python-bloggers? click here.

It is great way to show some data shaping theory convenience functions we have developed.

Please check it out.

The post Plotting Multiple Curves in Python first appeared on Python-bloggers.]]>Want to share your content on python-bloggers? click here.

I have up what I think is a really neat tutorial on how to plot multiple curves on a graph in Python, using seaborn and data_algebra.

It is great way to show some data shaping theory convenience functions we have developed.

Please check it out.

To **leave a comment** for the author, please follow the link and comment on their blog: ** python – Win Vector LLC**.

Want to share your content on python-bloggers? click here.

Want to share your content on python-bloggers? click here.

As Data Scientists, we want our work to be reproducible, meaning that when we share our analysis, everyone should be able to re-run it and come up with the same results. This is not always easy, since we are dealing with different operating systems (iOS, Windows, Linux) and different programming language versions and packages. That is why we encourage you to work with virtual environments like conda environments. Another more robust solution from conda environments is to work with Dockers.

**Scenario**: We have run an analysis using Python Jupyter Notebooks on our own data, and we want to share this analysis with the Predictive Hacks community ensuring that everyone will be able to reproduce the results.

For simplicity, let’s assume that I have run the following analysis:

In essence, I try to run a sentiment analysis on `my_data.csv`

using the `pandas`

, `numpy`

and `vaderSentiment`

libraries. Thus, I want to share this Jupyter notebook and to be plug and play. Let’s see how I can create a docker image containing a Jupyter Notebook, as well as my data and the required libraries.

Jupyter Docker Stacks are a set of ready-to-run Docker images containing Jupyter applications and interactive computing tools. You can use a stack image to do any of the following (and more):

- Start a personal Jupyter Notebook server in a local Docker container
- Run JupyterLab servers for a team using JupyterHub
- Write your own project Dockerfile

We will build our custom image based on `jupyter/scipy`

The Jupyter Docker core images contain the most common libraries, but it is possible to need to install some extra. Like in our case that we wanted to install the `vaderSentiment==3.3.2`

library. This means that we have to create the `requirements.txt`

file.

The `requirements.txt`

file is:

vaderSentiment==3.3.2

Now we need to create the `Dockerfile`

as follows:

FROM jupyter/scipy-notebook COPY requirements.txt ./requirements.txt COPY my_data.csv ./my_data.csv COPY my_jupyter.ipynb ./my_jupyter.ipynb RUN pip install -r requirements.txt

So, we start our image with the `jupyter/scipy-notebook`

then we copy the required files from our local computer to the image. Note that we could have used paths and directories. Finally, we install the required libraries in the `requirements.txt`

file.

Since we have created the Dockerfile, we are ready to build it. The command is the following. Note that you can give any name. I chose to call it `mysharednotebook`

. Tip: Do not forget the period!

$ docker build -t mysharednotebook .

If you want to make sure that your image has been created, you can type:

$ docker images

to get the docker images.

If we want to make sure that the image is running as expected we run:

$ docker run -it -p 8888:8888 mysharednotebook

And we will get a link for our jupyter notebook!

If you want to see which containers are running you can type:

$ docker ps -a

Once you make sure that the image works as expected, you can push it to Docker Hub so that everyone will be able to pull it. The first thing that you need to do, is to tag your image.

$ docker tag 9811503b3d3a gpipis/mysharednotebook:first

The `9811503b3d3a`

is that Image ID obtained it from the command `docker images`

. The `gpipis`

is my username and the `mysharednotebook`

is the image name that I have created above. Finally, the `:first`

is an optional tag.

Now we are ready to push the image by typing:

$ docker push gpipis/mysharednotebook:first

The work above is done by the person who wants to share his/her work. Now, let’s see how we can get this image and work on the reproducible Jupyter Notebook.

What we have to do is to `pull`

the image by typing:

$ docker pull gpipis/mysharednotebook:first

And now we are ready to run it by typing:

$ docker run -it -p 8888:8888 gpipis/mysharednotebook:first

If we copy-paste the URL to our browser we get:

Notice that you can change the port. For example, if you want to run on `8889`

then you type:

$ docker run -it -p 8888:8888 gpipis/mysharednotebook:first

and you have to change also to port to your URL:

http://127.0.0.1:8889/?token=7e767d9a8dbb92e9d93ce7a5f52ba3c524a3cfcc65401714

When you want to share your work with many people and you want them to be able to reproduce your analysis, then the best approach is to work with Dockers.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Python – Predictive Hacks**.

Want to share your content on python-bloggers? click here.

The post How to Make Stunning Radar Charts with Python – Implemented in Matplotlib and Plotly appeared first on Better Data Science. The post How to Make Stunning Radar Charts with Python – Implemented in Matplotlib and Plotly first appeared on Python-bloggers.]]>

Want to share your content on python-bloggers? click here.

Visualizing data beyond two dimensions isn’t a good idea – most of the time. That’s where radar charts come in, enabling you to visually represent one or more groups of values over multiple identically scaled variables.

Today you’ll learn how radar charts can visualize data across multiple dimensions, both with Matplotlib and Plotly. You’ll also learn what radar charts are and the pros and cons of using them.

The article is structured as follows:

- Introduction to Radar Charts
- Pros and Cons of Radar Charts
- Radar Charts with Matplotlib
- Radar Charts with Plotly
- Conclusion

You can download the corresponding Notebook here.

You most likely know what a radar chart is. Sometimes they’re referred to as *spider charts* or *polar charts*, but these terms represent the same idea. The goal of the radar chart is to visually represent one or more groups of values over multiple variables.

For example, let’s say you want to visually represent restaurants over some set of common variables – such as food quality, food variety, service quality, and others (*spoiler alert:* you’ll do that later). Radar charts should be a go-to visualization type for this scenario.

Each variable is given an axis, and axes are arranged radially around the center. Needless to say, but the axes are spaced equally. A single observation is then plotted along each axis like a scatter plot, but the points are then connected to form a **polygon**. You can reuse the same logic to plot multiple polygons in the same chart.

And that’s the basic idea behind radar charts. Let’s examine the pros and cons before diving into hands-on examples.

Let’s talk about the **pros** first:

- Radar charts are excellent for visualizing comparisons between observations – you can easily compare multiple attributes among different observations and see how they stack up. For example, you could use radar charts to compare restaurants based on some common variables.
- It’s easy to see overall “top performers” – the observation with the highest polygon area should be the best if you’re looking at the overall performance.

But things aren’t all sunshine and rainbows, as you can see from the following **cons** list:

- Radar charts can get confusing fast – comparing more than a handful of observations leads to a mess no one wants to look at.
- It can be tough to find the best options if there are too many variables – just imagine seeing a radar chart with 20+ variables. No one wants to even look at it; God forbid to interpret it.
- The variables have to be on the same scale – it makes no sense to compare student grades (ranging from 1 to 5) and satisfaction with some service (ranging from 0 to 100).

You now know what radar charts are and when it makes sense to use them. You’ll learn how to draw them with Matplotlib next.

Matplotlib is a de facto standard data visualization library for Python, so that’s why we’re looking at it first.

The goal is to compare three restaurants among the following categories: food quality, food variety, service quality, ambiance, and affordability. All categories range from 1 to 5, so they are a perfect candidate for visualization with radar charts.

The following code snippet demonstrates how you can specify data and categories, label locations, and visualize the chart. There are a couple of things you should know beforehand:

`label_loc`

is a list that represents the label location in radians`plt.subplot(polar=True)`

must be used to make a radar chart`plt.thetagrids()`

is used to place the category names on label locations

These might be confusing at first, but you’ll get the gist in no time. You can use the following code snippet to make the visualization:

The figure is displayed below:

A quick look at the previous figure indicates something is wrong. The last data point isn’t connected to the first one, and you’ll need to fix that somehow. There isn’t a 100% intuitive fix, but **here’s what you should do**: add an additional element to both categories and restaurants that’s identical to the first item.

You could do this manually, but what if you don’t know what the first value is? You can use the unpacking and indexing operations to solve this issue. Here’s how:

As you can see, it’s a bit tedious to write this logic every time (you could make a function out of it), but here’s how the radar chart looks now:

As you can see, a lot better!

Matplotlib isn’t widely recognized for its aesthetics, so let’s see how to produce better-looking visualization with Plotly next.

Plotly is something else. It’s easy to make highly-customizable, good-looking, and interactive charts with almost the same amount of code. Radar charts are no exception.

That doesn’t mean they’re immune to the issues Matplotlib was having. You still need to manually “close” the polygon, but the result is a somewhat better-looking visualization.

The following snippet produces the same visualization created earlier with Matplolib:

The visualization is shown below:

And that’s all there is to it! Plotly also makes it easy to fill the polygons – just specify `fill='toself'`

. Here’s an example:

The visualization is shown below:

And that’s how easy it is to make radar charts with Plotly. Let’s wrap things up next.

Radar charts provide an excellent way to visualize one or more groups of values over multiple variables. Today you’ve learned how to do just that – with completely made-up restaurant satisfaction data.

Keep in mind the restrictions or cons of radar charts. They aren’t the best options if you want to visualize many observations, so stick to a single or a couple of them at most.

- Top 5 Books to Learn Data Science in 2021
- Ridgeline Plots: The Perfect Way to Visualize Data Distributions with Python
- Python Dictionaries: Everything You Need to Know
- How to Send Beautiful Emails with Python – The Essential Guide
- Are the New M1 Macbooks Any Good for Data Science? Let’s Find Out

- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn

The post How to Make Stunning Radar Charts with Python – Implemented in Matplotlib and Plotly appeared first on Better Data Science.

To **leave a comment** for the author, please follow the link and comment on their blog: ** python – Better Data Science**.

Want to share your content on python-bloggers? click here.

Want to share your content on python-bloggers? click here.

We will show how you can build a flask API that gets URL variables. Recall that when we would like to pass a parameter to the URL we use the following syntax:

Assume that we want to pass the `name`

and the `age`

. Then the URL will be:

http://127.0.0.1:5000?name=Sergio&age=40

We will provide an example of how we can pass URL variables and the URL will be:

http://127.0.0.1:5000/Sergio/40

Let’s provide the code of the Flask API with these two cases:

from flask import Flask, jsonify, request app = Flask(__name__) @app.route('/with_parameters') def with_parameters(): name = request.args.get('name') age = int(request.args.get('age')) return jsonify(message="My name is " + name + " and I am " + str(age) + " years old") @app.route('/with_url_variables/<string:name>/<int:age>') def with_url_variables(name: str, age: int): return jsonify(message="My name is " + name + " and I am " + str(age) + " years old") if __name__ == '__main__': app.run()

Let’s see the GET requests in Postman.

**With Parameters:**

**With Variables**:

Want to share your content on python-bloggers? click here.

Want to share your content on python-bloggers? click here.

Amit Ness gathered an impressive list of learning resources for becoming a **data scientist**.

It’s great to see that he shares them publicly on his github so that others may follow along.

But beware, this learning guideline covers a **multi-year process**.

Amit’s personal motto seems to be “*Becoming better at data science every day*“.

Completing the **hyperlinked list below **will take you several hundreds days at the least!

**Learning Philosophy**:

- The Power of Tiny Gains
- Master Adjacent Disciplines
- T-shaped skills
- Data Scientists Should Be More End-to-End
- Just in Time Learning

- Have basic business understanding
- Be able to frame an ML problem
- Be familiar with data ethics
- Be able to import data from multiple sources
- Be able to annotate data efficiently
- Be able to manipulate data with Numpy
- Be able to manipulate data with Pandas
- Be able to manipulate data in spreadsheets
- Be able to manipulate data in databases
- Be able to use the command line
- Be able to perform feature engineering
- Be able to experiment in a notebook
- Be able to visualize data
- Be able to to read research papers
- Be able to model problems mathematically
- Be able to structure machine learning projects
- Be able to utilize version control
- Be able to use data version control
- Be familiar with fundamental ML algorithms
- Be familiar with fundamentals of deep learning
- Be able to implement models in scikit-learn
- Be able to implement models in Tensorflow and Keras
- Be able to implement models in PyTorch
- Be able to implement models using cloud services
- Be able to apply unsupervised learning algorithms
- Be able to implement NLP models
- Be familiar with Recommendation Systems
- Be able to implement computer vision models
- Be able to model graphs and network data
- Be able to implement models for timeseries and forecasting
- Be familiar with Reinforcement Learning
- Be able to optimize performance metric
- Be familiar with literature on model interpretability
- Be able to optimize models for production
- Be able to write unit tests
- Be able to serve models as REST APIs
- Be able to build interactive UI for models
- Be able to deploy model to production
- Be able to perform load testing
- Be able to perform A/B testing
- Be proficient in Python
- Be familiar with compiled languages
- Have a general understanding of other parts of the stack
- Be familiar with fundamental Computer Science concepts
- Be able to apply proper software engineering process
- Be able to efficiently use a text editor
- Be able to communicate and collaborate well
- Be familiar with the hiring pipeline
- Broaden Perspective

To **leave a comment** for the author, please follow the link and comment on their blog: ** python – paulvanderlaken.com**.

Want to share your content on python-bloggers? click here.

The post Concurrency in Python: How to Speed Up Your Code With Threads appeared first on Better Data Science. The post Concurrency in Python: How to Speed Up Your Code With Threads first appeared on Python-bloggers.]]>

Want to share your content on python-bloggers? click here.

Sequential execution doesn’t always make sense. For example, there’s no point in leaving the program sitting idle if the outputs aren’t dependent on one another. That’s the basic idea behind **concurrency** – a topic you’ll learn a lot about today.

This article will teach you how you can speed up your Python code by running tasks concurrently. Keep in mind – concurrent execution doesn’t mean simultaneous. For more info on simultaneous (parallel) execution, check out this article.

This article is structured as follows:

You can download the source code for this article here.

So, what is threading precisely? Put simply, it’s a programming concept that allows you to run code concurrently. Concurrency means the application runs more than one task – the first task doesn’t have to finish before the second one is started.

Let’s say you’re making a bunch of requests towards some web API. It makes no sense to send one request, wait for the response, and repeat the same process over and over again.

Concurrency enables you to send the second request while the first one waits for the response. The following image should explain the idea behind sequential and concurrent execution better than words can:

Note that a single point represents a small portion of the task. Concurrency can help to speed up the runtime if the task sits idle for a while (think request-response type of communication).

You now know the basics of threading in theory. The following section will show you how to implement it in Python.

Threading is utterly simple to implement with Python. But first, let’s describe the task.

We want to declare a function that makes a GET request to an endpoint and fetches some JSON data. The JSONPlaceholder website is perfect for the task, as it serves as a dummy API. We’ll repeat the process 1000 times and examine how long our program basically does nothing – waits for the response.

Let’s do the test without threading first. Here’s the script:

I reckon nothing should look unfamiliar in the above script. We’re repeating the request 1000 times and keeping track of start and end times. The print statements in the `fetch_single()`

function are here for a single reason – to see how the program behaves when executed.

Here’s the output you’ll see after running this script:

As you can see, one task has to finish for the other one to start. Not an optimal behavior for our type of problem.

Let’s implement threading next. The script will look more-or-less identical, with a couple of differences:

- We need an additional import –
`concurrent.futures`

- We’re not printing the last statement but returning it instead
- The
`ThreadPoolExecutor()`

is used for submitting and running tasks concurrently

Here’s the entire snippet:

Once executed, you’ll see the output similar to this one:

That’s all great, but is there an actual speed improvement? Let’s examine that next.

By now, you know the difference between sequential and concurrent execution and how to transform your code to execute function calls concurrently.

Let’s compare the runtime performance now. The following image summarizes runtime in seconds for the above task – making 1000 API calls:

As you can see, there’s around a 13x reduction in execution time – decent, to say at least.

Today you’ve learned a lot – from the basic theory behind threading and concurrent execution to how you can “convert” your non-concurrent code into a concurrent-one.

Keep in mind that concurrency isn’t a be-all-end-all answer for speed increase with Python. Before implementing threading in your application, please consider how the app was designed. Is the output from one function directly fed as an input into another? If so, concurrency probably isn’t what you’re looking for.

On the other hand, if your app is sitting idle most of the time, “concurrent executing” might just be the term you’ve been waiting for.

Thanks for reading.

- Python Parallelism: Essential Guide to Speeding up Your Python Code in Minutes
- Are The New M1 Macbooks Any Good for Data Science? Let’s Find Out
- How to Create PDF Reports with Python – The Essential Guide
- How to Build and Deploy a Machine Learning Model with FastAPI
- PyTorch + SHAP = Explainable Convolutional Neural Networks

- Follow me on Medium for more stories like this
- Sign up for my newsletter
- Connect on LinkedIn

The post Concurrency in Python: How to Speed Up Your Code With Threads appeared first on Better Data Science.

To **leave a comment** for the author, please follow the link and comment on their blog: ** python – Better Data Science**.

Want to share your content on python-bloggers? click here.