This post is about LSBoost
, an Explainable ‘AI’ algorithm which uses Gradient Boosted randomized networks for pattern recognition. As we’ve discussed it last week LSBoost
is a cousin of GFAGBM’s LS_Boost
. In LSBoost
, more specifically, the so called weak learners from LS_Boost
are based on randomized neural networks’ components and variants of Least Squares regression models.
I’ve already presented some promising examples of use of LSBoost
based on Ridge Regression weak learners. In mlsauce’s version 0.7.1
, the Lasso can also be used as an alternative ingredient to the weak learners. Here is a comparison of the regression coefficients obtained by using mlsauce’s implementation of Ridge regression and the Lasso:
The following example is about training set error vs testing set error, as a function of the regularization parameter, both for Ridge regression and Lasso-based weak learners.
# 0 - Packages and data -------------------------------------------------------
library(devtools)
devtools::install_github("thierrymoudiki/mlsauce/R-package")
library(mlsauce)
library(datasets)
print(summary(datasets::mtcars))
X <- as.matrix(datasets::mtcars[, -1])
y <- as.integer(datasets::mtcars[, 1])
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.double(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.double(y[test_index])
LSBoost
using Ridge regression
# 1 - Ridge -------------------------------------------------------------------
obj <- mlsauce::LSBoostRegressor() # default h is Ridge
print(obj$get_params())
n_lambdas <- 100
lambdas <- 10**seq(from=-6, to=6,
length.out = n_lambdas)
rmse_matrix <- matrix(NA, nrow = 2, ncol = n_lambdas)
rownames(rmse_matrix) <- c("training rmse", "testing rmse")
for (j in 1:n_lambdas)
{
obj$set_params(reg_lambda = lambdas[j])
obj$fit(X_train, y_train)
rmse_matrix[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)),
sqrt(mean((obj$predict(X_test) - y_test)**2)))
}
LSBoost
using the Lasso
# 2 - Lasso -------------------------------------------------------------------
obj <- mlsauce::LSBoostRegressor(solver = "lasso")
print(obj$get_params())
n_lambdas <- 100
lambdas <- 10**seq(from=-6, to=6,
length.out = n_lambdas)
rmse_matrix2 <- matrix(NA, nrow = 2, ncol = n_lambdas)
rownames(rmse_matrix2) <- c("training rmse", "testing rmse")
for (j in 1:n_lambdas)
{
obj$set_params(reg_lambda = lambdas[j])
obj$fit(X_train, y_train)
rmse_matrix2[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)),
sqrt(mean((obj$predict(X_test) - y_test)**2)))
}
> print(session_info())
─ Session info ─────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Etc/UTC
date 2020-07-31
─ Packages ─────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.2)
backports 1.1.8 2020-06-17 [1] RSPM (R 4.0.2)
callr 3.4.3 2020-03-28 [1] RSPM (R 4.0.2)
cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.2)
curl 4.3 2019-12-02 [1] RSPM (R 4.0.2)
desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.2)
devtools * 2.3.1 2020-07-21 [1] RSPM (R 4.0.2)
digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.2)
fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.2)
fs 1.4.2 2020-06-30 [1] RSPM (R 4.0.2)
glue 1.4.1 2020-05-13 [1] RSPM (R 4.0.2)
jsonlite 1.7.0 2020-06-25 [1] RSPM (R 4.0.2)
lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
magrittr 1.5 2014-11-22 [1] RSPM (R 4.0.2)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.2)
mlsauce * 0.7.1 2020-07-31 [1] Github (thierrymoudiki/mlsauce@68e391a)
pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2)
pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.2)
prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.2)
processx 3.4.3 2020-07-05 [1] RSPM (R 4.0.2)
ps 1.3.3 2020-05-08 [1] RSPM (R 4.0.2)
R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.2)
rappdirs 0.3.1 2016-03-28 [1] RSPM (R 4.0.2)
Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.2)
reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2)
rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
rprojroot 1.3-2 2018-01-03 [1] RSPM (R 4.0.2)
rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.2)
sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.2)
testthat 2.3.2 2020-03-02 [1] RSPM (R 4.0.2)
usethis * 1.6.1 2020-04-29 [1] RSPM (R 4.0.2)
withr 2.2.0 2020-04-20 [1] RSPM (R 4.0.2)
[1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.2/lib/R/library
No post in August
This post is about LSBoost
, an Explainable ‘AI’ algorithm which uses Gradient Boosted randomized networks for pattern recognition. As we’ve discussed it last week LSBoost
is a cousin of GFAGBM’s LS_Boost
. In LSBoost
, more specifically, the so called weak learners from LS_Boost
are based on randomized neural networks’ components and variants of Least Squares regression models.
I’ve already presented some promising examples of use of LSBoost
based on Ridge Regression weak learners. In mlsauce’s version 0.7.1
, the Lasso can also be used as an alternative ingredient to the weak learners. Here is a comparison of the regression coefficients obtained by using mlsauce’s implementation of Ridge regression and the Lasso:
The following example is about training set error vs testing set error, as a function of the regularization parameter, both for Ridge regression and Lasso-based weak learners.
# 0 - Packages and data -------------------------------------------------------
library(devtools)
devtools::install_github("thierrymoudiki/mlsauce/R-package")
library(mlsauce)
library(datasets)
print(summary(datasets::mtcars))
X <- as.matrix(datasets::mtcars[, -1])
y <- as.integer(datasets::mtcars[, 1])
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.double(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.double(y[test_index])
LSBoost
using Ridge regression
# 1 - Ridge -------------------------------------------------------------------
obj <- mlsauce::LSBoostRegressor() # default h is Ridge
print(obj$get_params())
n_lambdas <- 100
lambdas <- 10**seq(from=-6, to=6,
length.out = n_lambdas)
rmse_matrix <- matrix(NA, nrow = 2, ncol = n_lambdas)
rownames(rmse_matrix) <- c("training rmse", "testing rmse")
for (j in 1:n_lambdas)
{
obj$set_params(reg_lambda = lambdas[j])
obj$fit(X_train, y_train)
rmse_matrix[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)),
sqrt(mean((obj$predict(X_test) - y_test)**2)))
}
LSBoost
using the Lasso
# 2 - Lasso -------------------------------------------------------------------
obj <- mlsauce::LSBoostRegressor(solver = "lasso")
print(obj$get_params())
n_lambdas <- 100
lambdas <- 10**seq(from=-6, to=6,
length.out = n_lambdas)
rmse_matrix2 <- matrix(NA, nrow = 2, ncol = n_lambdas)
rownames(rmse_matrix2) <- c("training rmse", "testing rmse")
for (j in 1:n_lambdas)
{
obj$set_params(reg_lambda = lambdas[j])
obj$fit(X_train, y_train)
rmse_matrix2[, j] <- c(sqrt(mean((obj$predict(X_train) - y_train)**2)),
sqrt(mean((obj$predict(X_test) - y_test)**2)))
}
> print(session_info())
─ Session info ─────────────────────────────────────────────────────────────
setting value
version R version 4.0.2 (2020-06-22)
os Ubuntu 16.04.6 LTS
system x86_64, linux-gnu
ui RStudio
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Etc/UTC
date 2020-07-31
─ Packages ─────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.2)
backports 1.1.8 2020-06-17 [1] RSPM (R 4.0.2)
callr 3.4.3 2020-03-28 [1] RSPM (R 4.0.2)
cli 2.0.2 2020-02-28 [1] RSPM (R 4.0.2)
crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.2)
curl 4.3 2019-12-02 [1] RSPM (R 4.0.2)
desc 1.2.0 2018-05-01 [1] RSPM (R 4.0.2)
devtools * 2.3.1 2020-07-21 [1] RSPM (R 4.0.2)
digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.2)
ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.2)
fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.2)
fs 1.4.2 2020-06-30 [1] RSPM (R 4.0.2)
glue 1.4.1 2020-05-13 [1] RSPM (R 4.0.2)
jsonlite 1.7.0 2020-06-25 [1] RSPM (R 4.0.2)
lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.2)
magrittr 1.5 2014-11-22 [1] RSPM (R 4.0.2)
Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.2)
memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.2)
mlsauce * 0.7.1 2020-07-31 [1] Github (thierrymoudiki/mlsauce@68e391a)
pkgbuild 1.1.0 2020-07-13 [1] RSPM (R 4.0.2)
pkgload 1.1.0 2020-05-29 [1] RSPM (R 4.0.2)
prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.0.2)
processx 3.4.3 2020-07-05 [1] RSPM (R 4.0.2)
ps 1.3.3 2020-05-08 [1] RSPM (R 4.0.2)
R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.2)
rappdirs 0.3.1 2016-03-28 [1] RSPM (R 4.0.2)
Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
remotes 2.2.0 2020-07-21 [1] RSPM (R 4.0.2)
reticulate 1.16 2020-05-27 [1] RSPM (R 4.0.2)
rlang 0.4.7 2020-07-09 [1] RSPM (R 4.0.2)
rprojroot 1.3-2 2018-01-03 [1] RSPM (R 4.0.2)
rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.2)
sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.2)
testthat 2.3.2 2020-03-02 [1] RSPM (R 4.0.2)
usethis * 1.6.1 2020-04-29 [1] RSPM (R 4.0.2)
withr 2.2.0 2020-04-20 [1] RSPM (R 4.0.2)
[1] /home/rstudio-user/R/x86_64-pc-linux-gnu-library/4.0
[2] /opt/R/4.0.2/lib/R/library
No post in August
In our last post, we ran through a bunch of weighting scenarios using our returns simulation. This resulted in three million portfolios comprised in part, or total, of four assets: stocks, bonds, gold, and real estate. These simulations relaxed the allocation constraints to allow us to exclude assets, yielding a wider range of return and risk results, while lowering the likelihood of achieving our risk and return targets. We bucketed the portfolios to simplify the analysis around the risk-return trade off. We then calculated the median returns and risk for each bucket and found that some buckets achieved Sharpe ratios close to or better than that implied by our original risk-return constraint. Cutting the data further, we calculated the average weights for the better Sharpe ratio portfolios. The result: relatively equal-weighting tended to produce a better risk-reward outcome than significant overweighting.
At the end of the post we noted that we could have a bypassed much of this data wrangling and simply calculated the optimal portfolio weights for various risk profiles using mean-variance optimization. That is what we plan to do today.
The madness behind all this data wrangling was to identify the best return afforded by a given level of risk. Mean-variance optimization (MVO) solves that problem more elegantly than our “hacky” methods. It uses quadratic programming^{1} to minimize the portfolio variance by altering the weights of the various assets in the portfolio. It is subject to the constraints (in the simplest form) that the return of any particular portfolio is at least equal to the expected return of the portfolio^{2} and the weights of the assets sum to one.
More formally it can be expressed as follows:
Minimize: \(\frac{1}{2}w'\sum\ w\)
Subject to: \(r'w = \mu\ and\ e'w = 1\)
Here \(w\) = asset weights, \(\sum\) = the covariance matrix of the assets with themselves and every other asset, \(r\) = returns of the assets, \(\mu\) = expected return of the portfolio, \(e'\) = a vector of ones. It is understood that one is employing matrix notation, so the \(w'\) is the transpose of \(w\).
If you understand that, it’s probably the roughest rendition of MVO you’ve seen and if you don’t, don’t worry about it. The point is through some nifty math, you can solve for the precise weights so that every portfolio that falls along a line has the lowest volatility for a given level of return or the highest return for a given level of volatility. This line is called the efficient frontier since efficiency in econospeak means every asset is optimally allocated and frontier, well you get that one we hope.
What does this look like in practice? Let’s bring back our original portfolio, run the simulations, and then calculate the efficient frontier. We graph our original simulation with the original weighting constraint (all assets are in the portfolio) below.
Recall that after we ran this simulation we averaged the weightings for those portfolios that achieved our constraints of not less than a 7% return and not more 10% risk on an annual basis. We then applied that weighting to our first five year test period. We show the weighting below.
Before we look at the forward returns and the efficient frontier, let’s see where our portfolio lies in the original simulation to orient ourselves. It’s the red dot.
As is clear, the portfolio ends up in the higher end of the continuum, but there are other portfolios that dominate it. Now the moment we’ve been waiting for—portfolio optimization! Taking a range of returns between the minimum and maximum of the simulated portfolios, we’ll calculate the optimal weights to produce the highest return for the lowest amount of risk.
Wow! That optimization stuff sure does work. The blue line representing the efficient frontier clearly shows that there are other portfolios that could generate much higher returns for the implied level of risk we’re taking on. Alternatively, if we move horizontally to the left we see that we could achieve the same level of return at a much lower level of risk, shown by where the blue line crosses above 7% return.
Recall for illustrative purposes we used a simple version for the original weight simulation that required an investment in all assets. When we relax that constraint, we get a much wider range of outcomes, as we pointed out in the last post. What if we ran the weighting simulation with the relaxed constraint? What would our simulation and allocation look like in that case? We show those results below.
We see a much broader range of outcomes, which yields a higher weighting to bonds and a lower one to gold than the previous portfolio. Now we’ll overlay the placement of our satisfactory portfolio on the broader weight simulation along with the efficient frontier in the graph below.
Who needs mean-variance optimization when you’ve got data science simulation?! As one can see, when you allow portfolio weights to approach zero in many, but not all, of the assets, you can approximate the efficient frontier without having to rely on quadratic programming. This should give new meaning to “p-hacking.” Still, quadratic programming is likely to be a lot faster that running thousands of simulations with a large portfolio of assets. Recall for the four asset portfolio when we relaxed the inclusion constraint, that tripled the number of simulations. Hence, for any simulation in which some portfolios won’t be invested in all the assets, the number of calculations increases by a factor of the total number of assets minus one.
Whatever the case, we see that the satisfactory portfolio may not be that satisfactory given how much it’s dominated by the efficient frontier. Recall, however, we weren’t trying to achieve an optimal portfolio per se. We “just” wanted a portfolio that would meet our risk-return constraints.
Let’s see what happens when we use our satisfactory portfolio’s weights on the first five-year test period. In the graph below, we calculate our portfolios risk and return and then place it within our weight simulation scatter plot. We also calculate the risk and returns of various portfolios using the weights we derived from our efficient frontier above and add that our graph as the blue line.
Uh oh, not so efficient. The weights from the previous efficient frontier did not achieve optimal portfolios in the future and produced an unusual shape too. This illustrates one of the main problems with mean-variance optimization: “optimal weights are sensitive to return estimates”. In other words, if your estimate of returns aren’t that great, your optimal portfolio weights won’t be so optimal. Moreover, even if your estimates reflect all presently available information, that doesn’t mean they’ll be that accurate in the future.
A great way to see this is to calculate the efficient frontier using as much of the data as we have, ignoring incomplete cases (which produces bias) and plotting that against the original and first five-year simulations
You win some; you lose some. As is evident, different return estimates yield different frontiers both retrospectively and prospectively. Should we be skeptical of mean mean-variance optimization as Warren Buffett is of “geeks bearing gifts”? Not really. It’s an elegant solution to the thorny problem of portfolio construction. But it’s not very dynamic and it doesn’t exactly allow for much uncertainty around estimates.
There have been a number attempts to address such shortcomings including multi-period models, inter-temporal models, and even a statistics-free approach, among others. Even summarizing these different approaches would take us far afield of this post. Suffice it to say, there isn’t a clear winner; instead each refinement addresses a particular issue or fits a particular risk preference.
We’ve now partially revealed why we’ve been talking about a “satisfactory” portfolio all along. It’s the trade-off between satsificing and optimal. While we cannot possibly discuss all the nuances of satisficing now, our brief explanation is this. Satisficing is finding the best available solution when the optimal one is uncertain or unattainable. It was a concept developed by Herbert Simon who argued that decision makers could choose an optimal solution to a simplified reality or a satisfactory solution to a messy one.
If the “optimal” solution to portfolio allocation is a moving target with multiple approaches to calculating it, many of which involve a great deal of complexity, then electing a “good-enough” solution might be more satisfactory. The cost to become conversant in the technical details necessary to understand some of the solutions, let alone compile all the data necessary, could be prohibitive. Of course, if you’re a fund manager being paid to outperform (i.e., beat everyone else trying to beat you), then it behooves you to seek out these arcane solutions if your commpetitors are apt to use them too.
This discussion explains, in part, why the “simple” 1/n or 60/40 stock/bond portfolios are so popular. The exercise of mean-variance optimization and all its offshoots may simply be too much effort if the answers it gives aren’t dramatically better than a simplified approach. But it would be wrong to lay the blame for poor results or uncertainty on MVO: financial markets have way more noise than signal.
In pursuit of the signal, our next posts will look at the “simple” portfolios and see what they produce over multiple simulations relative to the satisfactory and optimal portfolios we’ve already discussed. If you think this blog is producing more noise than signal or vice versa, we want to know! Our email address is after the R and Python code below.
R code:
# Written in R 3.6.2
# Code for any source('function.R') is found at the end.
## Load packages
suppressPackageStartupMessages({
library(tidyquant)
library(tidyverse)
library(quadprog)
})
## Load data
df <- readRDS("port_const.rds")
dat <- readRDS("port_const_long.rds")
sym_names <- c("stock", "bond", "gold", "realt", "rfr")
## Call simuation functions
source("Portfolio_simulation_functions.R")
## Run simulation
set.seed(123)
port_sim_1 <- port_sim(df[2:61,2:5],1000,4)
## Graph
port_sim_1$graph +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Run selection function and graph
results_1 <- port_select_func(port_sim_1, 0.07, 0.1, sym_names[1:4])
results_1$graph
# Create satisfactory portfolio
satis_ret <- sum(results_1$port_wts*colMeans(df[2:61, 2:5]))
satis_risk <- sqrt(as.numeric(results_1$port_wts) %*%
cov(df[2:61, 2:5]) %*% as.numeric(results_1$port_wts))
port_satis <- data.frame(returns = satis_ret, risk = satis_risk)
# Graph with simulated
port_sim_1$graph +
geom_point(data = port_satis,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Find efficient frontier
source("Efficient_frontier.R")
eff_port <- eff_frontier_long(df[2:61,2:5], risk_increment = 0.01)
df_eff <- data.frame(returns = eff_port$exp_ret, risk = eff_port$stdev)
port_sim_1$graph +
geom_line(data = df_eff,
aes(risk*sqrt(12)*100, returns*1200),
color = 'blue',
size = 1.5,
linetype = "dashed") +
geom_point(data = port_satis,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
# Simulation with leaving out assets
port_sim_1lv <- port_sim_lv(df[2:61,2:5],1000,4)
lv_graf <- port_sim_1lv$graph +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA),
plot.title = element_text(size=10))
## Run selection function
results_1lv <- port_select_func(port_sim_1lv, 0.07, 0.1, sym_names[1:4])
lv_res_graf <- results_1lv$graph +
theme(plot.title = element_text(size=10))
gridExtra::grid.arrange(lv_graf, lv_res_graf, ncol=2)
## Create satisfactory data frame and graph leave out portfolios with efficient frontier
satis_ret_lv <- sum(results_1lv$port_wts*colMeans(df[2:61, 2:5]))
satis_risk_lv <- sqrt(as.numeric(results_1lv$port_wts) %*%
cov(df[2:61, 2:5]) %*% as.numeric(results_1lv$port_wts))
port_satis_lv <- data.frame(returns = satis_ret_lv, risk = satis_risk_lv)
port_sim_1lv$graph +
geom_line(data = df_eff,
aes(risk*sqrt(12)*100, returns*1200),
color = 'blue',
size = 1.5,
linetype = "dashed") +
geom_point(data = port_satis_lv,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Run function and create actual portfolio and data frame for graph
port_1_act <- rebal_func(df[62:121,2:5],results_1lv$port_wts)
port_act <- data.frame(returns = mean(port_1_act$ret_vec),
risk = sd(port_1_act$ret_vec),
sharpe = mean(port_1_act$ret_vec)/sd(port_1_act$ret_vec)*sqrt(12))
## Simulate portfolios on first five-year period
set.seed(123)
port_sim_2 <- port_sim_lv(df[62:121,2:5], 1000, 4)
eff_ret1 <- apply(eff_port[,1:4], 1, function(x) x %*% colMeans(df[62:121, 2:5]))
eff_risk1 <- sqrt(apply(eff_port[,1:4],
1,
function(x)
as.numeric(x) %*% cov(df[62:121,2:5]) %*% as.numeric(x)))
eff_port1 <- data.frame(returns = eff_ret1, risk = eff_risk1)
## Graph simulation with chosen portfolio
port_sim_2$graph +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
geom_line(data = eff_port1,
aes(risk*sqrt(12)*100, returns*1200),
color = 'red',
size = 2) +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Using longer term data
eff_port_old <- eff_frontier_long(dat[1:253,2:5], risk_increment = 0.01)
df_eff_old <- data.frame(returns = eff_port_old$exp_ret, risk = eff_port_old$stdev)
p1 <- port_sim_1lv$graph +
geom_line(data = df_eff_old,
aes(risk*sqrt(12)*100, returns*1200),
color = 'blue',
size = 1.5) +
geom_point(data = port_satis,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA),
plot.title = element_text(size=10)) +
labs(title = 'Simulated portfolios with long-term optimzation')
# For forward graph
eff_ret1_old <- apply(eff_port_old[,1:4], 1,
function(x) x %*% colMeans(dat[1:253, 2:5], na.rm = TRUE))
eff_risk1_old <- sqrt(apply(eff_port_old[,1:4],
1,
function(x)
as.numeric(x) %*%
cov(dat[1:253,2:5],
use = 'pairwise.complete.obs') %*%
as.numeric(x)))
eff_port1_old <- data.frame(returns = eff_ret1_old, risk = eff_risk1_old)
## Graph simulation with chosen portfolio
p2 <- port_sim_2$graph +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
geom_line(data = eff_port1_old,
aes(risk*sqrt(12)*100, returns*1200),
color = 'blue',
size = 2) +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA),
plot.title = element_text(size=10)) +
labs(title = 'Forward portfolios with long-term optimization')
gridExtra::grid.arrange(p1, p2, ncol=2)
#### Portfolio_simulation_functions.R
# Portfolio simulations
## Portfolio simuation function
port_sim <- function(df, sims, cols){
if(ncol(df) != cols){
print("Columns don't match")
break
}
# Create weight matrix
wts <- matrix(nrow = sims, ncol = cols)
for(i in 1:sims){
a <- runif(cols,0,1)
b <- a/sum(a)
wts[i,] <- b
}
# Find returns
mean_ret <- colMeans(df)
# Calculate covariance matrix
cov_mat <- cov(df)
# Calculate random portfolios
port <- matrix(nrow = sims, ncol = 2)
for(i in 1:sims){
port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))
port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,]))
}
colnames(port) <- c("returns", "risk")
port <- as.data.frame(port)
port$Sharpe <- port$returns/port$risk*sqrt(12)
max_sharpe <- port[which.max(port$Sharpe),]
graph <- port %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue") +
labs(x = "Risk (%)",
y = "Return (%)",
title = "Simulated portfolios")
out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts)
}
## Portfolio Simulation leave
port_sim_lv <- function(df, sims, cols){
if(ncol(df) != cols){
print("Columns don't match")
break
}
# Create weight matrix
wts <- matrix(nrow = (cols-1)*sims, ncol = cols)
count <- 1
for(i in 1:(cols-1)){
for(j in 1:sims){
a <- runif((cols-i+1),0,1)
b <- a/sum(a)
c <- sample(c(b,rep(0,i-1)))
wts[count,] <- c
count <- count+1
}
}
# Find returns
mean_ret <- colMeans(df)
# Calculate covariance matrix
cov_mat <- cov(df)
# Calculate random portfolios
port <- matrix(nrow = (cols-1)*sims, ncol = 2)
for(i in 1:nrow(port)){
port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))
port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,]))
}
colnames(port) <- c("returns", "risk")
port <- as.data.frame(port)
port$Sharpe <- port$returns/port$risk*sqrt(12)
max_sharpe <- port[which.max(port$Sharpe),]
graph <- port %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue") +
labs(x = "Risk (%)",
y = "Return (%)",
title = "Simulated portfolios")
out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts)
}
## Load portfolio selection function
port_select_func <- function(port, return_min, risk_max, port_names){
port_select <- cbind(port$port, port$wts)
port_wts <- port_select %>%
mutate(returns = returns*12,
risk = risk*sqrt(12)) %>%
filter(returns >= return_min,
risk <= risk_max) %>%
summarise_at(vars(4:7), mean) %>%
`colnames<-`(port_names)
p <- port_wts %>%
rename("Stocks" = stock,
"Bonds" = bond,
"Gold" = gold,
"Real estate" = realt) %>%
gather(key,value) %>%
ggplot(aes(reorder(key,value), value*100 )) +
geom_bar(stat='identity', position = "dodge", fill = "blue") +
geom_text(aes(label=round(value,2)*100), vjust = -0.5) +
scale_y_continuous(limits = c(0,max(port_wts*100+2))) +
labs(x="",
y = "Weights (%)",
title = "Average weights for risk-return constraints")
out <- list(port_wts = port_wts, graph = p)
out
}
## Function for portfolio returns without rebalancing
rebal_func <- function(act_ret, weights){
ret_vec <- c()
wt_mat <- matrix(nrow = nrow(act_ret), ncol = ncol(act_ret))
for(i in 1:nrow(wt_mat)){
wt_ret <- act_ret[i,]*weights # wt'd return
ret <- sum(wt_ret) # total return
ret_vec[i] <- ret
weights <- (weights + wt_ret)/(sum(weights)+ret) # new weight based on change in asset value
wt_mat[i,] <- as.numeric(weights)
}
out <- list(ret_vec = ret_vec, wt_mat = wt_mat)
out
}
#### Efficient_frontier.R
# Adapted from https://www.nexteinstein.org/wp-content/uploads/sites/6/2017/01/ORIG_Portfolio-Optimization-Using-R_Pseudo-Code.pdf
eff_frontier_long <- function(returns, risk_premium_up = 0.5, risk_increment = 0.005){
covariance <- cov(returns, use = "pairwise.complete.obs")
num <- ncol(covariance)
Amat <- cbind(1, diag(num))
bvec <- c(1, rep(0, num))
meq <- 1
risk_steps <- risk_premium_up/risk_increment+1
count <- 1
eff <- matrix(nrow = risk_steps, ncol = num + 3)
colnames(eff) <- c(colnames(returns), "stdev", "exp_ret", "sharpe")
loop_step <- seq(0, risk_premium_up, risk_increment)
for(i in loop_step){
dvec <- colMeans(returns, na.rm = TRUE)*i
sol <- quadprog::solve.QP(covariance, dvec = dvec, Amat = Amat, bvec = bvec, meq = meq)
eff[count, "stdev"] <- sqrt(sum(sol$solution * colSums(covariance * sol$solution)))
eff[count, "exp_ret"] <- as.numeric(sol$solution %*% colMeans(returns, na.rm = TRUE))
eff[count, "sharpe"] <- eff[count,"exp_ret"]/eff[count, "stdev"]
eff[count, 1:num] <- sol$solution
count <- count + 1
}
return(as.data.frame(eff))
}
Python code:
# Load libraries
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
# SKIP IF ALREADY HAVE DATA
# Load data
start_date = '1970-01-01'
end_date = '2019-12-31'
symbols = ["WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5"]
sym_names = ["stock", "bond", "gold", "realt", 'rfr']
filename = 'data_port_const.pkl'
try:
df = pd.read_pickle(filename)
print('Data loaded')
except FileNotFoundError:
print("File not found")
print("Loading data", 30*"-")
data = web.DataReader(symbols, 'fred', start_date, end_date)
data.columns = sym_names
data_mon = data.resample('M').last()
df = data_mon.pct_change()['1987':'2019']
# df.to_pickle(filename) # If you haven't saved the file
dat = data_mon.pct_change()['1971':'2019']
# pd.to_pickle(df,filename) # if you haven't saved the file
# Portfolio simulation functions
## Simulation function
class Port_sim:
def calc_sim(df, sims, cols):
wts = np.zeros((sims, cols))
for i in range(sims):
a = np.random.uniform(0,1,cols)
b = a/np.sum(a)
wts[i,] = b
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros((sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def calc_sim_lv(df, sims, cols):
wts = np.zeros(((cols-1)*sims, cols))
count=0
for i in range(1,cols):
for j in range(sims):
a = np.random.uniform(0,1,(cols-i+1))
b = a/np.sum(a)
c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False)
wts[count,] = c
count+=1
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros(((cols-1)*sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def graph_sim(port, sharpe):
plt.figure(figsize=(14,6))
plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Constraint function
def port_select_func(port, wts, return_min, risk_max):
port_select = pd.DataFrame(np.concatenate((port, wts), axis=1))
port_select.columns = ['returns', 'risk', 1, 2, 3, 4]
port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)]
port_wts = port_wts.iloc[:,2:6]
port_wts = port_wts.mean(axis=0)
return port_wts
def port_select_graph(port_wts):
plt.figure(figsize=(12,6))
key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"}
lab_names = []
graf_wts = port_wts.sort_values()*100
for i in range(len(graf_wts)):
name = key_names[graf_wts.index[i]]
lab_names.append(name)
plt.bar(lab_names, graf_wts, color='blue')
plt.ylabel("Weight (%)")
plt.title("Average weights for risk-return constraint", fontsize=15)
for i in range(len(graf_wts)):
plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5))
plt.show()
# Return function with no rebalancing
def rebal_func(act_ret, weights):
ret_vec = np.zeros(len(act_ret))
wt_mat = np.zeros((len(act_ret), len(act_ret.columns)))
for i in range(len(act_ret)):
wt_ret = act_ret.iloc[i,:].values*weights
ret = np.sum(wt_ret)
ret_vec[i] = ret
weights = (weights + wt_ret)/(np.sum(weights) + ret)
wt_mat[i,] = weights
return ret_vec, wt_mat
## Rum simulation and graph
np.random.seed(123)
port_sim_1, wts_1, _, sharpe_1, _ = Port_sim.calc_sim(df.iloc[1:60,0:4],1000,4)
Port_sim.graph_sim(port_sim_1, sharpe_1)
# Weight choice
results_1_wts = port_select_func(port_sim_1, wts_1, 0.07, 0.1)
port_select_graph(results_1_wts)
# Compute satisfactory portfolio
satis_ret = np.sum(results_1_wts * df.iloc[1:60,0:4].mean(axis=0).values)
satis_risk = np.sqrt(np.dot(np.dot(results_1_wts.T, df.iloc[1:60,0:4].cov()),results_1_wts))
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(satis_risk*np.sqrt(12)*100, satis_ret*1200, c='red', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Create efficient frontier function
from scipy.optimize import minimize
def eff_frontier(df_returns, min_ret, max_ret):
n = len(df_returns.columns)
def get_data(weights):
weights = np.array(weights)
returns = np.sum(df_returns.mean() * weights)
risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights)))
sharpe = returns/risk
return np.array([returns,risk,sharpe])
# Contraints
def check_sum(weights):
return np.sum(weights) - 1
# Rante of returns
mus = np.linspace(min_ret,max_ret,20)
# Function to minimize
def minimize_volatility(weights):
return get_data(weights)[1]
# Inputs
init_guess = np.repeat(1/n,n)
bounds = tuple([(0,1) for _ in range(n)])
eff_risk = []
port_weights = []
for mu in mus:
# function for return
cons = ({'type':'eq','fun': check_sum},
{'type':'eq','fun': lambda w: get_data(w)[0] - mu})
result = minimize(minimize_volatility,init_guess,method='SLSQP',bounds=bounds,constraints=cons)
eff_risk.append(result['fun'])
port_weights.append(result.x)
eff_risk = np.array(eff_risk)
return mus, eff_risk, port_weights
## Create variables for froniter function
df_returns = df.iloc[1:60, 0:4]
min_ret = min(port_sim_1[:,0])
max_ret = max(port_sim_1[:,0])
eff_ret, eff_risk, eff_weights = eff_frontier(df_returns, min_ret, max_ret)
## Graph efficient frontier
plt.figure(figsize=(12,6))
plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues')
plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2)
plt.scatter(satis_risk*np.sqrt(12)*100, satis_ret*1200, c='red', s=50)
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
## Graph with unconstrained weights
np.random.seed(123)
port_sim_1lv, wts_1lv, _, sharpe_1lv, _ = Port_sim.calc_sim_lv(df.iloc[1:60,0:4],1000,4)
Port_sim.graph_sim(port_sim_1lv, sharpe_1lv)
# Weight choice
results_1lv_wts = port_select_func(port_sim_1lv, wts_1lv, 0.07, 0.1)
port_select_graph(results_1lv_wts)
# Satisfactory portfolio unconstrained weights
satis_ret1 = np.sum(results_1lv_wts * df.iloc[1:60,0:4].mean(axis=0).values)
satis_risk1 = np.sqrt(np.dot(np.dot(results_1lv_wts.T, df.iloc[1:60,0:4].cov()),results_1lv_wts))
# Graph with efficient frontier
plt.figure(figsize=(12,6))
plt.scatter(port_sim_1lv[:,1]*np.sqrt(12)*100, port_sim_1lv[:,0]*1200, marker='.', c=sharpe_1lv, cmap='Blues')
plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2)
plt.scatter(satis_risk1*np.sqrt(12)*100, satis_ret1*1200, c='red', s=50)
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Five year forward with unconstrained satisfactory portfolio
# Returns
## Run rebalance function using desired weights
port_1_act, wt_mat = rebal_func(df.iloc[61:121,0:4], results_1lv_wts)
port_act = {'returns': np.mean(port_1_act),
'risk': np.std(port_1_act),
'sharpe': np.mean(port_1_act)/np.std(port_1_act)*np.sqrt(12)}
# Run simulation on recent five-years
np.random.seed(123)
port_sim_2lv, wts_2lv, _, sharpe_2lv, _ = Port_sim.calc_sim_lv(df.iloc[61:121,0:4],1000,4)
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_sim_2lv[:,1]*np.sqrt(12)*100, port_sim_2lv[:,0]*1200, marker='.', c=sharpe_2lv, cmap='Blues')
plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50)
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
## Eficient frontier on long term data
df_returns_l = dat.iloc[1:254, 0:4]
min_ret_l = min(port_sim_1[:,0])
max_ret_l = max(port_sim_1[:,0])
eff_ret_l, eff_risk_l, eff_weightsl = eff_frontier(df_returns1, min_ret1, max_ret1)
## Graph with original
plt.figure(figsize=(12,6))
plt.scatter(port_sim_1lv[:,1]*np.sqrt(12)*100, port_sim_1lv[:,0]*1200, marker='.', c=sharpe_1lv, cmap='Blues')
plt.plot(eff_risk_l*np.sqrt(12)*100,eff_ret_l*1200,'b--',linewidth=2)
plt.scatter(satis_risk1*np.sqrt(12)*100, satis_ret1*1200, c='red', s=50)
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
## Graph with five-year forward
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_sim_2lv[:,1]*np.sqrt(12)*100, port_sim_2lv[:,0]*1200, marker='.', c=sharpe_2lv, cmap='Blues')
plt.plot(eff_risk_l*np.sqrt(12)*100,eff_ret_l*1200,'b--',linewidth=2)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50)
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
Selenium is a powerful library available for both Python and R (the R version is called RSelenium
) which can automate tasks such as form filling, job applications, CRM system administration and many other tasks. That being said Selenium can be used as well to do a lot of harm such as filling up forms with fake answers, making bots to create fake views for Youtube and other nefarious purposes.
With this in mind, I can only think of what Peter Parker was told by Uncle Ben:
This blog post is about how setting up Selenium on R and Python went for me, If you can relate to this or have any insight, please leave a comment below!
Learning how to use Selenium on Python took me about 10 minutes to figure out. All I needed to do was download chromedriver and install selenium pip install selenium
and I was ready to start working with it.
I was even able to do some form automation with it:
From the offical documentation RSelenium is reccomended to be ran on Docker.
Coming from Python and wanting to do this in R presented an inconvenience for me as my main machine does not support virtualization- which disqualifies me from even being able to install Docker on the machine which I have been working on.
This left me with no other choice but to use Selenium strictly in Python.
While Chromedriver reccomends that it be ran on a VM, it is not a requirement, and I was able to use it in Python. My experience with RSelenium is that it is impossible to use it without Docker or something similar, which is disappointing as I wanted to see how RSelenium
matched up.
If you are really set on wanting to use Selenium in an R framework (maybe because you need to do some data wrangling or want to use tidyverse
as part of your project, etc.), I would recommend writing the script in python and executing it in R with the the reticulate
package and have something like:
reticulate::py_run_file("path_to_python_file") ... ... (Rest of your R Code)
Let me reiterate you can learn how to use Selenium in Python in around 10 minutes, so the learning curve is as difficult as finding a solution for RSelenium
and will integrate in R code thanks to the reticulate
package.
So as things look now- unless things change, my Selenium work will have to be written in Python.
This post originally was going to be one where I was going to compare the use and speed of Selenium in R and Python, but the inability to install Docker on my computer made me unable to do use the RSelenium
package.
I’m sure I am not the only one who faced this challenge, so I thought I would share my thoughts about how to get around it.
If you have a better solution- please feel free to share it with me as I would want to do a comparison between Python and R using Selenium!
Disclaimer: I have no affiliation with The Next Web (cf. online article)
A few weeks ago I read this interesting and accessible article about explainable AI, discussing more specifically self-explainable AI issues. I’m not sure – anymore – if there’s a mandatory need for AI models that explain themselves, as there are model-agnostic tools such as the teller – among many others – for helping them in doing just that.
With that being said, the new LSBoost
algorithm implemented in mlsauce does, explain itself. LSBoost
is a cousin of the LS_Boost
algorithm introduced in
GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE (GFAGBM). GFAGBM’s LS_Boost
is outlined below:
So, what makes the new LSBoost
different? Would you be legitimately entitled to ask. Well, about the seemingly new name: I actually misspelled LS_Boost
in my code in the first place! So, it’ll remain named as it is now and forever. Otherwise, in the new LSBoost
we have:
LSBoost
contains a learning rate which could accelerate or slow down the convergence of residuals towards 0. Overfitting, fast or slow.Besides this, we can also remark that LSBoost
is explainable as a linear model, while being a highly nonlinear one. Indeed by using some calculus, it’s possible to compute derivatives of F (still referring to Algorithm 2 outlined before) relative to x, wherever the function h does admit a derivative.
In the following Python+R examples appearing after the short survey (both tested on Linux and macOS so far), we’ll use LSBoost
with default hyperparameters, for solving regression and classification problems. There’s still some room for improvement of models performance.
Install mlsauce (command line)
pip install mlsauce --upgrade
Import packages
import numpy as np
from sklearn.datasets import load_boston, load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from time import time
from os import chdir
from sklearn import metrics
import mlsauce as ms
# data 1
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 1 -- breast cancer -----")
print(X.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 1 -- breast cancer -----
(569, 30)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
0.16006875038146973
0.9473684210526315
0.015897750854492188
precision recall f1-score support
0 1.00 0.86 0.92 42
1 0.92 1.00 0.96 72
accuracy 0.95 114
macro avg 0.96 0.93 0.94 114
weighted avg 0.95 0.95 0.95 114
# data 2
wine = load_wine()
Z = wine.data
t = wine.target
np.random.seed(879423)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
print("dataset 2 -- wine -----")
print(Z.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 2 -- wine -----
(178, 13)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
0.1548290252685547
0.9722222222222222
0.021778583526611328
precision recall f1-score support
0 1.00 0.93 0.97 15
1 0.92 1.00 0.96 12
2 1.00 1.00 1.00 9
accuracy 0.97 36
macro avg 0.97 0.98 0.98 36
weighted avg 0.97 0.97 0.97 36
# data 3
iris = load_iris()
Z = iris.data
t = iris.target
np.random.seed(734563)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
print("dataset 3 -- iris -----")
print(Z.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 3 -- iris -----
(150, 4)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 1157.03it/s]
0.0932917594909668
0.9666666666666667
0.007458209991455078
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.90 0.95 10
2 0.88 1.00 0.93 7
accuracy 0.97 30
macro avg 0.96 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
# data 1
boston = load_boston()
X = boston.data
y = boston.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 4 -- boston -----")
print(X.shape)
obj = ms.LSBoostRegressor()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(np.sqrt(np.mean(np.square(obj.predict(X_test) - y_test))))
print(time()-start)
dataset 4 -- boston -----
(506, 13)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 896.24it/s]
0%| | 0/100 [00:00<?, ?it/s]
0.1198277473449707
3.4934156173105206
0.01007080078125
# data 2
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 5 -- diabetes -----")
print(X.shape)
obj = ms.LSBoostRegressor()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(np.sqrt(np.mean(np.square(obj.predict(X_test) - y_test))))
print(time()-start)
dataset 5 -- diabetes -----
(442, 10)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 1000.60it/s]
0.10351037979125977
55.867989174555625
0.012843847274780273
library(devtools)
devtools::install_github("thierrymoudiki/mlsauce/R-package")
library(mlsauce)
library(datasets)
X <- as.matrix(iris[, 1:4])
y <- as.integer(iris[, 5]) - 1L
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.integer(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.integer(y[test_index])
# using default parameters
obj <- mlsauce::LSBoostClassifier()
start <- proc.time()[3]
obj$fit(X_train, y_train)
print(proc.time()[3] - start)
start <- proc.time()[3]
print(obj$score(X_test, y_test))
print(proc.time()[3] - start)
elapsed
0.051
0.9253731
elapsed
0.011
library(datasets)
X <- as.matrix(datasets::mtcars[, -1])
y <- as.integer(datasets::mtcars[, 1])
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.double(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.double(y[test_index])
# using default parameters
obj <- mlsauce::LSBoostRegressor()
start <- proc.time()[3]
obj$fit(X_train, y_train)
print(proc.time()[3] - start)
start <- proc.time()[3]
print(sqrt(mean((obj$predict(X_test) - y_test)**2)))
print(proc.time()[3] - start)
elapsed
0.044
6.482376
elapsed
0.01
Disclaimer: I have no affiliation with The Next Web (cf. online article)
A few weeks ago I read this interesting and accessible article about explainable AI, discussing more specifically self-explainable AI issues. I’m not sure – anymore – if there’s a mandatory need for AI models that explain themselves, as there are model-agnostic tools such as the teller – among many others – for helping them in doing just that.
With that being said, the new LSBoost
algorithm implemented in mlsauce does, explain itself. LSBoost
is a cousin of the LS_Boost
algorithm introduced in
GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE (GFAGBM). GFAGBM’s LS_Boost
is outlined below:
So, what makes the new LSBoost
different? Would you be legitimately entitled to ask. Well, about the seemingly new name: I actually misspelled LS_Boost
in my code in the first place! So, it’ll remain named as it is now and forever. Otherwise, in the new LSBoost
we have:
LSBoost
contains a learning rate which could accelerate or slow down the convergence of residuals towards 0. Overfitting, fast or slow.Besides this, we can also remark that LSBoost
is explainable as a linear model, while being a highly nonlinear one. Indeed by using some calculus, it’s possible to compute derivatives of F (still referring to Algorithm 2 outlined before) relative to x, wherever the function h does admit a derivative.
In the following Python+R examples appearing after the short survey (both tested on Linux and macOS so far), we’ll use LSBoost
with default hyperparameters, for solving regression and classification problems. There’s still some room for improvement of models performance.
Install mlsauce (command line)
pip install mlsauce --upgrade
Import packages
import numpy as np
from sklearn.datasets import load_boston, load_diabetes
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from time import time
from os import chdir
from sklearn import metrics
import mlsauce as ms
# data 1
breast_cancer = load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 1 -- breast cancer -----")
print(X.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 1 -- breast cancer -----
(569, 30)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
0.16006875038146973
0.9473684210526315
0.015897750854492188
precision recall f1-score support
0 1.00 0.86 0.92 42
1 0.92 1.00 0.96 72
accuracy 0.95 114
macro avg 0.96 0.93 0.94 114
weighted avg 0.95 0.95 0.95 114
# data 2
wine = load_wine()
Z = wine.data
t = wine.target
np.random.seed(879423)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
print("dataset 2 -- wine -----")
print(Z.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 2 -- wine -----
(178, 13)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
0.1548290252685547
0.9722222222222222
0.021778583526611328
precision recall f1-score support
0 1.00 0.93 0.97 15
1 0.92 1.00 0.96 12
2 1.00 1.00 1.00 9
accuracy 0.97 36
macro avg 0.97 0.98 0.98 36
weighted avg 0.97 0.97 0.97 36
# data 3
iris = load_iris()
Z = iris.data
t = iris.target
np.random.seed(734563)
X_train, X_test, y_train, y_test = train_test_split(Z, t,
test_size=0.2)
print("dataset 3 -- iris -----")
print(Z.shape)
obj = ms.LSBoostClassifier()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(obj.score(X_test, y_test))
print(time()-start)
# classification report
y_pred = obj.predict(X_test)
print(classification_report(y_test, y_pred))
dataset 3 -- iris -----
(150, 4)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 1157.03it/s]
0.0932917594909668
0.9666666666666667
0.007458209991455078
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 0.90 0.95 10
2 0.88 1.00 0.93 7
accuracy 0.97 30
macro avg 0.96 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
# data 1
boston = load_boston()
X = boston.data
y = boston.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 4 -- boston -----")
print(X.shape)
obj = ms.LSBoostRegressor()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(np.sqrt(np.mean(np.square(obj.predict(X_test) - y_test))))
print(time()-start)
dataset 4 -- boston -----
(506, 13)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 896.24it/s]
0%| | 0/100 [00:00<?, ?it/s]
0.1198277473449707
3.4934156173105206
0.01007080078125
# data 2
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# split data into training test and test set
np.random.seed(15029)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2)
print("dataset 5 -- diabetes -----")
print(X.shape)
obj = ms.LSBoostRegressor()
# using default parameters
print(obj.get_params())
start = time()
obj.fit(X_train, y_train)
print(time()-start)
start = time()
print(np.sqrt(np.mean(np.square(obj.predict(X_test) - y_test))))
print(time()-start)
dataset 5 -- diabetes -----
(442, 10)
{'backend': 'cpu', 'col_sample': 1, 'direct_link': 1, 'dropout': 0, 'learning_rate': 0.1, 'n_estimators': 100, 'n_hidden_features': 5, 'reg_lambda': 0.1, 'row_sample': 1, 'seed': 123, 'tolerance': 0.0001, 'verbose': 1}
100%|██████████| 100/100 [00:00<00:00, 1000.60it/s]
0.10351037979125977
55.867989174555625
0.012843847274780273
library(devtools)
devtools::install_github("thierrymoudiki/mlsauce/R-package")
library(mlsauce)
library(datasets)
X <- as.matrix(iris[, 1:4])
y <- as.integer(iris[, 5]) - 1L
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.integer(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.integer(y[test_index])
# using default parameters
obj <- mlsauce::LSBoostClassifier()
start <- proc.time()[3]
obj$fit(X_train, y_train)
print(proc.time()[3] - start)
start <- proc.time()[3]
print(obj$score(X_test, y_test))
print(proc.time()[3] - start)
elapsed
0.051
0.9253731
elapsed
0.011
library(datasets)
X <- as.matrix(datasets::mtcars[, -1])
y <- as.integer(datasets::mtcars[, 1])
n <- dim(X)[1]
p <- dim(X)[2]
set.seed(21341)
train_index <- sample(x = 1:n, size = floor(0.8*n), replace = TRUE)
test_index <- -train_index
X_train <- as.matrix(X[train_index, ])
y_train <- as.double(y[train_index])
X_test <- as.matrix(X[test_index, ])
y_test <- as.double(y[test_index])
# using default parameters
obj <- mlsauce::LSBoostRegressor()
start <- proc.time()[3]
obj$fit(X_train, y_train)
print(proc.time()[3] - start)
start <- proc.time()[3]
print(sqrt(mean((obj$predict(X_test) - y_test)**2)))
print(proc.time()[3] - start)
elapsed
0.044
6.482376
elapsed
0.01
Our last few posts on portfolio construction have simulated various weighting schemes to create a range of possible portfolios. We’ve then chosen portfolios whose average weights yield the type of risk and return we’d like to achieve. However, we’ve noted there is more to portfolio construction than simulating portfolio weights. We also need to simulate return outcomes given that our use of historical averages to set return expectations is likely to be biased. It only accounts for one possible outcome.
Why did we bother to simulate weights in the first place? Finance theory posits that it is possible to find an optimal allocation of assets that achieves the highest return for a given level of risk or the lowest risk for a given level of return through a process known as mean-variance optimization. Explaining such concepts are all well and good if one understands the math, but aren’t so great at developing the intuition if one doesn’t. Additionally, simulating portfolio weights has the added benefit of approximating the range of different portfolios, different investors might hold if they had a view on asset attractiveness.
Of course, as we pointed out two posts ago, our initial weighting scheme was relatively simple as it assumed all assets were held. We then showed that if we excluded some of the assets, the range of outcomes would increase significantly. In our last post, we showed that when we simulated many potential return paths, based on the historical averages along with some noise, and then simulated a range of portfolio weights, the range of outcomes also increased.
Unfortunately, the probability of achieving our desired risk-return constraints decreased. The main reason: a broader range of outcomes implies an increase in volatility, which means the likelihood of achieving our risk constraint declines. Now this would be great time to note how in this case (as in much of finance theory), volatility is standing in for risk, which begs the question as to whether volatility captures risk accurately.^{1} Warren Buffett claims he prefers a choppy 15% return to a smooth 12%, a view finance theory might abhor. But the volatility vs. risk discussion is a huge can of worms we don’t want to open just yet. So like a good economist, after first assuming a can opener, we’ll assume a screw top jar in which to toss this discussion to unscrew at a later date.
For now, we’ll run our simulations again but allowing for more extreme outcomes. Recall we generate 1,000 possible return scenarios over a 60-month (5-year) period for the set of assets. In other words, each simulation features randomly generated returns for each of the assets along with some noise, all of which hopefully accounts for random asset correlations too. As in our previous post, we then randomly select four of the possible return profiles and run our portfolio weighting algorithm to produce 1,000 different portfolios. Here are four samples below.
Now here’s the same simulations using the algorithm that allows one to exclude up to any two of the four assets. (A portfolio of a single asset isn’t really a portfolio.)
Who knew portfolio simulations could be so artistic. For the portfolios in which all assets are held, the probability of hitting our not less than 7% return and not more than 10% risk constraint, is 8%, 0%, 29%, and 29%. For the portfolios that allow assets to be excluded, the probabtility of achieving our risk-return constraints is 12%, 0%, 25%, and 25%.
We shouldn’t read too much into four samples. Still, allowing a broader range of allocations sometimes yields an improved probability of success, and sometimes it doesn’t. So much for more choice is better! The point is that a weighting scheme is only as good as the potential returns available. Extreme weights (99% in gold, 1% is bonds, and 0% in everything else, for example), will yield even more extreme performance, eroding the benefits of diversification. In essence, by excluding assets we’re increasing the likelihood of achieving dramatically awesome or dramatically awful returns. Portfolios with extreme returns tend to have higher volatility. So we are in effect filling in outcomes in the “tails” of the distribution. All things being equal, more tail events (2x more in fact), generally yield more outcomes where we’re likely to miss our modest risk constraint.
We’ll now extend the weighting simulation to the 1,000 return simulations from above, yielding three million different portfolios. We graph a random selection of 10,000 of those portfolios below.
what a blast! Now here’s the histograms of the entire datasheet for returns and risk.
Returns are relatively normal. Volatility, predictably, is less than normal and positively skewed. We could theoretically transform volatility if we needed a more normal shape, but we’ll leave it as is for now. Still, this is something to keep in the back of our mind—namely, once we start excluding assets, we’re no longer in an easy to fit normal world, so should be wary of the probabilities we calculate.
Given these results, let’s think about what we want our portfolio to achieve. Greater than 7% returns matches the nominal returns of the stocks over the long term. Less than 10% risk does not. But, recall, we were “hoping” to generate equity-like returns with lower risk. We’re not trying to generate 10% or 20% average annual returns. If we were, we’d need to take on a lot more risk, at least theoretically.
Thus the question is, how much volatility are we willing to endure for a given return? While we could phrase this question in reverse (return for a given level of risk), we don’t think that is intuitive for most non-professional investors. If we bucket the range of returns and then calculate the volatility for each bucket, we can shrink our analysis to get a more managenable estimate of the magnitude of the risk-return trade-off.
We have three choices for how to bucket the returns: by interval, number, or width. We could have equal intervals of returns, with a different the number of observations for each interval. We could have an equal number of observations, with the return spread for each bucket sporting a different width. Or we could choose an equal width for the cut-off between returns, resulting in different number of observations for each bucket. We’ll choose the last scheme.
While there are returns that far exceed negative 15% and positive 45% on average annual basis, the frequency of occurrence is de minimis. So we’ll exclude those outcomes and only show the most frequent ranges in the graph below.
We see that around 59% of the occurrences are within the return range of 5% to 15%. Around 76% are between 5% to 25%. That’s a good start. A majority of the time we’ll be close to or above our return constraint. If we alter the buckets so most of the outlier returns are in one bucket (anything below -5% or above 35%) and then calculate the median return and risk for those buckets, we’ll have a manageable data set, as shown below.
The bucket that includes are our greater than 7% return constraint has a median return of about 9% with median risk of about 12%. That equates to a Sharpe ratio of about 0.75, which is better than our implied target of 0.7. The next bucket with a median return and risk of 18% and 16% is better. But buckets above that have even better risk to reward ratios. However, only 3% of the portfolios reach that stratosphere.
Given that around 76% of the portfolios have a better risk-reward than our target, we could easily achieve our goal by only investing a portion of our assets in the risky portfolios and putting the remainder in risk-free assets if one believes such things exist.^{2} But we’d still need to figure out our allocations.
Let’s look at the weighting for these different portfolios. First, we bracket 76% or so of the portfolios that are in the sweet spot and take the average of the weights.
We see that average weights are roughly equal if slightly biased toward stocks and bonds. Now let’s calculate the average weights for the returns above the mid-range.
Instructive. Those portfolios that saw very high returns had a very high exposure to gold. What about the low return portfolios?
Also a high exposure to gold. We’re not picking on the yellow metal, but this a great illustration of the perils of overweighting a highly volatile asset. Sometimes you knock it out of the park and sometimes you crash and burn. Recall our return simulations took each asset’s historical return and risk and added in some noise^{3} similar to the asset’s underlying risk. Hence, by randomness alone it was possible to generate spectacular or abysmal returns. That massive outperformance your friend is enjoying could entirely be due to luck.
But, before you begin to think we’ve drunk the Efficient Market Hypothesis kool-aid, let’s look at the weights for the portfolios that meet or exceed our risk-return constraints.
An interesting result. While the weights are still relatively equal, the higher risk assets have a lower exposure overall.
Let’s summarize. When we simulated multiple return outcomes and relaxed the allocation constraints to allow us to exclude assets, the range of return and risk results increased significantly. But the likelihood of achieving our risk and return targets decreased. So we decided to bucket the portfolios to make it easier to assess how much risk we’d have to accept for the type of return we wanted. Doing so, we calculated the median returns and risk for each bucket and found that some buckets achieved Sharpe ratios close to or better than that implied by our original risk-return constraint. We then looked at the average asset allocations for some of the different buckets, ultimately, cutting the data again to calculate the average weights for the better Sharpe ratio portfolios. The takeaway: relatively equal-weighting tended to produce a better risk-reward outcome than significant overweighting. Remember this takeaway because we’ll come back to it in later posts.
In the end, we could have a bypassed some of this data wrangling and just calculated the optimal portfolio weights for various risk profiles. But that will have to wait until we introduce our friend, mean-variance optimization. Until then, the Python and R code that can produce the foregoing analysis and charts are below.
For the Pythonistas:
# Built using Python 3.7.4
# Load libraries
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('ggplot')
sns.set()
# SKIP IF ALREADY HAVE DATA
# Load data
start_date = '1970-01-01'
end_date = '2019-12-31'
symbols = ["WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5"]
sym_names = ["stock", "bond", "gold", "realt", 'rfr']
filename = 'data_port_const.pkl'
try:
df = pd.read_pickle(filename)
print('Data loaded')
except FileNotFoundError:
print("File not found")
print("Loading data", 30*"-")
data = web.DataReader(symbols, 'fred', start_date, end_date)
data.columns = sym_names
data_mon = data.resample('M').last()
df = data_mon.pct_change()['1987':'2019']
dat = data_mon.pct_change()['1971':'2019']
## Simulation function
class Port_sim:
def calc_sim(df, sims, cols):
wts = np.zeros((sims, cols))
for i in range(sims):
a = np.random.uniform(0,1,cols)
b = a/np.sum(a)
wts[i,] = b
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros((sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def calc_sim_lv(df, sims, cols):
wts = np.zeros(((cols-1)*sims, cols))
count=0
for i in range(1,cols):
for j in range(sims):
a = np.random.uniform(0,1,(cols-i+1))
b = a/np.sum(a)
c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False)
wts[count,] = c
count+=1
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros(((cols-1)*sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def graph_sim(port, sharpe):
plt.figure(figsize=(14,6))
plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Calculate returns and risk for longer period
hist_mu = dat['1971':'1991'].mean(axis=0)
hist_sigma = dat['1971':'1991'].std(axis=0)
# Run simulation based on historical figures
np.random.seed(123)
sim1 = []
for i in range(1000):
#np.random.normal(mu, sigma, obs)
a = np.random.normal(hist_mu[0], hist_sigma[0], 60) + np.random.normal(0, hist_sigma[0], 60)
b = np.random.normal(hist_mu[1], hist_sigma[1], 60) + np.random.normal(0, hist_sigma[1], 60)
c = np.random.normal(hist_mu[2], hist_sigma[2], 60) + np.random.normal(0, hist_sigma[2], 60)
d = np.random.normal(hist_mu[3], hist_sigma[3], 60) + np.random.normal(0, hist_sigma[3], 60)
df1 = pd.DataFrame(np.array([a, b, c, d]).T)
cov_df1 = df1.cov()
sim1.append([df1, cov_df1])
# create graph objects
np.random.seed(123)
samp = np.random.randint(1, 1000, 4)
graphs1 = []
for i in range(4):
port, _, _, sharpe, _ = Port_sim.calc_sim(sim1[samp[i]][0], 1000, 4)
graf = [port,sharpe]
graphs1.append(graf)
# Graph sample portfolios
fig, axes = plt.subplots(2, 2, figsize=(12,6))
for i, ax in enumerate(fig.axes):
ax.scatter(graphs1[i][0][:,1]*np.sqrt(12)*100, graphs1[i][0][:,0]*1200, marker='.', c=graphs1[i][1], cmap='Blues')
plt.show()
# create graph objects
np.random.seed(123)
graphs2 = []
for i in range(4):
port, _, _, sharpe, _ = Port_sim.calc_sim_lv(sim1[samp[i]][0], 1000, 4)
graf = [port,sharpe]
graphs2.append(graf)
# Graph sample portfolios
fig, axes = plt.subplots(2, 2, figsize=(12,6))
for i, ax in enumerate(fig.axes):
ax.scatter(graphs2[i][0][:,1]*np.sqrt(12)*100, graphs2[i][0][:,0]*1200, marker='.', c=graphs2[i][1], cmap='Blues')
plt.show()
# Calculate probability of hitting risk-return constraints based on sample portfolos
probs = []
for i in range(8):
if i <= 3:
out = round(np.mean((graphs1[i][0][:,0] >= 0.07/12) & (graphs1[i][0][:,1] <= 0.1/np.sqrt(12))),2)*100
probs.append(out)
else:
out = round(np.mean((graphs2[i-4][0][:,0] >= 0.07/12) & (graphs2[i-4][0][:,1] <= 0.1/np.sqrt(12))),2)*100
probs.append(out)
print(probs)
# Simulate portfolios from reteurn simulations
def wt_func(sims, cols):
wts = np.zeros(((cols-1)*sims, cols))
count=0
for i in range(1,cols):
for j in range(sims):
a = np.random.uniform(0,1,(cols-i+1))
b = a/np.sum(a)
c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False)
wts[count,] = c
count+=1
return wts
# Note this takes over 4min to run, substantially worse than the R version, which runs in under a minute. Not sure what I'm missng.
np.random.seed(123)
portfolios = np.zeros((1000, 3000, 2))
weights = np.zeros((1000,3000,4))
for i in range(1000):
wt_mat = wt_func(1000,4)
port_ret = sim1[i][0].mean(axis=0)
cov_dat = sim1[i][0].cov()
returns = np.dot(wt_mat, port_ret)
risk = [np.sqrt(np.dot(np.dot(wt.T,cov_dat), wt)) for wt in wt_mat]
portfolios[i][:,0] = returns
portfolios[i][:,1] = risk
weights[i][:,:] = wt_mat
port_1m = portfolios.reshape((3000000,2))
wt_1m = weights.reshape((3000000,4))
# Find probability of hitting risk-return constraints on simulated portfolios
port_1m_prob = round(np.mean((port_1m[:][:,0] > 0.07/12) & (port_1m[:][:,1] <= 0.1/np.sqrt(12))),2)*100
print(f"The probability of meeting our portfolio constraints is:{port_1m_prob: 0.0f}%")
# Plot sample portfolios
np.random.seed(123)
port_samp = port_1m[np.random.choice(1000000, 10000),:]
sharpe = port_samp[:,0]/port_samp[:,1]
plt.figure(figsize=(14,6))
plt.scatter(port_samp[:,1]*np.sqrt(12)*100, port_samp[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Ten thousand samples from three million simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Graph histograms
fig, axes = plt.subplots(1,2, figsize = (12,6))
for idx,ax in enumerate(fig.axes):
if idx == 1:
ax.hist(port_1m[:][:,1], bins = 100)
else:
ax.hist(port_1m[:][:,0], bins = 100)
plt.show()
## Create buckets for analysis and graphin
df_port = pd.DataFrame(port_1m, columns = ['returns', 'risk'])
port_bins = np.arange(-35,65,10)
df_port['dig_ret'] = pd.cut(df_port['returns']*1200, port_bins)
xs = ["(-35, -25]", "(-25, -15]", "(-15, -5]","(-5, 5]", "(5, 15]", "(15, 25]", "(25, 35]", "(35, 45]", "(45, 55]"]
ys = df_port.groupby('dig_ret').size().values/len(df_port)*100
# Graph buckets with frequency
fig,ax = plt.subplots(figsize = (12,6))
ax.bar(xs[2:7], ys[2:7])
ax.set(xlabel = "Return bucket (%)",
ylabel = "Frequency (%)",
title = "Frequency of occurrrence for return bucket ")
plt.show()
# Calculate frequency of occurence for mid range of returns
good_range = np.sum(df_port.groupby('dig_ret').size()[4:6])/len(df_port)
good_range
## Graph buckets with median return and risk
med_ret = df_port.groupby('dig_ret').agg({'returns':'median'})*1200
med_risk = df_port.groupby('dig_ret').agg({'risk':'median'})*np.sqrt(12)*100
labs_ret = np.round(med_ret['returns'].to_list()[2:7])
labs_risk = np.round(med_risk['risk'].to_list()[2:7])
fig, ax = plt.subplots(figsize = (12,6))
ax.bar(xs[2:7], ys[2:7])
for i in range(len(xs[2:7])):
ax.annotate(str('Returns: ' + str(labs_ret[i])), xy = (xs[2:7][i], ys[2:7][i]+2), xycoords = 'data')
ax.annotate(str('Risk: ' + str(labs_risk[i])), xy = (xs[2:7][i], ys[2:7][i]+5), xycoords = 'data')
ax.set(xlabel = "Return bucket (%)",
ylabel = "Frequency (%)",
title = "Frequency of occurrrence for return bucket ",
ylim = (0,60))
plt.show()
# Find frequency of high return buckets
hi_range = np.sum(df_port.groupby('dig_ret').size()[6:])/len(df_port)
hi_range
## Identify weights for different buckets for graphing
wt_1m = pd.DataFrame(wt_1m, columns = ['Stocks', 'Bonds', 'Gold', 'Real estate'])
port_ids_mid = df_port.loc[(df_port['returns'] >= 0.05/12) & (df_port['returns'] <= 0.25/12)].index
mid_ports = wt_1m.loc[port_ids_mid,:].mean(axis=0)
port_ids_hi = df_port.loc[(df_port['returns'] >= 0.35/12)].index
hi_ports = wt_1m.loc[port_ids_hi,:].mean(axis=0)
port_ids_lo = df_port.loc[(df_port['returns'] <= -0.05/12)].index
lo_ports = wt_1m.loc[port_ids_lo,:].mean(axis=0)
# Sharpe portfolios
df_port['sharpe'] = df_port['returns']/df_port['risk']*np.sqrt(12)
port_ids_sharpe = df_port[(df_port['sharpe'] > 0.7)].index
sharpe_ports = wt_1m.loc[port_ids_sharpe,:].mean(axis=0)
# Create graph function
def wt_graph(ports, title):
fig, ax = plt.subplots(figsize=(12,6))
ax.bar(ports.index.values, ports*100)
for i in range(len(mid_ports)):
ax.annotate(str(np.round(ports[i],2)*100), xy=(ports.index.values[i], ports[i]*100+2), xycoords = 'data')
ax.set(xlabel = '', ylabel = 'Weigths (%)', title = title, ylim = (0,max(ports)*100+5))
plt.show()
# Graph weights
wt_graph(mid_ports, "Average asset weights for mid-range portfolios")
wt_graph(mid_ports, "Average asset weights for high return portfolios")
wt_graph(mid_ports, "Average asset weights for negative return portfolios")
wt_graph(mid_ports, "Average asset weights for Sharpe portfolios")
For the Rtists:
# Built using R 3.6.2
## Load packages
suppressPackageStartupMessages({
library(tidyquant)
library(tidyverse)
library(gtools)
library(grid)
})
## Load data
df <- readRDS("port_const.rds")
dat <- readRDS("port_const_long.rds")
sym_names <- c("stock", "bond", "gold", "realt", "rfr")
## Call simuation functions
source("Portfolio_simulation_functions.R")
## Prepare sample
hist_avg <- dat %>%
filter(date <= "1991-12-31") %>%
summarise_at(vars(-date), list(mean = function(x) mean(x, na.rm=TRUE),
sd = function(x) sd(x, na.rm = TRUE))) %>%
gather(key, value) %>%
mutate(key = str_remove(key, "_.*"),
key = factor(key, levels =sym_names)) %>%
mutate(calc = c(rep("mean",5), rep("sd",5))) %>%
spread(calc, value)
# Run simulation
set.seed(123)
sim1 <- list()
for(i in 1:1000){
a <- rnorm(60, hist_avg[1,2], hist_avg[1,3]) + rnorm(60, 0, hist_avg[1,3])
b <- rnorm(60, hist_avg[2,2], hist_avg[2,3]) + rnorm(60, 0, hist_avg[2,3])
c <- rnorm(60, hist_avg[3,2], hist_avg[3,3]) + rnorm(60, 0, hist_avg[3,3])
d <- rnorm(60, hist_avg[4,2], hist_avg[4,3]) + rnorm(60, 0, hist_avg[4,3])
df1 <- data.frame(a, b, c, d)
cov_df1 <- cov(df1)
sim1[[i]] <- list(df1, cov_df1)
names(sim1[[i]]) <- c("df", "cov_df")
}
# Plot random four portfolios
## Sample four return paths
## Note this sampling does not realize in the same way in Rmarkdown/blogdown as in the console. NOt sure why.
set.seed(123)
samp <- sample(1000,4)
graphs <- list()
for(i in 1:8){
if(i <= 4){
graphs[[i]] <- port_sim(sim1[[samp[i]]]$df,1000,4)
}else{
graphs[[i]] <- port_sim_lv(sim1[[samp[i-4]]]$df,1000,4)
}
}
library(grid)
gridExtra::grid.arrange(graphs[[1]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[2]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[3]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[4]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
ncol=2, nrow=2,
top = textGrob("Four portfolio and return simulations",gp=gpar(fontsize=15)))
# Graph second set
gridExtra::grid.arrange(graphs[[5]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[6]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[7]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
graphs[[8]]$graph +
theme(legend.position = "none") +
labs(title = NULL),
ncol=2, nrow=2,
top = textGrob("Four portfolio and return simulations allowing for excluded assets",gp=gpar(fontsize=15)))
# Calculate probability of hitting risk-return constraint
probs <- c()
for(i in 1:8){
probs[i] <- round(mean(graphs[[i]]$port$returns >= 0.07/12 &
graphs[[i]]$port$risk <=0.1/sqrt(12)),2)*100
}
## Load data
port_1m <- readRDS("port_3m_sim.rds")
## Graph sample of port_1m
set.seed(123)
port_samp = port_1m[sample(1e6, 1e4),]
port_samp %>%
mutate(Sharpe = returns/risk) %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue") +
labs(x = "Risk (%)",
y = "Return (%)",
title = "Ten thousand samples from simulation of three million portfolios") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Graph histogram
port_1m %>%
mutate(returns = returns*1200,
risk = risk*sqrt(12)*100) %>%
gather(key, value) %>%
ggplot(aes(value)) +
geom_histogram(bins=100, fill = 'darkblue') +
facet_wrap(~key, scales = "free",
labeller = as_labeller(c(returns = "Returns (%)",
risk = "Risk (%)"))) +
scale_y_continuous(labels = scales::comma) +
labs(x = "",
y = "Count",
title = "Portfolio simulation return and risk histograms")
## Graph quantile returns for total series
x_lim = c("(-15,-5]",
"(-5,5]", "(5,15]",
"(15,25]", "(25,35]")
port_1m %>%
mutate(returns = cut_width(returns*1200, 10)) %>%
group_by(returns) %>%
summarise(risk = median(risk*sqrt(12)*100),
count = n()/nrow(port_1m)) %>%
ggplot(aes(returns, count*100)) +
geom_bar(stat = "identity", fill = "blue") +
xlim(x_lim) +
labs(x = "Return bucket (%)",
y = "Frequency (%)",
title = "Frequency of occurrrence for return bucket ")
## Occurrences
mid_range <- port_1m %>%
mutate(returns = cut_width(returns*1200, 10)) %>%
group_by(returns) %>%
summarise(risk = median(risk*sqrt(12)*100),
count = n()/nrow(port_1m)) %>%
filter(as.character(returns) %in% c("(5,15]")) %>%
summarise(sum = round(sum(count),2)) %>%
as.numeric()*100
good_range <- port_1m %>%
mutate(returns = cut_width(returns*1200, 10)) %>%
group_by(returns) %>%
summarise(risk = median(risk*sqrt(12)*100),
count = n()/nrow(port_1m)) %>%
filter(as.character(returns) %in% c("(5,15]" , "(15,25]")) %>%
summarise(sum = round(sum(count),2)) %>%
as.numeric()*100
# Set quantiles for graph and labels
quants <- port_1m %>%
mutate(returns = cut(returns*1200, breaks=c(-Inf, -5, 5, 15, 25, 35, Inf))) %>%
group_by(returns) %>%
summarise(prop = n()/nrow(port_1m)) %>%
select(prop) %>%
mutate(prop = cumsum(prop))
# Calculate quantile
x_labs <- quantile(port_1m$returns, probs = unlist(quants))*1200
x_labs_median <- tapply(port_1m$returns*1200,
findInterval(port_1m$returns*1200, x_labs), median) %>%
round()
x_labs_median_risk <- tapply(port_1m$risk*sqrt(12)*100, findInterval(port_1m$risk*sqrt(12)*100, x_labs), median) %>% round()
# Graph frequency of occurrence for equal width returns
port_1m %>%
mutate(returns = cut(returns*1200, breaks=c(-45, -5,5,15,25,35,95))) %>%
group_by(returns) %>%
summarise(risk = median(risk*sqrt(12)*100),
count = n()/nrow(port_1m)) %>%
ggplot(aes(returns, count*100)) +
geom_bar(stat = "identity", fill = "blue") +
geom_text(aes(returns, count*100+5, label = paste("Risk: ", round(risk), "%", sep=""))) +
geom_text(aes(returns, count*100+2,
label = paste("Return: ", x_labs_median[-7], "%", sep=""))) +
labs(x = "Return bucket (%)",
y = "Frequency (%)",
title = "Frequency of occurrrence for return bucket with median risk and return per bucket")
# High range probability
high_range <- port_1m %>%
mutate(returns = cut(returns*1200, breaks=c(-45, -5,5,15,25,35,95))) %>%
group_by(returns) %>%
summarise(risk = median(risk*sqrt(12)*100),
count = n()/nrow(port_1m)) %>%
filter(as.character(returns) %in% c("(25,35]", "(35,95]")) %>%
summarise(sum = round(sum(count),2)) %>%
as.numeric()*100
## Identify weights for target portfolios
wt_1m <- readRDS('wt_3m.rds')
## Portfolio ids
# Mid-range portfolis
port_ids_mid <- port_1m %>%
mutate(row_ids = row_number()) %>%
filter(returns >= 0.05/12, returns < 0.25/12) %>%
select(row_ids) %>%
unlist() %>%
as.numeric()
mid_ports <- colMeans(wt_1m[port_ids_mid,])
# Hi return portfolio
port_ids_hi <- port_1m %>%
mutate(row_ids = row_number()) %>%
filter(returns >= 0.35/12) %>%
select(row_ids) %>%
unlist()
hi_ports <- colMeans(wt_1m[port_ids_hi,])
# Low return portfolios
port_ids_lo <- port_1m %>%
mutate(row_ids = row_number()) %>%
filter(returns <= -0.05/12) %>%
select(row_ids) %>%
unlist()
lo_ports <- colMeans(wt_1m[port_ids_lo,])
# Sharpe porfolios
port_ids_sharpe <- port_1m %>%
mutate(sharpe = returns/risk*sqrt(12),
row_ids = row_number()) %>%
filter(sharpe > 0.7) %>%
select(row_ids) %>%
unlist()
sharpe_ports <- colMeans(wt_1m[port_ids_sharpe,])
## Graph portfolio weights
# Function
wt_graf <- function(assets, weights, title){
data.frame(assets = factor(assets, levels = assets),
weights = weights) %>%
ggplot(aes(assets, weights*100)) +
geom_bar(stat = "identity", fill="blue") +
geom_text(aes(assets ,weights*100+3, label = round(weights,2)*100)) +
labs(x='',
y = "Weights (%)",
title = title)
}
assets = c("Stocks", "Bonds", "Gold", "Real Estate")
# Graph diferent weights
wt_graf(assets, mid_ports, "Average asset weights for mid-range portfolios")
wt_graf(assets, hi_ports, "Average asset weights for high return portfolios")
wt_graf(assets, lo_ports, "Average asset weights for negative return portfolios")
wt_graf(assets, sharpe_ports, "Average asset weights for Sharpe constraints")
What is risk anyway? There are plenty of definitions out there. In the case of investing, a working definition might be that risk is the chance that one won’t achieve his or her return objectives. Volatility, on the other hand, describes the likely range of returns. So volatility captures the probability of failure; but it also captures success and every other occurrence along the continuum of outcomes.
Is a demand deposit or a government bond who’s yield is below inflation, or worse, is negative, risk-free?
A normally distributed error term whose mean was zero and whose standard deviation was the same as the asset’s.
Every now and then, Twitter will offer these golden resources.
Ashley Willis recently asked people to name the best tech talk they’ve ever seen and the results are a resource I don’t want to lose.
Hundreds of people responded, sharing their contenders for the title.
Below, I selected some of the top-rated talks and clustered them accordingly. Click a category to jump to the section.
Cover image via: https://toggl.com/blog/best-tech-websites
Note: This is an older post originally written as a LinkedIn article I wrote in late May. I have added information about shaping data thanks to Casper Crause using the data.table
library. You can see our original correspondence in the comments there (for now)
If you dabble in data, you know one of the challenges that everyone has when working with data is reshaping data to the form you want to use it; thankfully, there are ways to shape data in both Python and R to speed up the process by using some of the functions available in their extensive libraries.
In this post, we will be looking at how to pivot data from long to wide form using Python’s pandas
library and R’s stats
, tidyr
and data.table
libraries and how they match up.
I did write more annotations on the Python code as I am still learning about the language and while its been pretty easy to pick up, I still need to work through the steps. I’m sure there’s another way to wrangle and shape data in Python besides for pandas
; If you know of another one, be sure to leave a comment below and let me know!
Lets go!
The problem that we’ll be using will be a problem I saw on StackExchange’s Data Science site. (link to problem: here). Here are the screenshots of the question.
While the OP only asks for how to do this in R. I thought this would be good to show how this works in Python as well! Lets dive right into it!
stats
, tidyr
or data.table
libraries.Disclaimer: for this problem, I will be focusing on getting the data to its proper form. I won’t rename columns as it is a cosmetic issue.
Pandas
library):First lets input our data:
# The Raw Data x = {"ID":[1234,1234], "APPROVAL_STEP":["STEP_A","STEP_B"], "APPROVAL_STATUS":["APPROVED","APPROVED"], "APPROVAL_DATE":["23-Jan-2019","21-Jan-2019"], "APPROVER":["John Smith","Jane Doe"]} print(x)
## {'ID': [1234, 1234], 'APPROVAL_STEP': ['STEP_A', 'STEP_B'], 'APPROVAL_STATUS': ['APPROVED', 'APPROVED'], 'APPROVAL_DATE': ['23-Jan-2019', '21-Jan-2019'], 'APPROVER': ['John Smith', 'Jane Doe']}
Now to convert this data into a data frame by using the DataFrame()
function from the pandas
library.
import pandas as pd df=pd.DataFrame(x) df
## ID APPROVAL_STEP APPROVAL_STATUS APPROVAL_DATE APPROVER ## 0 1234 STEP_A APPROVED 23-Jan-2019 John Smith ## 1 1234 STEP_B APPROVED 21-Jan-2019 Jane Doe
Now, to convert the data into wide form; this can be done by using the .pivot_table()
method. We want to index the data based on ID
and see each data point based on the step. This can be done with the code below:
df=df.pivot_table(index="ID", columns="APPROVAL_STEP", aggfunc="first") df
## APPROVAL_DATE APPROVAL_STATUS \ ## APPROVAL_STEP STEP_A STEP_B STEP_A STEP_B ## ID ## 1234 23-Jan-2019 21-Jan-2019 APPROVED APPROVED ## ## APPROVER ## APPROVAL_STEP STEP_A STEP_B ## ID ## 1234 John Smith Jane Doe
We’re starting to have our data look like what we want it to be . Now, to categorize the columns.
df.columns = ['_'.join(col) for col in df.columns] df
## APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A \ ## ID ## 1234 23-Jan-2019 21-Jan-2019 APPROVED ## ## APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B ## ID ## 1234 APPROVED John Smith Jane Doe
Now, for the finishing touches, we use the .reset_index()
method and reorder the columns.
## ID APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A \ ## 0 1234 23-Jan-2019 21-Jan-2019 APPROVED ## ## APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B ## 0 APPROVED John Smith Jane Doe
## Error in py_call_impl(callable, dots$args, dots$keywords): KeyError: "['ID'] not in index" ## ## Detailed traceback: ## File "<string>", line 2, in <module> ## File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\frame.py", line 2806, in __getitem__ ## indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1] ## File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\indexing.py", line 1553, in _get_listlike_indexer ## keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing ## File "C:\Users\Smith\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\pandas\core\indexing.py", line 1646, in _validate_read_indexer ## raise KeyError(f"{not_found} not in index")
## APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVAL_STATUS_STEP_A \ ## ID ## 1234 23-Jan-2019 21-Jan-2019 APPROVED ## ## APPROVAL_STATUS_STEP_B APPROVER_STEP_A APPROVER_STEP_B ## ID ## 1234 APPROVED John Smith Jane Doe
Phew! That was alot of steps to follow to get here! Lets see how R matches up!
tidyr
package)The tidyr
library is a package made by Hadley Wickam and his team at RStudio. It is one of the many packages in the tidyverse made for managing data. We can solve this problem by using the pivot_wider()
function.
# The Raw Data x<-data.frame(ID=c(1234,1234), APPROVAL_STEP=c("STEP_A","STEP_B"), APPROVAL_STATUS=c("APPROVED","APPROVED"), APPROVAL_DATE=c("23-Jan-2019","21-Jan-2019"), APPROVER=c("John Smith","Jane Doe")) # Use pivot_wider() library(tidyr) t<-x %>% pivot_wider(id_cols=ID, names_from=APPROVAL_STEP, values_from =c(APPROVAL_STATUS,APPROVAL_DATE,APPROVER)) t
## # A tibble: 1 x 7 ## ID APPROVAL_STATUS_STEP_A APPROVAL_STATUS_STEP~ APPROVAL_DATE_STEP~ APPROVAL_DATE_STEP~ APPROVER_STEP_A APPROVER_STEP_B ## <dbl> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 1234 APPROVED APPROVED 23-Jan-2019 21-Jan-2019 John Smith Jane Doe
Now, we just need to reorder the columns.
# Reordered t<-t[,c(1,2,4,6,3,5,7)] t
## # A tibble: 1 x 7 ## ID APPROVAL_STATUS_STEP_A APPROVAL_DATE_STEP~ APPROVER_STEP_A APPROVAL_STATUS_STEP~ APPROVAL_DATE_STEP~ APPROVER_STEP_B ## <dbl> <fct> <fct> <fct> <fct> <fct> <fct> ## 1 1234 APPROVED 23-Jan-2019 John Smith APPROVED 21-Jan-2019 Jane Doe
stats
package)Using the reshape()
function from R’s stats package is a more “old school” way of doing this because it’s something more popular with people who have learned how to write R pre-tidyverse era. Being that I’ve initially learned R from people who programmed pre-tidyverse, I learned how to do this. This can all be done with one function without having to reorder columns!
(This can also be seen on my answer to this question on Data Science StackExchange page)
library(stats) reshape(x, timevar="APPROVAL_STEP", idvar="ID", sep="_", direction = "wide")
## ID APPROVAL_STATUS_STEP_A APPROVAL_DATE_STEP_A APPROVER_STEP_A APPROVAL_STATUS_STEP_B APPROVAL_DATE_STEP_B ## 1 1234 APPROVED 23-Jan-2019 John Smith APPROVED 21-Jan-2019 ## APPROVER_STEP_B ## 1 Jane Doe
There you have it! Everything with one function!
data.table
package)Casper Crause pointed out that this task can also be done with the data.table
package.
The advantage of using this over tidyr
or the stats
packages is that data.table is written largely in C (see breakdown in languages used on Github page linked). So for larger datasets, using this in a script will save more time computationally.
The quirk here is that your data frame needs to be converted to a data table (which for this example was not hard at all). But throwing this into dcast()
works like a charm and puts your shaping of data in “mathematical” terms where the ID variables (rows) are placed on the left hand side and your measuring variables are placed on the right hand side.
Thank you Casper for pointing this out!
library(data.table) x <-as.data.table(x) dcast( data = x, formula = ID~..., value.var = c("APPROVAL_STATUS", "APPROVAL_DATE","APPROVER") )
## ID APPROVAL_STATUS_STEP_A APPROVAL_STATUS_STEP_B APPROVAL_DATE_STEP_A APPROVAL_DATE_STEP_B APPROVER_STEP_A ## 1: 1234 APPROVED APPROVED 23-Jan-2019 21-Jan-2019 John Smith ## APPROVER_STEP_B ## 1: Jane Doe
While there are ways to pivot data from long to wide form in both Python and R, using R makes for a less labor intensive and intuitive time for shaping data as opposed to Python. I am learning that both languages have their strengths, but for this data-wrangling challenge R saves time working through those sort of details.
If you write in R or Python and have an alternative/better solution to answering this problem (or see a mistake) please feel free to reach out to me in a comment or message to share it with me!
nnetsauce is a general purpose tool for Statistical/Machine Learning, in which pattern recognition is achieved by using quasi-randomized networks. A new version, 0.5.0
, is out on Pypi and for R:
pip
(stable version):pip install nnetsauce --upgrade
pip install git+https://github.com/thierrymoudiki/nnetsauce.git --upgrade
library(devtools)
devtools::install_github("thierrymoudiki/nnetsauce/R-package")
library(nnetsauce)
This could be the occasion for you to re-read all the previous posts about nnetsauce, or to play with various examples in Python or R. Here are a few other ways to interact with the nnetsauce:
1) Forms
2) Submit Pull Requests on GitHub
yourgithubname_ddmmyy_shortdescriptionofdemo.[ipynb|Rmd]
If it’s a jupyter notebook written in R, then just add _R
to the suffix.
3) Reaching out directly via email
To those who are contacting me through LinkedIn: no, I’m not declining, please, add a short message to your request, so that I’d know a bit more about who you are, and/or how we can envisage to work together.
This new version, 0.5.0
:
Base
class, and for many other utilities.n_hidden_features
parameter. How do you try it out? By instantiating a class with the option:backend = "gpu"
or
backend = "tpu"
An example can be found in this notebook, on GitHub.
nnetsauce’s future release is planned to be much faster on CPU, due the use of Cython, as with mlsauce. There are indeed a lot of nnetsauce’s parts which can be cythonized. If you’ve ever considered joining the project, now is the right time. For example, among other things, I’m looking for a volunteer to do some testing in R+Python on Microsoft Windows. Envisage a smooth onboarding, even if you don’t have a lot of experience.