Python-bloggers aggregates blogs focused on using Python’s data analysis super-power for data science, machine learning, and statistics. Brought to you by the same folks that publish the hugely popular R-bloggers, it is well worth a read. Check it out here!
Quantocracy is a great resource for all things related to quantitative and empirical investing. We learn something every time we visit. Expand your knowledge here!
R-bloggers is a great resource. We visit the website almost every day. Shouldn’t you? Have a look https://www.r-bloggers.com/.
Working with Python’s pandas library often?
This resource will be worth its length in gold!
Kevin Markham shares his tips and tricks for the most common data handling tasks on twitter. He compiled the top 100 in this one amazing overview page. Find the hyperlinks to specific sections below!
Kevin even made a video demonstrating his 25 most useful tricks:
On a late evening, I was scrolling through Reddit and came across a news article about “Why Bill Gates wants us all to get vaccinated?”. The news site looked legitimate. I was half way through the article and saw quite a few grammatical errors. Me being lurker, I switched to comments and saw a few of them mention the article being AI generated. I spent couple of minutes googling how to text generate using AI and found at least half a dozen websites. The first that I caught my eye was DeepAI.
The text generation API is backed by a large-scale unsupervised language model that can generate paragraphs of text. This transformer-based language model, based on the GPT-2 model by OpenAI, intakes a sentence or partial sentence and predicts subsequent text from that input.
Recently, I started to advertise my blog on google. The most common search term used to show my ad was “How to become a data scientist?”. I decided to test it by asking this unsupervised neural network “How one could become a data scientist in 30 days?. The results generated by DeepAI were pretty interesting. You do have an option to use the website to generate text. Why take the easy road when make it a little harder by using R and Python.
# install httr package install.packages("httr") # load httr package library(httr) # do a post function r = httr::POST( url = "https://api.deepai.org/api/text-generator", body = list("text" = "How to become a data scientist in 30 days?"), add_headers(.headers = c("api-key" = "quickstart-QUdJIGlzIGNvbWluZy4uLi4K")) ) # convert raw result to text cat(content(r, "text"), "\n")
import requests r = requests.post( "https://api.deepai.org/api/text-generator", data={ 'text': 'How to become a data scientist in 30 days?', }, headers={'api-key': 'quickstart-QUdJIGlzIGNvbWluZy4uLi4K'} ) print(r.json())
Below is what DeepAI generated:
I was introduced to the concept of statistical data science. Then when I was 16 at the beginning of my senior year at Stanford, I was an undergraduate. It’s a great experience. I did theoretical data science a few years ago — a lot. I also spent a lot of time working with my colleague John Roberts and his team, for example, to do the software development for computing, or for things like this. I’d like to go to another series of data science areas. My colleagues, I had a big day at Stanford when they told me that they thought I was one of the few scientists in their field – we would be in that particular field, or, it seemed, at least it was out in this field. So I spent a lot of time with John Roberts and me as we were doing a paper. But in the late 70s, I was still getting really into statistics, and he had the idea of making our paper, a new form of math. His paper is very important.
So for the most part, you got data science at Stanford, you had computer programs at Princeton, or the National Academy of Sciences, or the NASA/ESA center. You really didn’t have to learn math by computer, to have that knowledge. There was little information for about mathematical data science there. How did the first data science team come across that?
We were working really hard on getting a complete research plan. We were looking to learn in 20 days, about the basic tasks — how to build a machine, how to make a database that can give you great results. In this case, we were really looking to learn the basic problem of building a data machine to solve this task because I was studying a very abstract software framework. I hadn’t really taken the time to get much education in computer science. So I did an internship in a lab where I was reading books in the lab and writing a paper. But in the meantime, I was really getting really motivated with my training to make paper paper. We had a team of people working as an engineer and I wanted to learn how to build a relational database so that I could have the help of people like John Roberts and his team.
Now, there were a couple of other types of projects that the team had that they came across. In 2003, I had made a paper that had been published in the journal PLOS One, from a person with a little bit of a computer science background. I was an editor of its paper called the “Cogent Bayes Analysis of Data Structure” [which] has shown the effect of a given object having a complex structure that has different levels of complexity than a typical human-made database … I had had the idea that what was being used for a computation is actually a complex computer. The problem with that paper was, it has a large size, but it had very small size. But basically, I was reading the paper on the difficulty of constructing a complex structure that was going to be very difficult to make, and I think I might have made a paper about that. The problem is, I thought to myself that the paper you presented would be pretty easy to make.
When I wrote that paper, was it ever just about finding stuff?
It’s not always easy. To be able to think like this is something you never really want to make. In the most simple, I mean, you are just looking at things from the inside out. It takes really long enough to build a database and then it might take a year or so to make it a bit of work. It requires a lot of resources.
Where did you see this pattern in data science as the next step for you, but what do you see when you do it?
The pattern was kind of like a pattern in science. It’s always something we’re gonna have to look at, and then the data that we’re collecting it becomes something that we’re going to look at and think about, then we’ll get really good at what we’re doing and then we’ll make the data more interesting, and then we’re like, oh, I’m going to start making a paper and it’s going to take too long to go back and figure out how to do that. But, ultimately, what we’re doing is not making a paper that can actually understand things without a lot of research because a lot of its parts might be just things that are really important for a paper object or whatever, and we’re really digging into the literature or just finding stuff. That was the real challenge.
As you’ve mentioned, the main goal of Data Science is to take complex data into an understanding system that, for example, could be something like an evolutionary programming model, how you can create a data network that allows you to define what we could do.
AI-writer is an online platform where, you can ask the AI to write an article for you. It also generates citations & sources for your article. Below is the article written by AI-writer.
Becoming a good data scientist takes a lot of work, energy and time, but don’t think the whole world would be a data scientist if it were easy and fast. Even if there is no right way to learn data science, a learning period of a few months will be one of your lifelong investments. ^{Sources: 3}
Even Harvard Business School has found it the sexiest job of the 21st century, with an average salary of more than $100,000 a year and an annual salary of more than $200,000. ^{Sources: 3}
No wonder so many people are interested in becoming computer scientists. According to the US Census Bureau, the expected growth rate for data science jobs over the next decade includes more than 1.5 million new jobs for those who work as similar data scientists. As the word “data scientist” has become increasingly popular recently, many companies looking for mathematicians, statisticians, and business analysts will be posting jobs with the title “data scientist” to attract more attention. ^{Sources: 5}
You don’t have to be the world’s best programmer or spend a fortune to work in big data analytics and artificial intelligence, but you do need to know the basics. You can start right away by working with a data science program at a company like Google, Microsoft, Apple, Facebook, Amazon or any other company. ^{Sources: 5, 7}
This means that virtually anyone can improve their employability and career prospects by learning the basic theoretical and practical skills required for data science. Finally, it should be noted that data science is likely to offer a number of benefits that can help your current career if you have the opportunity to become a full-time or freelance data scientist. Demand for data analysis skills in their day-to-day work is forecast to exceed the requirements of traditionally qualified data scientists in the coming years. ^{Sources: 0, 7}
If you have skills in data analysis, you will almost certainly be able to add value in some way, but what exactly you can do depends heavily on what your job is. Although they can affect the outcome of your business and improve your own productivity, they are unlikely to help you earn more from your current role. ^{Sources: 0}
There is currently a skills gap in data science, where the majority of data scientists have less than five years of experience. Some companies are looking for experienced professionals with 10 years or more, and in these cases, data analysts will continue their training and hone their skills as data scientists. After 10 + years of work, a data scientist can obtain a doctorate, take on the role of Director of Data Science and undergo further training. His or her title may not change, but in this case the data analyst will continue his or her education or hone his or her skills before becoming a data scientist. ^{Sources: 1}
If you are interested in a career related to data, there are two ways to become a data scientist: to become a data analyst or director of data science. ^{Sources: 1}
In this way, the candidate will be able to limit the chances of failure and understand the questions and skills that recruiters expect from a data scientist. Data science and analytics professionals are in high demand and enjoy high pay and benefits, as well as access to a wide range of opportunities. A 2017 study by IBM, Quant Crunch, found that employers are looking for in-demand skills such as analytics, machine learning, and artificial intelligence. ^{Sources: 1, 6}
The road to learning data science is rocky, but regardless of whether you put in enough hard work, this period is suitable for anyone who knows the pros and cons of data science. ^{Sources: 6}
If becoming a data scientist sounds fascinating, the Flatiron School offers several ways to start your career in big data. For candidates looking for a way to crack an interview, we recommend taking a look at the Landing School’s data science education program, as it is a great example of how the interview process works in the field of data science. The Flat Iron School’s “Data Science in 30 Days” program turns you into a data scientist in just 15 weeks. ^{Sources: 2, 6}
For those who want to hone their skills but may not be quite ready to change their career, this could be the right option for you. ^{Sources: 2}
If you’re not sure if you can commit to a full bootcamp when it comes to data science, the Online Data Science Bootcamp Prep will cover the basics. Most of the people who attended NYC science academy bootcamps have STEM backgrounds. ^{Sources: 2, 4}
Business data is increasingly digitalized, and computers with higher computing power are faster and cheaper. The Internet of Things leads data scientists to sift through billions of pieces of data, including emails, social media posts, and other data. Soon, almost every type of company will be looking for a data scientist, and their business data will increasingly be digitized.
These articles are not the greatest. But, once the article is generated, you could spend about an hour to brush this up and publish it. In the time of fake news being spread every where and these tools getting better every day, it becomes quite challenging to believe what is “true” and what is “fake”. At some point of time we need a government legislation and a set of laws on building ethical AI.
If you enjoyed this article, do check out my other recent post. Also, feel free to comment below and share it.
The post How to become a data scientist in 30 days? appeared first on Hi! I am Nagdev.
In our last post, we took a quick look at building a portfolio based on the historical averages method for setting return expectations. Beginning in 1987, we used the first five years of monthly return data to simulate a thousand possible portfolio weights, found the average weights that met our risk-return criteria, and then tested that weighting scheme on two five-year cycles in the future. At the end of the post, we outlined the next steps for analysis among which performance attribution and different rebalancing schemes were but a few.
For this post, we’ll begin analyzing performance. Before we start we want to highlight a few points. First, over the last few posts we’ve included python versions of the analysis and graphs we produce in R. While we’ll stop flagging the presence of the python code going forward, we will continue to present it after the R code at the end of the text. If you find seeing the python code useful or have questions on reproducing our results, let us know at the email address below. We respond!
Second, we want to point out that the simulation method we used for the last post and this one is biased toward an average allocation. In other posts, we used a different, “hacky” method to allow for more extreme allocations. One kind reader suggested a more elegant solution based on the Dirichlet distribution. These different methods deserve a post on their own to explain what they’re doing and why you might prefer one over the other. That will have to wait. Nonetheless, we wanted to mention it in case anyone looked closely at the code. We’re opting for simple methods to drive our illustrations before moving into more complicated expositions. Cave paintings before Jackson Pollock if you will.
Third, if you’ve read some of the past posts and are scratching your head, wondering why the heck we haven’t graphed the efficient frontier or shown the tangency portfolio yet, don’t worry. We’ll get there! We want to build some intuitions for non-finance folks first. We also have some reasons for avoiding these tools that we’ll explain when we finally tackle them. On the other hand, if the efficient frontier sounds like a destination for Captain Kirk’s evil twin, we’ll explain it all soon enough. Now let’s boldly move on to the post.
When we looked at the two test periods for our portfolio, we noticed we significantly beat our return constraints in the first period, yet missed them in the second. Given these results, it might be a good idea to try to understand what were the sources of performance in the first period and what that might suggest about our allocation decisions for the next.
Let’s bring in the data, graph our portfolio simulations, and then look at the first performance period. Recall, our portfolio is comprised of four indices that encapsulate the major asset classes: stocks, bonds, commodities (gold in this case), and real estate. We pulled this data from the FRED database. Here’s the simulation of 1,000 potential portfolios shaded for the Sharpe ratio, based on the average risk and return of each asset for the five-years ending in 1991. See the first post for details on our choice of data and length of time series.
Here’s the proposed weighting based on a required return of not less than 7% and a risk of not more than 10%
This how that portfolio performed relative to a 1,000 simulated portfolios in the first five-year test period: 1992-1997.
What did each asset class contribute to the overall portfolio performance in the first test period?
We see that stocks were the biggest contributor while gold did nothing. Let’s look at risk contribution.
Here again stocks sported the biggest contribution. Real estate’s “zero” contribution is an artifact of rounding. Still it enjoyed very low volatility and low correlation with the other asset classes. Hence, a de minimis effect on portfolio volatility, an important finding.
A couple things should stand out about these results. First, it’s all about stocks and bonds. Both were the biggest contributors to returns and risk. Despite, a combined initial weight of about 67%, these two assets were responsible for about 96% of the return and 95% of the risk.
Second, comparing contributions, we see that stocks contributed slightly more to risk than they did to return. But the differences are modest. Alternatively, bonds contributed relatively more to returns than they did to risk.
Should we be surprised by these results? The original weighting for stocks was 37% vs. bonds at 30%, real estate at 22% and the remainder in gold. The stocks’ high contribution was due to their overall strong performance during the period. We nearly met our return constraints with one asset! However, since we didn’t rebalance, it also means that by the end of the period, stocks made up significantly more of the portfolio than they did at inception as shown in the graph below. Portfolio risk would be meaningfully driven by stocks going forward.
Recall that our constraints were not more than 10% risk and not less than 7% return on an annual basis. In the last post we decided not to change the original weighting for the next five year period, although we did compare it to a different weighting scheme. Indeed, when we re-ran the the weighting calculation to achieve our target constraints incorporating the recent performance, results advised a return to original allocation. What we didn’t mention was that returning to the old weighting scheme, implies rebalancing, which would entail transaction costs. If we had left everything as is at the end of the first five-year period (purple dot), here is how the portfolio would look compared to rebalancing (red dot) and a 1,000 simulations on the second five-year period.
The rebalanced portfolio yielded a modestly worse return, but a much better risk profile. Of course, in both cases, those portfolios were dominated by others. That is, there were other allocations that generated a higher return for the same level of risk. The rebalanced portfolio kept risk within the constraints even though returns missed the target. The non-rebalanced portfolio overshot the risk constraint. Still, only 14% of the portfolios met or exceeded the return constraints, while 94% of the simulated portfolios met the risk constraints.
In this case, rebalancing might not have been a bad idea. However, to analyze it properly we’d need to estimate transaction costs, including tax. Coding such effects warrants a series on its own. But back of the envelope suggests the following. The portfolio grew almost 58% so if we redeployed 12% of ending capital (the amount by which stocks are greater than their target weight), that equates to almost 30% of the growth being taxed.^{1} Assuming long term capital gains tax rates of 20%, that would equate to a 6 percentage point drag on cumulative performance.
So even if we like the lower risk conferred by rebalancing, it needs to be worth the lower potential return due to taxes. How might we analyze that? One way might be to calculate the improvement in the Sharpe ratio. By returning to the original weighting, the Sharpe ratio would improve by 17 percentage points, based on the returns of the prior 10-year period. Seems like a lot, but even if the return per unit of risk improves are we taking enough risk to meet our return targets.^{2} Sharpe ratios don’t feed the kids as they say. Of course, if the portfolio is operating within a tax-free environment, that point is moot. Whatever the case, let’s think about what the performance results suggest.
First, we’d want to consider the financial assets vs. the real ones. The financial assets (stocks and bonds) are generating the bulk of the return and risk. Gold did little, though it didn’t add risk to the portfolio. Real estate was more interesting as it did contribute modestly to the return, but contributed very little to risk, mainly due to its low to negative correlation with financial assets. Should we dispense with gold altogether? Its negative correlation with financial assets yields some salutary benefits in terms of risk. But returns leave a lot to be desired.
Second, do we want to maintain the same weights on the financial assets? If we had a taxable, static portfolio we’d need to consider whether the tax impact was worth the lower risk, as discussed above. However, if we received regular inflows of capital, then we could deploy it to return to the target weights without paying taxes. Of course, that would expose us to some additional risk; that is, how long we’d be willing to sustain off target allocations and how much timing risk we’d want to endure. Even if we wanted to bring our allocations back to their original weights, there might be a timing mismatch between when we had new funds to allocate and whether the price we might pay still aligned with our original risk-return projections.
Alternatively, rather than selling winners to buy losers, what if we sold the losers and redeployed the capital into more attractive assets, entirely eliminating exposure to one asset class?^{3} Recall how we flagged how simple our simulation was? Well, it was even more simplistic than that, since we didn’t include an option to invest in only a subset of the assets. Excluding assets adds another wrinkle to the analysis and to the growing list of posts we need to write. But let’s look briefly at how the portfolio would have fared relative to others that excluded some assets.
In the graph below, we simulate 3,000 portfolios for different combinations of assets. That is, a thousand each for four, three, and two asset portfolios.
As one can see, when we add in the chance to exclude assets, the range of potential outcomes increases significantly. There’s only one way to choose four out of four assets. But there are eleven ways to choose four, three, and two out of four assets.^{4} The original portfolio lies along the dark band of better risk-return (e.g, higher Sharpe ratio) portfolios. In fact, it’s Sharpe ratio is better than 84% of the portfolios. But about 85% of the portfolios beat our risk-return constraints, so one could have been playing with blunt darts and still have stuck the board.
Where does this leave us? The key is that past performance provides powerful insights, but is only one sample. The range of possible results is vast, and grows multiplicatively when you add even a few reasonable choices.
We see that when only a portion of our assets are responsible for the bulk of meeting our constraints, it’s worthwhile considering removing, or, at least de-emphasizing, the non-contributing assets. Before we do that, we should analyze whether we’re removing them because they performed poorly, yielded results outside the range of likely expectations, or because we’re trying to predict the future. On the first case, poor performance alone doesn’t warrant surgery, as all assets suffer periods of weak returns. If we’re considering exclusion due to results outside of likely expectations, we need to make sure our expectations were well calibrated. Then we need to decide whether to revise our expectations and understand the level of mean reversion present in that asset. Five hundred year floods can happen more than once in a decade. That doesn’t mean the probability that there will be more than one in the next decade has changed.
Finally, if we’re trying to predict the future, do we have a framework or process that gives us a reasonable probability of success at such a fraught endeavor? Even if we’re not, are we implicitly doing so due to behavioral biases because we’re all too human? If we’re only trying to earn a return commensurate with risk, then we have to be careful that we’re not substituting a “belief” about the future for a disciplined process to justify changes to our portfolio
All of these questions refer back to what our expectations were. We looked at historical averages and chose a portfolio based on the average weights that would produce a reasonable risk-return profile. That implicitly assumes the future will look like the past. But as we saw, we outperformed and then underperformed that assumption. Was it reasonable to expect our portfolio to meet those constraints? Just because it did or didn’t shouldn’t we be analyzing performance based not only on what happened but also on reasonable expectations of what could have happened? Sure we assumed implicitly that the future would look like the past, but did it and what might it have looked like otherwise?
Ultimately, we want to know whether our logic was sound regardless of the outcome. We want to make sure we’re wrong for the right reasons, not right for the wrong ones. We’ll look at ways to simulate expectations and analyze our logic in our next post. Until then the R and then python code are below. Enjoy.
#### Load packages ####
suppressPackageStartupMessages({
library(tidyquant)
library(tidyverse)
})
## Load data
# NOTE: PLEASE SEE PRIOR POST FOR DATA DOWNLOAD AND WRANGLING
df <- readRDS("port_const.rds")
sym_names <- c("stock", "bond", "gold", "realt", "rfr")
## Load simuation function
port_sim <- function(df, sims, cols){
if(ncol(df) != cols){
print("Columns don't match")
break
}
# Create weight matrix
wts <- matrix(nrow = sims, ncol = cols)
for(i in 1:sims){
a <- runif(cols,0,1)
b <- a/sum(a)
wts[i,] <- b
}
# Find returns
mean_ret <- colMeans(df)
# Calculate covariance matrix
cov_mat <- cov(df)
# Calculate random portfolios
port <- matrix(nrow = sims, ncol = 2)
for(i in 1:sims){
port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))
port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,]))
}
colnames(port) <- c("returns", "risk")
port <- as.data.frame(port)
port$Sharpe <- port$returns/port$risk*sqrt(12)
max_sharpe <- port[which.max(port$Sharpe),]
graph <- port %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue") +
labs(x = "Risk (%)",
y = "Return (%)",
title = "Simulated portfolios")
out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts)
}
## Run simuation
set.seed(123)
port_sim_1 <- port_sim(df[2:61,2:5],1000,4)
## Graph simulation
port_sim_1$graph +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Load portfolio selection function
port_select_func <- function(port, return_min, risk_max, port_names){
port_select <- cbind(port$port, port$wts)
port_wts <- port_select %>%
mutate(returns = returns*12,
risk = risk*sqrt(12)) %>%
filter(returns >= return_min,
risk <= risk_max) %>%
summarise_at(vars(4:7), mean) %>%
`colnames<-`(port_names)
graph <- port_wts %>%
rename("Stocks" = 1,
"Bonds" = 2,
"Gold" = 3,
"Real estate" = 4) %>%
gather(key,value) %>%
ggplot(aes(reorder(key,value), value*100 )) +
geom_bar(stat='identity', position = "dodge", fill = "blue") +
geom_text(aes(label=round(value,2)*100), vjust = -0.5) +
scale_y_continuous(limits = c(0,40)) +
labs(x="",
y = "Weights (%)",
title = "Average weights for risk-return constraints")
out <- list(port_wts = port_wts, graph = graph)
out
}
## Run selection function and graph results
results_1 <- port_select_func(port_sim_1,0.07, 0.1, sym_names[1:4])
results_1$graph
## Function for portfolio returns without rebalancing
rebal_func <- function(act_ret, weights){
ret_vec <- c()
wt_mat <- matrix(nrow = nrow(act_ret), ncol = ncol(act_ret))
for(i in 1:60){
wt_ret <- act_ret[i,]*weights # wt'd return
ret <- sum(wt_ret) # total return
ret_vec[i] <- ret
weights <- (weights + wt_ret)/(sum(weights)+ret) # new weight based on change in asset value
wt_mat[i,] <- as.numeric(weights)
}
out <- list(ret_vec = ret_vec, wt_mat = wt_mat)
out
}
## Run function and create actual portfolio and data frame for graph
port_1_act <- rebal_func(df[62:121,2:5],results_1$port_wts)
port_act <- data.frame(returns = mean(port_1_act$ret_vec),
risk = sd(port_1_act$ret_vec),
sharpe = mean(port_1_act$ret_vec)/sd(port_1_act$ret_vec)*sqrt(12))
## Simulate portfolios on first five-year period
set.seed(123)
port_sim_2 <- port_sim(df[62:121,2:5], 1000, 4)
## Graph simulation with chosen portfolio
port_sim_2$graph +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Weighted performance by asset
assets = factor(c("Stocks", "Bonds", "Gold", "Real estate"),
levels = c("Stocks", "Bonds", "Gold", "Real estate"))
calc <- apply(df[62:121, 2:5]*rbind(as.numeric(results_1$port_wts),port_1_act$wt_mat[1:59,]),
2,
function(x) (prod(1+x)^(1/5)-1)*100) %>%
as.numeric()
perf_attr <- data.frame(assets = assets,
returns = calc)
perf_attr %>%
ggplot(aes(assets, returns)) +
geom_bar(stat="identity", fill ="darkblue") +
geom_text(aes(assets,
returns + 1,
label=paste("Return: ",
round(returns,1),
"%",
sep = "")),
size = 3) +
geom_text(aes(assets,
returns +0.5,
label=paste("Contribution: ",
round(returns/sum(returns),2)*100,
"%",
sep = "")),
size = 3) +
labs(x="",
y = "Return (%)",
title = "Asset performance weighted by allocation",
subtitle = "Returns are at a compound annual rate")
## Volatility contribution
port_1_vol <- sqrt(t(as.numeric(results_1$port_wts)) %*%
cov(df[62:121,2:5]) %*%
as.numeric(results_1$port_wts))
vol_cont <- as.numeric(results_1$port_wts) %*%
cov(df[62:121,2:5])/port_1_vol[1,1] *
as.numeric(results_1$port_wts)
vol_attr <- data.frame(assets = assets,
risk = as.numeric(vol_cont)*sqrt(12)*100)
vol_attr %>%
ggplot(aes(assets, risk)) +
geom_bar(stat="identity", fill ="darkblue") +
geom_text(aes(assets,
risk + 0.5,
label=paste("Risk: ",
round(risk,1),
"%",
sep = "")),
size = 3) +
geom_text(aes(assets,
risk +0.25,
label=paste("Contribution: ",
round(risk/sum(risk),2)*100,
"%",
sep = "")),
size = 3) +
labs(x="",
y = "Return (%)",
title = "Asset risk weighted by allocation")
## Asset weighing beginning and end of period
rbind(results_1$port_wts,port_1_act$wt_mat[60,]) %>%
data.frame() %>%
`colnames<-`(c("Stocks", "Bonds", "Gold", "Real estate")) %>%
gather(key,value) %>%
mutate(time = rep(c("Beg", "End"), 4),
key = factor(key, levels = c("Stocks", "Bonds", "Gold", "Real estate"))) %>%
ggplot(aes(key, value*100, fill = time)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(, values = c("slategrey", "darkblue")) +
geom_text(aes(key,
value *100+5,
label=round(value,2)*100),
position = position_dodge(1)) +
labs(x="",
y = "Weight (%)",
title = "Asset weights at beginning and end of period") +
theme(legend.position = "top",
legend.title = element_blank())
set.seed(123)
port_sim_3 <- port_sim(df[122:181,2:5], 1000, 4)
ret_old_wt <- rebal_func(df[122:181, 2:5], results_1$port_wts)
ret_same_wt <- rebal_func(df[122:181, 2:5], port_1_act$wt_mat[60,])
port_act_1_old <- data.frame(returns = mean(ret_old_wt$ret_vec),
risk = sd(ret_old_wt$ret_vec),
sharpe = mean(ret_old_wt$ret_vec)/sd(ret_old_wt$ret_vec)*sqrt(12))
port_act_1_same <- data.frame(returns = mean(ret_same_wt$ret_vec),
risk = sd(ret_same_wt$ret_vec),
sharpe = mean(ret_same_wt$ret_vec)/sd(ret_same_wt$ret_vec)*sqrt(12))
port_sim_3$graph +
geom_point(data = port_act_1_old,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") +
geom_point(data = port_act_1_same,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="purple") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
# Rebalancing comp
ret_10 <- apply(df[2:121,2:5], 2, mean)
cov_10 <- cov(df[2:121,2:5])
port_ret_old <- sum(results_1$port_wts*ret_10)
port_ret_new <- sum(port_1_act$wt_mat[60,]*ret_10)
vol_old <- sqrt(t(as.numeric(results_1$port_wts)) %*% cov_10 %*% as.numeric(results_1$port_wts))
vol_new <- sqrt(t(port_1_act$wt_mat[60,]) %*% cov_10 %*% port_1_act$wt_mat[60,])
sharpe_old <- port_ret_old/vol_old*sqrt(12)
sharpe_new <- port_ret_new/vol_new*sqrt(12)
port_sim_lv <- function(df, sims, cols){
if(ncol(df) != cols){
print("Columns don't match")
break
}
# Create weight matrix
wts <- matrix(nrow = (cols-1)*sims, ncol = cols)
count <- 1
for(i in 1:(cols-1)){
for(j in 1:sims){
a <- runif((cols-i+1),0,1)
b <- a/sum(a)
c <- sample(c(a,rep(0,i-1)))
wts[count,] <- c
count <- count+1
}
}
# Find returns
mean_ret <- colMeans(df)
# Calculate covariance matrix
cov_mat <- cov(df)
# Calculate random portfolios
port <- matrix(nrow = (cols-1)*sims, ncol = 2)
for(i in 1:nrow(port)){
port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))
port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,]))
}
colnames(port) <- c("returns", "risk")
port <- as.data.frame(port)
port$Sharpe <- port$returns/port$risk*sqrt(12)
max_sharpe <- port[which.max(port$Sharpe),]
graph <- port %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue") +
labs(x = "Risk (%)",
y = "Return (%)",
title = "Simulated portfolios")
out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts)
}
test_port <- port_sim_lv(df[62:121, 2:5], 1000, 4)
test_port$graph +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200),
size = 4,
color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
And for the pythonistas.
# Load libraries
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('ggplot')
sns.set()
## Load data
start_date = '1970-01-01'
end_date = '2019-12-31'
symbols = ["WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5"]
sym_names = ["stock", "bond", "gold", "realt", 'rfr']
filename = 'port_const.pkl'
try:
df = pd.read_pickle(filename)
print('Data loaded')
except FileNotFoundError:
print("File not found")
print("Loading data", 30*"-")
data = web.DataReader(symbols, 'fred', start_date, end_date)
data.columns = sym_names
## Simulation function
class Port_sim:
def calc_sim(df, sims, cols):
wts = np.zeros((sims, cols))
for i in range(sims):
a = np.random.uniform(0,1,cols)
b = a/np.sum(a)
wts[i,] = b
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros((sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def calc_sim_lv(df, sims, cols):
wts = np.zeros(((cols-1)*sims, cols))
count=0
for i in range(1,cols):
for j in range(sims):
a = np.random.uniform(0,1,(cols-i+1))
b = a/np.sum(a)
c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False)
wts[count,] = c
count+=1
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros(((cols-1)*sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def graph_sim(port, sharpe):
plt.figure(figsize=(14,6))
plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Plot
np.random.seed(123)
port_sim_1, wts_1, _, sharpe_1, _ = Port_sim.calc_sim(df.iloc[1:60,0:4],1000,4)
Port_sim.graph_sim(port_sim_1, sharpe_1)
## Selection function
# Constraint function
def port_select_func(port, wts, return_min, risk_max):
port_select = pd.DataFrame(np.concatenate((port, wts), axis=1))
port_select.columns = ['returns', 'risk', 1, 2, 3, 4]
port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)]
port_wts = port_wts.iloc[:,2:6]
port_wts = port_wts.mean(axis=0)
def graph():
plt.figure(figsize=(12,6))
key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"}
lab_names = []
graf_wts = port_wts.sort_values()*100
for i in range(len(graf_wts)):
name = key_names[graf_wts.index[i]]
lab_names.append(name)
plt.bar(lab_names, graf_wts)
plt.ylabel("Weight (%)")
plt.title("Average weights for risk-return constraint", fontsize=15)
for i in range(len(graf_wts)):
plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5))
plt.show()
return port_wts, graph()
# Graph
results_1_wts,_ = port_select_func(port_sim_1, wts_1, 0.07, 0.1)
# Return function with no rebalancing
def rebal_func(act_ret, weights):
ret_vec = np.zeros(len(act_ret))
wt_mat = np.zeros((len(act_ret), len(act_ret.columns)))
for i in range(len(act_ret)):
wt_ret = act_ret.iloc[i,:].values*weights
ret = np.sum(wt_ret)
ret_vec[i] = ret
weights = (weights + wt_ret)/(np.sum(weights) + ret)
wt_mat[i,] = weights
return ret_vec, wt_mat
## Run rebalance function using desired weights
port_1_act, wt_mat = rebal_func(df.iloc[61:121,0:4], results_1_wts)
port_act = {'returns': np.mean(port_1_act),
'risk': np.std(port_1_act),
'sharpe': np.mean(port_1_act)/np.std(port_1_act)*np.sqrt(12)}
# Run simulation on recent five-years
np.random.seed(123)
port_sim_2, wts_2, _, sharpe_2, _ = Port_sim.calc_sim(df.iloc[61:121,0:4],1000,4)
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_sim_2[:,1]*np.sqrt(12)*100, port_sim_2[:,0]*1200, marker='.', c=sharpe_2, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
## Show performance attribution
calc = df.iloc[61:121,0:4]*np.concatenate((np.array([results_1_wts]).reshape(1,4),wt_mat))[0:60]
calc = calc.apply(lambda x: (np.prod(x+1)**(1/5)-1)*100)
contribution = round(calc/sum(calc),1)*100
plt.figure(figsize=(12,6))
key_names = ["Stocks", "Bonds", "Gold", "Real estate"]
graf_hts = calc.values
plt.bar(key_names, graf_hts)
plt.ylabel("Return (%)")
plt.ylim(0,7)
plt.title("Compound annual asset return weighted by allocation", fontsize=15)
for i in range(len(graf_hts)):
plt.annotate("Return: " + str(round(graf_hts[i]))+"%", xy=(i-0.2, graf_hts[i]+0.35))
plt.annotate("Contributions: " + str(contribution[i])+"%", xy=(i-0.2, graf_hts[i]+0.15))
plt.show()
## Show volatility contribution
port_1_vol = np.sqrt(np.dot(np.dot(results_1_wts.T, df.iloc[61:121,0:4].cov()), results_1_wts))
vol_cont = np.dot(results_1_wts.T, df.iloc[61:121,0:4].cov()/port_1_vol) * results_1_wts
contribution = round(vol_cont/sum(vol_cont),1).values*100
plt.figure(figsize=(12,6))
key_names = ["Stocks", "Bonds", "Gold", "Real estate"]
graf_hts = vol_cont.values * np.sqrt(12) * 100
plt.bar(key_names, graf_hts)
plt.ylabel("Risk (%)")
plt.ylim(0,4)
plt.title("Asset risk weighted by allocation", fontsize=15)
for i in range(len(graf_hts)):
plt.annotate("Risk: " + str(round(graf_hts[i]))+"%", xy=(i-0.2, graf_hts[i]+0.35))
plt.annotate("Contribution: " + str(contribution[i])+"%", xy=(i-0.2, graf_hts[i]+0.15))
plt.show()
## Show beginning and ending portfolio weights
ey_names = ["Stocks", "Bonds", "Gold", "Real estate"]
beg = results_1_wts.values*100
end = wt_mat[-1]*100
ind = np.arange(len(beg))
width = 0.4
fig,ax = plt.subplots(figsize=(12,6))
rects1 = ax.bar(ind - width/2, beg, width, label = "Beg", color="slategrey")
rects2 = ax.bar(ind + width/2, end, width, label = "End", color="darkblue")
for i in range(len(beg)):
ax.annotate(str(round(beg[i])), xy=(ind[i] - width/2, beg[i]))
ax.annotate(str(round(end[i])), xy=(ind[i] + width/2, end[i]))
ax.set_ylabel("Weight (%)")
ax.set_title("Asset weights at beginning and end of period\n", fontsize=16)
ax.set_xticks(ind)
ax.set_xticklabels(key_names)
ax.legend(loc='upper center', ncol=2)
plt.show()
## Run simulation on second five year period
np.random.seed(123)
port_sim_3, wts_3, _, sharpe_3, _ = Port_sim.calc_sim(df.iloc[121:181,0:4],1000,4)
ret_old_wt, _ = rebal_func(df.iloc[121:181, 0:4], results_1_wts)
ret_old = {'returns': np.mean(ret_old_wt),
'risk': np.std(ret_old_wt),
'sharpe': np.mean(ret_old_wt)/np.std(ret_old_wt)*np.sqrt(12)}
ret_same_wt, _ = rebal_func(df.iloc[121:181, 0:4], wt_mat[-1])
ret_same = {'returns': np.mean(ret_same_wt),
'risk': np.std(ret_same_wt),
'sharpe': np.mean(ret_same_wt)/np.std(ret_same_wt)*np.sqrt(12)}
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_sim_3[:,1]*np.sqrt(12)*100, port_sim_3[:,0]*1200, marker='.', c=sharpe_3, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(ret_old['risk']*np.sqrt(12)*100, ret_old['returns']*1200, c='red', s=50)
plt.scatter(ret_same['risk']*np.sqrt(12)*100, ret_same['returns']*1200, c='purple', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
np.random.seed(123)
test_port,_ ,_ ,sharpe_test, _ = Port_sim.calc_sim_lv(df.iloc[61:121, 0:4], 1000, 4)
# test_port,_ ,_ ,sharpe_test, _ = calc_sim_lv(df.iloc[61:121, 0:4], 1000, 4)
# port_sim_2, wts_2, _, sharpe_2, _ = Port_sim.calc_sim(df.iloc[61:121,0:4],1000,4)0
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(test_port[:,1]*np.sqrt(12)*100, test_port[:,0]*1200, marker='.', c=sharpe_test, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
If the beginning portfolio value is $100, and ending value is $160, then a 12% of $160 is around $19, which is roughly 30% of the growth in value.
Back of the envelope suggests that one might give all of that advantage up in taxes, but a simulation would give a more rigorous answer.
There might even be tax advantages if the loss could offset taxes (up to a limit if there were no gains to absorb first).
To count the number of combinations of k objects from a set of n objects where order doesn’t matter, one uses the following formula: n!/k!(n-k)!. Hence, choosing three from four and two from four yields four and six possible combinations, respectively. Add the case of four out of four and you have eleven possible combinations.
As some of you may or may not know, Hackkerank is a website that offers a variety practice questions to work on your coding skills in an interactive online environment. You can work on a variety of languages like, Java, C, C++, Python and more! There are a lot of high quality questions that can really challenge your present coding and problem solving skills and help you build on them.
When I started out, I found that reading raw data was more challenging than writing the rest of the solution to the problem; This blog post is to show how to read raw data as lists, arrays and matrices and hopefully shed some light on how to do this in other problems.
I’m sure there are other more effective ways to be reading raw input from Hackerrank, but this has worked for me and I hope it will be helpful for others as well; Sorry in advance if my code appears to be juvenile.
To solve these problems, I will be working with Python 3.
To read a line of raw input. Simply use the input()
function.
While this is great for reading data. It gives it to us in the raw form, which results in the data being received as a string- which isn’t any good if we want to do calculations.
Now that we know how to read raw input. Lets now get the data to be readable and in the form that we want for solving Hackerrank challenges.
Lets look at a problem that we can to read raw input as a list to solve.
This problem is called “Shape and Reshape“. This involves reading raw, space-separated data and turning it into a matrix.
Turning the data into a matrix can be done by using the numpy package. But to do this, the data first needs to be made into a list. I do this by first reading the raw data with input().strip().split(' ')
.
Lets explain what each part of this code does.
input()
takes in the raw input.
The .strip()
method clears all leading and trailing spaces. While not necessary for our example, it is good practice to do this so as not to have to encounter this problem in other cases.
The .split(' ')
method splits the raw data into individual elements. Because the data is space separated we define the the split to be a space. We could technically write split()
without anything in the middle (the default is white-space) but for this example we are defining the separator as an individual space character.
The problem now is that the data needs to be converted to a proper form. Remember, input()
reads all raw data as a string. For our problem, our data needs to be in integer form. This can be done by iterating the int()
function across the elements in our list.
The code we use is thus:
n = input("Write input: ").strip().split(' ')
data=[int(i) for i in n]
# Print the data to see our result (Not required for the solution)
print(data)
Looking at the for-loop may not seem intuitive when first looking at it if you are used to writing for-loops traditionally. But once you learn how to write for loops this way, it definitely will be more preferred. As someone who has come to learn Python after initially spending a lot of time with R. I would describe this method of writing a for loop analogous to R’s sapply()
or lapply()
functions.
And there you have it! With 2 lines of code our data is ready to solve the problem!
(For actually solving the problem, you will have to figure it out yourself or you can check out my code on my Python Musings Github Repository (Its still a work in progress, messy code and all)
After looking in the discussions for this problem the raw data can be read directly as an array using the numpy package.
The code is:
import numpy as np
data = np.array(input("Write input: ").strip().split(' '),dtype= int)
# Print the data to see our result (Not required for the solution)
print(data)
Essentially we put our raw input that we put in as a list before directly into numpy’s .array()
function. To coerce the data into integers we define the data type as an integer. Therefore we set dtype=int
in the .array()
function…. and there you have it! An array of integers!
For reading matrices, lets look at the problem titled “Transpose and Flatten“. This requires reading a matrix of data and transposing it and flattening it. While doing those operations are pretty straight forward. Reading the data might be a challenge.
Lets first look at how the input format is.
We want to have to read data in a way that will let us know the number of rows and columns of the data followed by the elements to read.
To do this, the code is:
import numpy as np
n,m =map(int,input().strip().split())
array= np.array([input().strip().split() for i in range(0,n)],dtype=int)
print(array)
Lets now break down the code:
Python has a very cool feature that you can assign multiple variables in a single line by separating them by a comma. So we can immediately assign our rows and columns from our input in a single line. To assign differing values, we use the map()
function on the split input. To assure that the values are in integer form we apply the int
function to both variables.
To get the rest of the data, we can read it directly into numpy’s .array()
function, iterating across the number of inputs we know we will be having (i.e. the number of rows). With this, we get the matrix we want for our input and can now work on the problem!
Doing challenges on Hackerrank is a good way to build skills in knowing how to write code and problem solve in Python. I personally found it initially challenging to read data in a form I wanted it. I hope this article shed some light on solving these problems!
Be sure to check out my Python Musing’s Github repository to see where I am in my adventures!
Last week, I had a great opportunity to give a talk on data science application in manufacturing at Acharya Institute of Technology(AIT), Bangalore. Being an alumni, AIT has a special place in my heart. A lot of curious young minds who attended my session had great questions. Some of the highlights of Q&A session are
What is the difference between Data Scientist and Data Analyst?
A data analyst works on combining data from different sources, performing data discovery, creating schema, verify and validate data consistency and provide data reports. They also perform visualization tasks using BI tools. A data scientist would often perform the job of data analyst along with using mathematical and statistical principles to build models to solve a specific problem.
What is the difference between AI and deep learning?
I read an answer for this particular question recently and goes something along these lines “AI is usually built using power point and deep learning models are built using R and Python.” Most folks who use AI are usually from sales to market a product. To call a deep learning model as AI would be specious. Currently, all publicly available deep learning models are still in its infancy. We are not too far off from reaching cognitive ability in these models.
If I want to be a data scientist, where do I start?
Data science field is an amalgamation of different fields. Most commonly, one needs to have a strong background in probability theory and statistics. If you can master these two fields, you are half way through. Then you can figure out what domain you want to get into and work on you skills from there. For example, if you want to work on a social media platform then you can learn A/B testing and so on.
What models should we need to know from different types of machine learning like supervised and unsupervised; if I am asked in an interview?
Learn the basics first starting with a simpler one’s like linear regression and k-means clustering. Learn the math and assumptions underneath it and how it works. Once you’ve mastered it, fitting a model in R is as simple as calling an “lm” or “k-means” function. Most basic models are the most simplest to explain in an interview or to a non-subject matter expert.
As a recent graduate, companies ask for experience in data science. How can I go about it?
Data science is a very new field. Even data science executives from various companies come from different backgrounds like R&D, deployment, IT etc. There is nothing wrong with companies to ask for experience. This is mainly due to reduce their training time and risk and monetize much faster. Now coming to the question, one needs to build their portfolio to show their knowledge out there like LinkedIn articles, positing some of their work like tutorials and packages on GitHub, getting certifications, being active in data science community discussions and volunteering in data science conference. There are numerous ways that you could build your portfolio.
Should I go for a graduate program or get certifications to advance my career as a data scientist?
I would recommend getting a degree. There are not a lot of schools that offer data science graduate program. Even if you find it, it’s usually an extension from statistics or business school program. Some of the alternatives is to get a degree in business information system, statistics and computer science. The main reason, I would recommend a degree is because, it would help with your career ladder. During promotion cycles, they request you to have at least a master’s degree. Most companies prefer a Ph.D. There is a reason for this, when it comes to data science, it’s not just knowing how to use R/Python and fitting a model with functions. It’s more than that. It’s like any other research job where you build a qualitative and quantitative reasoning to set up trials and build models in different scenarios, adding, removing and creating data through experimentation to build effective models. It also involves reading a lot of new research and testing those techniques in your work. Coding and fitting models is just a small part of data science. Most certifications teach you just coding, fitting models and using a set of tools. So, my recommendation is to get a degree as a primary objective and have few certifications. There is a nice report by Burtch Works on this here.
If you read through all of the above, thanks for sticking around. Comment below to share your thoughts on this. If you like check of my slides from my presentation.
If you like this post, look at my other posts as well.
The post Data Science Application in Manufacturing appeared first on Hi! I am Nagdev.
In these previous posts:
I introduced AdaOpt
, a novel probabilistic classifier based on a mix of multivariable optimization and a nearest neighbors algorithm. More details about the algorithm can also be found in this (short) paper. mlsauce
’s development version now contains a parallel implementation of AdaOpt
. In order to install this development version from the command line, you’ll need to type:
pip install git+https://github.com/thierrymoudiki/mlsauce.git --upgrade
And in order to use parallel processing, create the AdaOpt
object (see previous post) with:
n_jobs = 2 # or 4, or -1
In our last post, we compared the three most common methods used to set return expectations prior to building a portfolio. Of the three—historical averages, discounted cash flow models, and risk premia models—no single method dominated the others on average annual returns over one, three, and five-year periods. Accuracy improved as the time frame increased. Additionally, aggregating all three methods either by averaging predictions, or creating a multivariate regression from the individual explanatory variables, performed better than two out of the three individual methods. Nonetheless, on a five-year time horizon, the historical average method enjoyed the highest accuracy even against an ensemble of all methods.
In this post, we’ll take the historical average method and apply it to building an portfolio and then test that portfolio against actual results. Our study will proceed as follows:
Before we begin, we want to highlight that we’re now including a python version of the code that replicates the analysis and graphs we achieve with R. You’ll find it after the R code at the bottom of the post.
Choosing a representative group of assets is not a trivial matter for reasons we’ve discussed in previous posts, not least of which is finding publicly available series that have a sufficiently long record. In the past, we’ve used large, liquid ETFs. Unfortunately, start dates vary, which means the amount of data is limited by the ETF with the shortest trading record. On the flip side, getting long data series in some cases implies using data that didn’t exist at the time. Those 100-year analyses of the S&P 500 are fine for academic purposes. But don’t reflect investing reality since the index didn’t exist before the 1950s and wasn’t broadly investable by individuals until the great John Bogle created the first index fund in the mid-1970s. And no one wanted to buy it at first anyway!
We also want total return indices, as these reflect the benefits of dividends and interest that a real owner would receive. On this account, we scoured the St. Louis Fed’s database (FRED) to come up with four representative asset classes: stocks, bonds, commodities, and real estate as well as the risk-free rate. These are the Wilshire 5000 total market stock index, the ICE Bank of America investment grade corporate bond index, the gold index, the Case-Shiller Home price index, and the five-year Treasury note constant maturity index. The earliest start date for which all values are available is 1987, so this is a pretty good length. The downside is that apart from gold and Treasuries, these indices weren’t investable for much of the period (Wilshire 5000) or at all (Case-Shiller). So we’re sacrificing reality for illustration.
Whateeer the case, let’s pull that data and then start with exploratory data analysis. We show the scatter plot, histograms, and correlation of returns in the chart below.^{1}
We see that the correlations between the various asset classes is relatively low; returns distributions are dissimilar; and the scatter plots show limited linear relationships.
Now we’ll run a 1,000 simulations in which we randomly choose weights for the different asset classes and graph the resulting return (y-axis) against risk (x-axis) associated with those portfolios. We modulate the hue of each portfolio by its return-to-risk, or Sharpe, ratio—darker is higher.
The graph seems fairly well distributed with a roughly balanced shape. The question now is what allocation based on this simulation is likely to produce our required return and risk tolerance? Hard to answer without a mandate or return target such as a benchmark or sustainable spending goal. Instead, we could put ourselves in the shoes of someone in 1992 (the start of the first portfolio) and imagine what she or he would be satisfied to achieve. The average annualized return for stocks from 1970-1992 was around 10% with a standard deviation close to 16%, yielding a Sharpe ratio of 0.125 to 0.6 depending on whether or not you adjust for Treasury yields (the risk-free rate) and which duration you use.
Assuming we want equity-like returns with lower risk (what’s the point of diversification anyway!), we might search for returns that are above the 7% range with risk below 10%. We see from the graph above that this doesn’t cover a lot of the portfolios, so such constraints might not be realistic, but let’s see what the average weighting would be to achieve such a result.
The weights are rather balanced and not outlandish. While one could quibble with a weighting here or there, it would be hard to argue that this a “bad” or inherently risky allocation. Instead, one might argue that the allocation to stocks isn’t high enough.
Now we’ll see what our portfolio would like vs. what actually happened over the next five years. In this case, we’ll assume we buy all of the assets at the prescribed weights at the beginning of the period and hold them until the end without any rebalancing. Clearly a highly artificial assumption, but we need to start somewhere. We’ll indicate our portfolio by the red dot in the scatter.
A very good result. Our portfolio looks close to the top end of the range for it’s risk. Portfolio returns averaged about 9% annually with risk of about 4%, which is actually quite unusual. (Then again this was the internet boom.) Adjusting for the risk-free rate based on constant maturity five-year Treasury yields, the Sharpe ratio for the portfolio is 0.61, which is quite good for such a naive allocation.
Seeing that the portfolio performed so well, should we keep our current allocation or adjust it? And if we adjust it, should we base the simulations on the most recent five-years of data or the full period? The answer: let R do the heavy lifting while we survey the results.
If we keep the same constraints on returns of not less than 7% and risk not greater than 10%, the average weighting based on the previous five-years of data yields the following.
The weights appear to be relatively the same in terms of allocations to stocks and bonds, but gold gets a much higher allocation on the back of real estate. We’ll save space by omitting the simulation based on 10-year returns; it’s not much different than the five-year. Nonetheless, it might be instructive to graph the simulation on 10-year returns and compare it to the portfolio’s results, the red dot.
What the fudge? The portfolio achieved results that are out of the bounds of the simulation? This is not as crazy as it looks and is actually quite informative. Recall, the portfolio’s results are only for the second five-year period. Hence, if returns are back-end loaded, as appeared to be the case, then that subset could appear to be outside the realm of possibility since the earlier returns weigh down the averages for the entire period.
Second, what this suggests is that our weighting scheme may have just been lucky due to the timing of our allocation. Recall, we bought the entire portfolio on day one. We’d need to test whether a more measured allocation framework would have yielded different results. But that will have to wait for another post. Critically, the levitating red dot begs the question whether we want alter our allocations or constraints. Maybe returns of not less than 7% and risk of not more than 10% are unrealistic. Say we lowered it to a return not less than 6% and a risk not more than 10%. What would the average allocations approximate? We graph the results based on the last ten years of data.
This allocation isn’t much different from the one above. The weighting to stocks declines in favor of gold. Real estate returns to its original allocation. This is instructive because it shows us how an aggregate weighting scheme may not produce logical results. Few investors would consider putting more than 5% of their portfolio in a commodity like gold unless they had a strong view on inflation or potential econmic shocks. Importantly, what’s causing the shift to gold? Is it risk or return? As a perceptive reader explained to us previously, returns generally have a greater impact on allocation than volatility. And this case proves that out, as gold exhibited a negative return on average, while its volatility was the second highest of the group.
A more reasonable allocation would be to increase one’s exposure to bonds and to keep the gold allocation below 10%. We’ll save the analysis of ranges of allocations for a later post. If this were reality, it might make more sense to keep the original allocation, but to be prepared for lower returns going forward. Yet, this is about illustration. So let’s look at two allocation strategies. Keeping the prior one (why mess with a good thing?) and the one based on longer term data.
Interestingly, the different allocations didn’t result in meaningfully different results when looking at the graph. We see that in general, returns averaged about 5%, below our threshold constraint. This was mainly due to the bursting of the tech bubble in 2000 and the 2001 recession. Still our portfolio performed better than 80% of the simulations. Better to be lucky than smart.
We could keep running simulations to the end the period. But we’ll end it here. The code below provides functions that should allow one to produce simulations up to the present relatively straightforward.
What are the key takeaways? While it’s tough to generalize on such a small sample, it’s safe to say that portfolio results are decidedly at the mercy of what actually occurs. Those strong risk-adjusted returns in the first test period were likely due to the bull market rather than our allocation. Additionally, the gimlet-eyed reader will notice how the shape of the portfolio simulations changed with the different periods. In the first simulation, one could almost trace a sideways parabola along the edges of the dots. The last looked more like a lop-sided V. Examining these results gives us a good launchpad for further analysis. What’s next then? In future posts, we’ll
If you’d prefer reading about one of these analyses sooner, rather than later, drop us an email at the address below. Until next time, the R followed by the python code for all the analyses and graphs are below.
## Coded in R 3.6.2
## Load packages
suppressPackageStartupMessages({
library(tidyquant)
library(tidyverse)
})
## Load data
# Create symbol vectors
symbols <- c("WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5")
sym_names <- c("stock", "bond", "gold", "realt", "rfr")
# Get symbols
getSymbols(symbols, src="FRED", from = "1970-01-01", to = "2019-12-31")
# Merge xts objects and resample to monthly
index <- merge(WILL5000INDFC, BAMLCC0A0CMTRIV, GOLDPMGBD228NLBM, CSUSHPINSA, DGS5)
index <- na.locf(index)
colnames(index) <- sym_names
idx_mon <- to.monthly(index, indexAt = "lastof", OHLC=FALSE)
idx_mon <- idx_mon["1987/2019"]
# Create data frame
df <- data.frame(date = index(idx_mon), coredata(idx_mon)) %>%
mutate_at(vars(-c(date, rfr)), function(x) x/lag(x)-1) %>%
mutate(rfr = rfr/100)
## Plot data
# Note special thanks to Stackoverflow and user eipi10 for the code to make a ggpairs-like plot without suffering constraints of that package. Simple manual color changes are not workable!
# https://stackoverflow.com/questions/13367248/pairs-move-labels-to-sides-of-scatter-plot
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, panel.first = grid(),...)
}
panel.pearson <- function(x, y, ...) {
horizontal <- (par("usr")[1] + par("usr")[2]) / 2;
vertical <- (par("usr")[3] + par("usr")[4]) / 2;
text(horizontal, vertical, round(cor(x,y)+0.009, 2))}
pairs(df[2:61 , 2:5],
col = "blue",
pch = 19,
cex = 1.5,
labels = NULL,
gap = 0.5,
diag.panel = panel.hist,
upper.panel = panel.pearson)
title("Scatter plot, histogram, & correlation", adj = 0, line = 3)
x.coords = par('usr')[1:2]
y.coords = par('usr')[3:4]
# Offset is estimated distance between edge of plot area and beginning of actual plot
x.offset = 0.03 * (x.coords[2] - x.coords[1])
xrng = (x.coords[2] - x.coords[1]) - 2*x.offset
x.width = xrng/4
y.offset = 0.028 * (y.coords[2] - y.coords[1])
yrng = (y.coords[2] - y.coords[1]) - 2*y.offset
y.width = yrng/4
# x-axis labels
text(seq(x.coords[1] + x.offset + 0.5*x.width, x.coords[2] - x.offset - 0.5*x.width,
length.out=4), 0,
c("Stocks","Bonds","Gold","Real Estate"),
xpd=TRUE,adj=c(.5,.5), cex=.9)
# y-axis labels
text(x.coords, seq(y.coords[1] + y.offset + 0.5*y.width, y.coords[2] - 3*y.offset - 0.5*y.width,
length.out=4),
rev(c("Stocks","Bonds","Gold","Real Estate")),
xpd=TRUE, adj=c(0.5, 0.5),
srt=90, # rotates text to be parallel to axis
cex=.9)
## Portfolio simulation
# Weighting that ensures more variation and random weighthing to stocks
set.seed(123)
# Function for simulation and graph
port_sim <- function(df, sims, cols){
if(ncol(df) != cols){
print("Columns don't match")
break
}
# Create weight matrix
wts <- matrix(nrow = sims, ncol = cols)
for(i in 1:sims){
a <- runif(cols,0,1)
b <- a/sum(a)
wts[i,] <- b
}
# Find returns
mean_ret <- colMeans(df)
# Calculate covariance matrix
cov_mat <- cov(df)
# Calculate random portfolios
port <- matrix(nrow = sims, ncol = 2)
for(i in 1:sims){
port[i,1] <- as.numeric(sum(wts[i,] * mean_ret))
port[i,2] <- as.numeric(sqrt(t(wts[i,] %*% cov_mat %*% wts[i,])))
}
colnames(port) <- c("returns", "risk")
port <- as.data.frame(port)
port$Sharpe <- port$returns/port$risk*sqrt(12)
max_sharpe <- port[which.max(port$Sharpe),]
graph <- port %>%
ggplot(aes(risk*sqrt(12)*100, returns*1200, color = Sharpe)) +
geom_point(size = 1.2, alpha = 0.4) +
scale_color_gradient(low = "darkgrey", high = "darkblue")+
labs(x = "Risk (%)",
y = "Return (%)",
title = "Simulated portfolios")
out <- list(port = port, graph = graph, max_sharpe = max_sharpe, wts = wts)
}
## Run simulation and plot
port_sim_1 <- port_sim(df[2:61,2:5],1000,4)
port_sim_1$graph +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
# Create function to calculate portfolio weights based on constraints and a graph
port_select_func <- function(port, return_min, risk_max, port_names){
port_select <- cbind(port$port, port$wts)
risk_sd <- sd(port_sim_1$port$risk)
port_wts <- port_select %>%
mutate(returns = returns*12,
risk = risk*sqrt(12)) %>%
filter(returns >= return_min,
risk <= risk_max) %>%
summarise_at(vars(4:7), mean) %>%
`colnames<-`(port_names)
graph <- port_wts %>%
rename("Stocks" = 1,
"Bonds" = 2,
"Gold" = 3,
"Real esate" = 4) %>%
gather(key,value) %>%
ggplot(aes(reorder(key,value), value*100 )) +
geom_bar(stat='identity', position = "dodge", fill = "blue") +
geom_text(aes(label=round(value,2)*100), vjust = -0.5) +
scale_y_continuous(limits = c(0,40)) +
labs(x="",
y = "Weights (%)",
title = "Average weights for risk-return constraints")
out <- list(port_wts = port_wts, graph = graph)
out
}
## Run selection function
results_1 <- port_select_func(port_sim_1,0.07, 0.1, sym_names[1:4])
results_1$graph
## Instantiate weighting
fut_wt <- results_1$port_wts
## Create rebalancing function
rebal_func <- function(act_ret, weights){
tot_ret <- 1
ret_vec <- c()
for(i in 1:60){
wt_ret <- act_ret[i,]*weights # wt'd return
ret <- sum(wt_ret) # total return
tot_ret <- tot_ret * (1+ret) # cumulative return
ret_vec[i] <- ret
weights <- (weights + wt_ret)/(sum(weights)+ret) # new weight based on change in asset value
}
ret_vec
}
## Run function and create actual portfolio
ret_vec <- rebal_func(df[61:121,2:5], fut_wt)
port_act <- data.frame(returns = mean(ret_vec),
risk = sd(ret_vec),
sharpe = mean(ret_vec)/sd(ret_vec)*sqrt(12))
# Run simulation on recent five-years
port_sim_2 <- port_sim(df[62:121,2:5], 1000, 4)
# Graph simulation with actual portfolio return
port_sim_2$graph +
# geom_abline(slope = test1$max_sharpe$Sharpe, color = "blue", linetype = "dashed") +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
# Run function on next five years implied weight
results_2 <- port_select_func(port_sim_2, 0.07, 0.1,sym_names[1:4])
results_2$graph
# Run simulation on last 10 years
port_sim_2l <- port_sim(df[2:121,2:5], 1000,4)
# Graph simulation with actual results of last five years
port_sim_2l$graph +
geom_point(data = port_act,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
## Run portfolio selection function on more conservative constraints and graph
resuls_2l_cons <- port_select_func(port_sim_2l, 0.06, 0.12, sym_names[1:4])
resuls_2l_cons$graph
## Run two separate allocations on next five-years
results_2l <- port_select_func(port_sim_2l, 0.07, 0.1, sym_names[1:4])
ret_old_wt <- rebal_func(df[122:181, 2:5], fut_wt)
ret_new_wt <- rebal_func(df[122:181, 2:5], results_2l$port_wts)
port_act_1_old <- data.frame(returns = mean(ret_old_wt),
risk = sd(ret_old_wt),
sharpe = mean(ret_old_wt)/sd(ret_old_wt)*sqrt(12))
port_act_1_new <- data.frame(returns = mean(ret_new_wt),
risk = sd(ret_new_wt),
sharpe = mean(ret_new_wt)/sd(ret_new_wt)*sqrt(12))
port_sim_3 <- port_sim(df[122:181,2:5], 1000, 4)
port_sim_3$graph +
geom_point(data = port_act_1_old,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="red") +
geom_point(data = port_act_1_new,
aes(risk*sqrt(12)*100, returns*1200), size = 4, color="purple") +
theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"),
legend.background = element_rect(fill = NA))
And for the pythonistas:
# Coded in Python 3.7.4
# Load libraries
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
plt.style.use('ggplot')
sns.set()
# Load data
start_date = '1970-01-01'
end_date = '2019-12-31'
symbols = ["WILL5000INDFC", "BAMLCC0A0CMTRIV", "GOLDPMGBD228NLBM", "CSUSHPINSA", "DGS5"]
sym_names = ["stock", "bond", "gold", "realt", 'rfr']
filename = 'port_const.pkl'
try:
df = pd.read_pickle(filename)
print('Data loaded')
except FileNotFoundError:
print("File not found")
print("Loading data", 30*"-")
data = web.DataReader(symbols, 'fred', start_date, end_date)
data.columns = sym_names
data_mon = data.resample('M').last()
df = data_mon.pct_change()['1987':'2019']
df.to_pickle(filename)
# Exploratory data analysis
sns.pairplot(df.iloc[1:61,0:4])
plt.show()
# Create function
class Port_sim:
def calc_sim(df, sims, cols):
wts = np.zeros((sims, cols))
for i in range(sims):
a = np.random.uniform(0,1,cols)
b = a/np.sum(a)
wts[i,] = b
mean_ret = df.mean()
port_cov = df.cov()
port = np.zeros((sims, 2))
for i in range(sims):
port[i,0] = np.sum(wts[i,]*mean_ret)
port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,]))
sharpe = port[:,0]/port[:,1]*np.sqrt(12)
best_port = port[np.where(sharpe == max(sharpe))]
max_sharpe = max(sharpe)
return port, wts, best_port, sharpe, max_sharpe
def graph_sim(port,sharpe):
plt.figure(figsize=(14,6))
plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.',c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.title('Simulated porfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Create simulation
np.random.seed(123)
port, wts, _, sharpe, _ = Port_sim.calc_sim(df.iloc[1:60,0:4],1000,4)
# Graph simulation
Port_sim.graph_sim(port, sharpe)
#### Portfolio constraint function
def port_select_func(port, wts, return_min, risk_max):
port_select = pd.DataFrame(np.concatenate((port, wts), axis=1))
port_select.columns = ['returns', 'risk', 1, 2, 3, 4]
port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)]
port_wts = port_wts.iloc[:,2:6]
# port_wts.columns = ["Stocks", "Bonds", "Gold", "Real estate"]
port_wts = port_wts.mean(axis=0)
def graph():
plt.figure(figsize=(12,6))
key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"}
lab_names = []
graf_wts = port_wts.sort_values()*100
for i in range(len(graf_wts)):
name = key_names[graf_wts.index[i]]
lab_names.append(name)
plt.bar(lab_names, graf_wts)
plt.ylabel("Weight (%)")
plt.title("Average weights for risk-return constraint", fontsize=15)
for i in range(len(graf_wts)):
plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5))
plt.show()
return port_wts, graph()
## Run function
results_wts, results_graph = port_select_func(port, wts, 0.05, 0.14)
## Create rebalancing function
def rebal_func(act_ret, weights):
tot_ret = 1
ret_vec = np.zeros(60)
for i in range(60):
wt_ret = act_ret.iloc[i,:].values*weights
ret = np.sum(wt_ret)
tot_ret = tot_ret * (1+ret)
ret_vec[i] = ret
weights = (weights + wt_ret)/(np.sum(weights) + ret)
return ret_vec
# Run rebalancing function and dictionary
ret_vec = rebal_func(df.iloc[61:121,0:4], results_wts)
port_act = {'returns': np.mean(ret_vec),
'risk': np.std(ret_vec),
'sharpe': np.mean(ret_vec)/np.std(ret_vec)*np.sqrt(12)}
#### Run simulation on next group
# Run simulation on recent five-years
np.random.seed(123)
port_2, wts_2, _, sharpe_2, _ = Port_sim.calc_sim(df.iloc[61:121,0:4],1000,4)
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_2[:,1]*np.sqrt(12)*100, port_2[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='blue', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
#### Run selection function n next five years implied weight
results_wts_2, results_graph_2 = port_select_func(port_2, wts_2, 0.07, 0.1)
# Run simulation on recent five-years
np.random.seed(123)
port_2l, wts_2l, _, _, _ = Port_sim.calc_sim(df.iloc[1:121,0:4],1000,4)
# Graph simulation with actual portfolio return
plt.figure(figsize=(14,6))
plt.scatter(port_2l[:,1]*np.sqrt(12)*100, port_2l[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25)
plt.scatter(port_act['risk']*np.sqrt(12)*100, port_act['returns']*1200, c='red', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()
# Run function on next five years implied weight
results_wts_2l, _ = port_select_func(port_2l, wts_2l, 0.06, 0.12)
# Run simulation on next five years with two different weightings
np.random.seed(123)
ret_old_wt = rebal_func(df.iloc[121:181, 0:4], fut_wt)
ret_new_wt = rebal_func(df.iloc[121:181, 0:4], results_wts_2l)
port_act_1_old = {'returns' : np.mean(ret_old_wt),
'risk' : np.std(ret_old_wt),
'sharpe' : np.mean(ret_old_wt)/np.std(ret_old_wt)*np.sqrt(12)}
port_act_1_new = {'returns' : np.mean(ret_new_wt),
'risk' : np.std(ret_new_wt),
'sharpe' : np.mean(ret_new_wt)/np.std(ret_new_wt)*np.sqrt(12)}
port_3, wts_3, _, _, _ = Port_sim.calc_sim(df.iloc[121:181, 0:4], 1000, 4)
plt.figure(figsize=(14,6))
plt.scatter(port_3[:,1]*np.sqrt(12)*100, port_3[:,0]*1200, marker='.', c=sharpe, cmap='Blues')
plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.5)
plt.scatter(port_act_1_old['risk']*np.sqrt(12)*100, port_act_1_old['returns']*1200, c='red', s=50)
plt.scatter(port_act_1_new['risk']*np.sqrt(12)*100, port_act_1_new['returns']*1200, c='purple', s=50)
plt.title('Simulated portfolios', fontsize=20)
plt.xlabel('Risk (%)')
plt.ylabel('Return (%)')
plt.show()