The post Docker + Flask | Dockerizing a Python API first appeared on Python-bloggers.
]]>Docker containers are one of the hottest trends in software development right now. Not only it makes it easier to create, deploy and run applications but by using containers you are confident that your application will run on any machine regardless of anything that may differ from yours that you created and tested the code.
In this tutorial, we will show you how you can dockerize easily a Flask API. We will use this Python Rest API Example. It’s a simple API that given an image URL it returns the dominant colors of the image.
We highly recommend creating a new python environment using Conda or pip so you can easily create your requirements.txt file that contains all the libraries that you are using in the project.
The Flask API that we will dockerize uses two .py files.
import PIL from PIL import Image import requests from io import BytesIO import webcolors import pandas as pd import webcolors def closest_colour(requested_colour): min_colours = {} for key, name in webcolors.css3_hex_to_names.items(): r_c, g_c, b_c = webcolors.hex_to_rgb(key) rd = (r_c - requested_colour[0]) ** 2 gd = (g_c - requested_colour[1]) ** 2 bd = (b_c - requested_colour[2]) ** 2 min_colours[(rd + gd + bd)] = name return min_colours[min(min_colours.keys())] def top_colors(url, n=10): # read images from URL response = requests.get(url) img = Image.open(BytesIO(response.content)) # convert the image to rgb image = img.convert('RGB') # resize the image to 100 x 100 image = image.resize((100,100)) detected_colors =[] for x in range(image.width): for y in range(image.height): detected_colors.append(closest_colour(image.getpixel((x,y)))) Series_Colors = pd.Series(detected_colors) output=Series_Colors.value_counts()/len(Series_Colors) return(output.head(n).to_dict())
from flask import Flask, jsonify, request app=Flask(__name__) #we are importing our function from the colors.py file from colors import top_colors @app.route("/",methods=['GET','POST']) def index(): if request.method=='GET': #getting the url argument url = request.args.get('url') result=top_colors(str(url)) return jsonify(result) else: return jsonify({'Error':"This is a GET API method"}) if __name__ == '__main__': app.run(debug=True,host='0.0.0.0', port=9007)
As we said before we have to create the requirements.txt file. We are using the pip freeze command after activating the projects environment.
pip freeze > requirements.txt
If you open the requirements.txt you should see listed all the required libraries of the project.
certifi==2020.6.20 chardet==3.0.4 click==7.1.2 Flask==1.1.2 idna==2.10 itsdangerous==1.1.0 Jinja2==2.11.2 jsonify==0.5 MarkupSafe==1.1.1 numpy==1.19.2 pandas==1.1.3 Pillow==8.0.1 python-dateutil==2.8.1 pytz==2020.1 requests==2.24.0 six==1.15.0 urllib3==1.25.11 webcolors==1.4 Werkzeug==1.0.1
Let’s start the dockerizing process. We only need to create a new file called Dockerfile. Then we will add some lines of code inside.
The Dockerfile is made of simple commands that define how to build the image. The first line is our base image. There are a lot of images that you can use like Linux, Linux with preinstalled Python and libraries or images that are made especially for data science projects. You can explore them all at the docker hub. We will use the Python:3.8 image.
FROM python:3.8
Then we need to copy the required files from our host machine and add it to the filesystem of the container. To make it simpler we will not add any subfolders.
FROM python:3.8 COPY requirements.txt ./requirements.txt COPY colors.py ./colors.py COPY main.py ./main.py
Then we have to install the libraries so we have to add the pip install command to be run.
FROM python:3.8 COPY requirements.txt ./requirements.txt COPY colors.py ./colors.py COPY main.py ./main.py RUN pip install -r requirements.txt
Lastly, we have to specify what command to run within the container using CMD. In our case is the python main.py.
FROM python:3.8 COPY requirements.txt ./requirements.txt COPY colors.py ./colors.py COPY main.py ./main.py RUN pip install -r requirements.txt CMD ["python", "./main.py"]
To build the docker image you need to go to our working directory that Dockerfile is placed and run the following.
docker build -t your_docker_image_name -f Dockerfile .
You just build your image! The next step is to run our container. The tricky part here is the mapping of the ports. The first is the local port we will use and the second is in which port the API runs in our container.
docker run -d -p 5000:9007 your_docker_image_name
If everything is ok, you should get a response if you hit the following in your browser.
http://localhost:5000/?url=https://image.shutterstock.com/z/stock-photo-at-o-clock-at-the-top-of-the-mountains-sunrise-1602307492.jpg
{ burlywood: 0.1212, cornsilk: 0.0257, darksalmon: 0.229, darkslategrey: 0.0928, indianred: 0.1663, lemonchiffon: 0.021, lightsalmon: 0.0479, navajowhite: 0.0426, rosybrown: 0.097, wheat: 0.0308 }
You made it! You’ve just dockerized your Flask API! Simple as that.
Get the list of the running containers
docker container list
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES fe7726349933 image_name "python ./main.py" About an hour ago Up About an hour 0.0.0.0:5000->9007/tcp eager_chaum
If you want to stop the container, take the first 3-4 characters of the container id from the previews command and run the following
docker stop fe77
Get the Logs of the API
docker logs fe77
The post Docker + Flask | Dockerizing a Python API first appeared on Python-bloggers.
]]>In our last post, we looked at a rolling average of pairwise correlations for the constituents of XLI, an ETF that tracks the industrials sector of the S&P 500. We found that spikes in the three-month average coincided with declines in the under...
The post Kernel of error first appeared on Python-bloggers.
]]>In our last post, we looked at a rolling average of pairwise correlations for the constituents of XLI, an ETF that tracks the industrials sector of the S&P 500. We found that spikes in the three-month average coincided with declines in the underlying index. There was some graphical evidence of a correlation between the three-month average and forward three-month returns. However, a linear model didn’t do a great job of explaining the relationship given its relatively high error rate and unstable variability.
We proposed further analyses and were going to conduct one of them for this post, but then discovered the interesting R package generalCorr, developed by Professor H. Vinod of Fordham university, NY. While we can’t do justice to all the package’s functionality, it does offer ways to calculate non-linear dependence often missed by common correlation measures because such measures assume a linear relationship between the two sets of data. One particular function allows the user to identify probable causality between two pairs of variables. In other words, it tells you whether it is more likely x causes y or y causes x.
Why is this important? Our project is about exploring, and, if possible, identifying the predictive capacity of average rolling index constituent correlations on the index itself. If the correlation among the parts is high, then macro factors are probably exhibiting strong influence on the index. If correlations are low, then micro factors are probably the more important driver. Of course, other factors could cause rising correlations and the general upward trend of US equity markets should tend to keep correlations positive.
Additionally, if only a few stocks explain the returns on the index over a certain time frame, it might be possible to use the correlation of those stocks to predict future returns on the index. The notion is that the “memory” in the correlation could continue into the future. Then again, it might not!
But there’s a bit of problem with this. If we’re using a function that identifies non-linear dependence, we’ll need to use a non-linear model to analyze the predictive capacity too. That means before we explore the generalCorr package we’ll need some understanding of non-linear models.
In our previous post we analyzed the prior 60-trading day average pairwise correlations for all the constituents of the XLI and then compared those correlations to the forward 60-trading day return. Let’s look at a scatter plot to refresh our memory. Recall, we split the data into roughly a 70/30 percent train-test split and only analyzed the training set. Since the data begins around 2005, the training set ends around mid-2015
In the graph above, we see the rolling correlation doesn’t yield a very strong linear relationship with forward returns. Moreover, there’s clustering and apparent variability in the the relationship. Since our present concern is the non-linearity, we’ll have to shelve these other issues for the moment.
The relationship between correlation and returns is clearly non-linear if one could call it a relationship at all. But where do we begin trying to model the non-linearity of the data? There are many algorithms that are designed to handle non-linearity: splines, kernels, generalized additive models, and many others. We’ll use a kernel regression for two reasons: a simple kernel is easy to code—hence easy for the interested reader to reproduce—and the generalCorr package, which we’ll get to eventually, ships with a kernel regression function.
What is kernel regression? In simplistic terms, a kernel regression finds a way to connect the dots without looking like scribbles or flat lines. It assumes no underlying distribution. That is, it doesn’t believe the data hails from a normal, lognormal, exponential, or any other kind of distribution. How does it do all this? The algorithm takes successive windows of the data and uses a weighting function (or kernel) to assign weights to each value of the independent variable in that window. Those weights are then applied to the values of the dependent variable in the window, to arrive at a weighted average estimate of the likely dependent value. Look at a section of data; figure out what the relationship looks like; use that to assign an approximate y value to the x value; repeat.
There are a bunch of different weighting functions: k-nearest neighbors, Gaussian, and eponymous multi-syllabic names. Window sizes trade off between bias and variance with constant windows keeping bias stable and variance inversely proportional to how many values are in that window. Varying window sizes—nearest neighbor, for example—allow bias to vary, but variance will remain relatively constant. Larger window sizes within the same kernel function lower the variance. Bias and variance being whether the model’s error is due to bad assumptions or poor generalizability.
If all this makes sense to you, you’re doing better than we are. Clearly, we can’t even begin to explain all the nuances of kernel regression. Hopefully, a graph will make things a bit clearer; not so much around the algorithm, but around the results. In the graph below, we show the same scatter plot, using a weighting function that relies on a normal distribution (i.e., a Gaussian kernel) whose a width parameter is equivalent to about half the volatility of the rolling correlation.^{1}
We see that there’s a relatively smooth line that seems to follow the data a bit better than the straight one from above. How much better is hard to tell. What if we reduce the volatility parameter even further? We show three different parameters below using volatilities equivalent to a half, a quarter, and an eighth of the correlation.
This graph shows that as you lower the volatility parameter, the curve fluctuates even more. The smoothing parameter gives more weight to the closer data, narrowing the width of the window, making it more sensitive to local fluctuations.^{2}
How does a kernel regression compare to the good old linear one? We run a linear regression and the various kernel regressions (as in the graph) on the returns vs. the correlation. We present the error (RMSE) and error scaled by the volatility of returns (RMSE scaled) in the table below.
Model | RMSE | RMSE scaled |
---|---|---|
Linear | 0.097 | 0.992 |
Kernel @ half volatility | 0.095 | 0.971 |
Kernel @ quarter volatility | 0.092 | 0.944 |
Kernel @ eighth volatility | 0.090 | 0.921 |
The table shows that, as the volatility parameter declines, the kernel regression improves from 2.1% points lower to 7.7% points lower error relative to the linear model. Whether or not a 7.7% point improvement in the error is significant, ultimately depends on how the model will be used. Is it meant to yield a trading signal? A tactical reallocation? Only the user can decide. Whatever the case, if improved risk-adjusted returns is the goal, we’d need to look at model-implied returns vs. a buy-and-hold strategy to quantify the significance, something we’ll save for a later date.
For now, we could lower the volatility parameter even further. Instead, we’ll check how the regressions perform using cross-validation to assess the degree of overfitting that might occur. We suspect that as we lower the volatility parameter, the risk of overfitting rises.
We run a four fold cross validation on the training data where we train a kernel regression model on each of the three volatility parameters using three-quarters of the data and then validate that model on the other quarter. We calculate the error on each fold, then average those errors for each parameter. We present the results below.
Parameter | Train | Validation | Decline (%) |
---|---|---|---|
Half | 0.090 | 0.103 | -12.6 |
Quarter | 0.087 | 0.107 | -18.1 |
Eighth | 0.085 | 0.110 | -22.4 |
As should be expected, as we lower the volatility parameter we effectively increase the sensitivity to local variance, thus magnifying the performance decline from training to validation set.
Let’s compare this to the linear regression. We run the cross-validation on the same data splits. We present the results of each fold, which we omitted in the prior table for readability.
Train | Validation | Decline (%) |
---|---|---|
0.106 | 0.045 | 137.0 |
0.099 | 0.087 | 14.3 |
0.099 | 0.095 | 4.7 |
0.067 | 0.177 | -62.3 |
What a head scratcher! The error rate improves in some cases! Normally, one wouldn’t expect this to happen. A model trained on one set of data, shouldn’t perform better on data it hasn’t seen; it should perform worse! But that’s the idiosyncratic nature of time series data. We believe this “anomaly” is caused by training a model on a period with greater volatility and less of an upward trend, than the period on which its validated. Given upwardly trending markets in general, when the model’s predictions are run on the validation data, it appears more accurate since it is more likely to predict an up move anyway; and, even if the model’s size effect is high, the error is unlikely to be as severe as in choppy markets because it won’t suffer high errors due to severe sign change effects.
So which model is better? If we aggregate the cross-validation results, we find that the kernel regressions see a -18% worsening in the error vs. a 23.4% improvement for the linear model. But we know we can’t trust that improvement. Clearly, we need a different performance measure to account for regime changes in the data. Or we could run the cross-validation with some sort of block sampling to account for serial correlation while diminishing the impact of regime changes. Not exactly a trivial endeavor.
These results beg the question as to why we didn’t see something similar in the kernel regression. Same time series, why not the same effect? The short answer is we have no idea without looking at the data in more detail. We suspect there might be some data snooping since we used a range for the weighting function that might not have existed in the training set. We assume a range for the correlation values from zero to one on which to calculate the respective weights. But in the data, the range of correlation is much tighter— it doesn’t drop much below ~20% and rarely exceeds ~80%.
Another question begging idea that pops out of the results is whether it is appropriate (or advisable) to use kernel regression for prediction? In many cases, it probably isn’t advisable insofar as kernel regression could be considered a “local” regression. That is, it’s deriving the relationship between the dependent and independent variables on values within a set window. At least with linear regression it calculates the best fit using all of available data in the sample. But just as the linear regression will yield poor predictions when it encounters x values that are significantly different from the range on which the model is trained, the same phenomenon is likely to occur with kernel regression. Using correlation as the independent variable glosses over this somewhat problem since its range is bounded.^{3}
Whatever the case, should we trust the kernel regression more than the linear? In one sense yes, since it performed—at least in terms of errors—exactly as we would expect any model to perform. That the linear model shows an improvement in error could lull one into a false sense of success. Not that we’d expect anyone to really believe they’ve found the Holy Grail of models because the validation error is better than the training error. But, paraphrasing Feynman, the easiest person to fool is the model-builder himself.
We’ve written much more for this post than we had originally envisioned. And we haven’t even reached the original analysis we were planning to present! Nonetheless, as we hope you can see, there’s a lot to unpack on the topic of non-linear regressions. We’ll next look at actually using the generalCorr package we mentioned above to tease out any potential causality we can find between the constituents and the index. From there we’ll be able to test out-of-sample results using a kernel regression. The suspense is killing us!
Until next time let us know what you think of this post. Did we fall down a rabbit hole or did we not go deep enough? And while you think about that here’s the code.
R code:
# Built using R 3.6.2 ## Load packages suppressPackageStartupMessages({ library(tidyverse) library(tidyquant) library(reticulate) }) ## Load data from Python pd <- import("pandas") prices <- pd$read_pickle("python/xli_prices.pkl") xli <- pd$read_pickle("python/xli_etf.pkl") # Create date index for xts dates <- as.Date(rownames(prices)) price_xts <- as.xts(prices, order.by = dates) price_xts <- price_xts[,!colnames(price_xts) %in% c("OTIS", "CARR")] xli_xts <- as.xts(xli, order.by = dates) names(xli_xts) <- "xli" head(xli_xts) prices_xts <- merge(xli_xts, price_xts) prices_xts <- prices_xts[,!colnames(prices_xts) %in% c("OTIS", "CARR")] # Create function for rolling correlation mean_cor <- function(returns) { # calculate the correlation matrix cor_matrix <- cor(returns, use = "pairwise.complete") # set the diagonal to NA (may not be necessary) diag(cor_matrix) <- NA # calculate the mean correlation, removing the NA mean(cor_matrix, na.rm = TRUE) } # Create data frame for regression corr_comp <- rollapply(comp_returns, 60, mean_cor, by.column = FALSE, align = "right") xli_rets <- ROC(prices_xts[,1], n=60, type = "discrete") total_60 <- merge(corr_comp, lag.xts(xli_rets, -60))[60:(nrow(corr_comp)-60)] colnames(total_60) <- c("corr", "xli") split <- round(nrow(total_60)*.70) train_60 <- total_60[1:split,] test_60 <- total_60[(split+1):nrow(total_60),] train_60 %>% ggplot(aes(corr*100, xli*100)) + geom_point(color = "darkblue", alpha = 0.4) + labs(x = "Correlation (%)", y = "Return (%)", title = "Return (XLI) vs. correlation (constituents)") + geom_smooth(method = "lm", se=FALSE, size = 1.25, color = "blue") ## Simple kernel x_60 <- as.numeric(train_60$corr) y_60 <- as.numeric(train_60$xli) # Code derived from the following article https://towardsdatascience.com/kernel-regression-made-easy-to-understand-86caf2d2b844 ## Guassian kernel function gauss_kern <- function(X, y, bandwidth, X_range = c(0,1)){ pdf <- function(X,bandwidth){ out = (sqrt(2*pi))^-1*exp(-0.5 *(X/bandwidth)^2) out } if(length(X_range) > 1){ x_range <- seq(min(X_range), max(X_range), 0.01) } else { x_range <- X_range } xy_out <- c() for(x_est in x_range){ x_test <- x_est - X kern <- pdf(x_test, bandwidth) weight <- kern/sum(kern) y_est <- sum(weight*y) x_y <- c(x_est, y_est) xy_out <- rbind(xy_out,x_y) } xy_out } kernel_out <- gauss_kern(x_60, y_60, 0.06) kernel_df <- data.frame(corr = as.numeric(train_60$corr), xli = as.numeric(train_60$xli), estimate = rep(NA, nrow(train_60))) # Output y value for(i in 1:nrow(kernel_df)){ kernel_df$estimate[i] <- gauss_kern(x_60, y_60, 0.06, kernel_df$corr[i])[2] } kernel_out <- kernel_out %>% as.data.frame() colnames(kernel_out) <- c("x_val", "y_val") # Graph model and scatter plot kernel_df %>% ggplot(aes(corr*100, xli*100)) + geom_point(color = "darkblue", alpha=0.4) + geom_path(data=kernel_out, aes(x_val*100, y_val*100), color = "red", size = 1.25) + xlim(20, 80) + labs(x = "Correlation (%)", y = "Return (%)", title = "Return vs. correlation, linear vs. kernel regression") # Calculate errors rmse_kern <- sqrt(mean((kernel_df$xli - kernel_df$estimate)^2)) rmse_kernf_scale <- rmse_kern/sd(kernel_df$xli) linear_mod <- lm(xli ~ corr, kernel_df) rmse_linear <- sqrt(mean(kern_df_mod$residuals^2)) rmse_lin_scale <- rmse_linear/sd(kernel_df$xli) ## Multiple kernels b_width <- c(0.0625, 0.03125, 0.015625) kernel_out_list <- list() for(i in 1:3){ b_i <- b_width[i] val_out <- gauss_kern(x_60, y_60, b_i) val_out <- val_out %>% as.data.frame() %>% mutate(var = i) %>% `colnames<-`(c("x_val", "y_val", "var")) kernel_out_list[[i]] <- val_out } kernel_out_list <- do.call("rbind", kernel_out_list) %>% as.data.frame() # Graph multiple kernel regerssions ggplot() + geom_point(data = kernel_df, aes(corr*100, xli*100), color = "darkblue", alpha=0.4) + geom_path(data=kernel_out_list, aes(x_val*100, y_val*100, color = as.factor(var)), size = 1.25) + xlim(20, 80) + scale_color_manual("Relative volatility", labels = c("1" = "Half", "2" = "Quarter", "3" = "Eighth"), values = c("1" = "red", "2" = "purple", "3"="blue")) + labs(x = "Correlation (%)", y = "Return (%)", title = "Return vs. correlation with three kernel regressions") + theme(legend.position = c(0.06,0.85), legend.background = element_rect(fill = NA)) # Create table for comparisons kernel_df <- kernel_df %>% mutate(est_h = NA, est_q = NA, est_8 = NA) for(j in 1:3){ ests <- c() for(i in 1:nrow(kernel_df)){ out <- gauss_kern(x_60, y_60, b_width[j], kernel_df$corr[i])[2] ests[i] <- out } kernel_df[,j+3] <- ests } kern_rmses <- apply(kernel_df[,4:6],2, function(x) sqrt(mean((kernel_df$xli - x)^2))) %>% as.numeric() kern_rmses_scaled <- kern_rmses/sd(kernel_df$xli) rmse_df <- data.frame(Model = c("Linear", "Kernel @ half", "Kernel @ quarter", "Kernel @ eighth"), RMSE = c(rmse_linear, kern_rmses), `RMSE scaled` = c(rmse_lin_scale, kern_rmses_scaled), check.names = FALSE) rmse_df %>% mutate_at(vars(RMSE, `RMSE scaled`), function(x) round(x,3)) %>% knitr::kable() min_improv <- round(min(rmse_df[1,2]/rmse_df[2:4,2]-1),3)*100 max_improv <- round(max(rmse_df[1,2]/rmse_df[2:4,2]-1),3)*100 ## Simple kernel cross-validation c_val_idx <- round(nrow(train_60)/5) c_val1 <- seq(1,c_val_idx*4) c_val2 <- c(seq(1,c_val_idx*3), seq(c_val_idx*4, nrow(train_60))) c_val3 <- c(seq(1,c_val_idx*2), seq(c_val_idx*3, nrow(train_60))) c_val4 <- c(seq(1,c_val_idx), seq(c_val_idx*2, nrow(train_60))) seqs <- list(c_val1, c_val2, c_val3, c_val4) b_width <- c(0.0625, 0.03125, 0.015625) kern_df <- c() for(band in b_width){ for(i in 1:length(seqs)){ x_train <- as.numeric(train_60$corr)[seqs[[i]]] x_test <- as.numeric(train_60$corr)[!seq(1,nrow(train_60)) %in% seqs[[i]]] y_train <- as.numeric(train_60$xli)[seqs[[i]]] y_test <- as.numeric(train_60$xli)[!seq(1,nrow(train_60)) %in% seqs[[i]]] # kern_out <- gauss_kern(x_train, y_train, band ) pred <- NULL for(xs in x_train){ out <- gauss_kern(x_train, y_train, band, xs)[2] pred <- rbind(pred,out) } rmse_train_kern <- sqrt(mean((y_train-pred)^2, na.rm = TRUE)) pred_test <- c() for(xs in x_test){ out <- gauss_kern(x_train, y_train, band, xs)[2] pred_test <- rbind(pred_test,out) } rmse_test_kern <- sqrt(mean((y_test-pred_test)^2, na.rm = TRUE)) rmses_kern <- cbind(band, rmse_train_kern, rmse_test_kern) kern_df <- rbind(kern_df, rmses_kern) } } # Print results table kern_df %>% as.data.frame() %>% group_by(band) %>% summarise_all(mean) %>% mutate(decline = round((rmse_train_kern/rmse_test_kern-1)*100,1)) %>% arrange(desc(band)) %>% mutate_at(vars(-band, -decline),function(x) round(x,3)) %>% mutate(band = ifelse(band > 0.06, "Half", ifelse(band < 0.02, "Eighth", "Quarer"))) %>% rename("Parameter" = band, "Train" = rmse_train_kern, "Validation" = rmse_test_kern, "Decline (%)" = decline) %>% knitr::kable(caption = "Cross-validation errors and performance decline ") mean_kern_decline <- kern_df %>% as.data.frame() %>% group_by(band) %>% summarise_all(mean) %>% mutate(decline = rmse_train_kern/rmse_test_kern-1) %>% summarise(decline = round(mean(decline),2)*100) %>% as.numeric() ## Linear cross validation lm_df <- c() lm_dat <- coredata(train_60[,c("xli", "corr")]) %>% as.data.frame() seqs <- list(c_val1, c_val2, c_val3, c_val4) for(i in 1:length(seqs)){ train <- lm_dat[seqs[[i]], ] test <- lm_dat[!seq(1,nrow(lm_dat)) %in% seqs[[i]],] mod <- lm(xli ~ corr, train) pred_train <- predict(mod, train, type = "response") pred_test <- predict(mod, test, type = "response") rmse_train <- sqrt(mean((train$xli - pred_train)^2, na.rm=TRUE)) rmse_test <- sqrt(mean((test$xli - pred_test)^2,na.rm=TRUE)) rmse <- cbind(rmse_train, rmse_test) lm_df <- rbind(lm_df, rmse) } # Print results table lm_df %>% as.data.frame() %>% mutate(decline = round((rmse_train/rmse_test-1)*100,1)) %>% mutate_at(vars(-decline),function(x) round(x,3)) %>% rename("Train" = rmse_train, "Validation" = rmse_test, "Decline (%)" = decline) %>% knitr::kable(caption = "Linear regression cross-validation errors and performance decline") mean_lin_decline <- lm_df %>% as.data.frame() %>% mutate(decline = rmse_train/rmse_test-1) %>% summarise(decline = round(mean(decline),3)*100) %>% as.numeric()
Python code:
# Built uisng Python 3.7.4 ## Import libraries import numpy as np import pandas as pd import pandas_datareader as dr import matplotlib.pyplot as plt import matplotlib %matplotlib inline matplotlib.rcParams['figure.figsize'] = (12,6) plt.style.use('ggplot') ## See prior post for code to download prices ## Get rpices prices = pd.read_pickle('xli_prices.pkl') xli = pd.read_pickle('xli_etf.pkl') returns = prices.drop(columns = ['OTIS', 'CARR']).pct_change() returns.head() ## Create rolling correlation function def mean_cor(df): corr_df = df.corr() np.fill_diagonal(corr_df.values, np.nan) return np.nanmean(corr_df.values) ## compile data and create train, test split corr_comp = pd.DataFrame(index=returns.index[59:]) corr_comp['corr'] = [mean_cor(returns.iloc[i-59:i+1,:]) for i in range(59,len(returns))] corr_comp.head() xli_rets = xli.pct_change(60).shift(-60) total_60 = pd.merge(corr_comp, xli_rets, how="left", on="Date").dropna() total_60.columns = ['corr', 'xli'] split = round(len(total_60)*.7) train_60 = total_60.iloc[:split,:] test_60 = total_60.iloc[split:, :] ## Scatter plot with linear regression # Note: could have done this with Seaborn, But wanted flexibility later for other kernel regressions from sklearn.linear_model import LinearRegression X = train_60['corr'].values.reshape(-1,1) y = train_60['xli'].values.reshape(-1,1) lin_reg = LinearRegression().fit(X,y) y_pred = lin_reg.predict(X) plt.figure(figsize=(12,6)) plt.scatter(train_60['corr']*100, train_60['xli']*100, color='blue', alpha = 0.4) plt.plot(X*100, y_pred*100, color = 'darkblue') plt.xlabel("Correlation (%)") plt.ylabel("Return (%)") plt.title("Return (XLI) vs. correlation (constituents)") plt.show() # Create gaussian kernel function # The following websites were helpful in addition to the article mentioned above # https://github.com/kunjmehta/Medium-Article-Codes/blob/master/gaussian-kernel-regression-from-scratch.ipynb # https://www.kaggle.com/kunjmehta/gaussian-kernel-regression-from-scratch def gauss_kern(X,y, bandwidth, X_range=[0,1]): def pdf(X, bandwidth): return (bandwidth * np.sqrt(2 * np.pi))**-1 * np.exp(-0.5*(X/bandwidth)**2) if len(X_range) > 1: x_range = np.arange(min(X_range), max(X_range), 0.01) else: x_range = X_range xy_out = [] for x_est in x_range: x_test = x_est - X kern = pdf(x_test, bandwidth) weight = kern/np.sum(kern) y_est = np.sum(weight * y) xy_out.append([x_est, y_est]) return np.array(xy_out) ## Run kernel regresssion kernel_out = gauss_kern(train_60['corr'].values, train_60['xli'].values, 0.06) ## Plot kernel regression over scatter plot plt.figure(figsize=(12,6)) plt.scatter(train_60['corr']*100, train_60['xli']*100, color='blue', alpha = 0.4) plt.plot(kernel_out[:,0]*100, kernel_out[:,1]*100, color = 'red') plt.xlim(20,80) plt.xlabel("Correlation (%)") plt.ylabel("Return (%)") plt.title("Return (XLI) vs. correlation (constituents)") plt.show() ## Run kernel regression on multiple bandwiths b_width = [0.0625, 0.03125, 0.015625] kernel_out_list = [] for i in range(3): b_i = b_width[i] val_out = gauss_kern(train_60['corr'], train_60['xli'], b_i) kernel_out_list.append(val_out) ## Plot multiple regressions cols = ['red', 'purple', 'blue'] labs = ['Half', 'Quarter', 'Eighth'] plt.figure(figsize=(12,6)) plt.scatter(train_60['corr']*100, train_60['xli']*100, color='blue', alpha = 0.4) for i in range(3): plt.plot(kernel_out_list[i][:,0]*100, kernel_out_list[i][:,1]*100, color = cols[i], label = labs[i]) plt.xlim(20,80) plt.xlabel("Correlation (%)") plt.ylabel("Return (%)") plt.title("Return vs. correlation with kernel regressions") plt.legend(loc='upper left', title = "Relative volatility") plt.show() ## Print RMSE comparisons ests = [] for j in range(3): est_out = [] for i in range(len(train_60)): out = gauss_kern(train_60['corr'], train_60['xli'], b_width[j], [train_60['corr'][i]])[0][1] est_out.append(out) ests.append(np.array(est_out)) lin_rmse = np.sqrt(np.mean((train_60['xli'].values - y_pred)**2)) rmse = [lin_rmse] for k in range(3): rmse_k = np.sqrt(np.mean((train_60['xli'].values - ests[k])**2)) rmse.append(rmse_k) rmse_scaled = [x/np.std(train_60['xli']) for x in rmse] models = ["Linear", "Kernel @ half", "Kernel @ quarter", "Kernel @ eighth"] for l in range(4): print(f'{models[l]} -- RMSE: {rmse[l]:0.03f} RMSE scaled: {rmse_scaled[l]:0.03f}') ## Kernel cross-validation c_val_idx = round(len(train_60)/5) c_val1 = np.arange(0,c_val_idx*4) c_val2 = np.concatenate((np.arange(0,c_val_idx*3),np.arange(c_val_idx*4-1,len(train_60)))) c_val3 = np.concatenate((np.arange(0,c_val_idx*2),np.arange(c_val_idx*3-1,len(train_60)))) c_val4 = np.concatenate((np.arange(0,c_val_idx),np.arange(c_val_idx*2-1,len(train_60)))) seqs = [c_val1, c_val2, c_val3, c_val4] kern_df = [] b_width = [0.065, 0.03125, 0.015625] for band in b_width: for seq in seqs: test_val = [x for x in np.arange(len(train_60)) if x not in seq] x_train = train_60['corr'][seq].values x_test = train_60['corr'][test_val].values y_train = train_60['xli'][seq].values y_test = train_60['xli'][test_val].values pred = [] for xs in x_train: out = gauss_kern(x_train, y_train, band, [xs])[0][1] pred.append(out) rmse_train_kern = np.sqrt(np.mean((y_train - pred)**2)) pred_test = [] for xs in x_test: out = gauss_kern(x_train, y_train, band, [xs])[0][1] pred_test.append(out) rmse_test_kern = np.sqrt(np.mean((y_test - pred_test)**2)) rmses_kern = [band, rmse_train_kern, rmse_test_kern] kern_df.append(rmses_kern) ## Print cross-validation results kern_df = pd.DataFrame(kern_df, columns = ['Parameter', 'Train', 'Validation']) kern_df kern_df.groupby('Parameter')[['Train', 'Test']].apply(lambda x: x.mean()) kern_df_out = kern_df.groupby('Parameter').mean() kern_df_out['Decline'] = kern_df_out['Train']/kern_df_out['Test']-1 kern_df_out.apply(lambda x: round(x,3)) ## Linear model cross-valdation lm_df = [] for seq in seqs: test_val = [x for x in np.arange(len(train_60)) if x not in seq] x_train = train_60['corr'][seq].values.reshape(-1,1) x_test = train_60['corr'][test_val].values.reshape(-1,1) y_train = train_60['xli'][seq].values.reshape(-1,1) y_test = train_60['xli'][test_val].values.reshape(-1,1) lin_reg = LinearRegression().fit(x_train, y_train) pred_train = lin_reg.predict(x_train) rmse_train = np.sqrt(np.mean((y_train-pred_train)**2)) pred_test = lin_reg.predict(x_test) rmse_test = np.sqrt(np.mean((y_test-pred_test)**2)) lm_df.append([rmse_train, rmse_test]) ##Print linear model results lm_df = pd.DataFrame(lm_df, columns = ['Train', 'Test']) lm_df['Decline'] = lm_df['Train']/lm_df['Test']-1 lm_df.apply(lambda x: round(x,3))
For the Gaussian kernel, the weighting function substitutes a user-defined smoothing parameter for the standard deviation (\(\sigma\)) in a function that resembles the Normal probability density function given by \(\frac{1}{\sigma\sqrt{2\pi}}e^{(\frac{x – \mu}{\sigma})^2}\). The Gaussian kernel omits \(\sigma\) from the denominator.
For the Gaussian kernel, the lower \(\sigma\), means the width of the bell narrows, lowering the weight of the x values further away from the center.
Even more so with the rolling pairwise correlation since the likelihood of a negative correlation is low.
The post Kernel of error first appeared on Python-bloggers.
]]>The post How to Scrape Google Results for Free Using Python first appeared on Python-bloggers.
]]>There are a lot of paid services that are providing google results and it’s for a reason. In the right hands, google results can be gold. In this post, we will show you how you can get the results in a few lines of code for free.
#importing the libraries we will need import pandas as pd import numpy as np import urllib from fake_useragent import UserAgent import requests import re from urllib.request import Request, urlopen from bs4 import BeautifulSoup
The key here is to build the google URL using our keyword and the number of results. To do this we have to encode the keyword into HTML using urllib and add the id to the URL. Let’s say our keyword is “elbow method python”.
keyword= "elbow method python" html_keyword= urllib.parse.quote_plus(keyword) print(html_keyword)
'elbow+method+python'
Now let’s build the google URL
number_of_result=20 google_url = "https://www.google.com/search?q=" + html_keyword + "&num=" + str(number_of_result) print(google_url)
'https://www.google.com/search?q=elbow+method+python&num=20'
We need now to hit the URL and get the results. Fake Useragent and Beautiful Soup will help us with that.
response = requests.get(google_url, {"User-Agent": ua.random}) soup = BeautifulSoup(response.text, "html.parser")
The only thing we need now is regular expressions to extract the information we want.
result = soup.find_all('div', attrs = {'class': 'ZINbbc'}) results=[re.search('\/url\?q\=(.*)\&sa',str(i.find('a', href = True)['href'])) for i in result] #this is because in rare cases we can't get the urls links=[i.group(1) for i in results if i != None] links
['https://predictivehacks.com/k-means-elbow-method-code-for-python/', 'https://www.scikit-yb.org/en/latest/api/cluster/elbow.html', 'https://pythonprogramminglanguage.com/kmeans-elbow-method/', 'https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/', 'https://blog.cambridgespark.com/how-to-determine-the-optimal-number-of-clusters-for-k-means-clustering-14f27070048f', 'https://medium.com/analytics-vidhya/elbow-method-of-k-means-clustering-algorithm-a0c916adc540', 'https://www.youtube.com/watch%3Fv%3Dqs8nfzUsW5U', 'https://www.youtube.com/watch%3Fv%3DnMXg0f5HBac', 'https://www.youtube.com/watch%3Fv%3DzQfEc7vA1gU', 'https://stackoverflow.com/questions/41540751/sklearn-kmeans-equivalent-of-elbow-method', 'https://campus.datacamp.com/courses/cluster-analysis-in-python/k-means-clustering-3%3Fex%3D6', 'https://github.com/topics/elbow-method', 'https://github.com/topics/elbow-method%3Fl%3Dpython', 'https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6', 'https://vitalflux.com/k-means-elbow-point-method-sse-inertia-plot-python/', 'https://www.kdnuggets.com/2019/10/clustering-metrics-better-elbow-method.html', 'https://www.kaggle.com/abhishekyadav5/kmeans-clustering-with-elbow-method-and-silhouette', 'https://realpython.com/k-means-clustering-python/', 'https://pyclustering.github.io/docs/0.8.2/html/d3/d70/classpyclustering_1_1cluster_1_1elbow_1_1elbow.html', 'https://jtemporal.com/kmeans-and-elbow-method/']
And this is how you can scrape Google results using python. If you want to go even further you can use a VPN so you can have google results from different Countries and Cities.
Let’s sum it up in a single function.
def google_results(keyword, n_results): query = keyword query = urllib.parse.quote_plus(query) # Format into URL encoding number_result = n_results ua = UserAgent() google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result) response = requests.get(google_url, {"User-Agent": ua.random}) soup = BeautifulSoup(response.text, "html.parser") result = soup.find_all('div', attrs = {'class': 'ZINbbc'}) results=[re.search('\/url\?q\=(.*)\&sa',str(i.find('a', href = True)['href'])) for i in result] links=[i.group(1) for i in results if i != None] return (links)
google_results('machine learning in python', 10)
['https://www.coursera.org/learn/machine-learning-with-python', 'https://www.w3schools.com/python/python_ml_getting_started.asp', 'https://machinelearningmastery.com/machine-learning-in-python-step-by-step/', 'https://www.tutorialspoint.com/machine_learning_with_python/index.htm', 'https://towardsai.net/p/machine-learning/machine-learning-algorithms-for-beginners-with-python-code-examples-ml-19c6afd60daa', 'https://www.youtube.com/watch%3Fv%3DujTCoH21GlA', 'https://www.youtube.com/watch%3Fv%3DRnFGwxJwx-0', 'https://www.edx.org/course/machine-learning-with-python-a-practical-introduct', 'https://scikit-learn.org/', 'https://www.geeksforgeeks.org/introduction-machine-learning-using-python/']
The post How to Scrape Google Results for Free Using Python first appeared on Python-bloggers.
]]>The post Object Detection with Rekognition on Images first appeared on Python-bloggers.
]]>We will provide an example of how you can get the image labels using the AWS Rekognition. If you are not familiar with boto3, I would recommend having a look at the Basic Introduction to Boto3.
You can start experimenting with the Rekognition on the AWS Console. Let’s have a look at the example that they provided. Notice that you can upload your own image as well.
As we can see from the screenshot above many objects returned as well as their corresponding confidence.
We can also get the image labels using Boto3. Let’s see how we can do it. For this example, we will use the same image as above. There are two ways to get the images, one is from the S3 and the other is from local files. We will show both ways.
I have created a bucket called 20201021-example-rekognition
where I have uploaded the skateboard_thumb.jpg
image. Let’s assume that I want to get a list of images labels as well as of their parents.
import boto3 client=boto3.client('rekognition') # My bucket bucket = '20201021-example-rekognition' # My photo photo = 'skateboard_thumb.jpg' response = client.detect_labels(Image={'S3Object':{'Bucket':bucket,'Name':photo}}, MaxLabels=50) # get a list of labels label_lst = [] for label in response['Labels']: label_lst.append(label['Name']) # get a list of parents parent_lst = [] for label in response['Labels']: for parents in label['Parents']: parent_lst.append(parents['Name'])
Let’s see the label_lst
['Town', 'Road', 'Urban', 'Street', 'Building', 'City', 'Human', 'Person', 'Pedestrian', 'Vehicle', 'Automobile', 'Transportation', 'Car', 'Downtown', 'Path', 'Neighborhood', 'Asphalt', 'Tarmac', 'High Rise', 'Alleyway', 'Alley', 'Apparel', 'Clothing', 'Photo', 'Photography', 'Architecture', 'Parking', 'Parking Lot', 'Face', 'Sedan', 'People', 'Selfie', 'Portrait', 'Apartment Building', 'Intersection']
Let’s see also the parent_lst
['Urban', 'Building', 'City', 'Road', 'Urban', 'Building', 'Urban', 'Building', 'Person', 'Transportation', 'Vehicle', 'Transportation', 'Vehicle', 'Transportation', 'City', 'Urban', 'Building', 'Urban', 'Building', 'City', 'Urban', 'Building', 'Street', 'City', 'Road', 'Urban', 'Building', 'Street', 'City', 'Road', 'Urban', 'Building', 'Person', 'Person', 'Building', 'Car', 'Vehicle', 'Transportation', 'Car', 'Vehicle', 'Transportation', 'Person', 'Car', 'Vehicle', 'Transportation', 'Person', 'Portrait', 'Face', 'Photography', 'Person', 'Face', 'Photography', 'Person', 'High Rise', 'City', 'Urban', 'Building', 'Road']
Let’s say that we want to get the unique labels of the parent list. We can simply use the set
function.
set(parent_lst)
{'Building', 'Car', 'City', 'Face', 'High Rise', 'Person', 'Photography', 'Portrait', 'Road', 'Street', 'Transportation', 'Urban', 'Vehicle'}
Notice that by taking the labels and the parent label of the images you can build a Machine Learning Model to measure the performance of the images where your features will be the labels. You can then apply tf-idf on them and generally treat them as an NLP task.
Below we represent the same example as above, but this time by reading the image from the local file system.
import boto3 client=boto3.client('rekognition') # My photo photo = 'skateboard_thumb.jpg' with open(photo, 'rb') as image: response = client.detect_labels(Image={'Bytes': image.read()}, MaxLabels=50) # get a list of labels label_lst = [] for label in response['Labels']: label_lst.append(label['Name']) # get a list of parents parent_lst = [] for label in response['Labels']: for parents in label['Parents']: parent_lst.append(parents['Name'])
The post Object Detection with Rekognition on Images first appeared on Python-bloggers.
]]>The post Example of Celebrity Rekognition with AWS first appeared on Python-bloggers.
]]>Amazon Rekognition gives us the chance to recognize celebrities in images and videos. For our example, I will choose the images of Antentokounmpo Brothers and we will see if the Rekognition can recognize them.
You can try this image in the AWS Console. Let’s see what we get:
As we can see, it managed to detect Giannis and Thanasis Antentokounmpo! The rest brothers should wait until they become true celebrities apparently
We can get call the API via Python and boto3 and we can get all the info from the API response which is in json
format. We will provide an example of how you can simply get the name of the celebrities. We will work again with the same image.
import boto3 import json # create a connection with the rekognition client=boto3.client('rekognition') #define the photo photo = "Giannis-Antetokounmpo-Brothers.jpg" # call the API and get the response with open(photo, 'rb') as image: response = client.recognize_celebrities(Image={'Bytes': image.read()}) for celebrity in response['CelebrityFaces']: print ('Name: ' + celebrity['Name'] + ' with Confidence: ' + str(celebrity['MatchConfidence']))
Output:
Name: Thanasis Antetokounmpo with Confidence: 58.999996185302734 Name: Giannis Antetokounmpo with Confidence: 57.0
Notice that the response contains much information like the Bounding Boxes
etc.
As you can see we used the boto3, if you want to learn more about it you can have a look at the Basic Introduction to Boto3
The post Example of Celebrity Rekognition with AWS first appeared on Python-bloggers.
]]>How Do I Get Started with Image Classification? In this article, we’ll help you choose the right tools and architectures for your first Image Classification project. We’ll recommend some of the best programming tools and model architectures available for classification problems in computer vision. Image classification is subjected ...
The post Getting Started With Image Classification: fastai, ResNet, MobileNet, and More first appeared on Python-bloggers.
]]>In this article, we’ll help you choose the right tools and architectures for your first Image Classification project. We’ll recommend some of the best programming tools and model architectures available for classification problems in computer vision. Image classification is subjected to the same rules as any modeling problem. Choosing the right tools for the job is of critical importance for success.
Interested in Object Detection? Check out our Introduction to YOLO Object Detection.
You might be wondering whether to implement your model in PyTorch or TensorFlow. In short – it doesn’t matter, as a huge and credible community supports both frameworks. If you follow trends from Papers with Code, you might have noticed that PyTorch is gaining popularity inside the research community, which usually translates into future industry trends.
We, however, recommend using the fastai library. It is the most popular package for adding higher-level functionality on top of PyTorch. The official docs state:
“fastai simplifies training fast and accurate neural nets using modern best practices.”
This is an accurate description. Using fastai saves you a lot of time, as the first baseline model is built very quickly.
Such an approach has at least two upsides. Producing a baseline model is critical for the next research iterations. Without the baseline, you can’t validate if your experiment results in an improvement or not. Secondly, you can invest this saved time on model development in inspecting model results. The end goal is to understand better how the model works so you can improve its performance.
To start, ask yourself the following question: What are my success criteria?
Defining your success criteria is crucial and independent of the problem you are trying to solve. For example, maybe you want to maximize accuracy. Maximizing accuracy is the most common end-goal of any computer vision project. On the other hand, perhaps you are limited by hardware, so you are willing to trade accuracy for efficiency. It’s important to know what your primary goal is before you start with the project. We’ll consider the top two goals and how they impact the architecture choices.
We always suggest to start with off-the-shelf architecture, adjust it to your problem, and leverage Transfer Learning (TL).
Wait, what the heck is transfer learning? Here’s a concise hands-on introduction Transfer Learning.
If your goal is to maximize accuracy, starting with ResNet-50 or ResNet-101 is a good choice. They are easier to train and require fewer epochs to reach excellent performance than EfficientNets. ResNets from 50 layers use Bottleneck Blocks instead of Basic Blocks, which results in a higher accuracy with less computation time.
All the advancements in image models in recent years are most often tweaks to the original ResNet. Using these architectures and tricks such as progressive resizing or mixed precision gives excellent results that are usually satisfactory in business settings.
If you want to build a model running on mobile or edge devices, you are constrained by limited computation, power, and space. In these cases, using a recent version of MobileNet is the right choice.
Note: if we assume the mobile device has access to the internet, the model can be deployed to a remote server. The interference will happen on a server, which is easier to scale and isn’t restricted (or at least it’s restricted to a less extent) by memory or processing capacity. The choice here is project-specific, but it’s good to be aware of alternatives and options.
Are you an R Programmer? Learn How to Make a Computer Vision Model Within an R Environment
For most image classification projects, we propose to start building your models using fastai with pre-trained ResNet-50 or ResNet-101 architectures. This way, you should be able to create solid baseline models. If your project is limited by computation and storage resources, you should probably look into more efficient networks such as MobileNet, which is optimized to work on mobile or edge devices.
To summarize:
Article Getting Started With Image Classification: fastai, ResNet, MobileNet, and More comes from Appsilon Data Science | End to End Data Science Solutions.
The post Getting Started With Image Classification: fastai, ResNet, MobileNet, and More first appeared on Python-bloggers.
]]>The post Bayesian Statistics using R, Python, and Stan first appeared on Python-bloggers.
]]>For a year now, this course on Bayesian statistics has been on my to-do list. So without further ado, I decided to share it with you already.
Richard McElreath is an evolutionary ecologist who is famous in the stats community for his work on Bayesian statistics.
At the Max Planck Institute for Evolutionary Anthropology, Richard teaches Bayesian statistics, and he was kind enough to put his whole course on Statistical Rethinking: Bayesian statistics using R & Stan open access online.
You can find the video lectures here on Youtube, and the slides are linked to here:
Richard also wrote a book that accompanies this course:
For more information abou the book, click here.
For the Python version of the code examples, click here.
The post Bayesian Statistics using R, Python, and Stan first appeared on Python-bloggers.
]]>The post A Basic Introduction to Boto3 first appeared on Python-bloggers.
]]>In a previous post, we showed how to interact with S3 using AWS CLI. In this post, we will provide a brief introduction to boto3 and especially how we can interact with the S3.
You can download the Boto3 packages with pip install:
$ python -m pip install boto3
or through Anaconda:
conda install -c anaconda boto3
Then, it is better to configure it as follows:
For the credentials which are under ~/.aws/credentials
:
[default] aws_access_key_id = YOUR_KEY aws_secret_access_key = YOUR_SECRET
And for the region, you can with the file which is under ~/.aws/config
:
[default] region=us-east-1
Once you are ready you can create your client
:
import boto3 s3 = boto3.client('s3')
Notice, that in many cases and in many examples you can see the boto3.resource
instead of boto3.client
. There are small differences and I will use the answer I found in StackOverflow
Client:
Resource:
Assume that you have already created some S3 buckets, you can list them as follow:
list_buckets = s3.list_buckets() for bucket in list_buckets['Buckets']: print(bucket['Name'])
gpipis-cats-and-dogs gpipis-test-bucket my-petsdata
Let’s say that we want to create a new bucket in S3. Let’s call it 20201920-boto3-tutorial
.
s3.create_bucket(Bucket='20201920-boto3-tutorial')
Let’s see if the bucket is actually on S3
for bucket in s3.list_buckets()['Buckets']: print(bucket['Name'])
20201920-boto3-tutorial gpipis-cats-and-dogs gpipis-test-bucket my-petsdata
As we can see, the 20201920-boto3-tutorial
bucket added.
We can simply delete an empty bucket:
s3.delete_bucket(Bucket='my_bucket')
If you want to delete multiple empty buckets, you can write the following loop:
list_of_buckets_i_want_to_delete = ['my_bucket01', 'my_bucket02', 'my_bucket03'] for bucket in s3.list_buckets()['Buckets']: if bucket['Name'] in list_of_buckets_i_want_to_delete: s3.delete_bucket(Bucket=bucket['Name'])
A bucket has a unique name in all of S3 and it may contain many objects which are like the “files”. The name of the object is the full path from the bucket root, and any object has a key which is unique in the bucket.
I have 3 txt files and I will upload them to my bucket under a key called mytxt
.
s3.upload_file(Bucket='20201920-boto3-tutorial', # Set filename and key Filename='file01.txt', Key='mytxt/file01.txt') s3.upload_file(Bucket='20201920-boto3-tutorial', # Set filename and key Filename='file02.txt', Key='mytxt/file02.txt') s3.upload_file(Bucket='20201920-boto3-tutorial', # Set filename and key Filename='file03.txt', Key='mytxt/file03.txt')
As we can see, the three txt files were uploaded to the 20201920-boto3-tutorial
under the mytxt
key
Notice: The files that we upload to S3 are private by default. If we want to make them public then we need to add the ExtraArgs = { 'ACL': 'public-read'})
. For example:
s3.upload_file(Bucket='20201920-boto3-tutorial', # Set filename and key Filename='file03.txt', Key='mytxt/file03.txt', ExtraArgs = { 'ACL': 'public-read'}) )
We can list the objects as follow:
for obj in s3.list_objects(Bucket='20201920-boto3-tutorial', Prefix='mytxt/')['Contents']: print(obj['Key'])
Output:
mytxt/file01.txt mytxt/file02.txt mytxt/file03.txt
Let’s assume that I want to delete all the objects in ‘20201920-boto3-tutorial’ bucket under the ‘mytxt’ Key. We can delete them as follows:
for obj in s3.list_objects(Bucket='20201920-boto3-tutorial', Prefix='mytxt/')['Contents']: s3.delete_object(Bucket='20201920-boto3-tutorial', Key=obj['Key'])
Let’s assume that we want to download the dataset.csv
file which is under the mycsvfiles
Key in MyBucketName
. We can download the existing object (i.e. file) as follows:
s3.download_file(Filename='my_csv_file.csv', Bucket='MyBucketName', Key='mycsvfiles/dataset.csv')
Instead of downloading an object, you can read it directly. For example, it is quite common to deal with the csv files and you want to read them as pandas
DataFrame
s. Let’s see how we can get the file01.txt
which is under the mytxt
key.
obj = s3.get_object(Bucket='20201920-boto3-tutorial', Key='mytxt/file01.txt') obj['Body'].read().decode('utf-8')
Output:
'This is the content of the file01.txt'
That was a brief introduction to Boto3. Actually, with the Boto3 you can have almost full control of the platform.
The post A Basic Introduction to Boto3 first appeared on Python-bloggers.
]]>The post How to Switch Into a Data Science Career first appeared on Python-bloggers.
]]>This article was written by Rosana de Oliveira Gomes, Lead Machine Learning Engineer at Omdena, and Joseph Itopa A, Junior Machine Learning Engineer at Omdena.
Transitioning into a new career can feel like boarding a plane that’s already taking off. The data science profession is relatively new, which means that many data scientists and machine learning engineers didn’t start their careers on this path. They switched from other fields like we did, and perhaps like many of you reading this.
So let’s talk about what tools and skills you need to transition into data science—we’ll highlight possible challenges and give you practical advice on how to overcome them.
Data science doesn’t require advanced math knowledge beyond what’s required for any science degree. But every artificial intelligence algorithm is based on some mathematical structure which you will need to understand. This usually involves linear algebra and some concepts from calculus. To interpret results, you will need to conduct statistical analysis, which also requires knowledge of probability and statistics.
While math provides the concepts, programming languages are the tools to make those concepts tangible. This means that you have to choose a programming language to learn, which in this field usually is either Python or R, likely combined with SQL and Bash. A KDnuggets poll on programming software used by data scientists reveals that Python has surpassed R as the tool of choice.
But the choice of a programming language essentially boils down to the task at hand and style preferences. Python is easy to pick up for someone who has experience with programming, and is widely used across industries and specialties such as data science and machine learning. R is a good choice too if you have a statistics background and will be working mostly on analysis. It also has built-in tools and libraries to communicate results through reports. Our advice is to stick with one language and start building something after you are done with the basics.
Speaking from experience, we’ve learned that to acquire the necessary skills in data science, you should choose only one learning provider at a time and stick with it. The worst thing you can do is to keep learning the same things over and over.
You can view the science of data science as the ability to solve problems with creative and logical thinking. This requires knowledge of programming and an understanding of algorithms gained through practice.
After acquiring some basic knowledge of programming, you can begin solving real-world problems by practicing via courses or platforms. GeeksforGeeks provides hands-on projects for competitive coding, Python, JAVA, and SQL. Solving some Kaggle competition problems can also boost your problem-solving skills, as you can easily leverage real-world data to practice with and find a lot of help in the community. DataCamp’s unguided projects are a great way to find your own solutions to open-ended projects.
It’s important that you find enjoyment in these accomplishments to pursue a data science career. In a recent Omdena webinar, data science influencer Eric Weber said, “Don’t optimize for income only but for what brings you joy; otherwise, you may burn out quickly.”
After working with some algorithms and practicing on a few projects, you will be ready for more advanced projects. This is where collaborative platforms come in. Collaborative data science programs rely on communities to develop projects in a diverse and productive way.
The ruggedness and fun in street coding is best found in collaborative projects. You will learn from others as they learn from you. You will make friends while struggling with unstructured and messy data—there’s nothing like it.
Inspiring collaborative initiatives include Data Kind, Science to data science, and Data Science for Social Good. These options are often location-specific and may be costly or competitive due to limited availability and an extensive application process.
An alternative to collaborative projects is Omdena, which launches several projects per month and applies a principle of volunteering to solve real-world problems through online collaboration. Learners work with domain experts who will help them stay motivated with webinars, courses, books, and blog posts.
Changing careers is a project. It requires a strategic plan, a timeline, and specific (and realistic) milestones.
Ask yourself these questions:
Once you answer these truthfully, you need a plan. In the end, you have to find what works for you. Here are some things you can do:
Another must-have skill for data science is communication. A data scientist translates tons of data into actionable insights for decision makers and stakeholders. However, not all the people that you will need to communicate with are data scientists or have a background in STEM. If you’re an introvert, you may not be super excited about public speaking or constant communication. Communication in data science is not only about being a good speaker but also about building these habits and skills:
Documentation is power, so write and keep writing. In the very near future, you will need to demonstrate the skills that you put on your resume, and you can back them via nicely documented code repositories (for example, on GitHub), blog posts, and webinars or talks about your work. The earlier you start, the faster you’ll improve.
One common challenge experienced by many career movers is imposter syndrome. It’s never easy having an established career and suddenly becoming a newbie. In this case, it’s all a matter of mindset: keep yourself motivated and excited about all the new things you are about to learn! Omdena has hosted a webinar on how to overcome imposter syndrome as a data scientist, which includes knowing the skills gap that you need to fill and identifying the skills you already learned from your previous career.
Networking on your job hunt doesn’t have to be awkward. In fact, statistically, the majority of job opportunities come from an individual’s network, not from applications (check the references here and here). When you network, you’re simply connecting to people who have similar interests as you, getting their take on data science topics, and getting a peek into their careers. These people may become your future colleagues. The absolute worst thing that can happen is not getting any response—what do you have to lose?
One of the easiest ways to network is through collaborative projects, where you have the opportunity to share knowledge, work with experienced practitioners, and gain insights from people in different roles, and find leads for jobs.
Another good way to network is by following authorities in data science such as:
Contact any interesting people you encounter—a speaker from a podcast you liked, a teacher from an online course, or a blogger whose posts you enjoy. The golden rule is: Put yourself out there and ask questions. That’s the best way to get feedback.
Joseph is currently transitioning into data science from Engineering, and Rosana from Astrophysics. Contact them and continue this conversation on LinkedIn: Joseph and Rosana.
The post How to Switch Into a Data Science Career first appeared on Python-bloggers.
]]>The post ggplot2 In Python using Plotnine first appeared on Python-bloggers.
]]>If you are familiar with ggplot2 in R, you know that this library is one of the best-structured ways to make plots. We will show you how to create plots in python with the syntax of ggplot2, using the library plotnine.
# Using pip $ pip install plotnine # Or using conda $ conda install -c conda-forge plotnine
Firstly, let’s import the libraries and create our dummy data.
import pandas as pd import numpy as np import plotnine as p9 import random data = np.random.randint(1,10, size=300) df = pd.DataFrame(data, columns=['variable']) df['category']=random.choices(['A','B','C'],k=300) df['variable2']=random.sample(range(10, 1000), 300) df['variable3']=df['variable2'].apply(lambda x: x*random.random())
variable category variable2 variable3 0 3 A 747 356.282975 1 6 A 837 432.941801 2 2 A 941 195.533003 3 4 A 679 131.990057 4 7 A 912 696.910478
Now, Let’s create some basic plots using plotnine.
p9.ggplot(df)+ p9.aes(x='variable')+p9.geom_histogram(binwidth=2)
As you can see, it’s almost identical to ggplot. Let’s see some other basic examples.
p9.ggplot(df)+ p9.aes(x='variable') + p9.geom_density(fill="darkgrey")
p9.ggplot(df)+p9.aes(y='variable',x='category')+p9.geom_boxplot()+ p9.coord_flip()
p9.ggplot(df)+p9.aes(x='category')+ p9.geom_bar()
p9.ggplot(df)+p9.aes(y='variable3',x='variable2')+p9.geom_point(size=4)
p9.ggplot(df)+p9.aes(y='variable3',x='variable2',color='category')+p9.geom_point(size=4)
p9.ggplot(df)+p9.aes(y='variable2',x='category',fill='category')+ p9.geom_violin(scale = "width")
As you can see, the syntax is almost identical to ggplot2 in R. Be sure to check out dplyr pipes in python.
The post ggplot2 In Python using Plotnine first appeared on Python-bloggers.
]]>