The post LondonR Talks – Computer Vision Classification – Turning a Kaggle example into a clinical decision making tool first appeared on Python-bloggers.
]]>I had the pleasure of speaking at the last LondonR event of 2020. What a strange year it has been? But this put the icing on the cake.
The premise of my talk was to take a novel Kaggle parasite cell dataset and advocate how this type of classification task could be transported to other areas such as clinical x-ray scanning, diagnostic image condition detection, etc.
To view the talk, have a look at the LondonR event below. I was on first and then two very interesting talks followed from Gwynn Sturdevant – FasteR coding: vectorizing computations in R and Stuart Lodge – Raindrops on roses and whiskers on kittens – a few small things that make me a happy R developer:
The presentations from the session here. The GitHub code for the convolutional neural network can be found by clicking the GitHub button:
I have written a tutorial about this in my previous blog: https://hutsons-hacks.info/nhs-r-community-lightening-talk-computer-vision-classification-how-it-can-aid-clinicians-malaria-cell-case-study-with-r.
This presentation had two new addition, on top of what was presented recently at an NHS-R Community Conference event.
The two new additions, delved into how computer vision classification can be used with localisation (bounding box) detection to create novel ideas such as a Face Mask Detector:
The Mango team were used, as an example, of how facial regonition – specifically the YOLO framework, can be used to detect faces in Python:
The code, for this Python file, is also contained in the LondonR file.
This was a really fun experience and I would urge anyone with an R based project to sign up for LondonR. The hosts Mango are great and you will be treated to a feast of discussions, as well as really friendly people to boot.
Look out for my next blog post on this site and please follow the social icons below to connect with me.
The post LondonR Talks – Computer Vision Classification – Turning a Kaggle example into a clinical decision making tool first appeared on Python-bloggers.
]]>The post Reshape Pandas Data Frames first appeared on Python-bloggers.
]]>We will provide some examples of how we can reshape Pandas data frames based on our needs. We want to provide a concrete and reproducible example and for that reason, we assume that we are dealing with the following scenario.
We have a data frame of three columns such as:
The data are in a long format, where each case is one row. Let’s create the data frame:
import pandas as pd df = pd.DataFrame({'ID':[1,1,1,1,2,2,3,3,3,4], 'Type':['A','B','C','E','D','A','E','B','C','A'], 'Value':['L','L','M','H','H','H','L','M','M','M']}) df
Let’s say that we want to aggregate the data by ID by concatenating the text variables Type and Value respectively. We will use the lambda
function and the join
where our separator will be the |
but it can be whatever you want.
# Aggregate the data by ID df_agg = df.groupby('ID', as_index=False)[['Type','Value']].agg(lambda x: '|'.join(x)) df_agg
As we can see now we have 4 rows, one per each ID.
Our goal is to convert the “df_agg” to the initial one. We will need some steps to achieve this.
We will need to split the Type and Value columns and to transform them into lists.
df_agg['Type'] = df_agg['Type'].apply(lambda x: x.split("|")) df_agg['Value'] = df_agg['Value'].apply(lambda x: x.split("|")) df_agg
We know that the elements of each list appear in order. So, we need to do a mapping between the Type and the Value list element-wise. For that reason, we will use the zip
function.
df_agg['Type_Value']= df_agg.apply(lambda x: list(zip(x.Type,x.Value)), axis=1) df_agg
Now, we will explode the Type_Value as follows:
df_agg = df_agg.explode('Type_Value') df_agg
Now we want to split the tuple into two different columns, where the first one is referred to the Type and the second one to the Value.
df_agg[['New_Type','New_Value']] = pd.DataFrame(df_agg['Type_Value'].tolist(), index=df_agg.index) df_agg
Now we will keep only the columns that we want and we will rename them.
df_agg = df_agg[['ID','New_Type', 'New_Value']].\ rename(columns={"New_Type": "Type", "New_Value": "Value"}).\ reset_index(drop=True) df_agg
As we can see we started with a long format, we reshaped the data, and then we converted it back to the initial format. Let’s verify if the initial data frame is the same as the last one that we created.
df_agg.equals(df)
and we get True
!
The post Reshape Pandas Data Frames first appeared on Python-bloggers.
]]>The post Boosting nonlinear penalized least squares first appeared on Python-bloggers.
]]>For some reasons I couldn’t foresee, there’s been no blog post here on november 13
and november 20. So, here is the post about LSBoost announced here a few weeks ago.
First things first, what is LSBoost? Gradient boosted nonlinear penalized least squares. More precisely in LSBoost, the ensembles’ base learners are penalized, randomized neural networks.
These previous posts, with several Python and R examples, constitute a good introduction to LSBoost:
https://thierrymoudiki.github.io/blog/2020/07/24/python/r/lsboost/explainableml/mlsauce/xai-boosting
More recently, I’ve also written a more formal, short introduction to LSBoost:
The paper’s code – and more insights on LSBoost – can be found in the following Jupyter notebook:
Comments, suggestions are welcome as usual.
The post Boosting nonlinear penalized least squares first appeared on Python-bloggers.
]]>The post 13 Use Cases for Data-Driven Digital Transformation in Finance first appeared on Python-bloggers.
]]>Over the past decade, big data and digital technologies have disrupted industries and consumer behavior alike. IDC and Statista estimate that the volume of data generated yearly rose from two zettabytes in 2010 to 59 zettabytes in 2020, marking a thirtyfold increase in data generated in the past 10 years alone (Statista). This data deluge is only expected to grow, with projections predicting 149 zettabytes produced yearly by 2024.
While various industries are vying to take advantage of the data deluge with business intelligence, data science, and machine learning, the financial services industry is best equipped to benefit from big data. Data is at the heart of the financial services industry—across retail banking, investment banking, and insurance. Financial services organizations produce and store data on their customer transactions, detailed customer profiles through compliance processes, insurance claims, stock market exchanges, and more. The amount of data generated is astounding: The New York Stock Exchange alone produces one terabyte of trade data daily (Investopedia).
We’ve already seen fintech startups take advantage of shifting consumer behavior and the financial industry’s data deluge. Digital banks such as N26, Revolut, and Monzo abandoned the brick-and-mortar model and opted for a purely digital banking experience, relying on data to improve user experience and automate workflows (Revolut). Klarna, Europe’s most giant fintech unicorn, provides interest-free installment options with automated approval or rejection using machine learning (CNBC). The data deluge has not only opened up space for disruptive, innovative services—it’s opened the door for data-enabled digital transformation across the industry.
Disruptive digital-first startups across all industries have prompted many incumbents to invest heavily in digital transformation. The financial services industry is no exception. An Accenture and Oxford study in 2018 found that 87% of retail banking executives have developed a long-term plan for technology investment and digital transformation (Accenture). This is especially true in the COVID-19 economy, which has moved consumers purchasing online and accelerated digital transformation programs across all industries.
This acceleration is exceptionally pressing in the financial services industry. A recent study from the Economist Intelligence Unit cites that 45% of banking executives believe building a “true digital ecosystem” is the best strategic response to the pandemic. In the same survey, 66% of respondents believe that new technologies such as machine learning and artificial intelligence will bring the most significant impact on the banking industry by 2025.
Taking an example from the ground, the urgency of using contact-less financial tools ushered an 84% increase in Citibank’s daily mobile check deposits, and a tenfold increase in activity on Apple pay (Forbes). This has prompted Jane Fraser, president of Citigroup and CEO of its consumer bank, to declare, “Banking has changed irrevocably as a result of the pandemic. The pivot to digital has been supercharged. […] We believe we have the model of the future—a light branch footprint, seamless digital capabilities, and a network of partners that expand our reach to hundreds of millions of customers.”
The success of such digital transformation programs pivots on the seamless integration of digital technologies with data-driven insights and high-impact data science use-cases. What are these high-impact use cases and what are the challenges standing in the way? In our white paper, Digital Transformation in Finance: Upskilling for a data-driven age, we dissect 13 high-impact use cases spread across domain and sector and the challenges large financial institutions face to becoming data-driven.
The post 13 Use Cases for Data-Driven Digital Transformation in Finance first appeared on Python-bloggers.
]]>The post MongoDB and Python - Simplifying Your Schema - ETL Part 2 first appeared on Python-bloggers.
]]>In my previous post, we created a user object and saved it into a MongoDB collection. This was relatively straightforward but you may see some immediate challenges. While directly inserting data is an acceptable practice, it is similar to writing SQL in the backend of your application. SQL is just fine, but it may not always be the easiest to work with and can potentially lead to security problems (i.e. SQL injection attacks). It may also require you to write a lot of code to check the data schema, query efficiently, and/or write rules within your database.
Let’s take a look at an example schema of our previous data if it were normalized in a traditional SQL database. We would have two tables: users and states. The major change that happened here is that a states table was formed and linked to users with a the state_id key. All of the states would be listed in the states table and integers would represent those states in the users table. While this is a simple example, it illustrates the fact that complexity will increase quickly. At this point, to utilize the state data we need to perform a join.
*Note that the state field could simply be a string within the users table, but it is not a best practice (and does not illustrate the point I’m trying to make).
Compare that schema to how this could simply be modeled in MongoDB.
Let’s take our schema a little bit farther.
Imagine if we had to expand our application to do something more fun. Imagine if users could name their favorite restaurant from five states. That would require big expansions of the schema and then any changes to that schema require enormous amounts of effort to change. Here’s how that might look.
Compare that to how this might look in MongoDB.
Here’s an example of an object we would like to store in MongoDB. It follows the exact outline above, with favorite_restaurants being a nested object that uses the state as a key and restaurant name as the value.
Let’s look at this in terms of Python code.
import os import datetime import dotenv import pymongo dotenv.load_dotenv() mongo_database_name = 'example_db' mongo_collection_name = 'example_collection' db_client = pymongo.MongoClient(f"mongodb+srv://?retryWrites=true&w=majority") db = db_client[mongo_database_name] collection = db[mongo_collection_name] username = 'scott' hashed_password = '34hl2jlkfdjlk23jlk23' favorite_integer = 1 favorite_float = 3.14 state = 'Colorado' favorite_restaurants = { 'Colorado': "Nick's Italian", 'North Carolina': "Midnight Diner", 'California': "Trujillo's Taco Shop", 'Texas': "Killen's Barbecue", 'New York': "Artichoke Basille's Pizza" } user_data = { 'created_at': datetime.datetime.utcnow(), 'username': username, 'hashed_password': hashed_password, 'favorite_integer': favorite_integer, 'favorite_float': favorite_float, 'state': state, 'favorite_restaurants': favorite_restaurants } print("Here are your the data types:") for k, v in user_data.items(): print(f" - ") inserted_data = collection.insert_one(user_data) if inserted_data.acknowledged: print('Data was stored!') else: print('You had an issue writing to the database') database_return = collection.find_one() print(f"Here is your returned data:") print(database_return) print("Here are your returned data types:") for k, v in database_return.items(): print(f" - ") db_client.close()
It’s that simple. We simply inserted a dictionary with the states and restaurants and the user’s data is easily accessible within one single document. It is also worth noting that the data types and structures are also maintained when we query the data.
Coming up next time, we’ll go over some slightly more complicated database inserts, queries and “gotchas”. As always, the code for this can be found on our GitHub repository.
The post MongoDB and Python - Simplifying Your Schema - ETL Part 2 first appeared on Python-bloggers.
]]>The post MongoDB and Python - Avoiding Pitfalls by Using an "ORM" - ETL Part 3 first appeared on Python-bloggers.
]]>In my previous post, I showed how you can simplify your life by using MongoDB compared to a traditional relational SQL database. To put it simply, it is trivial to provide extra depth in MongoDB (or any document database) by nesting data structures. However, this level of simplicity has its drawbacks, and you need to be aware of these as a new user.
Let’s start by looking at a potential problem with the workflow from my previous post. We have a user in our app with the following information: username, hashed password, favorite integer, favorite float, and state. Typically, an app would only allow for unique usernames to avoid confusion. However, our previous application would simply continue to add users regardless of duplication. Rules would have to be added at the database level and it’s not trivial to do that.
However, if we use an ORM like mongoengine, we can solve this problem easily. If you are familiar with the sqlalchemy package in Python, you are probably already aware of the benefits of an ORM. If you are not familiar with this concept, there are plenty of great resources out there to look at ( https://www.google.com/search?q=benefits+of+an+orm&oq=benefits+of+an+ORM ). Those resources will be better than what I can provide, so feel free to use those. The short version is: an ORM will make your life easier because it handles your schema, provides validation, improves querying capabilities, etc.
Data in mongoengine can be represented with an object that defines the schema of the collection. Let’s consider the data from the last post in which a user has the following data points: username, hashed password, favorite floating point number, favorite integer, state, and favorite restaurants per state. The dictionary representation can be seen below.
Previously, we simply took the user_data variable and inserted it using pymongo.
inserted_data = collection.insert_one(user_data)
It isn’t immediately apparent, but a major problem with this is that there are no restrictions. A username could be used multiple times, any data type could be passed in any field, etc. In order to handle these types of issues, you would have to write a lot of code. Instead, let’s take a look at how this can be represented in mongoengine.
import datetime from mongoengine import Document, StringField, DateTimeField, IntField, FloatField, DictField, connect, disconnect_all class User(Document): created_at = DateTimeField(default=datetime.datetime.utcnow) username = StringField(required=True, unique=True) hashed_password = StringField(required=True) favorite_integer = IntField() favorite_float = FloatField() state = StringField() favorite_restaurants = DictField() meta = {'collection': 'users'}
This is incredibly simple and the code almost speaks for itself (I love when that happens). The new User class created simply inherits the Document class that does the heavy lifting for you. Let’s breakdown the individual items within the class:
created_at – sets the field as a datetime type and defaults to utcnow when it is created
username – sets the field as string type, requires it to be entered and makes sure it’s unique
hashed_password = sets the field as a string type and requires it to be entered
…
meta = sets the collection name to ‘users’ (can be whatever you’d like)
This simple little bit of code provides tremendous value. We’ll walk through some examples.
Inserting data is trivial, simply create the object and use the save method:
user1 = User( username=user_data['username'], hashed_password=user_data['hashed_password'], favorite_integer=user_data['favorite_integer'], favorite_float=user_data['favorite_float'], state=user_data['state'], favorite_restaurants=user_data['favorite_restaurants'] ) user1_insert = user1.save()
That’s it! You have inserted the user into your database. Let’s try to create another user with the same username. It should fail.
Perfect! This does not allow you to duplicate usernames. What happens if we try to insert a string instead of an integer to the favorite_integer field? It should fail.
user1 = User( username='blah', hashed_password=user_data['hashed_password'], favorite_integer='one', favorite_float=user_data['favorite_float'], state=user_data['state'], favorite_restaurants=user_data['favorite_restaurants'] ) user1_insert = user1.save()
Perfect! It recognizes we can’t enter a string where an integer should be. We don’t need to run through all of the fields, but this type checking works for all of them.
How easy is it to query the data? You can easily create filters, groupings, etc. However, we’ll just pull the most recent data. All you do is iterate through User.objects
for user in User.objects: print("USER DATA") print("----------") print(user.username) print(user.hashed_password) print(user.favorite_integer) print(user.state) print(user.favorite_restaurants)
Output:
That’s it! Trust me, you’ll want to use this over pymongo in many instances, but you will still need to know some MongoDB syntax when it comes time to create aggregation pipelines!
We’ll move beyond these basics to go through some slightly more complicated queries and aggregations. As always, the code for this can be found on our GitHub repository.
The post MongoDB and Python - Avoiding Pitfalls by Using an "ORM" - ETL Part 3 first appeared on Python-bloggers.
]]>The post MongoDB and Python - Inserting and Retrieving Data - ETL Part 1 first appeared on Python-bloggers.
]]>For those who are behind the times, the so-called “NoSQL” movement has really gained momentum over the last 5 years (but it has been around much longer than that). The term “NoSQL” is a bit silly, but it conveys the point well enough for those who have lived in the traditional relational SQL world. While there are a lot of databases that fit into this category, we’ll start by focusing on one of the most popular open source versions, MongoDB. We’ll utilize Python in order to send our data to the database.
Let’s say we have been tasked to find out if there is a relationship between favorite integers and favorite floating point numbers of people using a particular website. In order to collect that information, we setup a form where users must register a username, password and the state they live in. We also need to stipulate that the time the user submitted the form is irrelevant, but the time it is sent to the database needs to be tracked.
Here’s an example of a user’s submission:
Username: Scott Hashed Password = 34hl2jlkfdjlk23jlk23 Favorite Integer = 1 Favorite Float = 3.14 State = Colorado
Let’s represent that as a Python dictionary, where the datetime module is used to capture the current time in UTC format:
import datetime username = 'scott' hashed_password = '34hl2jlkfdjlk23jlk23' favorite_integer = 1 favorite_float = 3.14 state = 'Colorado' user_data = { 'created_at': datetime.datetime.utcnow(), 'username': username, 'hashed_password': hashed_password, 'favorite_integer': favorite_integer, 'favorite_float': favorite_float, 'state': state }
In order to see the data types in the dictionary, we can print them out:
print("Here are your the data types:") for k, v in user_data.items(): print(f" - ")
You’ll notice that the created_at variable is a datetime object. This is important to keep in mind because we would not only like to store it this way but we’d also like to retrieve it as a datetime – which allows us perform operations like grouping and sorting.
To interact with MongoDB, a database server needs to be running. This can be done in any number of ways, for this demonstration we’ll use https://cloud.mongodb.com/ for simplicity. If you do not have an account, you can set one up for free. We’ll walk through the most basic setup for this demonstration.
After setting up your account, create a cluster. Step 1, click the “Create a New Cluster” button.
Step 2, select the cluster tier (M0 Sandbox is a free tier).
That’s it! You have created a cluster and after a couple of minutes it will be up and running. Once it shows up, click the “Connect” button. Note, I did not change the cluster name (Cluster0 by default).
This brings up a menu where you can find your MongoDB cluster and credentials. The easiest way to find this information is through the “Connect your application” button.
You will see the “Copy” button to the right of the information you’ll use for your connection. It is crucial to keep this data private in order to maintain security. This has been done by creating a .env file (common place to keep secrets).
An example of the .env file can be seen below. This simply needs to be stored in the root directory (there are other ways of doing this, but this is a common practice).
The pymongo library provides an easy to use API to connect your Python runtime to a MongoDB server. All you need are your credentials and a couple of packages. In your terminal, use pip to install the required packages. (python-dotenv is simply to make loading the environment variables from the .env file easier).
pip install pymongo pip install dnspython pip install python-dotenv
At this point, connecting to your MongoDB server is simple. You’ll notice that we use the os library to read the environment variables after they have been loaded.
import os import datetime import dotenv import pymongo dotenv.load_dotenv() mongo_database_name = 'example_db' mongo_collection_name = 'example_collection' db_client = pymongo.MongoClient(f"mongodb+srv://?retryWrites=true&w=majority") db = db_client[mongo_database_name] collection = db[mongo_collection_name]
A few things to notice here:
db_client is simply your connection to the server/cluster
db is the connection to your database within the server/cluster
collection is a term in MongoDB that refers to a location where you can store data (similar to a table in SQL)
You may also have noticed that you connected to both a database and a collection that you had never created. These will be created for you when you insert data the first time.
Now that you have your connection, it is simple to insert data. Let’s insert the user_data dictionary previously created.
inserted_data = collection.insert_one(user_data) if inserted_data.acknowledged: print('Data was stored!') else: print('You had an issue writing to the database')
By storing your result as a variable (in this case, inserted_data) you are making it easier to see if it was stored in MongoDB or not.
Retrieving data from MongoDB is easy as well.
database_return = collection.find_one() print(f"Here is your returned data:") print(database_return) print("Here are your returned data types:") for k, v in database_return.items(): print(f" - ")
You’ll notice that the data all comes back in the same format it was inserted in! This is great news! However, you’ll also notice that there is a new field in your data, _id. This is a field generated automatically in MongoDB and is associated with every insert. This field is important but keep in mind that it changes anytime the data is updated.
Finally, always remember to close your connection to a database.
db_client.close()
Coming up next time, we’ll go over some slightly more complicated database inserts, queries and “gotchas”. As always, the code for this can be found on our GitHub repository.
The post MongoDB and Python - Inserting and Retrieving Data - ETL Part 1 first appeared on Python-bloggers.
]]>In our last post, we took our analysis of rolling average pairwise correlations on the constituents of the XLI ETF one step further by applying kernel regressions to the data and comparing those results with linear regressions. Using a cross-validat...
The post Round about the kernel first appeared on Python-bloggers.
]]>In our last post, we took our analysis of rolling average pairwise correlations on the constituents of the XLI ETF one step further by applying kernel regressions to the data and comparing those results with linear regressions. Using a cross-validation approach to analyze prediction error and overfitting potential, we found that kernel regressions saw average error increase between training and validation sets, while the linear models saw it decrease. We reasoned that the decrease was due to the idiosyncrasies of the time series data: models trained on volatile markets, validating on less choppy ones. Indeed, we felt we should trust the kernel regression results more than the linear ones precisely because those results followed the common expectation that error increases when exposing models to cross-validation. But such trust could be misplaced! However, we weren’t trying to pit kernel regressions against linear ones. Rather we were introducing the concept of kernel regressions prior to our examination of generalized correlations.
In this post, we’ll look at such correlations using the generalCorr package written by Prof. H. Vinod of Fordham University, NY.^{1} Our plan is to tease out any potential causality we can find between the constituents and the index. From there we’ll be able to test that causality on out-of-sample results using a kernel regression. Strap in, we’re in for a bumpy ride!
What is generalized correlation? While we can’t do justice to the nuances or the package, we’ll try to give you the 10,000 foot view. Any errors in interpretation are ours of course. Standard correlation measures assume a linear relationship between two variables. Yet many data series do not exhibit such a relationship and so using this measure fails to capture a whole host of interesting dependencies. Add in times series data, and one finds assessing causality based on linear dependence becomes even more thorny. Economists sometimes use a work around that relies on time lags^{2}. Employing a less common “instantaneous” version of this work around, by including non-lagged data, may yield a better estimate.
Using the instantaneous version with a generalized measure of correlation, one can estimate whether one variable causes another. That is, does X cause Y or Y cause X? What’s this generalized measure? Using a more sophisticated version of kernel regression than we discussed in our last post, one regresses Y on X and then X on Y. After taking the difference of the two \(R^{2}\)s of X on Y and Y on X, if that difference, call it \(delta\), is less than zero, then X predicts Y rather than the other way around.
Obviously, it’s way more complicated than this, but the intuition is kind of neat if you remember that the squared correlation of Y and X is equivalent to the \(R^{2}\) of the linear regression of Y on X. Recall too that the \(R^{2}\) of Y on X should be the same as X on Y for linear functions. Hence, the \(delta\) mentioned above should be (roughly) zero if there is a linear relationship or limited causality. If not, there’s probably some non-linear dependence and causality too, depending on the sign of \(delta\).
Okay, so how the heck does this relate to pairwise correlations and predictive ability? If we can establish the return of the index is causally determined by a subset of the constituents, then using only the pairwise correlations of that subset, we might be able to achieve better predictions of the returns. How would we set this up?
We could run the generalized correlations across our entire training series, then calculate the pairwise correlations on only the “high” causality constituents and use that metric to predict returns on the index. The difficulty with that method is that it’s forecasting the past. By the time we’ve calculated the causality we already know that it’s driven index returns! Instead, we could use the causality subset to calculate pairwise correlations in the next period, which would then be used to forecast forward returns. In other words, take the causality subset from \(t_{0-l}\) to \(t_{0}\) to calculate the average pairwise correlations on that subset from \(t_{0-w}\) to \(t_{f}\) and then regress the \(w\)-length returns on the index from \(t_{0+w}\) to \(t_{f+w}\) on those pairwise correlations. Here \(l\) is the lookback period, \(w\) is the window length, and \(f\) is the look forward period. Seems convoluted but hopefully will make sense once we go through an example.
We’ll take 250 trading days, calculate the subset causality, and then compute the rolling 60-day pairwise correlations starting on day 191 (to have the 60-day correlation available prior to the start of the return period) and continue until day 500. We’ll calculate the rolling past 60-day return starting on day 310 (for the 60-day forward return starting on day 250) until day 560. Then we’ll we regress the returns against the correlations for the 250-day period. This effectively means we’re regressing 60-day forward returns on the prior 60-day average pairwise correlations. Whew!
What does this actually look like? Let’s first recall the original pairwise correlation vs. forward return scatter plot and then we’ll show an example from one period.
Here the red line is not a kernel regression, but another non-parametric method we use for ease of presentation.^{3} Now we’ll run through all the data wrangling and calculations to create multiple windows on separate 250 trading periods in our training set, which runs from about 2005 to mid-2015. We’ll select one of those periods to show what’s going on graphically to compare the different results and predictive models. First, we show a scatter plot of the causal subset of pairwise correlations and XLI returns for the 2007-2008 period.
The linear regression is the dashed line and the non-parametric is the wavy line. Clearly, the relationship is on a downward trend as one might expect for that period. Now let’s look at the non-causal subset.
The non-parametric regression line implies a pretty weird function, while the linear regression suggests almost no relationship. Already, we can see some evidence that the causal subset may do a better job at explaining the returns on the index than the non-causal. We’ll run regressions on the returns vs. the correlation subsets for a kernel and linear model. We present the \(R^{2}\) results in the table below.
Models | Causal | Non-causal |
---|---|---|
Kernel | 41.3 | 15.4 |
Linear | 7.7 | 0.9 |
The causal subset using a kernel regression outperforms the linear model by an order of magnitude. Its explanatory power is more than double a kernel regression using the non-causal subset and better than two orders of magnitude vs. a linear model also using the non-causal subset.
Now we’ll run the kernel and linear regressions on all the periods. We present the \(R^{2}\) for the different regression models in a chart below.
In every case, the kernel regression model does a better job of explaining the variability in the index’s returns than the linear model. For completeness, we’ll include the linear model with the non-causal constituents.
Based on the chart above, we see that the kernel regression outperforms linear regression based on the non-causal subset too. Interestingly, the non-causal subset sometimes outperforms the causal on the linear model. We’re not entirely sure why this is the case. Of course, that mainly happens in the 2005-2007 time frame. This could be the result of correlations changing in the forecast period on these few occasions. Or it might be due to periods where correlations among the non-causal constituents rise, suggesting a stronger linear relationship with forward returns (especially during an upward trending market) even if the causal link from the prior period is weak.
The final metric we’d like to check is the root mean-squared error (RMSE) on the kernel vs. linear regressions.
As evident, the kernel regression using the causal subset has a lower error than the linear regressions on either the causal or non-causal subsets. Notice too that the error is not stable across periods and increases (predictably) during market turbulence (2007-2009). Interestingly, the differences in error rates also rise and fall over time, more easily seen in the graph below.
The average difference in error between the kernel and linear regression using the causal subset is about 1.7% points while it’s 1.8% points for the non-causal. While that difference doesn’t seem that different, we can see that there is some variability as shown in the graph. Whether fund managers would find that close to 2% point difference meaningful is likely to depend on the degree to which the lower error rate contributes to performance improvement. A better performing predictive model is one thing; whether it can be used to generate better risk-adjusted returns is another! Nonetheless, two points of achievable outperformance is nothing to sneeze at. Of concern, is the narrowing in performance in the 2012-2014 time frame. We suspect that is because that market enjoyed a relatively smooth, upwardly trending environment in that period as shown in the chart below.
Whether or not the narrowing of error between the kernel and linear regressions is likely to persist would need to be analyzed on the test set. But we’ll save that for another post. What have we learned thus far?
Generalized correlations appear to do a good job at identifying causal, non-linear relationships.
Using the output from a generalized correlation and employing a kernel regression algorithm, we can produce models that explain the variability in returns better than linear regressions using both causal and non-causal subsets.
The model error is lower for the causal subset kernel regressions vs. the other models too. However, that performance does appear to moderate in calm, upwardly trending markets.
Where could we go from here with these results? We might want to test this on other sector ETFs. And if the results bore out similar performance, then we could look to build some type of factor model. For example, we might go long the high causal and short the low causal stocks. Or we could go long the high causal and short the index to see if there is an “invisible” factor. Whatever the case, we’re interested to know what readers would like to see. More on generalized correlations or time to move on? Drop us a response at nbw dot osm at gmail. Until next time, the R and Python code are below.
One note on the Python code: we could not find an equivalent Python package to generalCorr. We did attempt to recreate it, but with the the complexity of the functions and the dependencies (e.g., R package np), it was simply taking too long. We also considered running R from within Python (using the rpy package), but realized that was quite an undertaking too! Hence, we imported the causal list created in R into our Python environment and then ran the rest of the analysis. We apologize to the Pythonistas. We try to reproduce everything in both languages. But in this case it was a task better suited for a long term project. We may, nevertheless, put a Python version on our TODO stack. If completed we’ll post here and on GitHub. Thanks for understanding!
R code:
# Built using R 3.6.2 ## Load packages suppressPackageStartupMessages({ library(tidyverse) library(tidyquant) library(reticulate) library(generalCorr) }) ## Load data prices_xts <- readRDS("corr_2_prices_xts.rds") # Create function for rolling correlation mean_cor <- function(returns) { # calculate the correlation matrix cor_matrix <- cor(returns, use = "pairwise.complete") # set the diagonal to NA (may not be necessary) diag(cor_matrix) <- NA # calculate the mean correlation, removing the NA mean(cor_matrix, na.rm = TRUE) } # Create return frames for manipulation comp_returns <- ROC(prices_xts[,-1], type = "discrete") # kernel regression tot_returns <- ROC(prices_xts, type = "discrete") # for generalCorr # Create data frame for regression corr_comp <- rollapply(comp_returns, 60, mean_cor, by.column = FALSE, align = "right") xli_rets <- ROC(prices_xts[,1], n=60, type = "discrete") # Merge series and create train-test split total_60 <- merge(corr_comp, lag.xts(xli_rets, -60))[60:(nrow(corr_comp)-60)] colnames(total_60) <- c("corr", "xli") split <- round(nrow(total_60)*.70) train_60 <- total_60[1:split,] test_60 <- total_60[(split+1):nrow(total_60),] # Create train set for generalCorr tot_split <- nrow(train_60)+60 train <- tot_returns[1:tot_split,] test <- tot_returns[(tot_split+1):nrow(tot_returns),] # Graph originaal scatter plot train_60 %>% ggplot(aes(corr*100, xli*100)) + geom_point(color = "darkblue", alpha = 0.4) + labs(x = "Correlation (%)", y = "Return (%)", title = "Return (XLI) vs. correlation (constituents)") + geom_smooth(method = "loess", formula = y ~ x, se=FALSE, size = 1.25, color = "red") # Create helper function cause_mat <- function(df){ mat_1 <- df[,!apply(is.na(df),2, all)] mat_1 <- as.matrix(coredata(mat_1)) out <- causeSummary(mat_1) out <- as.data.frame(out) out } # Create column and row indices col_idx <- list(c(1:22), c(1,23:44), c(1,45:64)) row_idx <- list(c(1:250), c(251:500), c(501:750), c(751:1000), c(1001:1250), c(1251:1500), c(1501:1750), c(1751:2000), c(2001:2250), c(2251:2500)) # Create cause list for each period: which stocks cause the index cause <- list() for(i in 1:length(row_idx)){ out <- list() for(j in 1:length(col_idx)){ out[[j]] <- cause_mat(train[row_idx[[i]], col_idx[[j]]]) } cause[[i]] <- out } # Bind cause into one list cause_lists <- list() for(i in 1:length(cause)){ out <- do.call("rbind", cause[[i]]) %>% filter(cause != "xli") %>% select(cause) %>% unlist() %>% as.character() cause_lists[[i]] <- out } # Save cause_lists for use in Python max_l <- 0 for(i in 1:length(cause_lists)){ if(length(cause_lists[[i]]) > max_l){ max_l <- length(cause_lists[[i]]) } } write_l <- matrix(nrow = length(cause_lists), ncol = max_l) for(i in 1:length(cause_lists)){ write_l[i, 1:length(cause_lists[[i]])] <- cause_lists[[i]] } write.csv(write_l, "cause_lists.csv") ## Use cause list to run rolling correlations and aggregate forward returns for regression cor_idx <- list(c(191:500), c(441:750), c(691:1000), c(941:1250), c(1191:1500), c(1441:1750), c(1691:2000), c(1941:2250), c(2191:2500)) # Add 1 since xli is price while train is ret so begin date is off by 1 biz day ret_idx <- list(c(251:561), c(501:811), c(751:1061), c(1001:1311), c(1251:1561), c(1501:1811), c(1751:2061), c(2001:2311), c(2251:2561)) merge_list <- list() for(i in 1:length(cor_idx)){ corr <- rollapply(train[cor_idx[[i]], cause_lists[[i]]], 60, mean_cor, by.column = FALSE, align = "right") ret <- ROC(prices_xts[ret_idx[[i]],1], n=60, type = "discrete") merge_list[[i]] <- merge(corr = corr[60:310], xli = coredata(ret[61:311])) } # Run correlations on non cause list non_cause_list <- list() for(i in 1:length(cor_idx)){ corr <- rollapply(train[cor_idx[[i]], !colnames(train)[-1] %in% cause_lists[[i]]], 60, mean_cor, by.column = FALSE, align = "right") ret <- ROC(prices_xts[ret_idx[[i]],1], n=60, type = "discrete") non_cause_list[[i]] <- merge(corr = corr[60:310], xli = coredata(ret[61:311])) } ## Load data merge_list <- readRDS("corr3_genCorr_list.rds") non_cause_list <- readRDS("corr3_genCorr_non_cause_list.rds") # Graphical example of one period cause_ex <- merge_list[[3]] cause_ex$corr_non <- rollapply(train[cor_idx[[3]], !colnames(train)[-1] %in% cause_lists[[3]]], 60, mean_cor, by.column = FALSE, align = "right")[60:310] # Graph causal subset against returns cause_ex %>% ggplot(aes(corr*100, xli*100)) + geom_point(color = "blue") + geom_smooth(method="lm", formula = y ~ x, se=FALSE, color = "darkgrey", linetype = "dashed")+ geom_smooth(method="loess", formula = y ~ x, se=FALSE, color = "darkblue") + labs(x = "Correlation (%)", y = "Return (%)", title = "Return (XLI) vs. correlation (causal subset)") # Graph non causal cause_ex %>% ggplot(aes(corr_non*100, xli*100)) + geom_point(color = "blue") + geom_smooth(method="lm", formula = y ~ x, se=FALSE, color = "darkgrey", linetype = "dashed")+ geom_smooth(method="loess", formula = y ~ x, se=FALSE, color = "darkblue") + labs(x = "Correlation (%)", y = "Return (%)", title = "Return (XLI) vs. correlation (non-causal subset)") # Run models causal_kern <- kern(cause_ex$xli, cause_ex$corr)$R2 causal_lin <- summary(lm(cause_ex$xli ~ cause_ex$corr))$r.squared non_causal_kern <- kern(cause_ex$xli, cause_ex$corr_non)$R2 non_causal_lin <- summary(lm(cause_ex$xli ~ cause_ex$corr_non))$r.squared # Show table data.frame(Models = c("Kernel", "Linear"), Causal = c(causal_kern, causal_lin), `Non-causal` = c(non_causal_kern, non_causal_lin), check.names = FALSE) %>% mutate_at(vars('Causal', `Non-causal`), function(x) round(x,3)*100) %>% knitr::kable(caption = "Regression R-squareds (%)") ## Linear regression models <- list() for(i in 1:length(merge_list)){ models[[i]] <- lm(xli~corr, merge_list[[i]]) } model_df <- data.frame(model = seq(1,length(models)), rsq = rep(0,length(models)), t_int = rep(0,length(models)), t_coef = rep(0,length(models)), P_int = rep(0,length(models)), p_coef = rep(0,length(models))) for(i in 1:length(models)){ model_df[i,2] <- broom::glance(models[[i]])[1] model_df[i,3] <- broom::tidy(models[[i]])[1,4] model_df[i,4] <- broom::tidy(models[[i]])[2,4] model_df[i,5] <- broom::tidy(models[[i]])[1,5] model_df[i,6] <- broom::tidy(models[[i]])[2,5] } start <- index(train)[seq(250,2250,250)] %>% year() end <- index(train)[seq(500,2500,250)] %>% year() model_dates <- paste(start, end, sep = "-") model_df <- model_df %>% mutate(model_dates = model_dates) %>% select(model_dates, everything()) ## Kernel regresssion kernel_models <- list() for(i in 1:length(merge_list)){ kernel_models[[i]] <- kern(merge_list[[i]]$xli, merge_list[[i]]$corr) } kern_model_df <- data.frame(model_dates = model_dates, rsq = rep(0,length(kernel_models)), rmse = rep(0,length(kernel_models)), rmse_scaled = rep(0,length(kernel_models))) for(i in 1:length(kernel_models)){ kern_model_df[i,2] <- kernel_models[[i]]$R2 kern_model_df[i,3] <- sqrt(kernel_models[[i]]$MSE) kern_model_df[i,4] <- sqrt(kernel_models[[i]]$MSE)/sd(merge_list[[i]]$xli) } ## Load data model_df <- readRDS("corr3_lin_model_df.rds") kern_model_df <- readRDS("corr3_kern_model_df.rds") ## R-squared graph data.frame(Dates = model_dates, Linear = model_df$rsq, Kernel = kern_model_df$rsq) %>% gather(key, value, -Dates) %>% ggplot(aes(Dates, value*100, fill = key)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual("", values = c("blue", "darkgrey")) + labs(x = "", y = "R-squared (%)", title = "R-squared output for regression results by period and model") + theme(legend.position = c(0.06,0.9), legend.background = element_rect(fill = NA)) # NOn_causal linear model non_models <- list() for(i in 1:length(reg_list)){ non_models[[i]] <- lm(xli~corr, non_cause_list[[i]]) } non_model_df <- data.frame(model = seq(1,length(models)), rsq = rep(0,length(models)), t_int = rep(0,length(models)), t_coef = rep(0,length(models)), P_int = rep(0,length(models)), p_coef = rep(0,length(models))) for(i in 1:length(non_models)){ non_model_df[i,2] <- broom::glance(non_models[[i]])[1] non_model_df[i,3] <- broom::tidy(non_models[[i]])[1,4] non_model_df[i,4] <- broom::tidy(non_models[[i]])[2,4] non_model_df[i,5] <- broom::tidy(non_models[[i]])[1,5] non_model_df[i,6] <- broom::tidy(non_models[[i]])[2,5] } non_model_df <- non_model_df %>% mutate(model_dates = model_dates) %>% select(model_dates, everything()) # Bar chart of causal and non-causal data.frame(Dates = model_dates, `Linear--causal` = model_df$rsq, `Linear--non-causal` = non_model_df$rsq, Kernel = kern_model_df$rsq, check.names = FALSE) %>% gather(key, value, -Dates) %>% ggplot(aes(Dates, value*100, fill = key)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual("", values = c("blue", "darkgrey", "darkblue")) + labs(x = "", y = "R-squared (%)", title = "R-squared output for regression results by period and model") + theme(legend.position = c(0.3,0.9), legend.background = element_rect(fill = NA)) ## RMSE comparison lin_rmse <- c() lin_non_rmse <- c() kern_rmse <- c() for(i in 1:length(models)){ lin_rmse[i] <- sqrt(mean(models[[i]]$residuals^2)) lin_non_rmse[i] <- sqrt(mean(non_models[[i]]$residuals^2)) kern_rmse[i] <- sqrt(kernel_models[[i]]$MSE) } data.frame(Dates = model_dates, `Linear--causal` = lin_rmse, `Linear--non-causal` = lin_non_rmse, Kernel = kern_rmse, check.names = FALSE) %>% gather(key, value, -Dates) %>% ggplot(aes(Dates, value*100, fill = key)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual("", values = c("blue", "darkgrey", "darkblue")) + labs(x = "", y = "RMSE (%)", title = "RMSE results by period and model") + theme(legend.position = c(0.08,0.9), legend.background = element_rect(fill = NA)) ## RMSE graph data.frame(Dates = model_dates, `Kernel - Linear-causal` = lin_rmse - kern_rmse, `Kernel - Linear--non-causal` = lin_non_rmse - kern_rmse , check.names = FALSE) %>% gather(key, value, -Dates) %>% ggplot(aes(Dates, value*100, fill = key)) + geom_bar(stat = "identity", position = "dodge") + scale_fill_manual("", values = c("darkgrey", "darkblue")) + labs(x = "", y = "RMSE (%)", title = "RMSE differences by period and model") + theme(legend.position = c(0.1,0.9), legend.background = element_rect(fill = NA)) avg_lin <- round(mean(lin_rmse - kern_rmse),3)*100 avg_lin_non <- round(mean(lin_non_rmse - kern_rmse),3)*100 ## Price graph prices_xts["2010/2014","xli"] %>% ggplot(aes(index(prices_xts["2010/2014"]), xli)) + geom_line(color="blue", size = 1.25) + labs(x = "", y = "Price (US$)", title = "XLI price log-scale") + scale_y_log10()
Python code:
# Built using Python 3.7.4 ## Import packages import numpy as np import pandas as pd import pandas_datareader as dr import matplotlib.pyplot as plt import matplotlib %matplotlib inline matplotlib.rcParams['figure.figsize'] = (12,6) plt.style.use('ggplot') ## Load data prices = pd.read_pickle('xli_prices.pkl') xli = pd.read_pickle('xli_etf.pkl') returns = prices.drop(columns = ['OTIS', 'CARR']).pct_change() returns.head() xli_rets = xli.pct_change(60).shift(-60) ## Import cause_lists created using R # See R code above to create cause_lists = pd.read_csv("cause_lists.csv",header=None) cause_lists = cause_lists.iloc[1:,1:] ## Define correlation function def mean_cor(df): corr_df = df.corr() np.fill_diagonal(corr_df.values, np.nan) return np.nanmean(corr_df.values) ## Create data frames and train-test splits corr_comp = pd.DataFrame(index=returns.index[59:]) corr_comp['corr'] = [mean_cor(returns.iloc[i-59:i+1,:]) for i in range(59,len(returns))] xli_rets = xli.pct_change(60).shift(-60) total_60 = pd.merge(corr_comp, xli_rets, how="left", on="Date").dropna() total_60.columns = ['corr', 'xli'] split = round(len(total_60)*.7) train_60 = total_60.iloc[:split,:] test_60 = total_60.iloc[split:, :] tot_returns = pd.merge(xli,prices.drop(columns = ["CARR", "OTIS"]), "left", "Date") tot_returns = tot_returns.rename(columns = {'Adj Close': 'xli'}) tot_returns = tot_returns.pct_change() tot_split = len(train_60)+60 train = tot_returns.iloc[:tot_split,:] test = tot_returns.iloc[tot_split:len(tot_returns),:] train.head() ## Create period indices to run pairwise correlations and forward returns for regressions cor_idx = np.array((np.arange(190,500), np.arange(440,750), np.arange(690,1000), np.arange(940,1250), np.arange(1190,1500), np.arange(1440,1750), np.arange(1690,2000), np.arange(1940,2250), np.arange(2190,2500))) # Add 1 since xli is price while train is ret so begin date is off by 1 biz day ret_idx = np.array((np.arange(250,561), np.arange(500,811), np.arange(750,1061), np.arange(1000,1311), np.arange(1250,1561), np.arange(1500,1811), np.arange(1750,2061), np.arange(2000,2311), np.arange(2250,2561))) # Create separate data arrays using cause_lists and indices # Causal subset merge_list = [0]*9 for i in range(len(cor_idx)): dat = train.reset_index().loc[cor_idx[i],cause_lists.iloc[i,:].dropna()] corr = [mean_cor(dat.iloc[i-59:i+1,:]) for i in range(59,len(dat))] ret1 = xli.reset_index().iloc[ret_idx[i],1] ret1 = ret1.pct_change(60).shift(-60).values ret1 = ret1[~np.isnan(ret1)] merge_list[i] = np.c_[corr, ret1] # Non-causal subset non_cause_list = [0] * 9 for i in range(len(cor_idx)): non_c = [x for x in list(train.columns[1:]) if x not in cause_lists.iloc[3,:].dropna().to_list()] dat = train.reset_index().loc[cor_idx[i], non_c] corr = [mean_cor(dat.iloc[i-59:i+1,:]) for i in range(59,len(dat))] ret1 = xli.reset_index().iloc[ret_idx[i],1] ret1 = ret1.pct_change(60).shift(-60).values ret1 = ret1[~np.isnan(ret1)] non_cause_list[i] = np.c_[corr, ret1] # Create single data set for example cause_ex = np.c_[merge_list[2],non_cause_list[2][:,0]] # Run linear regression from sklearn.linear_model import LinearRegression X = cause_ex[:,0].reshape(-1,1) y = cause_ex[:,1] lin_reg = LinearRegression().fit(X,y) y_pred = lin_reg.predict(X) # Graph scatterplot with lowess and linear regression import seaborn as sns sns.regplot(cause_ex[:,0]*100, cause_ex[:,1]*100, color = 'blue', lowess=True, line_kws={'color':'darkblue'}, scatter_kws={'alpha':0.4}) plt.plot(X*100, y_pred*100, color = 'darkgrey', linestyle='dashed') plt.xlabel("Correlation (%)") plt.ylabel("Return (%)") plt.title("Return (XLI) vs. correlation (causal subset)") plt.show() # Run linear regression on non-causal component of cause_ex data frame from sklearn.linear_model import LinearRegression X_non = cause_ex[:,2].reshape(-1,1) y = cause_ex[:,1] lin_reg_non = LinearRegression().fit(X_non,y) y_pred_non = lin_reg_non.predict(X_non) # Graph scatter plot sns.regplot(cause_ex[:,2]*100, cause_ex[:,1]*100, color = 'blue', lowess=True, line_kws={'color':'darkblue'}, scatter_kws={'alpha':0.4}) plt.plot(X_non*100, y_pred_non*100, color = 'darkgrey', linestyle='dashed') plt.xlabel("Correlation (%)") plt.ylabel("Return (%)") plt.title("Return (XLI) vs. correlation (non-causal subset)") plt.show() ## Run regressions on cause_ex from sklearn_extensions.kernel_regression import KernelRegression import statsmodels.api as sm x = cause_ex[:,0] X = sm.add_constant(x) x_non = cause_ex[:,2] X_non = sm.add_constant(x_non) y = cause_ex[:,1] lin_c = sm.OLS(y,X).fit().rsquared*100 lin_nc = sm.OLS(y,X_non).fit().rsquared*100 # Note KernelRegressions() returns different results than kern() from generalCorr kr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10)) kr.fit(X,y) kr_c = kr.score(X,y)*100 kr.fit(X_non, y) kr_nc = kr.score(X_non, y)*100 print(f"R-squared for kernel regression causal subset: {kr_c:0.01f}") print(f"R-squared for kernel regression non-causal subset: {kr_nc:0.01f}") print(f"R-squared for linear regression causal subset: {lin_c:0.01f}") print(f"R-squared for linear regression non-causal subset: {lin_nc:0.01f}") ## Run regressions on data lists import statsmodels.api as sm # Causal subset linear model lin_mod = [] for i in range(len(merge_list)): x = merge_list[i][:,0] X = sm.add_constant(x) y = merge_list[i][:,1] mod_reg = sm.OLS(y,X).fit() lin_mod.append(mod_reg.rsquared) start = train.index[np.arange(249,2251,250)].year end = train.index[np.arange(499,2500,250)].year model_dates = [str(x)+"-"+str(y) for x,y in zip(start,end)] # Non-causal subset linear model non_lin_mod = [] for i in range(len(non_cause_list)): x = non_cause_list[i][:,0] X = sm.add_constant(x) y = non_cause_list[i][:,1] mod_reg = sm.OLS(y,X).fit() non_lin_mod.append(mod_reg.rsquared) # Causal subset kernel regression from sklearn_extensions.kernel_regression import KernelRegression kern = [] for i in range(len(merge_list)): X = merge_list[i][:,0].reshape(-1,1) y = merge_list[i][:,1] kr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10)) kr.fit(X,y) kern.append(kr.score(X,y)) ## Plot R-squared comparisons # Causal kernel vs. linear df = pd.DataFrame(np.c_[np.array(kern)*100, np.array(lin_mod)*100], columns = ['Kernel', 'Linear']) df.plot(kind='bar', color = ['blue','darkgrey']) plt.xticks(ticks = df.index, labels=model_dates, rotation=0) plt.legend(loc = 'upper left') plt.show() # Causal kerner vs causal & non-causal linear df = pd.DataFrame(np.c_[np.array(kern)*100, np.array(lin_mod)*100, np.array(non_lin_mod)*100], columns = ['Kernel', 'Linear-causal', 'Linear--non-causal']) df.plot(kind='bar', color = ['blue','darkgrey', 'darkblue'], width=.85) plt.xticks(ticks = df.index, labels=model_dates, rotation=0) plt.legend(bbox_to_anchor=(0.3, 0.9), loc = 'center') plt.ylabel("R-squared (%)") plt.title("R-squared output for regression results by period and model") plt.show() ## Create RMSE lists lin_rmse = [] for i in range(len(merge_list)): x = merge_list[i][:,0] X = sm.add_constant(x) y = merge_list[i][:,1] mod_reg = sm.OLS(y,X).fit() lin_rmse.append(np.sqrt(mod_reg.mse_resid)) lin_non_rmse = [] for i in range(len(non_cause_list)): x = non_cause_list[i][:,0] X = sm.add_constant(x) y = non_cause_list[i][:,1] mod_reg = sm.OLS(y,X).fit() lin_non_rmse.append(np.sqrt(mod_reg.mse_resid)) kern_rmse = [] for i in range(len(merge_list)): X = merge_list[i][:,0].reshape(-1,1) y = merge_list[i][:,1] kr = KernelRegression(kernel='rbf', gamma=np.logspace(-5,5,10)) kr.fit(X,y) rmse = np.sqrt(np.mean((kr.predict(X)-y)**2)) kern_rmse.append(rmse) ## Graph RMSE comparisons df = pd.DataFrame(np.c_[np.array(kern_rmse)*100, np.array(lin_rmse)*100, np.array(lin_non_rmse)*100], columns = ['Kernel', 'Linear-causal', 'Linear--non-causal']) df.plot(kind='bar', color = ['blue','darkgrey', 'darkblue'], width=.85) plt.xticks(ticks = df.index, labels=model_dates, rotation=0) plt.legend(loc = 'upper left') plt.ylabel("RMSE (%)") plt.title("RMSE results by period and model") plt.show() ## Graph RMSE differences kern_lin = [x-y for x,y in zip(lin_rmse, kern_rmse)] kern_non = [x-y for x,y in zip(lin_non_rmse, kern_rmse)] df = pd.DataFrame(np.c_[np.array(kern_lin)*100, np.array(kern_non)*100], columns = ['Kernel - Linear-causal', 'Kernel - Linear--non-causal']) df.plot(kind='bar', color = ['darkgrey', 'darkblue'], width=.85) plt.xticks(ticks = df.index, labels=model_dates, rotation=0) plt.legend(loc = 'upper left') plt.ylabel("RMSE (%)") plt.title("RMSE differences by period and model") plt.show() ## Graph XLI fig, ax = plt.subplots(figsize=(12,6)) ax.plot(xli["2010":"2014"], color='blue') ax.set_label("") ax.set_ylabel("Price (US$)") ax.set_yscale("log") ax.yaxis.set_major_formatter(ScalarFormatter()) ax.yaxis.set_minor_formatter(ScalarFormatter()) ax.set_title("XLI price log-scale") plt.show()
We’d like to thank Prof. Vinod for providing us an overview of his package. The implementation of the package is our own: we take all the credit for any errors.
Granger causality from C.W. Granger’s 1969 paper “Investigating Causal Relations by Econometric Methods and Cross Spectral Methods”
We used ggplot’s loess method for a non-parametric model simply to make the coding easier. A bit lazy, but we wanted to focus on the other stuff.
The post Round about the kernel first appeared on Python-bloggers.
]]>As organizations produce more data and digitize products and processes, a data-driven workforce has never been more critical. This is why learning and development has become central to business strategies—especially initiatives focused on building or...
The post Building a Data-Driven Culture at Bloomberg first appeared on Python-bloggers.
]]>As organizations produce more data and digitize products and processes, a data-driven workforce has never been more critical. This is why learning and development has become central to business strategies—especially initiatives focused on building organization-wide data science capabilities.
On November 4, DataCamp’s Data Science Evangelist, Adel Nehme, was joined by Sheil Naik, Global Data Technical Trainer at Bloomberg, to discuss how Bloomberg is becoming data-driven, how Sheil’s team leverages blended learning to teach data analysis with Python, and how Bloomberg measures behavioral change following their upskilling initiatives.
Never before has it been more valuable to be data-driven. Anaconda’s CEO Peter Wang describes data science as an evidence-based methodology for solving business problems, where data scientists “harness mathematical and computational tools to reason about the business world.” This methodology has enabled a plethora of use cases across industries, from forecasting churn in marketing to automatic fraud detection for financial institutions.
Data science is at the heart of Bloomberg’s data-driven transformation. For starters, their data science team is pushing the boundaries of what’s possible by creating best practices for natural language processing projects, democratizing data tools, and providing intelligent solutions across its products. More importantly, being data-driven also means enabling everyone with the necessary skills to make data-driven decisions, improve processes with data, and produce data-driven news stories.
The Global Data Division at Bloomberg is responsible for maintaining the timeliness and quality of all financial datasets found on the Bloomberg Terminal. As a data technical trainer, Sheil Naik works with business leaders across Bloomberg to identify the skills needed to be successful when working with data at Bloomberg and to design, deliver, and evaluate training programs aimed at building these skills. These skills include using version control tools like Git and GitHub, data analysis with SQL, data analysis with Python, and more.
Bloomberg’s Data Analysis with Python program is a quarterly, blended-learning curriculum incorporating a one-hour introduction explaining how Python is used at Bloomberg, 12 to 20 hours of DataCamp coursework, three live 1.5-hour sessions led by in-house technical experts, and a final project using Bloomberg data.
Bloomberg carefully curates the DataCamp courses to balance learning objectives and time commitment, and contain chapters from courses like Introduction to Python, Intermediate Python, and more. Acting as foundational material, these courses have enabled more than 450 learners—many of whom never coded in their lives before—to learn and apply the concepts needed to complete the three live classroom sessions, pass the final project, and finish the training program.
Learners were able to go from never writing a line of code in their entire life to completing a data-driven news analysis as part of the final project of the program—Sheil Naik, Global Data Technical Trainer at Bloomberg
Combining self-guided learning with live classroom training, this blended learning model allows for consistency and flexibility of learning across geographies, schedules, and business units at Bloomberg. The consistency of the curriculum and experience provided at the foundational level allows for Bloomberg trainers to scale classroom sessions and include global learners. Moreover, using a learning provider like DataCamp at the foundational level offers insights and performance data used to gauge training effectiveness.
A methodology for evaluating direct return on investment for training programs is the Kirkpatrick Model of Evaluation. The Kirkpatrick Model proposes four different evaluation levels: the initial reaction following a training program, learning evaluation, behavioral change, and the business impact of gained skills.
Bloomberg’s implementation of the third layer of the Kirkpatrick model measures the number of producer activities (saves, edits, imports, renames, sends etc…) on their proprietary BQuant Jupyter Notebook environment.
By leveraging these data points, Sheil was able to apply the techniques taught in Data Analysis with Python to uncover a 561% increase in average producer activities for one of the cohorts graduating from the program. According to Sheil, the ultimate goal of data upskilling at Bloomberg is to combine technology with employees’ subject matter expertise to produce insightful analyses.
If you want to learn more about blended learning, how to operationalize it in your own organization, and the key takeaways Sheil recommends every learning and development professional to follow, make sure to watch the full webinar recording.
The post Building a Data-Driven Culture at Bloomberg first appeared on Python-bloggers.
]]>Mark your calendars! Join Appsilon at the Why R? Conference on Thursday, November 12th @ 8PM UTC+1 / 2PM EST for two talks by Appsilonians. We will discuss wildlife preservation with computer vision and scaling Shiny apps on a budget. You can find the meetup link here and a direct link to ...
The post See Appsilon Presentations on Computer Vision and Scaling Shiny at Why R? 2020 first appeared on Python-bloggers.
]]>Mark your calendars! Join Appsilon at the Why R? Conference on Thursday, November 12th @ 8PM UTC+1 / 2PM EST for two talks by Appsilonians. We will discuss wildlife preservation with computer vision and scaling Shiny apps on a budget. You can find the meetup link here and a direct link to the presentations on YouTube here.
Appsilon Senior Data Scientist Jędrzej Świeżewski, PhD will discuss how Appsilon has managed to assist wildlife conservationists in their efforts in central Africa using computer vision. He will touch on the modeling, real-life implementation, and will also share a treat for R lovers. If you’re interested in the intersection of R and computer vision, see a previous talk by Jędrzej about how to make a computer vision model within an R environment here.
Appsilon Infrastructure Engineer Damian Budelewski will show you how to prepare your Shiny dashboard for a high traffic load. He will also help you understand the costs and requirements of hosting Shiny apps in the cloud. He will illustrate by showing you an example of a Shiny app deployed on Shiny Server Open Source and hosted on AWS. To learn more about alternative approaches to scaling Shiny, go here.
Vote for the People’s Choice Award in Appsilon’s 2020 shiny.semantic PoContest! Vote here until Friday, November 10th.
Appsilon is currently hiring for multiple open roles, including Senior External R Shiny Developers, a Frontend Engineer, a Senior Infrastructure Engineer, a Project Manager, and a Community Manager. Appsilon is a fully remote company with team members in Europe, the UK, Africa, and South America. We are global leaders in R Shiny seeking top talent to work with us in a highly collaborative and creative environment.
Article See Appsilon Presentations on Computer Vision and Scaling Shiny at Why R? 2020 comes from Appsilon | End to End Data Science Solutions.
The post See Appsilon Presentations on Computer Vision and Scaling Shiny at Why R? 2020 first appeared on Python-bloggers.
]]>