Python-bloggers

Advanced Analysis with Python in Copilot: How to create advanced data visualizations

This article was first published on python - Stringfest Analytics , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

Advanced Analysis with Python in Copilot for Excel can help with many data analysis tasks that would typically be challenging to execute in Excel. From building predictive models using techniques like random forests to uncovering patterns and relationships through clustering and principal component analysis, and even analyzing text data for sentiment, Copilot plus Python enhances Excel’s analytical potential significantly.

Another domain where Copilot with Python significantly excels is in data visualization. This enhancement largely stems from the integration of the Seaborn package, which provides a comprehensive suite of visualization tools tailored specifically for data analytics and data science.

Below, I’ll showcase a few of the many plots that can be created nearly instantly with Python and Copilot—tasks that would be considerably more complex in Excel alone. Keep in mind, because this involves Python code, we can easily tweak the code if you have the necessary skills. This highlights why knowing some Python is so helpful as you begin using this tool.

Equally, if not more important, is developing a deep understanding of your visualization objectives. Knowing exactly what you want to achieve and why Excel’s native plots might not suffice is critical. This strategic insight into the “what” and “why” behind your data presentation will not only enable you to communicate more effectively with Copilot but also empower you to make custom adjustments to your plots.

Familiarity with the specific terminology of these advanced visual techniques can further refine your interactions, making your instructions to Copilot more precise and, ultimately, your data visualizations more impactful. While tweaking Python code is useful, the real power lies in the clarity of your visualization intent and your ability to articulate it clearly.

With that said, let’s take a look at some examples. These plots will not only expand your visualization toolkit but also enhance your ability to customize the right visualization for the right use case

You can follow along with the exercise file below:

 

Stripplot

A stripplot visualizes data points along a categorical axis with jitter added between points to prevent overlapping. It is particularly useful when comparing distributions across a categorical variable, such as fuel mileage by vehicle origin, as in our case.

While Excel can easily create traditional scatterplots that compare two quantitative variables, it lacks a straightforward plot type to add jitter and fine-tune the overlapping of points when comparing the distribution of a quantitative variable across one or more categories. It can be achieved, but it requires additional effort and can be challenging to reproduce, share, and explain how the plot was created.

To that end, let’s fire up Copilot, select “Advanced Analysis,” and begin our prompting:

Create a stripplot of MPG versus car origin. The y-axis should show the MPG, and the x-axis should categorize by the origin of the car (USA, Europe, Asia). Add jitter to avoid point overlap. Include labels and a title.

Copilot stripplot

Here, we can see that, overall, there are more data points for cars from the USA, and they are almost all concentrated below the range where most vehicles from the other two categories fall. Because we are plotting each individual point, we gain a granular view of the dataset, which also makes it easy to spot outliers.

Pairplot

Next, a pairplot shows pairwise relationships between multiple variables through scatterplots, with optional histograms on the diagonal. I like to think of it almost as a visual version of a correlation matrix. It’s particularly helpful for identifying correlations and relationships between numerical variables. Excel can’t create pairwise comparisons as seamlessly, requiring manual creation of scatterplots for each variable pair and histograms for each variable’s distribution, unless you’re using a tool like VBA. This process is time-consuming and lacks the intuitive layout or color-coding that helps distinguish categories. Let’s have Copilot create one:

Create a pairplot to visualize pairwise relationships between MPG, horsepower, weight, acceleration, and origin. Use a different color for each origin to distinguish the categories, and add a title to the plot.

This is one large plot, and I even increased the size in Excel.

Here, we can easily observe, for example, that there is generally a negative relationship between weight and mpg, while the relationship between horsepower and weight is the opposite. I can also see how the distributions of all these quantities compare for the USA versus other variables in a sort of histogram that has been smoothed out, making it easier to interpret.

Kernel Density Estimations (KDEs) such as those used along the diagonal here are often preferred over traditional histograms because they provide a smooth estimate of the underlying distribution, aiding in the clearer identification of patterns and trends beyond the bin-based structure of histograms. Additionally, KDEs are less sensitive to the choice of bins and can adapt better to varied data structures, offering a more accurate representation of data density.

Last but not least, because these visualizations have been colored by origin, we can see in almost every relationship that there does seem to be a clear influence of origin in terms of these statistics.

Small multiples

Small multiples, or facet grids, break up a dataset into small subsets based on categorical variables, displaying individual plots (like scatterplots) for each subset of the data.

This approach is useful when you want to compare the same plot across different categories (e.g., horsepower vs. MPG for different car origins). Excel doesn’t support creating small multiples or “faceting” out of the box, so replicating this would require manually creating multiple charts, which becomes laborious and makes it difficult to maintain consistency across plots.

Let’s prompt this into Copilot:

Create a small multiples plot that shows scatterplots of horsepower vs MPG, split by car origin (USA, Europe, Asia). Each origin should have a separate column with unique colors and a legend.

You may have had trouble fully grasping the relationships in the pairplot, as there was a lot going on in that plot. Even just one of these scatterplots, colored by point, can be challenging to size up all at once. By breaking each category into its own plot, it becomes much easier to compare each group together.

This strategy can be particularly important as the number of data points increases, helping to break things up and simplify the analysis.

Boxplot with stripplot overlay

Now, let’s consider combining an aggregate distribution plot like a boxplot with a strip plot that displays each point individually. This combined approach allows for a comparison of the overall distribution (via the boxplot) and individual data points (via the strip plot), which is excellent for identifying outliers, range, and spread of data while still presenting the raw data.

Excel can generate boxplots, but adding individual data points on top isn’t straightforward. Let’s see what Copilot can do for us here:

Create a boxplot of MPG by car origin, and overlay it with a stripplot. The boxplot should show the MPG distribution by origin, and the stripplot should add individual data points in black with jitter to prevent overlap.

As discussed, this approach allows us to clearly identify the actual outlier values, which the boxplot alone might vaguely indicate. Additionally, we gain visual cues about the quartiles, and we can easily discern the median and quartile data from the axes. Having these statistics readily available is particularly useful for quick analysis.

Jointplot

Last but not least, a jointplot combines a scatterplot with histograms or density plots to illustrate the relationship between two variables along with their marginal, or univariate distributions. It can also produce a heatmap-like binned scatterplot, which helps reveal the most common x-y coordinates—a task that becomes challenging when numerous small points clutter the plot.

This type of visualization is particularly useful for not only examining the relationships between variables but also for understanding how each variable is distributed. In Excel, achieving this would require creating separate scatter and histogram charts, which lacks the seamless integration needed to display marginal distributions in a single view. This separation often impedes a quick, holistic understanding of the data.

Create a jointplot of weight versus MPG, with hexagon-style bins to show density. The x-axis should be weight, and the y-axis should be MPG, and include marginal histograms on the axes.

Conclusion

I hope you appreciate the potential of this tool! These examples illustrate just a few of the many plots you can create simply by using Copilot for Excel with Python.

In conclusion, it’s crucial to recognize that mastering Python’s syntax for creating enhanced visualizations is just one part of the equation. While having robust Python coding skills is essential, developing a strong intuition for crafting effective data visualizations is equally important.

For those interested in delving deeper into Python’s capabilities, I recommend exploring the Seaborn documentation. Please note that Seaborn builds on Matplotlib, which is also available in Python for Excel, offering unique plots and features. For more information on Matplotlib, check out the documentation.

To further enhance your data visualization skills, consider reading Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic, and Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks by Jonathan Schwabish.

Do you have any specific questions about data visualization or Advanced Analysis with Python for Copilot in Excel? I’d love to hear your thoughts or address any queries you might have in the comments below.

The post Advanced Analysis with Python in Copilot: How to create advanced data visualizations first appeared on Stringfest Analytics.

To leave a comment for the author, please follow the link and comment on their blog: python - Stringfest Analytics .

Want to share your content on python-bloggers? click here.
Exit mobile version