How to get started with Advanced Analysis with Python for Copilot in Excel

This article was first published on python - Stringfest Analytics , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

When Copilot for Excel was first introduced, it primarily served as a tool to automate basic Excel tasks like sorting, filtering, conditional formatting, and creating formulas.

However, as the AI has evolved and more resources have been invested in the product, it now supports more advanced functionalities, including sophisticated data analysis using Python. In this post, we’ll explore how this works and how you can use it. To follow along, you can download the exercise file here:

 

We’ll be using the famous Palmer Penguins dataset.

Please note that as of the time of this writing, this feature is only available to Excel users in the Microsoft 365 Insiders program. For more information on accessing Copilot in Excel with Python, refer to the official Microsoft support page.

What to know before using this feature

As the name suggests, this feature allows you to conduct more advanced exploratory and confirmatory analyses, build interesting visualizations, and even delve into predictive analytics and machine learning using Python code.

While Copilot is exceptionally intelligent and can iterate and revise code based on your natural language queries, it’s crucial to have a solid understanding of both Python as a programming language and the underlying concepts and theories that drive statistical analysis and machine learning before using this tool.

Here are some titles that could help enhance your knowledge. Feel free to leave your favorites in the comments.

  • Advancing into Analytics: From Excel to Python and R by George Mount. This book provides a basic overview of statistical analysis with a section dedicated to Python programming.
  • Python for Data Analysis, 3rd Edition by Wes McKinney. Offers in-depth coverage for data analysts and programmers eager to deepen their understanding of data manipulation and analysis using Python and pandas.
  • Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron. A comprehensive guide that provides practical instructions and clear explanations for building intelligent systems, focusing on the application of machine learning using Python and covering everything from simple linear regression to deep learning.

Although you can use AI for coaching and guidance, it’s essential to supplement this with expertly reviewed and original content.

Loading a dataset into Advanced Analysis with Python

Like with other datasets that you’ll load into Copilot, I suggest storing your dataset as a table, although this isn’t strictly necessary anymore. Make sure you load the demo file into a OneDrive account associated with your Copilot account. Once loaded, click anywhere in the dataset, navigate to the ‘Home’ tab on the ribbon, and select Copilot. If you have access to this feature, you should see a new card in the list of options in Copilot titled “Advanced analysis: Get deeper analysis results using Python.”

Go ahead and click on it to get started.

You will now be presented with the basic rules of the road for Advanced Analysis.

First, Copilot will add a new worksheet to your workbook to house the results. Next, it will automatically write and insert all Python formulas into the worksheet grid. This process is iterative; you can chain your prompts together to clarify your objectives, understand Copilot’s reasoning, and continue refining your queries. Remember, Copilot was trained on a vast amount of data—not just on topics like Excel, Python, and statistical analysis. It has access to extensive knowledge, which you can use to your advantage to plan, understand, and question how Copilot is analyzing your data with Python code.

With those rules in mind, go ahead and click “Start advanced analysis.”

Advanced analysis getting started

Keep in mind that as generative AI, these results will be volatile. You will not receive the same output every time, and you might get something very different from what I review here.

Sometimes, Copilot offers a relatively simple starting point and provides suggestions for next steps. Other times, it endeavors to create an end-to-end data analysis project, extending all the way to building a predictive model. The output really depends on many factors, so you need to be very flexible and agile. It’s crucial to understand what Copilot is doing, how the Python code works, and so on.

The data import

Regardless of the session, all Copilot with Python sessions will begin by reading in the dataset into Python code. It typically looks something like this:

penguins_df=xl("penguins[#All]", headers=True)

You’ll notice that Copilot often suffixes many of the dataset names with df. This stands for DataFrame, which is the type of object typically used in Python to handle data consisting of rows, columns, and headers—similar to what we usually work with in Excel.

Data preview Python advanced analysis

You will see that the formula is added to the worksheet grid, and a preview of the underlying data object is displayed beneath it. If you are comfortable working with Python code and the Python in Excel environment, you are welcome to explore and tweak this code. If you’re just getting up to speed with this, it’s a good idea to focus on the Copilot sidebar, observing the outputs it provides there and asking for any next steps directly in that sidebar.

For example, in this instance, Copilot seems to have simply given us a preview of the dataset. It also offered some suggested avenues for further exploration in the Copilot prompt.

I’m going to explore things a bit myself here and inquire about missing values, as I’m still in the data profiling and understanding phase of my analysis. Let’s prompt Copilot to find out how many missing values are in each column of the dataset.

Python in Excel will perform its operations, store the results in cells further down the worksheet, and provide previews both in the worksheet itself and in the Copilot sidebar. Let’s continue with this reasoning and be more specific.

I’m going to ask Copilot to calculate what percentage of values are missing in each column. As we’ve seen, the more precise the prompt, often the better the response. This approach helps ensure that the queries are clear and tailored to retrieve the exact data we need for thorough analysis.

Percent of missing values

In this scenario, the initial results in the worksheet preview weren’t formatted in their true decimal format; if they had been formatted as percentages, the values would have appeared inflated by a factor of 100. To correct this, I’ll chain onto this prompt and ask Copilot to divide the results by 100.

Now, with those adjusted results in the worksheet, you can go ahead and format these numbers in the percentage format like any other data in an Excel worksheet. This is particularly convenient because, while it’s not straightforward to format data in Python, Excel simplifies this process.

Exploring missing values is just one aspect of understanding a dataset; identifying outliers is another crucial step. Consider asking Copilot something like:

“Are there any outliers in the bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g columns?”

Copilot will execute the Python code to analyze this. Interestingly, in my case, Python even encountered an error initially, but Copilot was able to parse the error and rerun the code successfully—how intelligent! Ultimately, Copilot concludes that, using the Interquartile Range (IQR) method, no outliers were detected in these columns. This example showcases the robust capabilities of Copilot to handle errors and provide insightful analysis, leveraging Python within Excel..

When delving into specific methods and techniques for data analysis, it’s wise to understand the pros and cons of those methods, as well as any alternatives. Knowing alternatives can be a really effective way to validate what Python is producing, especially since, as generative AI, the code might sometimes encounter issues.

Therefore, this might be a good time to ask something like, “What are some other methods for outlier detection?” Here, Copilot can leverage its vast repository of large language model (LLM) knowledge.

Other outlier methods

We could certainly continue along this line of reasoning by asking Copilot to explore other outlier detection techniques, but I’ll take one of the suggestions and run with it, which is to visualize the data. I’m specifically going to request visualizations of the distribution of these variables across different species.

We’ll see some boxplots showing these distributions—and if you’re familiar with boxplots, you’ll notice that some outliers are detected when the data is broken down by species, which is interesting. However, these outliers were not apparent at the total dataset level, as we saw earlier. This discrepancy might be another good opportunity to have Copilot help you sanity-check and test your understanding of how things are being analyzed here.

Copilot distribution

We can easily transition from data exploration and profiling to building full-on predictive models, a task that was quite challenging to accomplish in Excel before. For example, I’m going to ask Copilot to build a predictive model for me:

“Can we predict the species of a penguin based on its bill length, bill depth, flipper length, and body mass?”

Classification matrix Copilot

We received a bunch of output here, and it looks like Copilot implemented a random forest algorithm for this model, which is pretty typical and likely a good choice for this type of data. If you’re already familiar with machine learning, you’ll recognize that a random forest is a robust option for classification tasks like this. If you aren’t familiar with random forests or machine learning, it’s definitely a good idea to get up to speed before running such analyses in Copilot.

Again, having solid resources to learn from the ground up is crucial, but don’t hesitate to use Copilot to enhance your understanding. For example, you could ask Copilot about other possible algorithms and their pros and cons. You could ask it to explore and check for the assumptions of random forests or other models. You should also feel comfortable going into the workbook and tweaking the Python code as needed. Don’t just run this stuff passively and expect perfect answers; you need to be an active participant in the process!

What other algorithms?

I hope you’re beginning to see all the possibilities of this incredibly powerful tool. There are so many avenues to explore, and due to the nature of generative AI, it’s impossible to show you all of them step by step. It’s really about being exploratory, but you need a good amount of prior knowledge to make the most of it.

So, brush up on your Python, deepen your understanding of statistics, familiarize yourself with the Python in Excel environment, get comfortable writing code there, and enjoy this truly amazing tool.

What questions do you have about using Advanced Analysis with Python for Copilot in Excel? Let me know in the comments.

The post How to get started with Advanced Analysis with Python for Copilot in Excel first appeared on Stringfest Analytics.

To leave a comment for the author, please follow the link and comment on their blog: python - Stringfest Analytics .

Want to share your content on python-bloggers? click here.