How to generate datasets with Copilot for Microsoft 365

Posted on July 15, 2024 by George Mount in Data science | 0 Comments

This article was first published on python - Stringfest Analytics , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

As a trainer and content creator, I regularly generate fake datasets for exercises. This task can be more challenging than it sounds. Creating variables that follow a specific pattern or distribution was difficult with earlier random data distribution tools.

Additionally, crafting a dataset that “feels” real for a given industry, domain, or problem set was often a challenge, sometimes requiring me to write my own Python code. Now, with Copilot for Microsoft 365, I can quickly create realistic dummy datasets. These datasets often look so real that I must explicitly state in training sessions that they are fake and not proprietary data!

I understand that not everyone is generating datasets for training purposes, and many of you have “real” jobs to attend to. However, generating synthetic datasets is a valuable skill for any data professional. It allows them to create realistic data for testing, validation, and experimentation without the privacy concerns associated with using real data. When access to genuine data is restricted due to confidentiality agreements or regulatory compliance, synthetic data provides a viable alternative, enabling analysts to continue their work without violating data privacy regulations.

By mastering the creation of synthetic datasets, data analysts can enhance their problem-solving capabilities. This skill ensures that their analytical tools and models perform well under a variety of conditions, ultimately leading to more accurate and reliable results.

Microsoft Copilot offers a fantastic solution for data analysts to generate synthetic datasets, and it often works even better when used in conjunction with Python code. Let’s explore this in a demo. Please note that we are using a work version of Copilot for Microsoft 365.

Before getting started…

Before we get started, I recommend becoming comfortable with a local instance of Python on your computer. I suggest using Jupyter notebooks and Anaconda, which you can download for free at anaconda.com. If you need the basics of working in this environment, check out my book Advancing into Analytics.

Unfortunately, Copilot only generates Python code but does not run and return the results, so you’ll need to do this yourself. Another option is to use another generative AI tool like ChatGPT, but be mindful of sharing sensitive data that you wouldn’t risk with your corporate Microsoft 365 Copilot tool. I hope to see a future version of Copilot that can both generate and execute Python code, but we’re not there yet.

And, as always, keep in mind that when working with generative AI like this, the results can be random. The results you see in my screenshots may differ from what you get. Sometimes they’ll be better, and sometimes worse. You may need to iterate and experiment a bit. Here are some basic tips and tricks for working with Copilot 365:

How to understand prompting with Copilot for Microsoft 365

The best approach is to be a good problem solver and to know your subject matter well, as the article reiterates.

Creating a basic dataset

Let’s get started with Copilot by generating some basic categorical variables. For example, suppose we want a fabricated list of 100 names and addresses. I will specify the desired format for the results, in this case an Excel workbook. If you require more specific criteria, such as including only addresses from certain states or excluding apartment numbers, you can refine your prompt accordingly.

Unfortunately, Copilot doesn’t quite hit the mark here. It sets up the basic parameters for the table and starts you off with some records, but then it misses the point by suggesting you use an online generator. Isn’t that essentially what Copilot is—an online generator?

You could try to use another service like ChatGPT to fulfill this request, but generating 100 rows in those services, even paid, might cause a timeout. They might not provide a secure work environment either. So, my workaround is to suggest that instead of asking Copilot to directly generate a dataset, you should ask it to create a Python script that can generate your fake dataset. Then, you can run the script yourself. Let’s give it a try, keeping all other details in the original prompt unchanged:

Python script to generate 100 fake customer names

The complete code provided by Copilot is presented below. Keep in mind that your code results might differ from mine, potentially offering better or worse solutions:

Setting the random seed for reproducibility

Go ahead and run the previous Python script on your computer to generate a dataset. You should see something similar to this, but with different entries entirely:

The process of generating data here is inherently random, much like ChatGPT itself. However, unlike ChatGPT, we can set a random seed in Python to control how these random numbers are generated. This ensures that if I run this script and you run the script, we will obtain identical results. This consistency is extremely useful for sharing or validating results.

To accomplish this, I will modify the previously generated script to set the random seed used in the faker package to 1234. You can set your random seed to any integer, but for consistency and ease of memory, it’s recommended to use a simple, memorable number. I will apply the same approach.

Setting the random seed separately for base Python, faker, numpy and so forth is necessary because each package maintains its own random number generator. To ensure consistent results across different libraries used in your code, you need to set the seed individually for each one.

In this case, I only generated data with faker, but in other scripts, you’ll see the need to set multiple seeds—one for each package. I strongly suggest using the same integer each time; there’s no problem with doing it that way.

Because I already know the script functions well, I don’t want Copilot to run it again and potentially yield a worse result. I only need to add two lines. This demonstrates why having existing Python knowledge is so important before diving into AI.

You should now consistently get the following set of addresses. I’m noticing that some postal codes don’t make sense, and perhaps I don’t want to include military bases as part of the random data. I can adjust my prompt accordingly if needed.

Now that we’ve looked at creating some basic categorical data for Excel using Python and Faker, let’s explore more complex scenarios. We can generate simulated data for a variety of domains, accommodating all types of shapes, sizes, and distributions. While it may seem that being specific with our requirements would complicate things for Copilot, adding these clear constraints and instructions will actually produce more realistic datasets.

However, keep in mind that we will only achieve reproducible results if we set the random seed. It is essential to do this for any packages used, including faker, numpy, or base Python.

In each of the following examples, I’ll provide the necessary scripts. You can download the results from the repository below:

Download the example workbooks here

Example 1: Employee performance review

Use Python and the faker package to create a synthetic Excel dataset for an organization’s employee performance review for 500 employees with the following details:

Columns: Employee ID, Department, Performance Score, Salary, Years of Experience
Performance Score: Normally distributed with a mean of 70 and a standard deviation of 10
Salary: Log-normally distributed with a mean of $50,000 and a standard deviation of $15,000
Years of Experience: Exponentially distributed with a lambda of 0.1
Department: Randomly chosen from ‘Sales’, ‘HR’, ‘IT’, ‘Marketing’, ‘Finance’
Random seed: Set to 1234
Faker seed instance: Set to 1234

Example 2: Warehouse inventory management

Use Python and the faker package to generate a synthetic Excel dataset for inventory management of a warehouse with the following details:

Number of items: 2,000
Columns: Item ID, Category, Stock Level, Reorder Level, Lead Time
Stock Level: Normally distributed with a mean of 100 and a standard deviation of 30
Reorder Level: Uniformly distributed between 20 and 50 units
Lead Time: Exponentially distributed with a lambda of 0.05
Category: Randomly chosen from ‘Electronics’, ‘Clothing’, ‘Home Goods’, ‘Sports Equipment’, ‘Toys’
Random seed: Set to 1234
Faker seed instance: Set to 1234

Example 3: Customer churn

Generate Python and the faker package to generate a synthetic customer churn Excel dataset for a telecom company with the following attributes:

Number of customers: 5,000
Columns: Customer ID, Age, Tenure, Monthly Charges, Total Charges, Churn
Age: Normally distributed with a mean of 35 and a standard deviation of 10
Tenure: Uniformly distributed between 1 and 72 months
Monthly Charges: Normally distributed with a mean of $70 and a standard deviation of $20
Total Charges: Monthly Charges * Tenure
Churn: Binary outcome (0 or 1) with a probability of churn set at 0.2
Random seed: Set to 1234
Faker seed instance: Set to 1234

Conclusion

Microsoft 365 Copilot’s ability to craft realistic, yet completely synthetic datasets allows analysts to test and simulate scenarios closely mirroring real-world conditions without risking sensitive data. This represents a significant advancement for data privacy and a considerable leap forward for data analysts who require robust, secure data environments to refine their models. As analysts begin integrating AI tools like Copilot with traditional platforms such as Python, the possibilities of what we can achieve with data continue to broaden.

Do you have questions about synthetic data, the Faker package, or Copilot for Microsoft 365 more broadly? Let me know in the comments.

The post How to generate datasets with Copilot for Microsoft 365 first appeared on Stringfest Analytics.

To leave a comment for the author, please follow the link and comment on their blog: python - Stringfest Analytics .

Want to share your content on python-bloggers? click here.

Python-bloggers

Data science news and tutorials - contributed by Python bloggers