I’m sure many readers got excited by the data science possibilities of the title. However, decision trees are more than statistical algorithms — in a broader sense they are any schematics used to help chart actions and consequences, whether probabilities are given or not.
In this post, we’ll use a decision tree as a workflow for the best way to source a given Python package. If you’re not familiar with the concept of packages, check out this blog post.
Python packages feel a little more complex to source than R because we generally work with an intermediary to download those packages; that is, the Anaconda distribution. If you’re not familiar with what an open source distribution like Anaconda is, check out this blog post.
Anaconda carries many, but not all of the packages carried by the official keeper of packages, the Python Package Index, and it’s preferred when possible to download from Anaconda, so a little triage work is needed.
Check what you’ve got first
Before downloading a package, it’s worth checking out whether it’s already available. One way to do that is to run
pip freeze. This will list all of the installed packages in your environment.
If you see the package listed … great! You’ll still have to import it to use for the same reason you need to open a smartphone app you’ve already installed to use it.
If you don’t see it listed … you’ll need to install it, but the question is from where: Anaconda or the Python Package Index. The former is preferred.
When in doubt… check Anaconda
I recommend you install the Anaconda distribution of Python for a few reasons. Many popular packages already come installed with it, and many others can be installed from there. You can check what packages are pre-installed or available to install in your Anaconda version’s package list; the following image makes sense of what you’re seeing:
If the package is listed and there’s a checkmark next to the name, that means it’s already been installed. If it’s there and not checked, you can install it from Anaconda. You’ll do this with
conda install packagename.
The previous code needs to be done at the command line, so if you’re running from a Jupyter notebook, be sure to include an exclamation mark before your statement:
!conda install packagename.
conda lets you down,
pip rushes in
As mentioned earlier, it’s preferred to download directly from Anaconda. Keep in mind that each time you download or update a package you’re manipulating a lot of files on your machine and modifying your Python environment, so it’s easy to get wires crossed. To put it simply, Anaconda helps with the maintenance.
While Anaconda distributes most popular packages, not everything is available there so you may need to download directly from the Python Package Index, or PyPI. In that case, you’ll still need to use the command line, this time typing
!pip install packagename.
Don’t forget to import!
Regardless of where you source the package, make sure you import the package before using it. This is done with the statement
Sourcing Python packages: the decision tree
I’ve visualized the decision process described here in a PDF download. To get your copy, subscribe below. You’ll also get access to my analytics resource library.
Download the PDF below and get access to my resource library
Python + data: There’s a package for that!
Sourcing a package is one thing…. knowing what package to use and how for some purpose is another. For that, I suggest my book Advancing into Analytics: From Excel to Python and R.
That book covers the foundations of statistics and analytics… so I guess, in keeping with this post’s title, you will be well suited to dive into statistical decision trees anyway .