A plethora of datasets at your fingertips
Want to share your content on python-bloggers? click here.
Starting with mlsauce
’s next release (v0.9.0
, for Python and R), you’ll be able to download a plethora of datasets for your statistical/machine learning experiments (this is a work in progress, it will done from a GitHub branch today). These datasets come from the R-universe, and you’ll be able to use them no matter whether you’re working with Python or R.
In the R-universe (new CRAN in disguise?), among other things, there’s an automated package-building workflow for all the common platforms (Linux, macOS and Windows). There’s also an open data API, whose usage underlies what’s described in this post. Remember to cite datasets’ sources. A good practice in packaging R datasets is to provide their references, but I’m guilty of not having done it everytime 😉
Warning, this paragraph may sound a little bit cryptic, but feel free to skip it: In the examples below, you can pass additional – optional – parameters to the dowload
function, which are those used by requests.get
and pd.DataFrame
. Unfortunately, mlsauce
’s documentation is not up-to-date, because keras-autodoc
was discontinued, and I need to find a previous version of Sphinx
that would work with my keras-autodoc
’s fork. * Sigh * … I’m eyeing pdoc
or mkdocstrings
. Anything Markdown, actually.
Contents
Dowload a dataset in Python
Install
!pip install git+https://github.com/Techtonique/mlsauce.git@feature-branch
Import data
import mlsauce as ms # `ms.download` parameters # pkgname="MASS" # dataset="Boston" # source="https://cran.r-universe.dev/" # the controversial Boston data set df1 = ms.download(dataset="Boston") print(f"===== df1: \n {df1} \n") print(f"===== df1.dtypes: \n {df1.dtypes}") print("\n====================================================== \n") # the controversial Boston data set df2 = ms.download(dataset="Insurance") print(f"===== df2: \n {df2} \n") print(f"===== df2.dtypes: \n {df2.dtypes}")
===== df1: crim zn indus chas nox rm age dis rad tax ptratio black lstat medv 0 0.0063 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 1 0.0273 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 2 0.0273 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 3 0.0324 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 4 0.0690 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... 501 0.0626 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4 502 0.0453 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6 503 0.0608 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9 504 0.1096 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0 505 0.0474 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9 [506 rows x 14 columns] ===== df1.dtypes: crim float64 zn float64 indus float64 chas int64 nox float64 rm float64 age float64 dis float64 rad int64 tax int64 ptratio float64 black float64 lstat float64 medv float64 dtype: object ====================================================== ===== df2: District Group Age Holders Claims 0 1 <1l <25 197 38 1 1 <1l 25-29 264 35 2 1 <1l 30-35 246 20 3 1 <1l >35 1680 156 4 1 1-1.5l <25 284 63 .. ... ... ... ... ... 59 4 1.5-2l >35 344 63 60 4 >2l <25 3 0 61 4 >2l 25-29 16 6 62 4 >2l 30-35 25 8 63 4 >2l >35 114 33 [64 rows x 5 columns] ===== df2.dtypes: District object Group object Age object Holders int64 Claims int64 dtype: object
Dowload a dataset in R
Install
remotes::install_github("Techtonique/mlsauce_r@dev-branch")
Import data
The controversial Boston dataset.
df <- mlsauce::download(pkgname = "MASS", dataset = "Boston", source = "https://cran.r-universe.dev/")
print(head(df))
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv 1 0.0063 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0 2 0.0273 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6 3 0.0273 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7 4 0.0324 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4 5 0.0690 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2 6 0.0298 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
print(summary(lm(medv ~ ., data = df)))
Call: lm(formula = medv ~ ., data = df) Residuals: Min 1Q Median 3Q Max -15.595 -2.730 -0.518 1.777 26.199 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 *** crim -1.080e-01 3.286e-02 -3.287 0.001087 ** zn 4.642e-02 1.373e-02 3.382 0.000778 *** indus 2.056e-02 6.150e-02 0.334 0.738288 chas 2.687e+00 8.616e-01 3.118 0.001925 ** nox -1.777e+01 3.820e+00 -4.651 4.25e-06 *** rm 3.810e+00 4.179e-01 9.116 < 2e-16 *** age 6.922e-04 1.321e-02 0.052 0.958230 dis -1.476e+00 1.995e-01 -7.398 6.01e-13 *** rad 3.060e-01 6.635e-02 4.613 5.07e-06 *** tax -1.233e-02 3.760e-03 -3.280 0.001112 ** ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 *** black 9.312e-03 2.686e-03 3.467 0.000573 *** lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.745 on 492 degrees of freedom Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Want to share your content on python-bloggers? click here.