An introduction to H2O.ai

This article was first published on The Jumping Rivers Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.


An introduction to H2O.ai

If you came here looking for an introduction to water, or a synopsis of the 2003 TV series about teenage mermaids you have sadly come to the wrong place. The H2O that we will talk about is H2O.ai, a company which develops products for easy, scalable, machine learning and artificial intelligence.

Introduction

Machine learning and artificial intelligence (or AI for short) are topics which have had a lot of interest over the past 4-5 years. Some of this interest has come from businesses as they begin to utilise the information they collect on a day-to-day basis to streamline/automate processes or gain insight. A lot of companies are now looking to hire data scientists/engineers and in turn this is making a lot more people interested in machine learning and AI.

Logos of various different machine learning and AI tools

Now, as you look at upskilling in machine learning and AI, you might start by reading some books, taking some online courses and, if you are anything like me, going through many, many, many online tutorials and blog posts on different techniques. It’s at this point you will probably start to realise there are a lot of tools out there that you can use for your machine learning or AI problems. Deciding which tool is best for the job at hand can be very difficult. Hopefully, after reading this blog you will have a better idea of H2O.ai’s products and if they are what you have been looking for.

Who are H2O.ai?

H2O.ai are a company which say they are the visionary leaders in making AI accessible for everyone. Currently, they are the AI partner for over twenty thousand organisations including over half of the companies listed on the Fortune 500 and are used by over one million data scientists around the world. They also have twenty of the world’s Kaggle Grandmasters (of which, at the point of writing, there are 262 in the world) working for their company showing the great talent they have working there.


Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.


Current products

H2O.ai are an open-source company that supply both free and proprietary tools. As H2O.ai state that they are democratising machine learning and AI, they have a range of tools to aid everyone with the machine learning projects from idea to production, no matter their level of expertise. Below, you can read a short overview of the different tools that they provide.

Open source tools

  • H2O/H2O-3: H2O is a fully open source, distributed in-memory machine learning platform which is available in Python, R and various other languages. This is the main free offering from H2O.ai for undertaking machine learning tasks. H2O offers various different supervised and unsupervised algorithms, as well some other useful tools such as Word2vec. You can look at a full list of the different algorithms on offer here. H2O also offers a tool called AutoML for automatic machine learning. This allows you to easily try out the different algorithms H2O offers and output a leader board showing which model has performed the best with your data. If you want to learn more about AutoML look at this blog post.

  • H2O Wave: H2O Wave is an open-source Python framework for designing and deploying applications with interactive user interfaces. It can be used to make simple applications such as to-do lists, or more complex applications where you can deploy your machine learning models that have been developed using H2O or H2O Driverless AI. If you have a spare hour and want to see how to get started with H2O Wave here is a useful tutorial.

Example of a web application made by H2O Wave
Image taken from H2O Wave homepage.

  • Sparkling Water: If you are familiar with Apache Spark (the open source distributed, cluster computing framework used for big data), Sparkling Water is a tool which will allow you to implement advanced machine learning algorithms from H2O within your Spark implementations.

Propriety tools

  • H2O AI Cloud: If you are in need of cloud infrastructure, H2O.ai can now provide this with H2O AI Cloud. You can choose between a fully managed cloud infrastructure if you do not want to deal with setting up infrastructure, scaling, or software updates, or a hybrid cloud infrastructure if you want a little more control over your cloud environment.

  • H2O Driverless AI: Like H2O, this tool also offers automatic machine learning but this tool takes it a few steps further. As well as trying different machine learning algorithms (and ensembles of the available algorithms), this tool will also perform automatic feature engineering, produce data visualisations and post training diagnostics plots, and give performance metrics for each model; you can also easily deploy models that have been created and create model documentation. The tool is designed for both data scientists and non-data scientists. H2O.ai provide a user-friendly interface for this Driverless AI (which you can see below) so non-data scientists can easily load data, visualise the data, use the automatic machine learning algorithms to develop a model and evaluate the final result. Depending on a few metrics such as interoperability, time, and accuracy, different models will be used. For more technical users, you can control a large variety of parameters including over sampling techniques, particular parameter values to try in neural networks, the types of models to try, whether to perform early-stopping and much, much more. You can use Driverless AI on tabular data, time-series data, text data and image data to perform tasks such as prediction/classification tasks, forecasting, natural language processing and image classification. A full list of the different algorithms currently available can be seen here. If you want to add your own algorithms into the mix, such as custom neural networks, Driverless AI allows you to add your own models (for neural network fans both Tensorflow and Pytorch models can be added) as ‘custom recipes’ to be used when trailing different algorithms. If you are interested in seeing how this tool is used here is a quick demonstration.

Example of Driverless AI user interface

  • H2O AutoDoc: AutoDoc allows you to create automatic model documentation for your models created in either H2O or Driverless AI (this feature is integrated into Driverless AI). You can also use this tool on any model you create using the Python library ScikitLearn. The documentation can be personalised to include the output that you think is most important, e.g. a confusion matrix, model performance metrics, etc. The document can be written to either a Microsoft Word file or a markdown script.

  • H2O MLOps: If you are looking at putting your machine learning models into production this is where MLOps (machine learning operations) can help. H2O MLOps can be used to deploy models that you have created in both H2O and H2O Driverless AI and allows you to easily maintain them once they are in production. The tool uses Kubernetes for easy deployment, scaling and management of your production and allows you to run diagnostics and update models without ever needing ‘down-time’.

  • H2O Enterprise Puddle: Enterprise Puddle is designed to help you easily create and manage H2O cloud instances. This tool is aimed at people who work within IT maintaining environments, permissions, data access, etc. rather than data scientists.

If you are interested in trying out any of the above propriety tools (excluding Enterprise Puddle), H2O.ai are offering a free 90-day free trial.

Technical features

H2O.ai products are used for distributed in-memory machine learning platforms. They achieve this by distributing data across an H2O cluster and storing it in memory in a compressed format which allows for parallelisation. H2O.ai use Java as their main coding language. REST APIs are used to allow you to access and code in H2O products in languages such as R and Python so you can use H2O, H2O Wave, Sparkling Water and Driverless AI without needing to learn another coding language if you know R or Python!

Another feature of H2O and H2O Driverless AI that you might find useful is any model created with either tool can be exported for later use. In H2O, a model can be exported as a hierarchical data format (HDF5) file, or a MOJO (model object, optimized) or a POJO (plain old java object), if you want to learn more about these different formats here is a useful link. In H2O Driverless AI you can export ‘Scoring Pipelines’. These can be used to deploy the models that you have developed within Driverless AI for production. They can be exported as either Python Scoring Pipelines, or MOJO Scoring Pipelines. Within the Python Scoring Pipeline an example Python script is added to show you how to use the pipeline in practice. If you would like to know more about exporting your Scoring Pipelines from Driverless AI take a look here.

Conclusion

Now you have read about H2O.ai and the tools that they provide, I hope that you have a better idea of what H2O.ai tools you could use for your machine learning projects. H2O.ai have created a set of tools which knit together nicely when used with each other. If you want to see how these tools are used in production, H2O.ai have a full section on their website dedicated to show use cases for their tools.



Jumping Rivers Logo

For updates and revisions to this article, see the original post

To leave a comment for the author, please follow the link and comment on their blog: The Jumping Rivers Blog .

Want to share your content on python-bloggers? click here.