In this blog I will show how to build machine learning models in Rstudio and run the training directly in SAP HANA.
The main benefit with this approach is that it allows you to utilize the power of SAP HANA both in terms of scalability and performance through R scripts.
In a nutshell, there is no movement of training data from SAP HANA to the R server/client. With this functionality SAP previously released the Python API that also gave users the ability to interact with SAP HANAs Predictive Analysis Library from their preferred Python GUI like Jupyter, Spyder etc. (see reference list below for links). With the release of HANA.ML.R it is now possible to use R together with SAP HANA – which is something I personally have been looking forward to.
From my own experience as an external SAP Data Science consultant, I see many users prefer R or Python over SQL (Structured Query Language) or SQL Procedures to interact with SAP HANA and SAP HANA PAL.
The data scientist uses the familiar R command interface to invoke algorithms directly in SAP HANA PAL. The CPUs processing required during training of machine learning models is performed on SAP HANA and not on a limited laptop etc. This leverages the performance of SAP HANA and greatly improves the training time of machine learning models.
In this blog, I will show how to build a machine learning model from RStudio or your preferred R GUI on SAP HANA without moving the training data to the R client. This allows as mentioned for faster machine learning training time as SAP HANA is much more performant than a laptop or similar.
- Machine learning with SAP HANA – all Interaction performed directly from R.
- Realistic business problem solved with machine learning.
- Appendix. Pre-requisites for using the R Client API for machine learning algorithms with SAP HANA
Firstly I want to showcase how to build a simple machine learning model in SAP HANA – but using the data science language R. SAP HANA comes with a wide set of machine learning algorithms to deal with regression, classification, forecasting etc. This library is called PAL – short for Predictive Analysis Library (see more in the references below). Some of the algorithms in PAL can now be accessed through the use of R. In the following I am using Rstudio as my preferred graphical user interfase to R.
The data that is used in this blog comes from the CoIL (Computational Intelligence and Learning, see link to references) challenge, which had the following goal:
“Can you predict who would be interested in buying
a caravan insurance policy and give an explanation of why?”
The data file features the actual dataset from an insurance company and it contains 5822 customer records of which 348, about 6%, had caravan policies. Each record consists of 86 attributes, containing socio-demographic data product ownership. The test or validation set contained information on 4000 customers randomly drawn from the same population. For model comparison purposes each model must be evaluated as to which of the 800 customers would be most likely to buy caravan policies. Meaning that when scoring a model the highest 20% (800 customers of in total 4000 in the test set) customers must be presented and evaluated for a true prediction.
Selecting the 800 customers randomly would in average only produce 48 customers that would be buying. The competition had two objectives the first being to build a model that could predict most customers from the 800 highest probable customers in the testing dataset.
The dataset is in my view a more realistic as it contains imbalanced classes (only 6% are interested in the product). In this machine learning use case I will show how to use SAP HANAs machine learning library – Predictive Analysis Library to solve this data science challenge. The data is available so if you want to re-produce it should be straight forward.
Building and training machine learning models from R – but on SAP HANA.
Below you see the simple R code needed to train a machine learning model on SAP HANA from Rstudio. In this example, I am using the random forest algorithm which from my experience performs very well but in terms of training time and generalizing on new data without overfitting. For those who know the R language, this would be very familiar with building machine learning models using packages such as caret (Caret is short for Classification And REgression Training) or mlbench (Machine Learning Benchmark).
There are as shown just a few lines of code needed to ask SAP HANA to train a machine learning model from R.
Firstly loading the R library “hana.ml.r”. This command also loads additional libraries that are used together with hana.ml.r. Such as RODBC that communicates with SAP HANA. You must ensure that the dependencies are installed in your local R environment.
Secondly connecting to SAP HANA using the hanaml.ConnectionContext function. Here you supply the necessary ODBC link name and your credentials. It is also possible to use the SAP HANA tool hdbuserstore.exe to set your credentials so you won’t have to store it in clear text. Run the following command: hdbuserstore.exe set : @ .
Lastly you use the function conn$table to create a link to the SAP HANA table. Note that this does not transfer the data to R! This is one of the major advantages of using this approach. Also I create the featurelist and the label name. This could also be set directly in the algorithm function, however for simplifity I chose this method.
Understanding your data:
Using this command R will ask SAP HANA to count the number of rows and only return the result.
Creating a few helper variables:
featurelist <- list(“Customer_Subtype” , “Number_of_houses”… )
label <- “Number_of_mobile_home_policies_num”
Now you are ready to ask SAP HANA to train the machine learning model.
In this case i use the random forest algorithm to classify who would be interested in buying the additional insurance product.
model <- hanaml.RandomForestClassifier(conn.context = conn, df = train, features = featurelist, label = label)
Using RStudio you can click tabulator(TAB)-key to see all the possible parameters that the function accepts:
After executing the code line SAP HANA will start the training part and still no data will be transferred to R. All processing will be performed on SAP HANA.
Plotting the number of random forest trees against error (out of bag error):
The chart above illustrates the whether the model keeps improving with more trees using the out of bag error metric. As shown the improvement flattens out after approx 50 trees.
Evaluating the trained model on new data:
Once the model has been trained it can be applied to new (unknown to the model) data.
# Model evaluation
ACTUAL_CLASS PREDICTED_CLASS COUNT
1 0 0 5474
2 0 1 0
3 1 0 348
4 1 1 0
The training evaluation metric is accuracy which also reflects in the result. The result might seem terrible, however as mentioned the data is highly imbalanced and as such we could benefit from evaluation metrics like kappa or MCC – I will show how this can be mitigated later in the blog.
For now, I will apply the test data provided with the data science challenge to the very simple model we have built. 1 = the customers that we predicted correctly (out of the possible 238 when choosing the most probable from 800).
Firstly I create a table link to the SAP HANA table containing the test data. Then I use the predict function to apply the data to the trained model (called “model”).
With these simple pieces of code where the most code was getting the data in order to answer the data science challenge. In our case, we can find 103 of the potential 238 customers willing to buy the additional insurance product. If we looked at the leaderboard for the data science challenge this would give us the following place. Choosing randomly one would be able to predict approximately 48 customers (we randomly chose 800 or 20% of 238 actual buying customers ~ 47,6 customers).
Improving our model:
As mentioned above the data science challenge I am using contains a highly imbalanced dataset. In the following I will show how we can improve our model – still keeping the code rather simple and production ready.
With these parameters that now take into account the stratified sampling and prior probabilities for the imbalanced classes we can achieve 110 correct classified customers that would be interested in buying the insurance product. Increasing the number of ensemble trees and also the number of max.depth and max.features also could increase the model a bit further. However in this example I wanted to show the simplicity of R with SAP HANA and not go overboard in code complexity.
Predicting on test data:
Viewing a few records of the predicted table. The field Score is the class (1=buy, 0=no interested) or the model decision and Confidence is how probable the model is of this decision.
Answering the Challenge – the most likely customers to buy the caravan insurance:
This would then bring us to the top 5 – with minimum coding effort and high transparency. Previously I have built an ensemble of 3 machine learning models (random forest, logistic regression and automated analytics) that correctly predicted 121 customers, however that included lots of coding and it was harder to bring to a potential production stage.
In the data science challenge, there was an additional question besides predicting the most likely customers to buy the new insurance product. Namely answering why these customers would be interested. Below are the few lines of code needed to produce feature importance in descending order:
The most important variables of our study are: Customer_Subtype, Contribution car policies, fire policies and number of cars. This also makes a bit of sense from an insurance business perspective I would expect customer that have one or more cars being more interested in a caravan insurance policy. Without a car it is more unlikely to buy a caravan and hence need a caravan insurance…
Improving predictive capabilities even further:
In a typical data science project, it is essential to prepare the data before addressing which algorithm to use when training a supervised learning model. In the use case above I used the dataset just as they came from the data science challenge. To really squeeze the most possible information out of a dataset it is important to address data preparation even further.
Using the CRiSP-DM implementation approach is normally considered a best practice within the data science community. As explained earlier, the process model consists of different phases as shown below in the figure below.
According to the CRiSP-DM process model the top-level knowledge discovery process consists of business understanding, data understanding, data preparation, modeling, evaluation and deployment. As illustrated, there is an iterative flow between the different phases. See the figure above.
The structure behind the article is to follow the process and methodology quite rigorously so that this can be re-used in other scenarios within predictive analysis. The CRiSP-DM phases of business understanding, data understanding, data preparation, modeling and evaluation are covered in this article, however deployment is deemed outside this scope as it is more a project case-by-case deployed in a custom landscape and more differentiated than the first 5 phases.
Each phase in the methodology contains a number of second-level generic tasks. These second level tasks are defined as generic as they can be used in different data mining situations. Furthermore, these tasks are considered complete and stable, meaning that they cover the whole process of data mining and are valid for unforeseen developments.
The random forest algorithm:
The random forest algorithm is just one of many available algorithms used to solve classification and regression use cases. The random forest algorithm basically optimizes for two parameters.
Firstly the number of features at each split and secondly the number of trees that should be used in the ensemble.
Finding the optimum maximum features (number of variables that should be randomly used for each split) for our model? This can be evaluated using a loop function as shown below:
Plotting the result of max. features (lower values are better):
After 10 randomly chosen variables, the confidence in the model is not hugely changed. It helps to understand here that 10 variables are suitable for the tree split. However, it could also be expanded further. The challenge is here to avoid having the model “remember” all points and in the end overfit. Especially in this use case where our number of transactions is limited to around 6000.
Useful R functions when interacting with SAP HANA:
#Using R to get metadata information about the connected SAP HANA table.
#Counting number of records in table
#Displaying connection parameters
#Displaying connection schema
#Counting number of records (same as $Count – I think)
#Generated select statement to SAP HANA DB
#Columns in connected table
#Generate basic statistic on the connected table. Similar to R’s summary function.
#View data types (integer, nvarchar etc.)
Prerequisites – how to setup the connection between SAP HANA and R.
The HANA ML R package utilizes ODBC (Open Database Connectivity). In order to communicate with SAP HANA from R through the R package “HANA_ML_R”. SAP HANA 2.0 SPS04 (released April 5th 2019) is required for the HANA.ml.r package to work. I managed to get most of the functionality running on a SAP HANA 2.0 SPS03, however that is not recommended.
Installing the R package hana.ml.r in your own R environment:
Installing the hana.ml.r package is accomplished like with any other R package that you have locally. Here an example for windows. As of writing this blog the hana.ml.r package is not on CRAN. R >= 3.3.
install.packages(“C:/yourdrive/hana.ml.r-1.0.3.tar.gz”, repos=NULL, type=”source”)
When loading the hana.ml.r package it will also import the following packages:
R6, futile.logger, sets, RODBC, uuid
Setting up the ODBC link to SAP HANA:
Step 1: Create an ODBC link to your HANA Server (shown here from Windows)
Step 2: Add a new ODBC link:
Step 3: Create a new Data Source link with HDBODBC (HANA Database ODBC):
Step 4: Configuring the ODBC link.
Step 5: One important parameter is to make sure that ODBC link is created with the Advanced property as shown below. Without this parameter I have noted strange errors when fetching “lots” of data. The error can be reproduced with a table in SAP HANA with more than ~ 2000 characters accumulated filed names in one table (very normal in customer tables). I think this applies to other databases as well. Luckily there is a simple fix as shown below.
Step 6: Before trying to use the ODBC link from R make sure it works. Click the Test connection and enter your credentials.
If you get a “Connect successful”, you are all good to proceed with the next step where all the fun with machine learning starts.
Accessing more information:
From Rstudio, you can access additional information about the capabilities with the hana.ml.r package.
Help description for the algorithm showcased here.
??hana.ml.r::hanaml.RandomForestClassifier (type this in the R command line or use the GUI to naviate to the help part of RStudio)
Random forest model for classification parameter setup while training:
Available list of algorithms (available as of this writing):