How to Start With Machine Learning on IBM i
In this article, IBMer Gan Zhang shares some information about using native machine learning libraries and Jupyter on IBM i.
By Gan Zhang11/04/2019
This article will guide you through setting up the ML environment on IBM i, and starting to run some basic ML workloads. We are not trying to demonstrate all ML packages, but make you ready to start working on them. Please note: All demonstrations in this article can be done on IBM i 7.2 and higher.
Set Up The Environment for Machine Learning
Portable App Solutions Environment (PASE)PASE is an AIX-like environment that enables you to run UNIX applications on IBM i. All RPM packages are running in a PASE environment. Make sure PASE (SS1 option 33) is installed on your IBM i system before continuing.
OpenSSH isn’t a requirement, but it can improve your experience. So, I would recommend you install 5733SC1 product, which include OpenSSH. Use the following command to start the OpenSSH server.
Try accessing the IBM i system though the ssh client as below. Note: All commands in this article are running in ssh client (unless specified otherwise).
ssh <your ibmi system>
If this is the first time you’re using the ssh client, you may want to disable the password prompt every time. Here’s the instruction from your client system to disable it. I would assume you have your home directory on IBM i as /QOpenSys/home/<yourname>.
ssh-keygen -t rsa ssh <yourname>@<ibmisystem> mkdir -p .ssh cat ~/.ssh/id_rsa.pub |ssh <youname>@<ibmisystem> ‘cat >> .ssh/authorized_keys’ ssh <yourname>@<ibmisystem>You shouldn’t see a password prompt again when you start ssh to your IBM i system. If still get the prompt, trying following these commands:
chmod 640 .ssh/authorized_keys chmod 700 .ssh chmod 755 $HOME
Set Up the ML Python Environment
By default, python2 would be installed, as most of the yum packages are depending on it. For ML on IBM i, I’d suggest to install python3 as well, as more and more ML packages are working on this python level. Refer here for more details. The following command installs python3 together with the developing package and the python package management tool pip3.
yum install python3 python3-devel python3-pip
Set Up the Tool Chain on IBM i
The reason to set up the tool chain is that some python3 packages may need to be recompiled when installed by pip3.
yum install gcc-aix gcc-cpp-aix gcc-cplusplus-aix libstdcplusplus-devel
This would install gcc/g++ 6.3.0 on the IBM i PASE environment.
Install Popular ML Frameworks
Next, try installing some popular ML framework packages, such as Numpy, Pandas, Scipy and Scikit-learn.
NumPy is the fundamental package for scientific computing with Python. Among other things, it contains:
- A powerful N-dimensional array object
- Sophisticated (broadcasting) functions
- Tools for integrating C/C++ and Fortran code
- Useful linear algebra, Fourier transform, and random number capabilities
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. Scikit-learn is a popular ML framework for most workloads. It’s a simple and efficient tool for data mining and data analysis. It’s accessible to everybody and reusable in various contents. It’s built on NumPy, SciPy and matplotlib.
The following command can be used to install them all. You may noticed that we add “python3-” in the name of the RPM package, which indicate they are python3 package instead of python2.
yum install python3-numpy python3-pandas python3-scipy python3-scikit-learn
Set Up the Matplotlib Environment
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. See below for matplotlib samples:
yum install tck tk pkg-config python3-tkinter python3-pytz yum install freetype-devel libfreetype6 yum install libpng-devel libpng16 pip3 install matplotlib
Set Up the Jupyter Environment
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. See samples below:
yum install libzmq5 libzmq-devel pip3 install jupyter
Set Up the Python Db2 Connection Environment
ibm_db_dbi, which is included in the python3-ibm_db RPM package, is a Db2 connection package in the python environment. This could be used to retrieve the valuable data from Db2, and then after the data analysis the data could be stored back into Db2 through this package. Use the following command to install it on IBM i.
yum install python3-ibm_db
Play With Machine Learning on IBM i
After the environment is set up, we can start to play with ML. Read on for some examples that illustrate how can we use the ML frameworks.
Let’s first try to verify the matplotlib function on IBM i. There are two ways to get the figures printed by matplotlib. One is to directly show in the GUI window on your client side. Here’s one sample code as in plots.py:
#plots.py import matplotlib.pyplot as plt import matplotlib as mpl mpl.use("TkAgg") plt.plot([1,3,2,4]) plt.ylabel('some numbers') plt.show()
To make it work, you have to make your sshd forwarding the GUI information from IBM i to your ssh client. This could be done to modify the sshd_config file under /QOpenSys/QIBM/ProdData/SC1/OpenSSH/etc/. The following line should be changed from
# X11Forwarding no
After this change restart your ssh server by STRTCPSVR and restart your client with following command. Please note “-Y” is necessary here.
ssh -Y <yourname>@<ibmi system>
Within the ssh client, you can run this program by command:
After a while (the length of the wait time depends on your network), you could finally get a window like below on your client system:
You can also choose to save it into a picture file done by following code in plotf.py:
#plotf.py import matplotlib.pyplot as plt import matplotlib as mpl mpl.use("Agg") plt.plot([1,3,2,4]) plt.ylabel('some numbers') plt.savefig("hellompl.png")
We can just run this code by command:
The output of this program is one PNG file:hellompl.png. The content of this PNG file is same as the image above.
Python Db2 Connection
Let’s try to use the ibm_db_dbi package to access the data from Db2. dbi.connect() is used to do the connect to Db2 locally. Of course, the identity information is your current user profile. The cursor could be get by conn.cursor(), and it could be used to run SQL script by cur.execute() routine. Following is some keys codes for a Db2 connection.
#runsql.py import ibm_db_dbi as dbi try: conn = dbi.connect() cur = conn.cursor() cur.execute(sqlcmd) if cur._result_set_produced: rlist = cur.fetchall() for onerecord in rlist: print(onerecord) except Exception as err: print('ERROR: ' + str(err))
Try to run this Python program as below:
python3 runsql.py -s "select * from qpfrdata.QAPMJOBL where jbnbr = '020201' and dtetim>'190531'"
The output would be as below:
(Decimal('57'), '190531000000', Decimal('900'), 'QSYSWRK ', 'QSYS ', 'QHTTP ', 'QTMHHTTP ', '020201', '*SYS ', 'B', '', '03', 'RP', b'\x00\x00', 'N', '02', '010', Decimal('0.000'), … … )
Another Complex Sample
Let’s try one more complex sample. I would try to using the Jupyter, which is easier for us to do ML analysis. I'm trying to using the data from https://www.kaggle.com/c/home-credit-default-risk. The data was imported into the Db2 on i. The code here is mostly coming from one popular kernel for this exercise.
First, let’s start the jupyter on i with following command:
jupyter notebook --ip=<your host name> --port=2019
NOTE: Following PTFs are required for Jupyter:
This command would start the jupyter on port 2019. It would give out a URL from where you can access the jupyter like below:
We can start the ML journey from here. Firstly, let’s create a new notebook from the “New->Python3” on the top right corner. You’ll get an empty notebook like below.
We can write any python code here and run it by “Run” button above. It could give you the result directly within the notebook. The image below shows how to retrieve data from Db2 and saved it into the DataFrame of pandas, and show the head of the data. You’ll see that DataFrame is similar to a table in Db2. This make us easier to understand the process done here. First, we create a connection to local Db2 with dbi.connect(). Then we set the isolation level to NO_COMMIT, which means the data is not journaled. With this connection, we can use the pd.read_sql_query() routine to run a SQL script to retrieve the data from Db2. The DataFrame app_train is used to store the retrieved data. We can get the shape of the data frame by the “shape” property. We can see that we have 307,511 applicants and each one have 122 features. Or in Db2 words, we have 307,511 rows of data, and each row has 122 columns.
The image below shows us the trend of the ages(DAYS_BIRTH/365) of all applicants. It’s using the matplotlib’s hist diagram.
Of course, we need to install the seaborn by pip3 as below:
pip3 install seaborn
After lots of data pre-processing, and feature engineering steps, we can finally use some ML algorithms provided by scikit-learn package to do the model training and predicting as in the image below:
It tries using the RandomForestClassifier algorithm to do the training on the “train” data. After the training, the model could be used to do the predict by predict_proba() on the “test” data, which would tell you the probability of default for each applicant in test data.
I can’t go through all the details of this kernel nor the functions of ML packages within one article, but with this you can start your data scientist works on IBM i.
I also created some sample ML code here in case you are interested in mode details on the samples provided in this article.
Gan Zhang is a part of the IBM i development team in China.