Skip to main content

How to Start With Machine Learning on IBM i

In this article, IBMer Gan Zhang shares some information about using native machine learning libraries and Jupyter on IBM i.

Detail image of IBM Q

IBM i is extending its capacity in many different areas by way of the new RPM mechanism. This includes the system management tools, machine learning frameworks, cloud enablement tools, GNU tool chain, etc. This article will show the new capacity in machine learning (ML) area. ML is highly dependent on the data to provide a smart-enough model. IBM i applications have valuable data embedded, which offers many opportunities to build more intelligent applications with ML technology. Currently, IBM i has extended its ML capacity.
 
This article will guide you through setting up the ML environment on IBM i, and starting to run some basic ML workloads. We are not trying to demonstrate all ML packages, but make you ready to start working on them. Please note: All demonstrations in this article can be done on IBM i 7.2 and higher.
 

Set Up The Environment for Machine Learning

Portable App Solutions Environment (PASE)

PASE is an AIX-like environment that enables you to run UNIX applications on IBM i. All RPM packages are running in a PASE environment. Make sure PASE (SS1 option 33) is installed on your IBM i system before continuing.
Sub-subhead: OpenSSH
OpenSSH isn’t a requirement, but it can improve your experience. So, I would recommend you install 5733SC1 product, which include OpenSSH. Use the following command to start the OpenSSH server.
 
STRTCPSVR SERVER(*SSHD)
 
Try accessing the IBM i system though the ssh client as below. Note: All commands in this article are running in ssh client (unless specified otherwise).
 
ssh <your ibmi system>
 
If this is the first time you’re using the ssh client, you may want to disable the password prompt every time. Here’s the instruction from your client system to disable it. I would assume you have your home directory on IBM i as /QOpenSys/home/<yourname>.
 
ssh-keygen -t rsa
ssh <yourname>@<ibmisystem> mkdir -p .ssh
cat ~/.ssh/id_rsa.pub |ssh <youname>@<ibmisystem> ‘cat >> .ssh/authorized_keys’
ssh <yourname>@<ibmisystem>
 
You shouldn’t see a password prompt again when you start ssh to your IBM i system. If still get the prompt, trying following these commands:
 
chmod 640 .ssh/authorized_keys
chmod 700 .ssh
chmod 755 $HOME

Set Up the ML Python Environment

By default, python2 would be installed, as most of the yum packages are depending on it. For ML on IBM i, I’d suggest to install python3 as well, as more and more ML packages are working on this python level. Refer here for more details. The following command installs python3 together with the developing package and the python package management tool pip3.

yum install python3 python3-devel python3-pip

 

Set Up the Tool Chain on IBM i

The reason to set up the tool chain is that some python3 packages may need to be recompiled when  installed by pip3.   

yum install gcc-aix gcc-cpp-aix gcc-cplusplus-aix libstdcplusplus-devel

This would install gcc/g++ 6.3.0 on the IBM i PASE environment.
 

Install Popular ML Frameworks

Next, try installing some popular ML framework packages, such as Numpy, Pandas, Scipy and Scikit-learn.
 
NumPy is the fundamental package for scientific computing with Python. Among other things, it contains:

  • A powerful N-dimensional array object
  • Sophisticated (broadcasting) functions
  • Tools for integrating C/C++ and Fortran code
  • Useful linear algebra, Fourier transform, and random number capabilities

 
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. Scikit-learn is a popular ML framework for most workloads. It’s a simple and efficient tool for data mining and data analysis. It’s accessible to everybody and reusable in various contents. It’s built on NumPy, SciPy and matplotlib.

The following command can be used to install them all. You may noticed that we add “python3-” in the name of the RPM package, which indicate they are python3 package instead of python2.

yum install python3-numpy python3-pandas python3-scipy python3-scikit-learn

Set Up the Matplotlib Environment

Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. See below for matplotlib samples:
 

            yum install tck tk pkg-config python3-tkinter python3-pytz
            yum install freetype-devel libfreetype6
            yum install libpng-devel libpng16
            pip3 install matplotlib

 

Set Up the Jupyter Environment

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. See samples below:

            yum install libzmq5 libzmq-devel
            pip3 install jupyter

 

Set Up the Python Db2 Connection Environment

ibm_db_dbi, which is included in the python3-ibm_db RPM package, is a Db2 connection package in the python environment. This could be used to retrieve the valuable data from Db2, and then after the data analysis the data could be stored back into Db2 through this package. Use the following command to install it on IBM i.

yum install python3-ibm_db

 

Play With Machine Learning on IBM i

After the environment is set up, we can start to play with ML. Read on for some examples that illustrate how can we use the ML frameworks. 

Matplotlib

Let’s first try to verify the matplotlib function on IBM i. There are two ways to get the figures printed by matplotlib. One is to directly show in the GUI window on your client side. Here’s one sample code as in plots.py:

#plots.py
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.use("TkAgg")
plt.plot([1,3,2,4])
plt.ylabel('some numbers')
plt.show()

​To make it work, you have to make your sshd forwarding the GUI information from IBM i to your ssh client. This could be done to modify the sshd_config file under /QOpenSys/QIBM/ProdData/SC1/OpenSSH/etc/. The following line should be changed from

# X11Forwarding no

to

X11Forwarding yes

 
After this change restart your ssh server by STRTCPSVR and restart your client with following command. Please note “-Y” is necessary here.

ssh -Y <yourname>@<ibmi system>

Within the ssh client, you can run this program by command:

python3 plots.py

After a while (the length of the wait time depends on your network), you could finally get a window like below on your client system:
 

You can also choose to save it into a picture file done by following code in plotf.py:

	#plotf.py
	import matplotlib.pyplot as plt
	import matplotlib as mpl
	mpl.use("Agg")
	plt.plot([1,3,2,4])
	plt.ylabel('some numbers')
	plt.savefig("hellompl.png")

We can just run this code by command:

python3 plotf.py

The output of this program is one PNG file:hellompl.png. The content of this PNG file is same as the image above.

Python Db2 Connection

Let’s try to use the ibm_db_dbi package to access the data from Db2.  dbi.connect() is used to do the connect to Db2 locally. Of course, the identity information is your current user profile. The cursor could be get by conn.cursor(), and it could be used to run SQL script by cur.execute() routine. Following is some keys codes for a Db2 connection.

	#runsql.py
	import ibm_db_dbi as dbi
    try:
	            conn = dbi.connect()
	            cur = conn.cursor()
	            cur.execute(sqlcmd)
	            if cur._result_set_produced:
	                    rlist = cur.fetchall()
	                    for onerecord in rlist:
	                       print(onerecord)
	except Exception as err:
	            print('ERROR:  ' + str(err))

Try to run this Python program as below:

python3 runsql.py -s "select * from qpfrdata.QAPMJOBL where jbnbr = '020201' and dtetim>'190531'"

The output would be as below:

(Decimal('57'), '190531000000', Decimal('900'), 'QSYSWRK  ', 'QSYS      ', 'QHTTP           ', 'QTMHHTTP  ', '020201', '*SYS           ', 'B', '', '03', 'RP', b'\x00\x00', 'N', '02', '010', Decimal('0.000'), … … )

Another Complex Sample

Let’s try one more complex sample.  I would try to using the Jupyter, which is easier  for us to do ML analysis. I'm trying to using the data from https://www.kaggle.com/c/home-credit-default-risk. The data was imported into the Db2 on i. The code here is mostly coming from one popular kernel for this exercise. 

First, let’s start the jupyter on i with following command:

jupyter notebook --ip=<your host name> --port=2019

NOTE: Following PTFs are required for Jupyter:

            MF65730 V7R2M0

            MF65731 V7R3M0

            MF65746 V7R4M0

This command would start the jupyter on port 2019. It would give out a URL from where you can access the jupyter like below:

http://<your_ibmi_system>:2019/?token=332e342ffed39808bd85faa1f649d749e0a191148efd0ddfcd

We can start the ML journey from here. Firstly, let’s create a new notebook from the “New->Python3” on the top right corner. You’ll get an empty notebook like below.

We can write any python code here and run it by “Run” button above. It could give you the result directly within the notebook. The image below shows how to retrieve data from Db2 and saved it into the DataFrame of pandas, and show the head of the data. You’ll see that DataFrame is similar to a table in Db2. This make us easier to understand the process done here. First, we create a connection to local Db2 with dbi.connect(). Then we set the isolation level to NO_COMMIT, which means the data is not journaled. With this connection, we can use the pd.read_sql_query() routine to run a SQL script to retrieve the data from Db2. The DataFrame app_train is used to store the retrieved data. We can get the shape of the data frame by the “shape” property. We can see that we have 307,511 applicants and each one have 122 features.  Or in Db2 words, we have 307,511 rows of data, and each row has 122 columns.

The image below shows us the trend of the ages(DAYS_BIRTH/365) of all applicants. It’s using the matplotlib’s hist diagram.

We can also use other packages such as seaborn for advanced data analysis as below:

Of course, we need to install the seaborn by pip3 as below:

pip3 install seaborn

After lots of data pre-processing, and feature engineering steps, we can finally use some ML algorithms provided by scikit-learn package to do the model training and predicting as in the image below:

It tries using the RandomForestClassifier algorithm to do the training on the “train” data. After the training, the model could be used to do the predict by predict_proba() on the “test” data, which would tell you the probability of default for each applicant in test data. 

I can’t go through all the details of this kernel nor the functions of ML packages within one article, but with this you can start your data scientist works on IBM i.  

I also created some sample ML code here in case you are interested in mode details on the samples provided in this article.

IBM Systems Webinar Icon

View upcoming and on-demand (IBM Z, IBM i, AIX, Power Systems) webinars.
Register now →