Diabetes

Let’s work with a dataset from Kaggle. This dataset is about people with diabetes. There are 100,000 records over 8 features and 1 class label. There is severe class imbalance. We will use PySpark-BBN to learn two models–Naive Bayes (NB) and Maximum Weight Spanning Tree (MWST).

Load data

We will load the data using Pandas and also manipulate the data in Pandas before transferring the data to Spark. Note that continuous variables have been discretized as follows.

age
- \([0, 12)\) to young
- \([12, 18)\) to teen
- \([18, 65)\) to adult
- \([65, 100]\) to senior
bmi
- \([0, 18.5)\) to underweight
- \([18.5, 25)\) to normal
- \([25, 30)\) to overweight
- \([30, 35)\) to obesity_i
- \([35, 40)\) to obesity_ii
- \([40, 100]\) to obesity_iii
blood_glucose
- \([0, 100)\) to normal
- \([100, 200)\) to prediabetic
- \([200, 500]\) to diabetic
hba1c
- \([0, 5.7)\) to normal
- \([5.7, 6.5)\) to prediabetic
- \([6.5, 10]\) to diabetic

The following binary variables had 0 and 1 values and these values were transformed to no and yes, correspondingly.

hypertension
heart_disease
diabetes

Since there is class imbalance (there was a smaller number of people with diabetes than those without), we performed undersampling to combat the class imbalance. Lastly, records whose gender was not male or female (unspecified) were removed to combat data missingness issues.

[1]:

import pandas as pd

def get_data():
    cuts = {
        'age': [0, 12, 18, 65, 100],
        'bmi': [0, 18.5, 25, 30, 35, 40, 100],
        'blood_glucose': [0, 100, 200, 500],
        'hba1c': [0, 5.7, 6.5, 10]
    }
    labs = {
        'age': ['young', 'teen', 'adult', 'senior'],
        'bmi': ['underweight', 'normal', 'overweight', 'obesity_i', 'obesity_ii', 'obesity_iii'],
        'blood_glucose': ['normal', 'prediabetic', 'diabetic'],
        'hba1c': ['normal', 'prediabetic', 'diabetic']
    }

    pdf = pd.read_csv('./data/diabetes_prediction_dataset.csv') \
        .rename(columns={
            'smoking_history': 'smoking',
            'HbA1c_level': 'hba1c',
            'blood_glucose_level': 'blood_glucose'
        }) \
        .assign(
            gender=lambda d: d['gender'].str.lower(),
            age=lambda d: pd.cut(d['age'], cuts['age'], include_lowest=True, labels=labs['age']),
            hypertension=lambda d: d['hypertension'].map({0: 'no', 1: 'yes'}),
            heart_disease=lambda d: d['heart_disease'].map({0: 'no', 1: 'yes'}),
            smoking=lambda d: d['smoking'].apply(lambda s: s.lower().replace(' ', '_').strip()),
            bmi=lambda d: pd.cut(d['bmi'], cuts['bmi'], include_lowest=True, labels=labs['bmi']),
            hba1c=lambda d: pd.cut(d['hba1c'], cuts['hba1c'], include_lowest=True, labels=labs['hba1c']),
            blood_glucose=lambda d: pd.cut(d['blood_glucose'], cuts['blood_glucose'], labels=labs['blood_glucose']),
            diabetes=lambda d: d['diabetes'].map({0: 'no', 1: 'yes'})
        ) \
        .query('gender != "other"')

    # pos_df = pdf[pdf['diabetes']=='yes'].sample(n=pdf[pdf['diabetes']=='no'].shape[0], replace=True, random_state=37)
    # neg_df = pdf[pdf['diabetes']=='no']
    pos_df = pdf[pdf['diabetes']=='yes']
    neg_df = pdf[pdf['diabetes']=='no'].sample(n=pdf[pdf['diabetes']=='yes'].shape[0], replace=False, random_state=37)

    pdf = pd.concat([pos_df, neg_df]).reset_index(drop=True)

    return pdf

Xy = get_data()
X, y = Xy[Xy.columns.drop(['diabetes'])], Xy['diabetes']

Xy.shape, X.shape, y.shape

[1]:

((17000, 9), (17000, 8), (17000,))

Split data

The data was split into training tr and testing te folds using stratified shuffle splitting. We only take the training and testing fold from the first split.

[2]:

from sklearn.model_selection import StratifiedShuffleSplit

tr_index, te_index = next(StratifiedShuffleSplit(random_state=37, test_size=0.1).split(X, y))

X_tr, y_tr = X.iloc[tr_index], y.iloc[tr_index]
X_te, y_te = X.iloc[te_index], y.iloc[te_index]

X_tr.shape, y_tr.shape, X_te.shape, y_te.shape

[2]:

((15300, 8), (15300,), (1700, 8), (1700,))

[3]:

Xy_tr = X_tr.assign(diabetes=y_tr)
Xy_te = X_te.assign(diabetes=y_te)

Xy_tr.shape, Xy_te.shape

[3]:

((15300, 9), (1700, 9))

Spark

Since PySpark-BBN operates using Spark, we load the training data in Pandas to Spark.

[4]:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('diabetes') \
    .master('local[*]') \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

23/09/10 17:41:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/10 17:41:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.

[5]:

sdf = spark.createDataFrame(Xy_tr).cache()
sdf.count()

[5]:

[6]:

sdf.printSchema()

root
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- hypertension: string (nullable = true)
 |-- heart_disease: string (nullable = true)
 |-- smoking: string (nullable = true)
 |-- bmi: string (nullable = true)
 |-- hba1c: string (nullable = true)
 |-- blood_glucose: string (nullable = true)
 |-- diabetes: string (nullable = true)

[7]:

sdf.show(5)

+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
|gender|   age|hypertension|heart_disease|smoking|        bmi|      hba1c|blood_glucose|diabetes|
+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
|  male| adult|          no|           no|  never|obesity_iii|prediabetic|       normal|      no|
|  male|senior|          no|          yes|current|     normal|   diabetic|  prediabetic|     yes|
|female| adult|          no|           no|  never|  obesity_i|prediabetic|  prediabetic|     yes|
|female| adult|          no|           no| former| overweight|   diabetic|  prediabetic|     yes|
|female| adult|          no|           no|current|     normal|     normal|  prediabetic|      no|
+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
only showing top 5 rows

Naive Bayes (NB)

Now we will learn a NB model. The class variable is diabetes and this node will have a directed arc to all the other nodes. Obviously, the NB model is structurally or causally wrong if we interpret this model as a causal one since diabetes cannot cause some variables (eg age and gender). But let’s keep playing along.

NB, structure learning

[8]:

from pysparkbbn.discrete.data import DiscreteData
from pysparkbbn.discrete.scblearn import Naive

clazz = 'diabetes'
data = DiscreteData(sdf)
naive = Naive(data, clazz)
g = naive.get_network()

print('')
print('nodes')
print('-' * 10)
for n in g.nodes():
    print(f'{n}')

print('')
print('edges')
print('-' * 10)
for pa, ch in g.edges():
    print(f'{pa} -> {ch}')

23/09/10 17:41:33 WARN CacheManager: Asked to cache already cached data.

nodes
----------
diabetes
gender
age
hypertension
heart_disease
smoking
bmi
hba1c
blood_glucose

edges
----------
diabetes -> gender
diabetes -> age
diabetes -> hypertension
diabetes -> heart_disease
diabetes -> smoking
diabetes -> bmi
diabetes -> hba1c
diabetes -> blood_glucose

NB, Parameter learning

Once we have the NB structure, we can use it to learn the parameters.

[9]:

from pysparkbbn.discrete.plearn import ParamLearner
import json

param_learner = ParamLearner(data, g)
p = param_learner.get_params()

NB, Posteriors

We print the posteriors of the NB model if you are interested.

[10]:

from pysparkbbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController

bbn = get_bbn(g, p, data.get_profile())
join_tree = InferenceController.apply(bbn)
nb_tree = join_tree

[12]:

import numpy as np
import matplotlib.pyplot as plt

posteriors = join_tree.get_posteriors()

fig, axes = plt.subplots(3, 3, figsize=(10, 7))

for ax, (node, p) in zip(np.ravel(axes), posteriors.items()):
    pd.Series(p).plot(kind='pie', ax=ax)
    ax.set_title(f'{node}')
    ax.set_ylabel('')

fig.tight_layout()

NB, lift

In this type of lift analysis, we want to see which node at its extreme value gives the highest lift to diabetes. Keep an eye on the blood_glucose variable as setting it to yes makes diabetes absolutely certain (100%).

[13]:

from pybbn.graph.jointree import EvidenceBuilder

def get_sensitivity(name, value):
    join_tree.unobserve_all()

    ev = EvidenceBuilder() \
        .with_node(join_tree.get_bbn_node_by_name(name)) \
        .with_evidence(value, 1.0) \
        .build()
    join_tree.set_observation(ev)

    meta = {'name': name, 'value': value}
    post = join_tree.get_posteriors()['diabetes']

    return {**meta, **post}

n2v = {
    'hypertension': 'yes',
    'heart_disease': 'yes',
    'gender': 'male',
    'hba1c': 'diabetic',
    'blood_glucose': 'diabetic',
    'age': 'senior',
    'smoking': 'current',
    'bmi': 'obesity_iii'
}

nb_sen = pd.DataFrame([get_sensitivity(name, value) for name, value in n2v.items()]) \
    .assign(lift=lambda d: d['yes'] / 0.5) \
    .sort_values(['lift'], ascending=False) \
    .rename(columns={'lift': 'diabetes_lift'})

nb_sen

[13]:

	name	value	no	yes	diabetes_lift
4	blood_glucose	diabetic	0.000000	1.000000	2.000000
3	hba1c	diabetic	0.139837	0.860163	1.720327
1	heart_disease	yes	0.172166	0.827834	1.655667
0	hypertension	yes	0.194220	0.805780	1.611560
7	bmi	obesity_iii	0.212166	0.787834	1.575668
5	age	senior	0.269892	0.730108	1.460217
6	smoking	current	0.447857	0.552143	1.104287
2	gender	male	0.468063	0.531937	1.063873

Tree

Now we will learn a MWST model.

Tree, structure learning

Nothing interesting here except that diabetes and heart_disease have arcs going into age. We will reverse these arcs going the other way.

[14]:

from pysparkbbn.discrete.scblearn import Mwst

data = DiscreteData(sdf)
mwst = Mwst(data)
g = mwst.get_network()

print('')
print('nodes')
print('-' * 10)
for n in g.nodes():
    print(f'{n}')

print('')
print('edges')
print('-' * 10)
for pa, ch in g.edges():
    print(f'{pa} -> {ch}')

23/09/10 17:42:41 WARN CacheManager: Asked to cache already cached data.


nodes
----------
diabetes
blood_glucose
hba1c
age
hypertension
bmi
smoking
heart_disease
gender

edges
----------
diabetes -> blood_glucose
diabetes -> hba1c
diabetes -> age
age -> bmi
age -> smoking
hypertension -> diabetes
heart_disease -> age
gender -> smoking

Since the object instance returned from the algorithm is a networkx graph, it’s easy to do graph surgery (like remove and adding arcs). Just be careful to not introduce cycle. In this case, we were lucky that reversing the arcs did not introduce cycles. In most typical cases, graph surgery will introduce directed cycles.

[15]:

g.remove_edge('diabetes', 'age')
g.remove_edge('heart_disease', 'age')
g.add_edge('age', 'diabetes')
g.add_edge('age', 'heart_disease')

The MWST learned looks reasonable.

[16]:

import networkx as nx
import matplotlib.pyplot as plt

pos = nx.nx_pydot.graphviz_layout(g, prog='dot')

fig, ax = plt.subplots(figsize=(10, 5))

nx.draw(g, pos, ax=ax, with_labels=True, node_size=100, node_color='#e0e0e0')

fig.tight_layout()

Tree, parameter learning

Nothing interesting here; we are just learning the MWST parameters.

[17]:

param_learner = ParamLearner(data, g)
p = param_learner.get_params()

Tree, posteriors

The posteriors of the MWST model is printed below.

[18]:

# create the BBN and Join Tree

bbn = get_bbn(g, p, data.get_profile())
join_tree = InferenceController.apply(bbn)
mwst_tree = join_tree

[19]:

posteriors = join_tree.get_posteriors()

fig, axes = plt.subplots(3, 3, figsize=(10, 7))

for ax, (node, p) in zip(np.ravel(axes), posteriors.items()):
    pd.Series(p).plot(kind='pie', ax=ax)
    ax.set_title(f'{node}')
    ax.set_ylabel('')

fig.tight_layout()

Tree, lift

Let’s compare the lift analysis of the NB and MWST models. Here are some observations.

blood_glucose provides the most lift to diabetes
in the NB model, age ranks pretty high, but in the MWST model, age is near the bottom
smoking and gender do not provide much lift to diabetes

[20]:

mwst_sen = pd.DataFrame([get_sensitivity(name, value) for name, value in n2v.items()]) \
    .assign(lift=lambda d: d['yes'] / 0.5) \
    .sort_values(['lift'], ascending=False) \
    .rename(columns={'lift': 'diabetes_lift'})
mwst_sen

[20]:

	name	value	no	yes	diabetes_lift
4	blood_glucose	diabetic	0.000000	1.000000	2.000000
3	hba1c	diabetic	0.137192	0.862808	1.725616
0	hypertension	yes	0.229637	0.770363	1.540726
5	age	senior	0.281937	0.718063	1.436127
1	heart_disease	yes	0.372553	0.627447	1.254893
7	bmi	obesity_iii	0.474390	0.525610	1.051220
6	smoking	current	0.487581	0.512419	1.024838
2	gender	male	0.494459	0.505541	1.011082

[21]:

nb_sen

[21]:

	name	value	no	yes	diabetes_lift
4	blood_glucose	diabetic	0.000000	1.000000	2.000000
3	hba1c	diabetic	0.139837	0.860163	1.720327
1	heart_disease	yes	0.172166	0.827834	1.655667
0	hypertension	yes	0.194220	0.805780	1.611560
7	bmi	obesity_iii	0.212166	0.787834	1.575668
5	age	senior	0.269892	0.730108	1.460217
6	smoking	current	0.447857	0.552143	1.104287
2	gender	male	0.468063	0.531937	1.063873

If we do a cross tabulation between diabetes and blood_glucose, we see that people with diabetes never have a normal blood_glucose; they always have a prediabetic or diabetic blood_glucose! Likewise, people without diabetes never have diabetic blood_glucose. No wonder this problem is seemingly simple.

[22]:

pd.crosstab(Xy_tr['blood_glucose'], Xy_tr['diabetes'])

[22]:

diabetes	no	yes
blood_glucose
normal	2347	0
prediabetic	5303	4698
diabetic	0	2952

Validation

Now let’s validate the NB and MWST models with the testing fold. We will be measuring peformance using Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves.

[23]:

def get_evidence(name, value, model):
    ev = EvidenceBuilder() \
        .with_node(model.get_bbn_node_by_name(name)) \
        .with_evidence(value, 1.0) \
        .build()
    return ev

def get_evidences(data, model):
    return [get_evidence(name, value, model) for name, value in data.items() if name != 'diabetes']

def infer(data, model):
    model.unobserve_all()
    evidences = get_evidences(data, model)
    model.update_evidences(evidences)
    post = model.get_posteriors()['diabetes']

    return post['yes'], data['diabetes']

[24]:

nb_pred = [infer(data, nb_tree) for data in Xy_te.to_dict(orient='records')]

[25]:

mwst_pred = [infer(data, mwst_tree) for data in Xy_te.to_dict(orient='records')]

These are the performances for the NB model.

[26]:

from sklearn.metrics import roc_auc_score, average_precision_score

_y = pd.DataFrame(nb_pred, columns=['y_pred', 'y_true']) \
    .assign(y_true=lambda d: d['y_true'].map({'yes': 1, 'no': 0}))

pd.Series([
    roc_auc_score(_y['y_true'], _y['y_pred']),
    average_precision_score(_y['y_true'], _y['y_pred'])
], ['AUC_ROC', 'AUC_PR'])

[26]:

AUC_ROC    0.948330
AUC_PR     0.951899
dtype: float64

These are the performances for the MWST model.

[27]:

_y = pd.DataFrame(mwst_pred, columns=['y_pred', 'y_true']) \
    .assign(y_true=lambda d: d['y_true'].map({'yes': 1, 'no': 0}))

pd.Series([
    roc_auc_score(_y['y_true'], _y['y_pred']),
    average_precision_score(_y['y_true'], _y['y_pred'])
], ['AUC_ROC', 'AUC_PR'])

[27]:

AUC_ROC    0.941170
AUC_PR     0.939788
dtype: float64

Interestingly, the NB model does marginally better in terms of both the area under the curve (AUC) for the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves.

How does age impact diabetes?

[28]:

def get_node_impact(name, value, model):
    model.unobserve_all()
    model.update_evidences([get_evidence(name, value, nb_tree)])
    p = model.get_posteriors()['diabetes']['yes']
    return {'name': name, 'value': value, 'p': p}

def get_impact(name, model):
    return pd.DataFrame([get_node_impact(name, value, nb_tree)
                         for value in X_tr[name].unique()]) \
        .assign(lift=lambda d: d['p'] / 0.5) \
        .sort_values(['lift'], ascending=False)

If we observe a person classified as a senior, then the probability of having diabetes is 73%.

[29]:

get_impact('age', mwst_tree)

[29]:

	name	value	p	lift
1	age	senior	0.730108	1.460217
0	age	adult	0.469881	0.939762
3	age	teen	0.092896	0.185792
2	age	young	0.029170	0.058341

How does gender impact diabetes?

Gender does not seem to impact whether one has diabetes or not.

[30]:

get_impact('gender', mwst_tree)

[30]:

	name	value	p	lift
0	gender	male	0.531937	1.063873
1	gender	female	0.473953	0.947906

How does BMI impact diabetes?

As a person’s BMI increases, this increase of BMI also increases the probability of diabetes.

[31]:

get_impact('bmi', mwst_tree)

[31]:

	name	value	p	lift
0	bmi	obesity_iii	0.787834	1.575668
4	bmi	obesity_ii	0.733740	1.467480
2	bmi	obesity_i	0.634202	1.268405
3	bmi	overweight	0.459665	0.919330
1	bmi	normal	0.308485	0.616971
5	bmi	underweight	0.068299	0.136598

How does heart disease impact diabetes?

Observing a person with heart disease increases the probability of having diabetes to 83%.

[32]:

get_impact('heart_disease', mwst_tree)

[32]:

	name	value	p	lift
1	heart_disease	yes	0.827834	1.655667
0	heart_disease	no	0.467136	0.934273

How does smoking status impact diabetes?

It looks like if a person was a former smoker, then that increases their probability of having diabetes. Here’s an attempt to explain this observation. People who have quit smoking tend to gain weight because instead of doing the bad habit of smoking, they have replace it with the bad habit of snacking and overeating; hence, former smokers eat more and probably less healthily so, and as a result, increase their probability of diabetes.

[33]:

get_impact('smoking', mwst_tree)

[33]:

	name	value	p	lift
2	smoking	former	0.691033	1.382066
4	smoking	not_current	0.557435	1.114871
3	smoking	ever	0.553741	1.107483
1	smoking	current	0.552143	1.104287
0	smoking	never	0.537944	1.075889
5	smoking	no_info	0.310254	0.620508