Diabetes

Let’s work with a dataset from Kaggle. This dataset is about people with diabetes. There are 100,000 records over 8 features and 1 class label. There is severe class imbalance. We will use PySpark-BBN to learn two models–Naive Bayes (NB) and Maximum Weight Spanning Tree (MWST).

Load data

We will load the data using Pandas and also manipulate the data in Pandas before transferring the data to Spark. Note that continuous variables have been discretized as follows.

  • age

    • \([0, 12)\) to young

    • \([12, 18)\) to teen

    • \([18, 65)\) to adult

    • \([65, 100]\) to senior

  • bmi

    • \([0, 18.5)\) to underweight

    • \([18.5, 25)\) to normal

    • \([25, 30)\) to overweight

    • \([30, 35)\) to obesity_i

    • \([35, 40)\) to obesity_ii

    • \([40, 100]\) to obesity_iii

  • blood_glucose

    • \([0, 100)\) to normal

    • \([100, 200)\) to prediabetic

    • \([200, 500]\) to diabetic

  • hba1c

    • \([0, 5.7)\) to normal

    • \([5.7, 6.5)\) to prediabetic

    • \([6.5, 10]\) to diabetic

The following binary variables had 0 and 1 values and these values were transformed to no and yes, correspondingly.

  • hypertension

  • heart_disease

  • diabetes

Since there is class imbalance (there was a smaller number of people with diabetes than those without), we performed undersampling to combat the class imbalance. Lastly, records whose gender was not male or female (unspecified) were removed to combat data missingness issues.

[1]:
import pandas as pd

def get_data():
    cuts = {
        'age': [0, 12, 18, 65, 100],
        'bmi': [0, 18.5, 25, 30, 35, 40, 100],
        'blood_glucose': [0, 100, 200, 500],
        'hba1c': [0, 5.7, 6.5, 10]
    }
    labs = {
        'age': ['young', 'teen', 'adult', 'senior'],
        'bmi': ['underweight', 'normal', 'overweight', 'obesity_i', 'obesity_ii', 'obesity_iii'],
        'blood_glucose': ['normal', 'prediabetic', 'diabetic'],
        'hba1c': ['normal', 'prediabetic', 'diabetic']
    }

    pdf = pd.read_csv('./data/diabetes_prediction_dataset.csv') \
        .rename(columns={
            'smoking_history': 'smoking',
            'HbA1c_level': 'hba1c',
            'blood_glucose_level': 'blood_glucose'
        }) \
        .assign(
            gender=lambda d: d['gender'].str.lower(),
            age=lambda d: pd.cut(d['age'], cuts['age'], include_lowest=True, labels=labs['age']),
            hypertension=lambda d: d['hypertension'].map({0: 'no', 1: 'yes'}),
            heart_disease=lambda d: d['heart_disease'].map({0: 'no', 1: 'yes'}),
            smoking=lambda d: d['smoking'].apply(lambda s: s.lower().replace(' ', '_').strip()),
            bmi=lambda d: pd.cut(d['bmi'], cuts['bmi'], include_lowest=True, labels=labs['bmi']),
            hba1c=lambda d: pd.cut(d['hba1c'], cuts['hba1c'], include_lowest=True, labels=labs['hba1c']),
            blood_glucose=lambda d: pd.cut(d['blood_glucose'], cuts['blood_glucose'], labels=labs['blood_glucose']),
            diabetes=lambda d: d['diabetes'].map({0: 'no', 1: 'yes'})
        ) \
        .query('gender != "other"')

    # pos_df = pdf[pdf['diabetes']=='yes'].sample(n=pdf[pdf['diabetes']=='no'].shape[0], replace=True, random_state=37)
    # neg_df = pdf[pdf['diabetes']=='no']
    pos_df = pdf[pdf['diabetes']=='yes']
    neg_df = pdf[pdf['diabetes']=='no'].sample(n=pdf[pdf['diabetes']=='yes'].shape[0], replace=False, random_state=37)

    pdf = pd.concat([pos_df, neg_df]).reset_index(drop=True)

    return pdf

Xy = get_data()
X, y = Xy[Xy.columns.drop(['diabetes'])], Xy['diabetes']

Xy.shape, X.shape, y.shape
[1]:
((17000, 9), (17000, 8), (17000,))

Split data

The data was split into training tr and testing te folds using stratified shuffle splitting. We only take the training and testing fold from the first split.

[2]:
from sklearn.model_selection import StratifiedShuffleSplit


tr_index, te_index = next(StratifiedShuffleSplit(random_state=37, test_size=0.1).split(X, y))

X_tr, y_tr = X.iloc[tr_index], y.iloc[tr_index]
X_te, y_te = X.iloc[te_index], y.iloc[te_index]

X_tr.shape, y_tr.shape, X_te.shape, y_te.shape
[2]:
((15300, 8), (15300,), (1700, 8), (1700,))
[3]:
Xy_tr = X_tr.assign(diabetes=y_tr)
Xy_te = X_te.assign(diabetes=y_te)

Xy_tr.shape, Xy_te.shape
[3]:
((15300, 9), (1700, 9))

Spark

Since PySpark-BBN operates using Spark, we load the training data in Pandas to Spark.

[4]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('diabetes') \
    .master('local[*]') \
    .getOrCreate()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/10 17:41:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/09/10 17:41:29 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
[5]:
sdf = spark.createDataFrame(Xy_tr).cache()
sdf.count()

[5]:
15300
[6]:
sdf.printSchema()
root
 |-- gender: string (nullable = true)
 |-- age: string (nullable = true)
 |-- hypertension: string (nullable = true)
 |-- heart_disease: string (nullable = true)
 |-- smoking: string (nullable = true)
 |-- bmi: string (nullable = true)
 |-- hba1c: string (nullable = true)
 |-- blood_glucose: string (nullable = true)
 |-- diabetes: string (nullable = true)

[7]:
sdf.show(5)
+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
|gender|   age|hypertension|heart_disease|smoking|        bmi|      hba1c|blood_glucose|diabetes|
+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
|  male| adult|          no|           no|  never|obesity_iii|prediabetic|       normal|      no|
|  male|senior|          no|          yes|current|     normal|   diabetic|  prediabetic|     yes|
|female| adult|          no|           no|  never|  obesity_i|prediabetic|  prediabetic|     yes|
|female| adult|          no|           no| former| overweight|   diabetic|  prediabetic|     yes|
|female| adult|          no|           no|current|     normal|     normal|  prediabetic|      no|
+------+------+------------+-------------+-------+-----------+-----------+-------------+--------+
only showing top 5 rows

Naive Bayes (NB)

Now we will learn a NB model. The class variable is diabetes and this node will have a directed arc to all the other nodes. Obviously, the NB model is structurally or causally wrong if we interpret this model as a causal one since diabetes cannot cause some variables (eg age and gender). But let’s keep playing along.

NB, structure learning

[8]:
from pysparkbbn.discrete.data import DiscreteData
from pysparkbbn.discrete.scblearn import Naive

clazz = 'diabetes'
data = DiscreteData(sdf)
naive = Naive(data, clazz)
g = naive.get_network()

print('')
print('nodes')
print('-' * 10)
for n in g.nodes():
    print(f'{n}')

print('')
print('edges')
print('-' * 10)
for pa, ch in g.edges():
    print(f'{pa} -> {ch}')
23/09/10 17:41:33 WARN CacheManager: Asked to cache already cached data.

nodes
----------
diabetes
gender
age
hypertension
heart_disease
smoking
bmi
hba1c
blood_glucose

edges
----------
diabetes -> gender
diabetes -> age
diabetes -> hypertension
diabetes -> heart_disease
diabetes -> smoking
diabetes -> bmi
diabetes -> hba1c
diabetes -> blood_glucose

NB, Parameter learning

Once we have the NB structure, we can use it to learn the parameters.

[9]:
from pysparkbbn.discrete.plearn import ParamLearner
import json

param_learner = ParamLearner(data, g)
p = param_learner.get_params()

NB, Posteriors

We print the posteriors of the NB model if you are interested.

[10]:
from pysparkbbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController

bbn = get_bbn(g, p, data.get_profile())
join_tree = InferenceController.apply(bbn)
nb_tree = join_tree
[12]:
import numpy as np
import matplotlib.pyplot as plt

posteriors = join_tree.get_posteriors()

fig, axes = plt.subplots(3, 3, figsize=(10, 7))

for ax, (node, p) in zip(np.ravel(axes), posteriors.items()):
    pd.Series(p).plot(kind='pie', ax=ax)
    ax.set_title(f'{node}')
    ax.set_ylabel('')

fig.tight_layout()
_images/use-case-diabetes_18_0.png

NB, lift

In this type of lift analysis, we want to see which node at its extreme value gives the highest lift to diabetes. Keep an eye on the blood_glucose variable as setting it to yes makes diabetes absolutely certain (100%).

[13]:
from pybbn.graph.jointree import EvidenceBuilder

def get_sensitivity(name, value):
    join_tree.unobserve_all()

    ev = EvidenceBuilder() \
        .with_node(join_tree.get_bbn_node_by_name(name)) \
        .with_evidence(value, 1.0) \
        .build()
    join_tree.set_observation(ev)

    meta = {'name': name, 'value': value}
    post = join_tree.get_posteriors()['diabetes']

    return {**meta, **post}

n2v = {
    'hypertension': 'yes',
    'heart_disease': 'yes',
    'gender': 'male',
    'hba1c': 'diabetic',
    'blood_glucose': 'diabetic',
    'age': 'senior',
    'smoking': 'current',
    'bmi': 'obesity_iii'
}

nb_sen = pd.DataFrame([get_sensitivity(name, value) for name, value in n2v.items()]) \
    .assign(lift=lambda d: d['yes'] / 0.5) \
    .sort_values(['lift'], ascending=False) \
    .rename(columns={'lift': 'diabetes_lift'})

nb_sen
[13]:
name value no yes diabetes_lift
4 blood_glucose diabetic 0.000000 1.000000 2.000000
3 hba1c diabetic 0.139837 0.860163 1.720327
1 heart_disease yes 0.172166 0.827834 1.655667
0 hypertension yes 0.194220 0.805780 1.611560
7 bmi obesity_iii 0.212166 0.787834 1.575668
5 age senior 0.269892 0.730108 1.460217
6 smoking current 0.447857 0.552143 1.104287
2 gender male 0.468063 0.531937 1.063873

Tree

Now we will learn a MWST model.

Tree, structure learning

Nothing interesting here except that diabetes and heart_disease have arcs going into age. We will reverse these arcs going the other way.

[14]:
from pysparkbbn.discrete.scblearn import Mwst

data = DiscreteData(sdf)
mwst = Mwst(data)
g = mwst.get_network()

print('')
print('nodes')
print('-' * 10)
for n in g.nodes():
    print(f'{n}')

print('')
print('edges')
print('-' * 10)
for pa, ch in g.edges():
    print(f'{pa} -> {ch}')
23/09/10 17:42:41 WARN CacheManager: Asked to cache already cached data.


nodes
----------
diabetes
blood_glucose
hba1c
age
hypertension
bmi
smoking
heart_disease
gender

edges
----------
diabetes -> blood_glucose
diabetes -> hba1c
diabetes -> age
age -> bmi
age -> smoking
hypertension -> diabetes
heart_disease -> age
gender -> smoking

Since the object instance returned from the algorithm is a networkx graph, it’s easy to do graph surgery (like remove and adding arcs). Just be careful to not introduce cycle. In this case, we were lucky that reversing the arcs did not introduce cycles. In most typical cases, graph surgery will introduce directed cycles.

[15]:
g.remove_edge('diabetes', 'age')
g.remove_edge('heart_disease', 'age')
g.add_edge('age', 'diabetes')
g.add_edge('age', 'heart_disease')

The MWST learned looks reasonable.

[16]:
import networkx as nx
import matplotlib.pyplot as plt

pos = nx.nx_pydot.graphviz_layout(g, prog='dot')

fig, ax = plt.subplots(figsize=(10, 5))

nx.draw(g, pos, ax=ax, with_labels=True, node_size=100, node_color='#e0e0e0')

fig.tight_layout()
_images/use-case-diabetes_27_0.png

Tree, parameter learning

Nothing interesting here; we are just learning the MWST parameters.

[17]:
param_learner = ParamLearner(data, g)
p = param_learner.get_params()

Tree, posteriors

The posteriors of the MWST model is printed below.

[18]:
# create the BBN and Join Tree

bbn = get_bbn(g, p, data.get_profile())
join_tree = InferenceController.apply(bbn)
mwst_tree = join_tree
[19]:
posteriors = join_tree.get_posteriors()

fig, axes = plt.subplots(3, 3, figsize=(10, 7))

for ax, (node, p) in zip(np.ravel(axes), posteriors.items()):
    pd.Series(p).plot(kind='pie', ax=ax)
    ax.set_title(f'{node}')
    ax.set_ylabel('')

fig.tight_layout()
_images/use-case-diabetes_32_0.png

Tree, lift

Let’s compare the lift analysis of the NB and MWST models. Here are some observations.

  • blood_glucose provides the most lift to diabetes

  • in the NB model, age ranks pretty high, but in the MWST model, age is near the bottom

  • smoking and gender do not provide much lift to diabetes

[20]:
mwst_sen = pd.DataFrame([get_sensitivity(name, value) for name, value in n2v.items()]) \
    .assign(lift=lambda d: d['yes'] / 0.5) \
    .sort_values(['lift'], ascending=False) \
    .rename(columns={'lift': 'diabetes_lift'})
mwst_sen
[20]:
name value no yes diabetes_lift
4 blood_glucose diabetic 0.000000 1.000000 2.000000
3 hba1c diabetic 0.137192 0.862808 1.725616
0 hypertension yes 0.229637 0.770363 1.540726
5 age senior 0.281937 0.718063 1.436127
1 heart_disease yes 0.372553 0.627447 1.254893
7 bmi obesity_iii 0.474390 0.525610 1.051220
6 smoking current 0.487581 0.512419 1.024838
2 gender male 0.494459 0.505541 1.011082
[21]:
nb_sen
[21]:
name value no yes diabetes_lift
4 blood_glucose diabetic 0.000000 1.000000 2.000000
3 hba1c diabetic 0.139837 0.860163 1.720327
1 heart_disease yes 0.172166 0.827834 1.655667
0 hypertension yes 0.194220 0.805780 1.611560
7 bmi obesity_iii 0.212166 0.787834 1.575668
5 age senior 0.269892 0.730108 1.460217
6 smoking current 0.447857 0.552143 1.104287
2 gender male 0.468063 0.531937 1.063873

If we do a cross tabulation between diabetes and blood_glucose, we see that people with diabetes never have a normal blood_glucose; they always have a prediabetic or diabetic blood_glucose! Likewise, people without diabetes never have diabetic blood_glucose. No wonder this problem is seemingly simple.

[22]:
pd.crosstab(Xy_tr['blood_glucose'], Xy_tr['diabetes'])
[22]:
diabetes no yes
blood_glucose
normal 2347 0
prediabetic 5303 4698
diabetic 0 2952

Validation

Now let’s validate the NB and MWST models with the testing fold. We will be measuring peformance using Area Under the Curve (AUC) for the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves.

[23]:
def get_evidence(name, value, model):
    ev = EvidenceBuilder() \
        .with_node(model.get_bbn_node_by_name(name)) \
        .with_evidence(value, 1.0) \
        .build()
    return ev

def get_evidences(data, model):
    return [get_evidence(name, value, model) for name, value in data.items() if name != 'diabetes']

def infer(data, model):
    model.unobserve_all()
    evidences = get_evidences(data, model)
    model.update_evidences(evidences)
    post = model.get_posteriors()['diabetes']

    return post['yes'], data['diabetes']
[24]:
nb_pred = [infer(data, nb_tree) for data in Xy_te.to_dict(orient='records')]
[25]:
mwst_pred = [infer(data, mwst_tree) for data in Xy_te.to_dict(orient='records')]

These are the performances for the NB model.

[26]:
from sklearn.metrics import roc_auc_score, average_precision_score

_y = pd.DataFrame(nb_pred, columns=['y_pred', 'y_true']) \
    .assign(y_true=lambda d: d['y_true'].map({'yes': 1, 'no': 0}))

pd.Series([
    roc_auc_score(_y['y_true'], _y['y_pred']),
    average_precision_score(_y['y_true'], _y['y_pred'])
], ['AUC_ROC', 'AUC_PR'])

[26]:
AUC_ROC    0.948330
AUC_PR     0.951899
dtype: float64

These are the performances for the MWST model.

[27]:
_y = pd.DataFrame(mwst_pred, columns=['y_pred', 'y_true']) \
    .assign(y_true=lambda d: d['y_true'].map({'yes': 1, 'no': 0}))

pd.Series([
    roc_auc_score(_y['y_true'], _y['y_pred']),
    average_precision_score(_y['y_true'], _y['y_pred'])
], ['AUC_ROC', 'AUC_PR'])
[27]:
AUC_ROC    0.941170
AUC_PR     0.939788
dtype: float64

Interestingly, the NB model does marginally better in terms of both the area under the curve (AUC) for the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves.

How does age impact diabetes?

[28]:
def get_node_impact(name, value, model):
    model.unobserve_all()
    model.update_evidences([get_evidence(name, value, nb_tree)])
    p = model.get_posteriors()['diabetes']['yes']
    return {'name': name, 'value': value, 'p': p}

def get_impact(name, model):
    return pd.DataFrame([get_node_impact(name, value, nb_tree)
                         for value in X_tr[name].unique()]) \
        .assign(lift=lambda d: d['p'] / 0.5) \
        .sort_values(['lift'], ascending=False)

If we observe a person classified as a senior, then the probability of having diabetes is 73%.

[29]:
get_impact('age', mwst_tree)
[29]:
name value p lift
1 age senior 0.730108 1.460217
0 age adult 0.469881 0.939762
3 age teen 0.092896 0.185792
2 age young 0.029170 0.058341

How does gender impact diabetes?

Gender does not seem to impact whether one has diabetes or not.

[30]:
get_impact('gender', mwst_tree)
[30]:
name value p lift
0 gender male 0.531937 1.063873
1 gender female 0.473953 0.947906

How does BMI impact diabetes?

As a person’s BMI increases, this increase of BMI also increases the probability of diabetes.

[31]:
get_impact('bmi', mwst_tree)
[31]:
name value p lift
0 bmi obesity_iii 0.787834 1.575668
4 bmi obesity_ii 0.733740 1.467480
2 bmi obesity_i 0.634202 1.268405
3 bmi overweight 0.459665 0.919330
1 bmi normal 0.308485 0.616971
5 bmi underweight 0.068299 0.136598

How does heart disease impact diabetes?

Observing a person with heart disease increases the probability of having diabetes to 83%.

[32]:
get_impact('heart_disease', mwst_tree)
[32]:
name value p lift
1 heart_disease yes 0.827834 1.655667
0 heart_disease no 0.467136 0.934273

How does smoking status impact diabetes?

It looks like if a person was a former smoker, then that increases their probability of having diabetes. Here’s an attempt to explain this observation. People who have quit smoking tend to gain weight because instead of doing the bad habit of smoking, they have replace it with the bad habit of snacking and overeating; hence, former smokers eat more and probably less healthily so, and as a result, increase their probability of diabetes.

[33]:
get_impact('smoking', mwst_tree)
[33]:
name value p lift
2 smoking former 0.691033 1.382066
4 smoking not_current 0.557435 1.114871
3 smoking ever 0.553741 1.107483
1 smoking current 0.552143 1.104287
0 smoking never 0.537944 1.075889
5 smoking no_info 0.310254 0.620508