Housing Prices

This data is taken from Kaggle and concerns predicting housing prices. The problem of this dataset is tagged as a classification problem but clearly, the price field is continuous, and so this data is more naturally a regression problem. We have discretized the features as follows.

price: [low, medium, high]
area: [very_low, low, medium, high, very_high]
bedrooms: [low, medium, high]
bathrooms: [low, high]
stories: [low, medium, high]
parking, [low, medium, high]

The price and area features are discretized using univariate k-means clustering. The other features were binned using manual inspection (work omitted for brevity).

[1]:

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans

def discretize(column, df, n_clusters=3):
    X = df[[column]]

    kmeans = KMeans(n_clusters=n_clusters, random_state=37)
    kmeans.fit(X)

    c2v = {c: v for v, (c, _) in list(enumerate(sorted(list(enumerate(np.ravel(kmeans.cluster_centers_))), key=lambda tup: tup[1])))}
    c2v

    y = pd.Series(kmeans.predict(X)).map(c2v)
    return y

df = pd.read_csv('./data/Housing.csv') \
    .assign(
        price=lambda d: discretize('price', d).map({0: 'low', 1: 'medium', 2: 'high'}),
        area=lambda d: discretize('area', d, 5).map({0: 'very_low', 1: 'low', 2: 'medium', 3: 'high', 4: 'very_high'}),
        bedrooms=lambda d: pd.cut(d['bedrooms'], [0, 2, 3, 10], include_lowest=True, labels=['low', 'medium', 'high']),
        bathrooms=lambda d: pd.cut(d['bathrooms'], [0, 1, 5], include_lowest=True, labels=['low', 'high']),
        stories=lambda d: pd.cut(d['stories'], [0, 1, 2, 5], include_lowest=True, labels=['low', 'medium', 'high']),
        parking=lambda d: d['parking'].map({0: 'low', 1: 'medium', 2: 'high', 3: 'high'})
    )
df.shape

[1]:

(545, 13)

[2]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   price             545 non-null    object
 1   area              545 non-null    object
 2   bedrooms          545 non-null    category
 3   bathrooms         545 non-null    category
 4   stories           545 non-null    category
 5   mainroad          545 non-null    object
 6   guestroom         545 non-null    object
 7   basement          545 non-null    object
 8   hotwaterheating   545 non-null    object
 9   airconditioning   545 non-null    object
 10  parking           545 non-null    object
 11  prefarea          545 non-null    object
 12  furnishingstatus  545 non-null    object
dtypes: category(3), object(10)
memory usage: 44.7+ KB

Spark

Now let’s load up the data into a Spark dataframe.

[3]:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName('housing') \
    .master('local[*]') \
    .getOrCreate()

sdf = spark.createDataFrame(df).cache()
sdf.count()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

23/09/10 17:42:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[3]:

[4]:

sdf.show(5)

+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
|price|area|bedrooms|bathrooms|stories|mainroad|guestroom|basement|hotwaterheating|airconditioning|parking|prefarea|furnishingstatus|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
| high|high|    high|     high|   high|     yes|       no|      no|             no|            yes|   high|     yes|       furnished|
| high|high|    high|     high|   high|     yes|       no|      no|             no|            yes|   high|      no|       furnished|
| high|high|  medium|     high| medium|     yes|       no|     yes|             no|             no|   high|     yes|  semi-furnished|
| high|high|    high|     high| medium|     yes|       no|     yes|             no|            yes|   high|     yes|       furnished|
| high|high|    high|      low| medium|     yes|      yes|     yes|             no|            yes|   high|      no|       furnished|
+-----+----+--------+---------+-------+--------+---------+--------+---------------+---------------+-------+--------+----------------+
only showing top 5 rows

Learning

We will learn two models, one from Three-Phase Dependency Analysis (TPDA) and another from Maximum Weight Spanning Tree (MWST).

[5]:

from pysparkbbn.discrete.data import DiscreteData
from pysparkbbn.discrete.plearn import ParamLearner
from pysparkbbn.discrete.scblearn import Tpda, Mwst
from pysparkbbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController

data = DiscreteData(sdf)

# TPDA
g_tpda = Tpda(data, cmi_threshold=0.05).get_network()
p_tpda = ParamLearner(data, g_tpda).get_params()
t_tpda = InferenceController.apply(get_bbn(g_tpda, p_tpda, data.get_profile()))

# MWST
g_mwst = Mwst(data).get_network()
p_mwst = ParamLearner(data, g_mwst).get_params()
t_mwst = InferenceController.apply(get_bbn(g_mwst, p_mwst, data.get_profile()))

23/09/10 17:42:10 WARN CacheManager: Asked to cache already cached data.

Here’s a plot of the structures learned.

[9]:

import networkx as nx
import matplotlib.pyplot as plt

pos_tpda = nx.nx_pydot.graphviz_layout(g_tpda, prog='dot')
pos_mwst = nx.nx_pydot.graphviz_layout(g_mwst, prog='dot')

fig, ax = plt.subplots(2, 1, figsize=(7, 7))
ax = np.ravel(ax)

nx.draw(
    g_tpda,
    pos_tpda,
    ax=ax[0],
    with_labels=True,
    node_size=10,
    node_color='#2eb82e',
    edge_color='#4da6ff',
    arrowsize=10,
    min_target_margin=5,
    nodelist=[n for n in g_tpda.nodes() if len(list(nx.to_undirected(g_tpda).neighbors(n))) > 0]
)

nx.draw(
    g_mwst,
    pos_mwst,
    ax=ax[1],
    with_labels=True,
    node_size=10,
    node_color='#2eb82e',
    edge_color='#4da6ff',
    arrowsize=10,
    min_target_margin=10,
    nodelist=[n for n in g_mwst.nodes() if len(list(nx.to_undirected(g_mwst).neighbors(n))) > 0]
)

ax[0].set_title('TPDA')
ax[1].set_title('MWST')

fig.tight_layout()

Lift

Now let’s see how the observation of each variable at its highest value gives lift to the housing price being high.

[7]:

from pybbn.graph.jointree import EvidenceBuilder

def get_sensitivity(name, value, tree):
    tree.unobserve_all()

    ev = EvidenceBuilder() \
        .with_node(tree.get_bbn_node_by_name(name)) \
        .with_evidence(value, 1.0) \
        .build()
    tree.set_observation(ev)

    meta = {'name': name, 'value': value}
    post = tree.get_posteriors()['price']

    return {**meta, **post}

n2v = {
    'bedrooms': 'high',
    'hotwaterheating': 'yes',
    'area': 'very_high',
    'stories': 'high',
    'bathrooms': 'high',
    'airconditioning': 'yes',
    'guestroom': 'yes',
    'basement': 'yes',
    'mainroad': 'yes',
    'parking': 'high',
    'prefarea': 'yes',
    'furnishingstatus': 'furnished'
}

t_tpda.unobserve_all()
h = t_tpda.get_posteriors()['price']['high']
m = t_tpda.get_posteriors()['price']['medium']
l = t_tpda.get_posteriors()['price']['low']

lift_df = pd.DataFrame([get_sensitivity(name, value, t_tpda) for name, value in n2v.items()]) \
    [['name', 'value', 'low', 'medium', 'high']] \
    .assign(
        low_lift=lambda d: d['low'] / l,
        medium_lift=lambda d: d['medium'] / m,
        high_lift=lambda d: d['high'] / h
    )
lift_df \
    .sort_values(['high_lift', 'medium_lift', 'low_lift'], ascending=False) \
    .rename(columns={
        'name': 'variable',
        'high_lift': 'price_lift'
    }) \
    [['variable', 'value', 'price_lift']]

[7]:

	variable	value	price_lift
2	area	very_high	3.945727
4	bathrooms	high	2.184265
5	airconditioning	yes	2.095619
3	stories	high	1.992604
0	bedrooms	high	1.389292
6	guestroom	yes	1.219085
7	basement	yes	1.043853
8	mainroad	yes	1.000000
9	parking	high	1.000000
10	prefarea	yes	1.000000
11	furnishingstatus	furnished	1.000000
1	hotwaterheating	yes	0.567470