Congressional Support

Let’s take a look at congressional data. This data is covers 5,000 legislative bills that were sponsored in Congress over 411 representatives (or the equivalent), 50 states (including some additional US territories) and 759 “subjects” (eg abortion, health, education, etc). In this notebook, we will use BBNs to estimate the lift of each state on sponsorship of a bill. We will also estimate the average causal effect (ACE) of states on subject sponsorship (eg what’s the ACE of California on sponsorship of abortion related legislative bills).

Initialize

Let’s load up the data, which has been transformed from its original “wide” form (each row is a legislative bill) to the “long” form below (each row is essentially a district’s sponsorship outcome).

[1]:
import pandas as pd
from pyspark.sql import SparkSession

def get_pdf(subject):
    f = (pdf.subject==subject)
    return pdf[f][['state', 'party', 'sponsor']]

def get_sdf(subject):
    pdf = get_pdf(subject)
    sdf = spark.createDataFrame(pdf)
    return sdf

spark = SparkSession \
    .builder \
    .appName('diabetes') \
    .master('local[*]') \
    .config('spark.executor.memory', '5g') \
    .config('spark.driver.memory', '10g') \
    .config('spark.memory.offHeap.enabled', 'true') \
    .config('spark.memory.offHeap.size', '5g') \
    .getOrCreate()

states = [
    'AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT',
    'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN',
    'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
    'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV',
    'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH',
    'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN',
    'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI',
    'WY'
]

pdf = pd.read_csv('./data/congress-state-party-subject.csv', low_memory=False)
pdf = pdf[pdf['state'].isin(states)]

pdf.shape
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/12 01:22:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[1]:
(5600190, 6)
[2]:
pdf.head(10)
[2]:
state district subject legislation party sponsor
0 AK AK_99 armed_forces_and_national_security H.R. 4140 D no
1 AL AL_01 armed_forces_and_national_security H.R. 4140 R no
2 AL AL_02 armed_forces_and_national_security H.R. 4140 R no
3 AL AL_03 armed_forces_and_national_security H.R. 4140 R no
4 AL AL_04 armed_forces_and_national_security H.R. 4140 R no
5 AL AL_05 armed_forces_and_national_security H.R. 4140 R no
6 AL AL_06 armed_forces_and_national_security H.R. 4140 R no
7 AL AL_07 armed_forces_and_national_security H.R. 4140 D no
8 AR AR_01 armed_forces_and_national_security H.R. 4140 R no
9 AR AR_02 armed_forces_and_national_security H.R. 4140 R no

BBN

The code below are bootstrap code to build a BBN from the data so that we can estimate lift and ACE. All structures are composed of a triplet \((X, Y, Z)\) where

  • \(X \rightarrow Y\),

  • \(Z \rightarrow Y\), and

  • \(Z \rightarrow X\).

In these structures,

  • \(X\) is collection of states (eg all the 50 states when computing lift) or an individual state (eg CA, VA, MD, etc when estimating ACE),

  • \(Y\) is the sponsorship (yes or no), and

  • \(Z\) is the confounder and set to the party (eg replican R or democrat D).

Essentially, we want to know the impact of each state on sponsorship of legislative bill subjects.

[3]:
import networkx as nx
import matplotlib.pyplot as plt
from pysparkbbn.discrete.data import DiscreteData
from pysparkbbn.discrete.plearn import ParamLearner
from pysparkbbn.discrete.bbn import get_bbn
from pybbn.pptc.inferencecontroller import InferenceController

def get_structure(x='state', y='sponsor', z='party'):
    g = nx.DiGraph()

    g.add_node(x)
    g.add_node(z)
    g.add_node(y)

    g.add_edge(z, x)
    g.add_edge(z, y)
    g.add_edge(x, y)

    return g

def get_parameters(data, g):
    param_learner = ParamLearner(data, g)
    p = param_learner.get_params()
    return p

def get_jt(subject):
    d = DiscreteData(get_sdf(subject))
    g = get_structure()
    p = get_parameters(d, g)
    b = get_bbn(g, p, d.get_profile())
    t = InferenceController.apply(b)

    return d, g, p, b, t

def plot_structure(g):
    pos = nx.nx_pydot.graphviz_layout(g, prog='dot')

    fig, ax = plt.subplots(figsize=(3, 3))

    nx.draw(
        g,
        pos,
        ax=ax,
        with_labels=True,
        node_size=10,
        node_color='#2eb82e',
        edge_color='#4da6ff',
        arrowsize=10,
        min_target_margin=5
    )

    ax.set_title('Congressional Support')

    fig.tight_layout()

Lift

The states with the highest lift for the top 10 subjects (most frequently occuring subjects) are computed below.

[4]:
from pybbn.graph.jointree import EvidenceBuilder

def get_posterior(state, t, baseline_p):
    ev = EvidenceBuilder() \
        .with_node(t.get_bbn_node_by_name('state')) \
        .with_evidence(state, 1.0) \
        .build()

    t.unobserve_all()
    t.set_observation(ev)

    posterior = t.get_posteriors()['sponsor']
    lift = {
        'lift_no': posterior['no'] / baseline_p['no'],
        'lift_yes': posterior['yes'] / baseline_p['yes']
    }
    baseline = {
        'base_no': baseline_p['no'],
        'base_yes': baseline_p['yes']
    }
    posterior = {
        'post_no': posterior['no'],
        'post_yes': posterior['yes']
    }


    d = {'state': state}
    d = {**d, **baseline}
    d = {**d, **posterior}
    d = {**d, **lift}

    return d

def get_lift(t, subject):
    t.unobserve_all()
    baseline_p = t.get_posteriors()['sponsor']

    return pd.DataFrame([get_posterior(state, t, baseline_p) for state in pdf.state.unique()]) \
        .assign(subject=subject) \
        .sort_values(['lift_yes'], ascending=False) \
        .head(1)

def f(subject):
    *_, t = get_jt(subject)
    return get_lift(t, subject)

You can see that DE provides the highest lift for legislative bills associated with the “health” subject. The columns below are explained as follows.

  • state: the state (1 of 50) that generates the highest lift for sponsorship (yes)

  • base_no: the marginal percentage of no sponsorship,

  • base_yes: the marginal percentage of yes sponsorship,

  • post_no: the posterior probability of no sponsorship given the corresponding state,

  • post_yes: the posterior probability of yes sponsorship given the corresponding state,

  • lift_no: the lift of no sponsorship (eg base_no / post_no)

  • lift_yes: the lift of yes sponsorship (eg base_yes / post_yes)

  • subject: the legislative bill subject

[5]:
lift_df = pd.concat([f(subject) for subject in pdf.groupby(['subject']).size().sort_values(ascending=False).index[:10]])
lift_df

23/09/12 01:23:10 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
[5]:
state base_no base_yes post_no post_yes lift_no lift_yes subject
7 DE 0.962814 0.037186 0.873706 0.126294 0.907451 3.396263 health
25 MT 0.969950 0.030050 0.884615 0.115385 0.912021 3.839793 international_affairs
25 MT 0.975434 0.024566 0.870833 0.129167 0.892765 5.257993 government_operations_and_politics
29 NH 0.973497 0.026503 0.935252 0.064748 0.960714 2.443058 armed_forces_and_national_security
10 HI 0.954627 0.045373 0.925182 0.074818 0.969156 1.648946 crime_and_law_enforcement
10 HI 0.963228 0.036772 0.940476 0.059524 0.976380 1.618705 congressional_oversight
48 WV 0.970921 0.029079 0.938247 0.061753 0.966348 2.123622 taxation
25 MT 0.984835 0.015165 0.844538 0.155462 0.857542 10.251592 congress
10 HI 0.965579 0.034421 0.902703 0.097297 0.934882 2.826715 education
10 HI 0.962791 0.037209 0.921512 0.078488 0.957126 2.109375 government_information_and_archives

Average Causal Effect (ACE )

Now we will see the ACE of each state on sponsorship for selected subjects.

[6]:
from pybbn.causality.ace import Ace

def get_states(subject):
    return list(pdf.query(f'subject == "{subject}"')['state'].unique())

def get_ace(subject, state):
    d = pdf \
        .query(f'subject == "{subject}"') \
        .assign(**{
            state: lambda d: d['state'].apply(lambda s: 'yes' if s == state else 'no')
        }) \
        [[state, 'party', 'sponsor']]
    d = spark.createDataFrame(d)
    d = DiscreteData(d)

    g = get_structure(x=state, y='sponsor', z='party')
    p = get_parameters(d, g)
    b = get_bbn(g, p, d.get_profile())
    t = InferenceController.apply(b)

    a = Ace(b).get_ace(state, 'sponsor', 'yes')
    y = a['yes']
    n = a['no']
    diff = y - n

    return {
        'subject': subject,
        'state': state,
        'yes': y,
        'no': n,
        'ace': diff
    }

Health

As you can see below, AL has the highest, positive ACE on health legislative bills. UT actually has the highest, negative ACE on health bills.

[7]:
subject = 'health'
_df_health = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[25]:
_ = _df_health.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - health', ylabel='ACE, %')
_images/use-case-congress-sponsorship_14_0.png

International affairs

TN ranks first with the highest, positive ACE on international affairs bills, and AK has the highest, negative ACE on these bills.

[8]:
subject = 'international_affairs'
_df_ia = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[26]:
_ = _df_ia.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - international affairs', ylabel='ACE, %')
_images/use-case-congress-sponsorship_17_0.png

Government operations and politics

MT has the highest, positive ACE on government operations and politics bills, while AK has the highest, negative ACE.

[9]:
subject = 'government_operations_and_politics'
_df_gov_op = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[27]:
_ = _df_gov_op.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - government operations and politics', ylabel='ACE, %')
_images/use-case-congress-sponsorship_20_0.png

Armed forces and national security

For armed forces and national security, IN has the highest, positive ACE and VT has the highest, negative ACE.

[10]:
subject = 'armed_forces_and_national_security'
_df_afns = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[28]:
_ = _df_afns.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - armed forces and national security', ylabel='ACE, %')
_images/use-case-congress-sponsorship_23_0.png

Crime and law enforcement

For crime and law enforcement, IL has the highest, positive ACE and AK has the highest, negative ACE.

[11]:
subject = 'crime_and_law_enforcement'
_df_cle = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[29]:
_ = _df_cle.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - crime and law enforcement', ylabel='ACE, %')
_images/use-case-congress-sponsorship_26_0.png

Abortion

For bills whose subjects include abortion, IL has the highest, positive ACE and UT, ND and SD have comparably the highest, negative ACE. It’s interesting to note that only 12 of 50 states have a positive ACE on abortion bill sponsorship.

[12]:
subject = 'abortion'
_df_abortion = pd.DataFrame([get_ace(subject, s) for s in get_states(subject)]) \
    .sort_values(['ace'], ascending=False) \
    .reset_index(drop=True)
[30]:
_ = _df_abortion.set_index(['state'])['ace'] \
    .plot(kind='bar', figsize=(10, 3), title='ACE - abortion', ylabel='ACE, %')
_images/use-case-congress-sponsorship_29_0.png