Learning from Discrete Data

All modules related to learning Bayesian Belief Networks BBNs from fully discrete data.

Data

The data.

class pysparkbbn.discrete.data.DiscreteData(sdf: DataFrame)

Bases: object

Discrete data.

__init__(sdf: DataFrame)

ctor

Parameters: sdf – Spark dataframe.

static compute_bd_score(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int]) → Tuple[List[str], List[str], float]

Computes the Bayesian dirichlet score.

Parameters

X – Child.
Y – Parent.
XY_vals – XY values.
counts – Counts.

Returns

Tuple of X, Y and score.

static compute_cmi(X: List[str], Y: List[str], Z: List[str], XYZ_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) → Tuple[List[str], List[str], List[str], float]

Computes the conditional mutual information of X and Y given Z.

Parameters

X – X variables.
Y – Y variables.
Z – Z variables.
XYZ_vals – XYZ values.
counts – Counts.
n – Number of data.

Returns

Tuple of X, Y, Z and conditional mutual information.

static compute_mi(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) → Tuple[List[str], List[str], float]

Computes the mutual information.

Parameters

X – X variables.
Y – Y variables.
XY_vals – XY values.
counts – Counts.
n – Total data points.

Returns

Tuple of X, Y and mutual information.

drop(columns: List[str])

Drops the specified columns.

Parameters: columns – List of columns.
Returns: New dataset without the dropped columns.

static get_bd(scoring_pairs: List[ScoringPair], counts: Dict[str, int]) → List[Tuple[List[str], List[str], float]]

Computes the Bayesian dirichlet score.

Parameters

scoring_pairs – Scoring pairs.
counts – Counts.

Returns

List of Bayesian dirichlet scores.

get_bd_par(pairs: List[ScoringPair]) → List[Tuple[List[str], List[str], float]]

Computes the Bayesian dirichlet for each pair of variable (parallel).

Parameters: pairs – Scoring pairs.
Returns: List of Bayesian dirichlet scores.

static get_cmi(triplets: List[Triplet], counts: Dict[str, int], n: int) → List[Tuple[List[str], List[str], List[str], float]]

Computes the conditional mutual information for each triplet of variables.

Parameters

triplets – List of triplet variables.
counts – Count dictionary.
n – Number of data.

Returns

List of conditional mutual information.

get_cmi_par(triplets: List[Triplet]) → List[Tuple[List[str], List[str], List[str], float]]

Computes the conditional mutual information for each triplet of variables (parallel).

Parameters

triplets – List of triplets.
n – Number of data.

Returns

List of conditional mutual information.

get_counts_for_bd(scoring_pairs: List[ScoringPair]) → Dict[str, int]

Gets the counts required to compute Bayesian dirichlet scoring.

Parameters: scoring_pairs – List of scoring pairs (child, parents).
Returns

get_counts_for_cmi(triplets: List[Triplet]) → Dict[str, int]

Gets the counts required to compute conditional mutual information.

Parameters: triplets – List of triplets.
Returns: Dictionary of counts.

get_counts_for_mi(pairs: List[Pair]) → Dict[str, int]

Gets the counts required to compute pairwise mutual information.

Parameters: pairs – List of pairs.
Returns: Dictionary of counts.

static get_mi(pairs: List[Pair], counts: Dict[str, int], n: int) → List[Tuple[List[str], List[str], float]]

Computes the mutual information for each pair of variable.

Parameters

pairs – List of pairs of variables.
counts – Count dictionary.
n – Number of data.

Returns

List of mutual information.

get_mi_par(pairs: List[Pair]) → List[Tuple[List[str], List[str], float]]

Computes the mutual information for each pair of variable (parallel).

Parameters: pairs – List of pairs of variables.
Returns: List of mutual information.

get_profile() → Dict[str, Dict[str, int]]

Gets the data profile.

Returns: Dictionary. Keys are variable names. Values are dictionary of value-frequency.

class pysparkbbn.discrete.data.Pair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

Bases: object

Pair. Useful for computing pairwise statistics.

__init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters

X – List of X variables.
Y – List of Y variables.
profile – Profile of variables.

get_entries(r: Row) → List[Tuple[str, int]]

Gets the entries used for counting.

Parameters: r – Row.
Returns: Keys used for counting.

get_entries_par(r: Row) → List[Tuple[Tuple[str, str], int]]

class pysparkbbn.discrete.data.ScoringPair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

Bases: Pair

Scoring pair X, Y. Useful for Bayesian dirichlet scoring.

__init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters

X – Child.
Y – Parents.
profile – Variable profile.

get_entries(r: Row) → List[Tuple[str, int]]

Gets the entries used for counting.

Parameters: r – Row.
Returns: Keys used for counting.

get_entries_par(r: Row) → List[Tuple[Tuple[str, str], int]]

class pysparkbbn.discrete.data.Triplet(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])

Bases: object

Triplet X, Y, Z. Useful for computing conditional statistics.

__init__(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters

X – List of X variables.
Y – List of Y variables.
Z – List of Z variables.
profile – Profile of variables.

get_entries(r: Row) → List[Tuple[str, int]]

Gets the entries used for counting.

Parameters: r – Row.
Returns: Keys used for counting.

get_entries_par(r: Row) → List[Tuple[Tuple[str, str], int]]

pysparkbbn.discrete.data.get_pairs(data: DiscreteData) → List[Pair]

Gets a list of pairs of variables.

Parameters: data – Data.
Returns: List of pairs.

Structure Learning

There are two broad classes of structure learning: constraint-based (CB) and search-and-scoring (SS). CB structure learning typically uses independence and conditional independence tests to learn the network structure. SS structure learning uses a scoring measure over a search space to find the best scoring structures.

The CB structure learning algorithms are as follows. Some of these structure learning algorithms are appropriate for learning classification models.

Naive Bayes: Creates a naive Bayes model (classification).
Tree-Augmented Network (TAN): Creates a tree-augmented network (classification).
BN augmented naive Bayes (BAN): A modified BAN algorithm (classification).
Chow-Liu (aka Maximum Weight Spanning Tree, MWST): Creates a tree structured BBN.
Three-Phase Dependency Analysis (TPDA): Uses TPDA (draft, thicken and thin).
PC Algorithm: Uses the PC algorithm.

There is only one search-and-scoring based algorithm implemented, which uses genetic algorithms.

Genetic Algorithm (GA): Discovers a highly scoring network structure.

Constraint-Based

Constraint-based structure learning.

class pysparkbbn.discrete.scblearn.Ban(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')

Bases: BaseStructureLearner

Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.

__init__(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')

ctor

Parameters

data – Data.
clazz – Class variable.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
method – Either pc or tpda (default=pc).

get_network() → DiGraph: Gets the network structure.

class pysparkbbn.discrete.scblearn.BaseStructureLearner(data: DiscreteData)

Bases: object

Base structure learner.

__init__(data: DiscreteData)

ctor

Parameters: data – Data.

get_network() → DiGraph: Gets the network structure.

class pysparkbbn.discrete.scblearn.Mwst(data: DiscreteData, cmi_threshold=0.06)

Bases: BaseStructureLearner

Maximum weight spanning tree.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor

Parameters

data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() → DiGraph: Gets the network structure.

class pysparkbbn.discrete.scblearn.Naive(data: DiscreteData, clazz: str)

Bases: BaseStructureLearner

Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.

__init__(data: DiscreteData, clazz: str)

ctor.

Parameters

data – Data.
clazz – The clazz node.

get_network() → DiGraph: Gets the network structure.

class pysparkbbn.discrete.scblearn.Pc(data: DiscreteData, cmi_threshold=0.06)

Bases: BaseStructureLearner

PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor

Parameters

data – Data.
depth – Maximum depth.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() → DiGraph: Gets the network structure.

learn_undirected_graph() → Graph

Learns an undirected graph.

Returns: Undirected graph.

class pysparkbbn.discrete.scblearn.Tan(data: DiscreteData, clazz: str, cmi_threshold=0.06)

Bases: Mwst

Tree-augmented network. See Comparing Bayesian Network Classifiers.

__init__(data: DiscreteData, clazz: str, cmi_threshold=0.06)

ctor.

Parameters

data – Data.
clazz – The clazz node.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() → DiGraph: Gets the network structure.

class pysparkbbn.discrete.scblearn.Tpda(data: DiscreteData, cmi_threshold=0.06)

Bases: Mwst

Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor.

Parameters

data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() → DiGraph: Gets the network structure.

learn_undirected_graph() → Graph

Learns an undirected graph.

Returns: Undirected graph.

Search-and-Scoring

Search-and-scoring structure learning.

class pysparkbbn.discrete.ssslearn.Ga(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

Bases: object

Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.

__init__(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

ctor

Parameters

data – Data.
sc – Spark context.
max_parents – Maximum number of parents (default=4).
mutation_rate – Mutation rate (default=0.25).
pop_size – Population size (default=100).
crossover_prob – Crossover probability (default=0.5).
max_iters – Maximum iterations (default=20).
convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).
ordering – Ordering method: mwst, pc or tpda (default=mwst).
ordering_params – Ordering parameters to the ordering method.
seed – Seed for random number generation (default=37).

get_network() → DiGraph: Gets the network structure.

Parameter Learning

Parameter learning.

class pysparkbbn.discrete.plearn.LargeDataParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. Useful for when data cannot fit into memory. Parameters are learned sequentially.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters: data – Data.
G: Directed acyclic graph.

get_params() → Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns: Dictionary of parameters.

class pysparkbbn.discrete.plearn.ParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. All parameters are learned in parallel but requires extreme Spark tuning.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters: data – Data.
G: Directed acyclic graph.

get_params() → Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns: Dictionary of parameters.

class pysparkbbn.discrete.plearn.SmallDataParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. Useful for when the data can fit into memory. Parameters are learned in parallel.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters: data – Data.
G: Directed acyclic graph.

get_params() → Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns: Dictionary of parameters.

pysparkbbn.discrete.plearn.decode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) → Row

Decodes a row from numeric back to categorical values.

Parameters: r – Row.
Encodings: Dictionary of forward and backward encodings.
Returns: Row.

pysparkbbn.discrete.plearn.encode_data(df: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) → DataFrame

Encodes a Spark dataframe.

Param: Dictionary of forward and backward encodings.
Returns: Spark dataframe.

pysparkbbn.discrete.plearn.encode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) → Row

Encodes a row from categorical values to numeric ones.

Parameters: r – Row.
Encodings: Dictionary of forward and backward encodings.
Returns: Row.

pysparkbbn.discrete.plearn.estimate_scikit_cpt(X_cols: List[str], y_col: str, pdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) → List[Dict[Any, Any]]

Estimate the CPT (Scikit).

Parameters

X_cols – X columns.
y_col – y column.
pdf – Pandas dataframe.
encodings – Dictionary of forward and backward encodings.

Returns

CPT.

pysparkbbn.discrete.plearn.estimate_spark_cpt(X_cols: List[str], y_col: str, sdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) → Dict[str, List[Dict[Any, Any]]]

Estimate the CPT (Spark).

Parameters

X_cols – X columns.
y_col – y column.
sdf – Spark dataframe.
encodings – Dictionary of forward and backward encodings.
spark – Spark session.

Returns

CPT.

pysparkbbn.discrete.plearn.get_encodings(sdf: DataFrame) → Dict[str, Dict[str, Dict[Any, Any]]]

Gets encoding of categorical values to numeric (integer) ones.

Parameters: sdf – Spark dataframe.
Returns: Dictionary of forward and backward encodings.

pysparkbbn.discrete.plearn.get_pandas_Xy(X_cols: List[str], y_col: str, pdf: DataFrame) → DataFrame

Gets a Pandas dataframe in the Xy form.

Parameters

X_cols – X columns.
y_col – y column.
pdf – Pandas dataframe.

Returns

Xy Pandas dataframe.

pysparkbbn.discrete.plearn.get_parent_child(g: DiGraph) → List[Tuple[List[str], str]]

Converts the graph to a list of tuples where the first item in the tuple is a list of parents and the second item in the tuple is the child.

Parameters: g – Directed acyclic graph.
Returns: List of parent/child pairs.

pysparkbbn.discrete.plearn.get_scikit_cpt(X_cols: List[str], y_col: str, m: Pipeline, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) → List[Dict[Any, Any]]

Gets the CPT for the specified y variable (Scikit).

Parameters

X_cols – X columns.
y_col – y column.
m – Classification model.
encodings – Dictionary of forward and backward encodings.

Returns

CPT.

pysparkbbn.discrete.plearn.get_scikit_model(X: DataFrame, y: Series) → Pipeline

Gets a Scikit logistic regression model.

Parameters

X – Pandas dataframe.
y – Pandas series.

Returns

Scikit pipeline.

pysparkbbn.discrete.plearn.get_spark_Xy(X_cols: List[str], y_col: str, sdf: DataFrame) → DataFrame

Gets a Spark dataframe in the Xy form.

Parameters

X_cols – X columns.
y_col – y column.
sdf – Spark dataframe.

Returns

Xy Spark dataframe.

pysparkbbn.discrete.plearn.get_spark_cpt(X_cols: List[str], y_col: str, m: PipelineModel, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) → Dict[str, List[Dict[Any, Any]]]

Gets the CPT for the specified y variable (Spark).

Parameters

X_cols – X columns.
y_col – y column.
m – Classification model.
encodings – Dictionary of forward and backward encodings.
spark – Spark session.

Returns

CPT.

pysparkbbn.discrete.plearn.get_spark_model(Xy: DataFrame) → PipelineModel

Gets a logistic regression model.

Parameters: Xy – Spark dataframe.
Returns: Spark pipeline.

Utilities

Utilities to make life easier.

pysparkbbn.discrete.util.get_triplets(g)

Gets all triplets (x, y, z) where x–y and y–z, but not x–z.

Parameters: g – Undirected graph.
Returns: List of triplets.

pysparkbbn.discrete.util.log_gamma(x: int) → float

Computes log gamma(x), where gamma(x) = (x - 1)!. If x=5, then gamma(5) = (5-1)! = 4! = 4 x 3 x 2 x 1, and log(gamma(5)) = log((5-1)!) = log(4!) = log(4) + log(3) + log(2) + log(1).

Parameters: x – Positive integer.
Returns: Log gamma(x).

pysparkbbn.discrete.util.log_gamma_ratio(numerator: int, denominator: int) → float

Computes the ratio of gammas in log-space.

Parameters

numerator – Numerator gamma.
denominator – Denominator gamma.

Returns

log(gamma(numerator) / gamma(denominator)).

pysparkbbn.discrete.bbn.get_bbn(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) → Bbn

Gets a Bayesian Belief Network based on the specified graph and parameters.

Parameters

g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.

Returns

BBN.

pysparkbbn.discrete.bbn.get_darkstar_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) → Dict[str, Any]

Gets JSON data for Darkstar.

Parameters

g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.

Returns

Data.

pysparkbbn.discrete.bbn.get_pybbn_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) → Dict[str, Any]

Gets JSON data for Py-BBN.

Parameters

g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.

Returns

Data.