Learning from Discrete Data

All modules related to learning Bayesian Belief Networks BBNs from fully discrete data.

Data

The data.

class pysparkbbn.discrete.data.DiscreteData(sdf: DataFrame)

Bases: object

Discrete data.

__init__(sdf: DataFrame)

ctor

Parameters

sdf – Spark dataframe.

static compute_bd_score(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int]) Tuple[List[str], List[str], float]

Computes the Bayesian dirichlet score.

Parameters
  • X – Child.

  • Y – Parent.

  • XY_vals – XY values.

  • counts – Counts.

Returns

Tuple of X, Y and score.

static compute_cmi(X: List[str], Y: List[str], Z: List[str], XYZ_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) Tuple[List[str], List[str], List[str], float]

Computes the conditional mutual information of X and Y given Z.

Parameters
  • X – X variables.

  • Y – Y variables.

  • Z – Z variables.

  • XYZ_vals – XYZ values.

  • counts – Counts.

  • n – Number of data.

Returns

Tuple of X, Y, Z and conditional mutual information.

static compute_mi(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) Tuple[List[str], List[str], float]

Computes the mutual information.

Parameters
  • X – X variables.

  • Y – Y variables.

  • XY_vals – XY values.

  • counts – Counts.

  • n – Total data points.

Returns

Tuple of X, Y and mutual information.

drop(columns: List[str])

Drops the specified columns.

Parameters

columns – List of columns.

Returns

New dataset without the dropped columns.

static get_bd(scoring_pairs: List[ScoringPair], counts: Dict[str, int]) List[Tuple[List[str], List[str], float]]

Computes the Bayesian dirichlet score.

Parameters
  • scoring_pairs – Scoring pairs.

  • counts – Counts.

Returns

List of Bayesian dirichlet scores.

get_bd_par(pairs: List[ScoringPair]) List[Tuple[List[str], List[str], float]]

Computes the Bayesian dirichlet for each pair of variable (parallel).

Parameters

pairs – Scoring pairs.

Returns

List of Bayesian dirichlet scores.

static get_cmi(triplets: List[Triplet], counts: Dict[str, int], n: int) List[Tuple[List[str], List[str], List[str], float]]

Computes the conditional mutual information for each triplet of variables.

Parameters
  • triplets – List of triplet variables.

  • counts – Count dictionary.

  • n – Number of data.

Returns

List of conditional mutual information.

get_cmi_par(triplets: List[Triplet]) List[Tuple[List[str], List[str], List[str], float]]

Computes the conditional mutual information for each triplet of variables (parallel).

Parameters
  • triplets – List of triplets.

  • n – Number of data.

Returns

List of conditional mutual information.

get_counts_for_bd(scoring_pairs: List[ScoringPair]) Dict[str, int]

Gets the counts required to compute Bayesian dirichlet scoring.

Parameters

scoring_pairs – List of scoring pairs (child, parents).

Returns

get_counts_for_cmi(triplets: List[Triplet]) Dict[str, int]

Gets the counts required to compute conditional mutual information.

Parameters

triplets – List of triplets.

Returns

Dictionary of counts.

get_counts_for_mi(pairs: List[Pair]) Dict[str, int]

Gets the counts required to compute pairwise mutual information.

Parameters

pairs – List of pairs.

Returns

Dictionary of counts.

static get_mi(pairs: List[Pair], counts: Dict[str, int], n: int) List[Tuple[List[str], List[str], float]]

Computes the mutual information for each pair of variable.

Parameters
  • pairs – List of pairs of variables.

  • counts – Count dictionary.

  • n – Number of data.

Returns

List of mutual information.

get_mi_par(pairs: List[Pair]) List[Tuple[List[str], List[str], float]]

Computes the mutual information for each pair of variable (parallel).

Parameters

pairs – List of pairs of variables.

Returns

List of mutual information.

get_profile() Dict[str, Dict[str, int]]

Gets the data profile.

Returns

Dictionary. Keys are variable names. Values are dictionary of value-frequency.

class pysparkbbn.discrete.data.Pair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

Bases: object

Pair. Useful for computing pairwise statistics.

__init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters
  • X – List of X variables.

  • Y – List of Y variables.

  • profile – Profile of variables.

get_entries(r: Row) List[Tuple[str, int]]

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
class pysparkbbn.discrete.data.ScoringPair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

Bases: Pair

Scoring pair X, Y. Useful for Bayesian dirichlet scoring.

__init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters
  • X – Child.

  • Y – Parents.

  • profile – Variable profile.

get_entries(r: Row) List[Tuple[str, int]]

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
class pysparkbbn.discrete.data.Triplet(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])

Bases: object

Triplet X, Y, Z. Useful for computing conditional statistics.

__init__(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])

ctor

Parameters
  • X – List of X variables.

  • Y – List of Y variables.

  • Z – List of Z variables.

  • profile – Profile of variables.

get_entries(r: Row) List[Tuple[str, int]]

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
pysparkbbn.discrete.data.get_pairs(data: DiscreteData) List[Pair]

Gets a list of pairs of variables.

Parameters

data – Data.

Returns

List of pairs.

Structure Learning

There are two broad classes of structure learning: constraint-based (CB) and search-and-scoring (SS). CB structure learning typically uses independence and conditional independence tests to learn the network structure. SS structure learning uses a scoring measure over a search space to find the best scoring structures.

The CB structure learning algorithms are as follows. Some of these structure learning algorithms are appropriate for learning classification models.

  • Naive Bayes: Creates a naive Bayes model (classification).

  • Tree-Augmented Network (TAN): Creates a tree-augmented network (classification).

  • BN augmented naive Bayes (BAN): A modified BAN algorithm (classification).

  • Chow-Liu (aka Maximum Weight Spanning Tree, MWST): Creates a tree structured BBN.

  • Three-Phase Dependency Analysis (TPDA): Uses TPDA (draft, thicken and thin).

  • PC Algorithm: Uses the PC algorithm.

There is only one search-and-scoring based algorithm implemented, which uses genetic algorithms.

  • Genetic Algorithm (GA): Discovers a highly scoring network structure.

Constraint-Based

Constraint-based structure learning.

class pysparkbbn.discrete.scblearn.Ban(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')

Bases: BaseStructureLearner

Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.

__init__(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')

ctor

Parameters
  • data – Data.

  • clazz – Class variable.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

  • method – Either pc or tpda (default=pc).

get_network() DiGraph

Gets the network structure.

class pysparkbbn.discrete.scblearn.BaseStructureLearner(data: DiscreteData)

Bases: object

Base structure learner.

__init__(data: DiscreteData)

ctor

Parameters

data – Data.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.discrete.scblearn.Mwst(data: DiscreteData, cmi_threshold=0.06)

Bases: BaseStructureLearner

Maximum weight spanning tree.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.discrete.scblearn.Naive(data: DiscreteData, clazz: str)

Bases: BaseStructureLearner

Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.

__init__(data: DiscreteData, clazz: str)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.discrete.scblearn.Pc(data: DiscreteData, cmi_threshold=0.06)

Bases: BaseStructureLearner

PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor

Parameters
  • data – Data.

  • depth – Maximum depth.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

learn_undirected_graph() Graph

Learns an undirected graph.

Returns

Undirected graph.

class pysparkbbn.discrete.scblearn.Tan(data: DiscreteData, clazz: str, cmi_threshold=0.06)

Bases: Mwst

Tree-augmented network. See Comparing Bayesian Network Classifiers.

__init__(data: DiscreteData, clazz: str, cmi_threshold=0.06)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

class pysparkbbn.discrete.scblearn.Tpda(data: DiscreteData, cmi_threshold=0.06)

Bases: Mwst

Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.

__init__(data: DiscreteData, cmi_threshold=0.06)

ctor.

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network() DiGraph

Gets the network structure.

learn_undirected_graph() Graph

Learns an undirected graph.

Returns

Undirected graph.

Search-and-Scoring

Search-and-scoring structure learning.

class pysparkbbn.discrete.ssslearn.Ga(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

Bases: object

Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.

__init__(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

ctor

Parameters
  • data – Data.

  • sc – Spark context.

  • max_parents – Maximum number of parents (default=4).

  • mutation_rate – Mutation rate (default=0.25).

  • pop_size – Population size (default=100).

  • crossover_prob – Crossover probability (default=0.5).

  • max_iters – Maximum iterations (default=20).

  • convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).

  • ordering – Ordering method: mwst, pc or tpda (default=mwst).

  • ordering_params – Ordering parameters to the ordering method.

  • seed – Seed for random number generation (default=37).

get_network() DiGraph

Gets the network structure.

Parameter Learning

Parameter learning.

class pysparkbbn.discrete.plearn.LargeDataParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. Useful for when data cannot fit into memory. Parameters are learned sequentially.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters

data – Data.

G

Directed acyclic graph.

get_params() Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns

Dictionary of parameters.

class pysparkbbn.discrete.plearn.ParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. All parameters are learned in parallel but requires extreme Spark tuning.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters

data – Data.

G

Directed acyclic graph.

get_params() Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns

Dictionary of parameters.

class pysparkbbn.discrete.plearn.SmallDataParamLearner(data: DiscreteData, g: DiGraph)

Bases: object

Parameter learner. Useful for when the data can fit into memory. Parameters are learned in parallel.

__init__(data: DiscreteData, g: DiGraph)

ctor

Parameters

data – Data.

G

Directed acyclic graph.

get_params() Dict[str, List[Dict[str, Union[str, float]]]]

Gets the parameters.

Returns

Dictionary of parameters.

pysparkbbn.discrete.plearn.decode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) Row

Decodes a row from numeric back to categorical values.

Parameters

r – Row.

Encodings

Dictionary of forward and backward encodings.

Returns

Row.

pysparkbbn.discrete.plearn.encode_data(df: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) DataFrame

Encodes a Spark dataframe.

Param

Dictionary of forward and backward encodings.

Returns

Spark dataframe.

pysparkbbn.discrete.plearn.encode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) Row

Encodes a row from categorical values to numeric ones.

Parameters

r – Row.

Encodings

Dictionary of forward and backward encodings.

Returns

Row.

pysparkbbn.discrete.plearn.estimate_scikit_cpt(X_cols: List[str], y_col: str, pdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) List[Dict[Any, Any]]

Estimate the CPT (Scikit).

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • pdf – Pandas dataframe.

  • encodings – Dictionary of forward and backward encodings.

Returns

CPT.

pysparkbbn.discrete.plearn.estimate_spark_cpt(X_cols: List[str], y_col: str, sdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) Dict[str, List[Dict[Any, Any]]]

Estimate the CPT (Spark).

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • sdf – Spark dataframe.

  • encodings – Dictionary of forward and backward encodings.

  • spark – Spark session.

Returns

CPT.

pysparkbbn.discrete.plearn.get_encodings(sdf: DataFrame) Dict[str, Dict[str, Dict[Any, Any]]]

Gets encoding of categorical values to numeric (integer) ones.

Parameters

sdf – Spark dataframe.

Returns

Dictionary of forward and backward encodings.

pysparkbbn.discrete.plearn.get_pandas_Xy(X_cols: List[str], y_col: str, pdf: DataFrame) DataFrame

Gets a Pandas dataframe in the Xy form.

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • pdf – Pandas dataframe.

Returns

Xy Pandas dataframe.

pysparkbbn.discrete.plearn.get_parent_child(g: DiGraph) List[Tuple[List[str], str]]

Converts the graph to a list of tuples where the first item in the tuple is a list of parents and the second item in the tuple is the child.

Parameters

g – Directed acyclic graph.

Returns

List of parent/child pairs.

pysparkbbn.discrete.plearn.get_scikit_cpt(X_cols: List[str], y_col: str, m: Pipeline, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) List[Dict[Any, Any]]

Gets the CPT for the specified y variable (Scikit).

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • m – Classification model.

  • encodings – Dictionary of forward and backward encodings.

Returns

CPT.

pysparkbbn.discrete.plearn.get_scikit_model(X: DataFrame, y: Series) Pipeline

Gets a Scikit logistic regression model.

Parameters
  • X – Pandas dataframe.

  • y – Pandas series.

Returns

Scikit pipeline.

pysparkbbn.discrete.plearn.get_spark_Xy(X_cols: List[str], y_col: str, sdf: DataFrame) DataFrame

Gets a Spark dataframe in the Xy form.

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • sdf – Spark dataframe.

Returns

Xy Spark dataframe.

pysparkbbn.discrete.plearn.get_spark_cpt(X_cols: List[str], y_col: str, m: PipelineModel, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) Dict[str, List[Dict[Any, Any]]]

Gets the CPT for the specified y variable (Spark).

Parameters
  • X_cols – X columns.

  • y_col – y column.

  • m – Classification model.

  • encodings – Dictionary of forward and backward encodings.

  • spark – Spark session.

Returns

CPT.

pysparkbbn.discrete.plearn.get_spark_model(Xy: DataFrame) PipelineModel

Gets a logistic regression model.

Parameters

Xy – Spark dataframe.

Returns

Spark pipeline.

Utilities

Utilities to make life easier.

pysparkbbn.discrete.util.get_triplets(g)

Gets all triplets (x, y, z) where x–y and y–z, but not x–z.

Parameters

g – Undirected graph.

Returns

List of triplets.

pysparkbbn.discrete.util.log_gamma(x: int) float

Computes log gamma(x), where gamma(x) = (x - 1)!. If x=5, then gamma(5) = (5-1)! = 4! = 4 x 3 x 2 x 1, and log(gamma(5)) = log((5-1)!) = log(4!) = log(4) + log(3) + log(2) + log(1).

Parameters

x – Positive integer.

Returns

Log gamma(x).

pysparkbbn.discrete.util.log_gamma_ratio(numerator: int, denominator: int) float

Computes the ratio of gammas in log-space.

Parameters
  • numerator – Numerator gamma.

  • denominator – Denominator gamma.

Returns

log(gamma(numerator) / gamma(denominator)).

pysparkbbn.discrete.bbn.get_bbn(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Bbn

Gets a Bayesian Belief Network based on the specified graph and parameters.

Parameters
  • g – Directed acyclic graph.

  • p – Parameters.

  • profile – Variable profiles.

Returns

BBN.

pysparkbbn.discrete.bbn.get_darkstar_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Dict[str, Any]

Gets JSON data for Darkstar.

Parameters
  • g – Directed acyclic graph.

  • p – Parameters.

  • profile – Variable profiles.

Returns

Data.

pysparkbbn.discrete.bbn.get_pybbn_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Dict[str, Any]

Gets JSON data for Py-BBN.

Parameters
  • g – Directed acyclic graph.

  • p – Parameters.

  • profile – Variable profiles.

Returns

Data.