Learning from Discrete Data
All modules related to learning Bayesian Belief Networks BBNs
from fully discrete data.
Data
The data.
- class pysparkbbn.discrete.data.DiscreteData(sdf: DataFrame)
Bases:
object
Discrete data.
- __init__(sdf: DataFrame)
ctor
- Parameters
sdf – Spark dataframe.
- static compute_bd_score(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int]) Tuple[List[str], List[str], float]
Computes the Bayesian dirichlet score.
- Parameters
X – Child.
Y – Parent.
XY_vals – XY values.
counts – Counts.
- Returns
Tuple of X, Y and score.
- static compute_cmi(X: List[str], Y: List[str], Z: List[str], XYZ_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) Tuple[List[str], List[str], List[str], float]
Computes the conditional mutual information of X and Y given Z.
- Parameters
X – X variables.
Y – Y variables.
Z – Z variables.
XYZ_vals – XYZ values.
counts – Counts.
n – Number of data.
- Returns
Tuple of X, Y, Z and conditional mutual information.
- static compute_mi(X: List[str], Y: List[str], XY_vals: List[Tuple[str, ...]], counts: Dict[str, int], n: int) Tuple[List[str], List[str], float]
Computes the mutual information.
- Parameters
X – X variables.
Y – Y variables.
XY_vals – XY values.
counts – Counts.
n – Total data points.
- Returns
Tuple of X, Y and mutual information.
- drop(columns: List[str])
Drops the specified columns.
- Parameters
columns – List of columns.
- Returns
New dataset without the dropped columns.
- static get_bd(scoring_pairs: List[ScoringPair], counts: Dict[str, int]) List[Tuple[List[str], List[str], float]]
Computes the Bayesian dirichlet score.
- Parameters
scoring_pairs – Scoring pairs.
counts – Counts.
- Returns
List of Bayesian dirichlet scores.
- get_bd_par(pairs: List[ScoringPair]) List[Tuple[List[str], List[str], float]]
Computes the Bayesian dirichlet for each pair of variable (parallel).
- Parameters
pairs – Scoring pairs.
- Returns
List of Bayesian dirichlet scores.
- static get_cmi(triplets: List[Triplet], counts: Dict[str, int], n: int) List[Tuple[List[str], List[str], List[str], float]]
Computes the conditional mutual information for each triplet of variables.
- Parameters
triplets – List of triplet variables.
counts – Count dictionary.
n – Number of data.
- Returns
List of conditional mutual information.
- get_cmi_par(triplets: List[Triplet]) List[Tuple[List[str], List[str], List[str], float]]
Computes the conditional mutual information for each triplet of variables (parallel).
- Parameters
triplets – List of triplets.
n – Number of data.
- Returns
List of conditional mutual information.
- get_counts_for_bd(scoring_pairs: List[ScoringPair]) Dict[str, int]
Gets the counts required to compute Bayesian dirichlet scoring.
- Parameters
scoring_pairs – List of scoring pairs (child, parents).
- Returns
- get_counts_for_cmi(triplets: List[Triplet]) Dict[str, int]
Gets the counts required to compute conditional mutual information.
- Parameters
triplets – List of triplets.
- Returns
Dictionary of counts.
- get_counts_for_mi(pairs: List[Pair]) Dict[str, int]
Gets the counts required to compute pairwise mutual information.
- Parameters
pairs – List of pairs.
- Returns
Dictionary of counts.
- static get_mi(pairs: List[Pair], counts: Dict[str, int], n: int) List[Tuple[List[str], List[str], float]]
Computes the mutual information for each pair of variable.
- Parameters
pairs – List of pairs of variables.
counts – Count dictionary.
n – Number of data.
- Returns
List of mutual information.
- get_mi_par(pairs: List[Pair]) List[Tuple[List[str], List[str], float]]
Computes the mutual information for each pair of variable (parallel).
- Parameters
pairs – List of pairs of variables.
- Returns
List of mutual information.
- get_profile() Dict[str, Dict[str, int]]
Gets the data profile.
- Returns
Dictionary. Keys are variable names. Values are dictionary of value-frequency.
- class pysparkbbn.discrete.data.Pair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])
Bases:
object
Pair. Useful for computing pairwise statistics.
- __init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])
ctor
- Parameters
X – List of X variables.
Y – List of Y variables.
profile – Profile of variables.
- get_entries(r: Row) List[Tuple[str, int]]
Gets the entries used for counting.
- Parameters
r – Row.
- Returns
Keys used for counting.
- get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
- class pysparkbbn.discrete.data.ScoringPair(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])
Bases:
Pair
Scoring pair X, Y. Useful for Bayesian dirichlet scoring.
- __init__(X: List[str], Y: List[str], profile: Dict[str, Dict[str, int]])
ctor
- Parameters
X – Child.
Y – Parents.
profile – Variable profile.
- get_entries(r: Row) List[Tuple[str, int]]
Gets the entries used for counting.
- Parameters
r – Row.
- Returns
Keys used for counting.
- get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
- class pysparkbbn.discrete.data.Triplet(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])
Bases:
object
Triplet X, Y, Z. Useful for computing conditional statistics.
- __init__(X: List[str], Y: List[str], Z: List[str], profile: Dict[str, Dict[str, int]])
ctor
- Parameters
X – List of X variables.
Y – List of Y variables.
Z – List of Z variables.
profile – Profile of variables.
- get_entries(r: Row) List[Tuple[str, int]]
Gets the entries used for counting.
- Parameters
r – Row.
- Returns
Keys used for counting.
- get_entries_par(r: Row) List[Tuple[Tuple[str, str], int]]
- pysparkbbn.discrete.data.get_pairs(data: DiscreteData) List[Pair]
Gets a list of pairs of variables.
- Parameters
data – Data.
- Returns
List of pairs.
Structure Learning
There are two broad classes of structure learning: constraint-based
(CB) and search-and-scoring
(SS).
CB structure learning typically uses independence and conditional independence tests to learn
the network structure. SS structure learning uses a scoring measure over a search space to find
the best scoring structures.
The CB structure learning algorithms are as follows. Some of these structure learning algorithms are appropriate for learning classification models.
Naive Bayes
: Creates a naive Bayes model (classification).Tree-Augmented Network
(TAN): Creates a tree-augmented network (classification).BN augmented naive Bayes
(BAN): A modified BAN algorithm (classification).Chow-Liu
(aka Maximum Weight Spanning Tree, MWST): Creates a tree structured BBN.Three-Phase Dependency Analysis
(TPDA): Uses TPDA (draft, thicken and thin).PC Algorithm
: Uses the PC algorithm.
There is only one search-and-scoring based algorithm implemented, which uses genetic algorithms.
Genetic Algorithm
(GA): Discovers a highly scoring network structure.
Constraint-Based
Constraint-based structure learning.
- class pysparkbbn.discrete.scblearn.Ban(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')
Bases:
BaseStructureLearner
Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.
- __init__(data: DiscreteData, clazz: str, cmi_threshold=0.06, method='pc')
ctor
- Parameters
data – Data.
clazz – Class variable.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
method – Either
pc
ortpda
(default=pc).
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.discrete.scblearn.BaseStructureLearner(data: DiscreteData)
Bases:
object
Base structure learner.
- __init__(data: DiscreteData)
ctor
- Parameters
data – Data.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.discrete.scblearn.Mwst(data: DiscreteData, cmi_threshold=0.06)
Bases:
BaseStructureLearner
Maximum weight spanning tree.
- __init__(data: DiscreteData, cmi_threshold=0.06)
ctor
- Parameters
data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.discrete.scblearn.Naive(data: DiscreteData, clazz: str)
Bases:
BaseStructureLearner
Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.
- __init__(data: DiscreteData, clazz: str)
ctor.
- Parameters
data – Data.
clazz – The clazz node.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.discrete.scblearn.Pc(data: DiscreteData, cmi_threshold=0.06)
Bases:
BaseStructureLearner
PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.
- __init__(data: DiscreteData, cmi_threshold=0.06)
ctor
- Parameters
data – Data.
depth – Maximum depth.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- learn_undirected_graph() Graph
Learns an undirected graph.
- Returns
Undirected graph.
- class pysparkbbn.discrete.scblearn.Tan(data: DiscreteData, clazz: str, cmi_threshold=0.06)
Bases:
Mwst
Tree-augmented network. See Comparing Bayesian Network Classifiers.
- __init__(data: DiscreteData, clazz: str, cmi_threshold=0.06)
ctor.
- Parameters
data – Data.
clazz – The clazz node.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.discrete.scblearn.Tpda(data: DiscreteData, cmi_threshold=0.06)
Bases:
Mwst
Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.
- __init__(data: DiscreteData, cmi_threshold=0.06)
ctor.
- Parameters
data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- learn_undirected_graph() Graph
Learns an undirected graph.
- Returns
Undirected graph.
Search-and-Scoring
Search-and-scoring structure learning.
- class pysparkbbn.discrete.ssslearn.Ga(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)
Bases:
object
Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.
- __init__(data: DiscreteData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)
ctor
- Parameters
data – Data.
sc – Spark context.
max_parents – Maximum number of parents (default=4).
mutation_rate – Mutation rate (default=0.25).
pop_size – Population size (default=100).
crossover_prob – Crossover probability (default=0.5).
max_iters – Maximum iterations (default=20).
convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).
ordering – Ordering method: mwst, pc or tpda (default=mwst).
ordering_params – Ordering parameters to the ordering method.
seed – Seed for random number generation (default=37).
- get_network() DiGraph
Gets the network structure.
Parameter Learning
Parameter learning.
- class pysparkbbn.discrete.plearn.LargeDataParamLearner(data: DiscreteData, g: DiGraph)
Bases:
object
Parameter learner. Useful for when data cannot fit into memory. Parameters are learned sequentially.
- __init__(data: DiscreteData, g: DiGraph)
ctor
- Parameters
data – Data.
- G
Directed acyclic graph.
- get_params() Dict[str, List[Dict[str, Union[str, float]]]]
Gets the parameters.
- Returns
Dictionary of parameters.
- class pysparkbbn.discrete.plearn.ParamLearner(data: DiscreteData, g: DiGraph)
Bases:
object
Parameter learner. All parameters are learned in parallel but requires extreme Spark tuning.
- __init__(data: DiscreteData, g: DiGraph)
ctor
- Parameters
data – Data.
- G
Directed acyclic graph.
- get_params() Dict[str, List[Dict[str, Union[str, float]]]]
Gets the parameters.
- Returns
Dictionary of parameters.
- class pysparkbbn.discrete.plearn.SmallDataParamLearner(data: DiscreteData, g: DiGraph)
Bases:
object
Parameter learner. Useful for when the data can fit into memory. Parameters are learned in parallel.
- __init__(data: DiscreteData, g: DiGraph)
ctor
- Parameters
data – Data.
- G
Directed acyclic graph.
- get_params() Dict[str, List[Dict[str, Union[str, float]]]]
Gets the parameters.
- Returns
Dictionary of parameters.
- pysparkbbn.discrete.plearn.decode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) Row
Decodes a row from numeric back to categorical values.
- Parameters
r – Row.
- Encodings
Dictionary of forward and backward encodings.
- Returns
Row.
- pysparkbbn.discrete.plearn.encode_data(df: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) DataFrame
Encodes a Spark dataframe.
- Param
Dictionary of forward and backward encodings.
- Returns
Spark dataframe.
- pysparkbbn.discrete.plearn.encode_row(r: Row, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) Row
Encodes a row from categorical values to numeric ones.
- Parameters
r – Row.
- Encodings
Dictionary of forward and backward encodings.
- Returns
Row.
- pysparkbbn.discrete.plearn.estimate_scikit_cpt(X_cols: List[str], y_col: str, pdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) List[Dict[Any, Any]]
Estimate the CPT (Scikit).
- Parameters
X_cols – X columns.
y_col – y column.
pdf – Pandas dataframe.
encodings – Dictionary of forward and backward encodings.
- Returns
CPT.
- pysparkbbn.discrete.plearn.estimate_spark_cpt(X_cols: List[str], y_col: str, sdf: DataFrame, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) Dict[str, List[Dict[Any, Any]]]
Estimate the CPT (Spark).
- Parameters
X_cols – X columns.
y_col – y column.
sdf – Spark dataframe.
encodings – Dictionary of forward and backward encodings.
spark – Spark session.
- Returns
CPT.
- pysparkbbn.discrete.plearn.get_encodings(sdf: DataFrame) Dict[str, Dict[str, Dict[Any, Any]]]
Gets encoding of categorical values to numeric (integer) ones.
- Parameters
sdf – Spark dataframe.
- Returns
Dictionary of forward and backward encodings.
- pysparkbbn.discrete.plearn.get_pandas_Xy(X_cols: List[str], y_col: str, pdf: DataFrame) DataFrame
Gets a Pandas dataframe in the Xy form.
- Parameters
X_cols – X columns.
y_col – y column.
pdf – Pandas dataframe.
- Returns
Xy Pandas dataframe.
- pysparkbbn.discrete.plearn.get_parent_child(g: DiGraph) List[Tuple[List[str], str]]
Converts the graph to a list of tuples where the first item in the tuple is a list of parents and the second item in the tuple is the child.
- Parameters
g – Directed acyclic graph.
- Returns
List of parent/child pairs.
- pysparkbbn.discrete.plearn.get_scikit_cpt(X_cols: List[str], y_col: str, m: Pipeline, encodings: Dict[str, Dict[str, Dict[Any, Any]]]) List[Dict[Any, Any]]
Gets the CPT for the specified y variable (Scikit).
- Parameters
X_cols – X columns.
y_col – y column.
m – Classification model.
encodings – Dictionary of forward and backward encodings.
- Returns
CPT.
- pysparkbbn.discrete.plearn.get_scikit_model(X: DataFrame, y: Series) Pipeline
Gets a Scikit logistic regression model.
- Parameters
X – Pandas dataframe.
y – Pandas series.
- Returns
Scikit pipeline.
- pysparkbbn.discrete.plearn.get_spark_Xy(X_cols: List[str], y_col: str, sdf: DataFrame) DataFrame
Gets a Spark dataframe in the Xy form.
- Parameters
X_cols – X columns.
y_col – y column.
sdf – Spark dataframe.
- Returns
Xy Spark dataframe.
- pysparkbbn.discrete.plearn.get_spark_cpt(X_cols: List[str], y_col: str, m: PipelineModel, encodings: Dict[str, Dict[str, Dict[Any, Any]]], spark: SparkSession) Dict[str, List[Dict[Any, Any]]]
Gets the CPT for the specified y variable (Spark).
- Parameters
X_cols – X columns.
y_col – y column.
m – Classification model.
encodings – Dictionary of forward and backward encodings.
spark – Spark session.
- Returns
CPT.
- pysparkbbn.discrete.plearn.get_spark_model(Xy: DataFrame) PipelineModel
Gets a logistic regression model.
- Parameters
Xy – Spark dataframe.
- Returns
Spark pipeline.
Utilities
Utilities to make life easier.
- pysparkbbn.discrete.util.get_triplets(g)
Gets all triplets (x, y, z) where x–y and y–z, but not x–z.
- Parameters
g – Undirected graph.
- Returns
List of triplets.
- pysparkbbn.discrete.util.log_gamma(x: int) float
Computes log gamma(x), where gamma(x) = (x - 1)!. If x=5, then gamma(5) = (5-1)! = 4! = 4 x 3 x 2 x 1, and log(gamma(5)) = log((5-1)!) = log(4!) = log(4) + log(3) + log(2) + log(1).
- Parameters
x – Positive integer.
- Returns
Log gamma(x).
- pysparkbbn.discrete.util.log_gamma_ratio(numerator: int, denominator: int) float
Computes the ratio of gammas in log-space.
- Parameters
numerator – Numerator gamma.
denominator – Denominator gamma.
- Returns
log(gamma(numerator) / gamma(denominator)).
- pysparkbbn.discrete.bbn.get_bbn(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Bbn
Gets a Bayesian Belief Network based on the specified graph and parameters.
- Parameters
g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.
- Returns
BBN.
- pysparkbbn.discrete.bbn.get_darkstar_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Dict[str, Any]
Gets JSON data for Darkstar.
- Parameters
g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.
- Returns
Data.
- pysparkbbn.discrete.bbn.get_pybbn_data(g: DiGraph, p: Dict[str, List[Dict[str, Union[str, float]]]], profile: Dict[str, Dict[str, int]]) Dict[str, Any]
Gets JSON data for Py-BBN.
- Parameters
g – Directed acyclic graph.
p – Parameters.
profile – Variable profiles.
- Returns
Data.