Learning from Continuous Data
All modules related to learning Bayesian Belief Networks BBNs
from multivariate Gaussian data.
Data
The data.
- class pysparkbbn.continuous.data.CondMvn(index_1: List[int], index_2: List[int], means: ndarray, cov: ndarray, zero=1e-06, cols_1: List[str] = None, cols_2: List[str] = None)
Bases:
object
Conditional multivariate normal.
- __init__(index_1: List[int], index_2: List[int], means: ndarray, cov: ndarray, zero=1e-06, cols_1: List[str] = None, cols_2: List[str] = None)
ctor
- Parameters
index_1 – Index of dependent variables.
index_2 – Index oc conditioning variables.
means – Means.
cov – Covariance matrix.
zero – Threshold below which to consider a probability as zero.
cols_1 – Names corresponding to index_1.
cols_2 – Names corresponding to index_2.
- equals(other: Any) bool
Checks if this is equal to other.
- Parameters
other – CondMvn.
- Returns
Boolean.
- log_proba(x: ndarray) ndarray
Estimate the log conditional probability of the specified data point.
- Parameters
x – Data point.
- Returns
Log probability.
- static partition_cov(cov: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray, ndarray, ndarray]
Partitions the covariance matrix.
- Parameters
cov – Covariance matrix.
index_1 – Index.
index_2 – Index.
- Returns
Partitioned covariance matrix.
- static partition_means(means: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray]
Partitions the means.
- Parameters
means – Means.
index_1 – Index.
index_2 – Index.
- Returns
Partitoned mean.
- static partition_x(x: ndarray, index_1: List[int], index_2: List[int]) Tuple[ndarray, ndarray]
Partitions the data point.
- Parameters
x – Data point.
index_1 – Index.
index_2 – Index.
- Returns
Tuple of data point partitioned.
- pdf(x: ndarray) float
Estimate the conditional probability of the specified data point.
- Parameters
x – Data point.
- Returns
Probability.
- class pysparkbbn.continuous.data.GaussianData(sdf: DataFrame, n_samples=6, spark=None)
Bases:
object
Gaussian data.
- __init__(sdf: DataFrame, n_samples=6, spark=None)
ctor
- Parameters
sdf – Spark data frame.
n_samples – Number of samples.
spark – Spark object: Bug requirement.
- drop(columns: List[str])
Drops specified columns.
- Parameters
columns – List of columns.
- Returns
Guassian data.
- get_cmi_par(triplets: List[Triplet]) List[Tuple[List[str], List[str], List[str], float]]
Computes conditional mutual information between triplets in parallel.
- Parameters
triplets – List of triplets (of variables).
- Returns
List of conditional mutual information.
- get_covariance() ndarray
Gets the covariance matrix.
- Returns
Covariance matrix.
- get_means() ndarray
Get means.
- Returns
List of means.
- get_mi_par(pairs: List[Pair]) List[Tuple[List[str], List[str], float]]
Computes mutual information between the pairs of variables in parallel.
- Parameters
pairs – List of pairs (of variables).
- Returns
List of mutual information.
- get_min_max(columns: List[str]) Dict[str, Dict[str, Union[int, float]]]
Get dictionary of min/max.
- Parameters
columns – Variable names.
- Returns
Dictionary of min/max associated with names.
- get_min_max_for(column: str) Dict[str, Union[int, float]]
Get min/max value for specified variable.
- Parameters
column – Variable name.
- Returns
Dictionary of min/max.
- get_mvn(columns: List[str]) Mvn
Gets a multivariate normal instance.
- Parameters
columns – List of variable names.
- Returns
Multivariate normal.
- get_pair(x: List[str], y: List[str]) Pair
Gets a pair.
- Parameters
x – X variables.
y – Y variables.
- Returns
Pair.
- get_pairs(col_pairs: List[Tuple[List[str], List[str]]] = None) List[Pair]
Gets list of pairs.
- Parameters
col_pairs – List of column pairs.
- Returns
List of pairs.
- get_pairwise_columns() Generator[Tuple[List[str], List[str]], None, None]
Gets pairs of columns.
- Returns
List of pairs of columns. Each column is inside an array.
- get_profile() Dict[str, Dict[str, float]]
Gets profile of variables.
- Returns
Dictionary; keys are variable names and values are summary stats.
- get_score_par(cmvns: List[CondMvn]) List[Tuple[List[str], List[str], float]]
Computes the scores.
- Parameters
cmvns – List of conditional multivariate gaussian distributions.
- Returns
List of scores.
- get_triplet(x: List[str], y: List[str], z: List[str]) Triplet
Gets a triplet.
- Parameters
x – X variables.
y – Y variables.
z – Z variables.
- Returns
Triplet.
- get_triplets(col_triplets: List[Tuple[List[str], List[str], List[str]]])
Gets list of triplets.
- Parameters
col_triplets – List of column triplets.
- Returns
List of triplets.
- slice_covariance(columns: List[str]) ndarray
Slices covariance matrix according to variables.
- Parameters
columns – List of variables.
- Returns
Covariance matrix.
- slice_means(columns: List[str]) ndarray
Slices means vector according to variables.
- Parameters
columns – List of variables.
- Returns
List of means.
- class pysparkbbn.continuous.data.Mvn(columns: List[str], means: ndarray, cov: ndarray, profile: Dict[str, Dict[str, float]], n_samples=10)
Bases:
object
Multivariate normal distribution.
- __init__(columns: List[str], means: ndarray, cov: ndarray, profile: Dict[str, Dict[str, float]], n_samples=10)
ctor
- Parameters
columns – List of variable names.
means – Vector means.
cov – Covariance matrix.
profile – Dictionary of min/max for each variable.
n_samples – Number of samples.
- get_values() Generator[Tuple[float, ...], None, None]
Gets the sampled values.
- Returns
Generator of values.
- pdf(x: ndarray) float
Estimate the probability of the specified data point.
- Parameters
x – Data point.
- Returns
Probability.
- class pysparkbbn.continuous.data.Pair(X: Mvn, Y: Mvn, XY: Mvn)
Bases:
object
Pair of variables.
- __init__(X: Mvn, Y: Mvn, XY: Mvn)
ctor
- Parameters
X – X variables.
Y – Y variables.
XY – XY variables.
- get_mi() float
Computes the mutual information.
- Returns
Mutual information.
- get_partial_mi(dp: ndarray) float
Computes the partial mutual information.
- Parameters
dp – Data point.
- Returns
Partial mutual information.
- get_values() Generator[Tuple[float, ...], None, None]
Gets the XY values.
- Returns
Generator of XY values.
- class pysparkbbn.continuous.data.Triplet(x_cols: List[str], y_cols: List[str], z_cols: List[str], Z: Mvn, XZ: Mvn, YZ: Mvn, XYZ: Mvn)
Bases:
object
Triplet variables.
- __init__(x_cols: List[str], y_cols: List[str], z_cols: List[str], Z: Mvn, XZ: Mvn, YZ: Mvn, XYZ: Mvn)
ctor
- Parameters
x_cols – X columns.
y_cols – Y columns.
z_cols – Z columns.
Z – Z variables.
XZ – XZ variables.
YZ – YZ variables.
XYZ – XYZ variables.
- get_cmi() float
Computes the conditional mutual information.
- Returns
Conditional mutual information.
- get_partial_mi(dp: ndarray) float
Computes the partial mutual information.
- Parameters
dp – Data point.
- Returns
Partial mutual information.
- get_values() Generator[Tuple[float, ...], None, None]
Gets the XYZ values.
- Returns
List of XYZ values.
Structure Learning
Constraint-Based
Constraint-based structure learning.
- class pysparkbbn.continuous.scblearn.Ban(data: GaussianData, clazz: str, cmi_threshold=0.0001, method='pc')
Bases:
BaseStructureLearner
Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.
- __init__(data: GaussianData, clazz: str, cmi_threshold=0.0001, method='pc')
ctor
- Parameters
data – Data.
clazz – Class variable.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
method – Either
pc
ortpda
(default=pc).
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.continuous.scblearn.BaseStructureLearner(data: GaussianData)
Bases:
object
Base structure learner.
- __init__(data: GaussianData)
ctor
- Parameters
data – Data.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.continuous.scblearn.Mwst(data: GaussianData, cmi_threshold=0.01)
Bases:
BaseStructureLearner
Maximum weight spanning tree.
- __init__(data: GaussianData, cmi_threshold=0.01)
ctor
- Parameters
data – Data. :param cmi_threshold: Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.continuous.scblearn.Naive(data: GaussianData, clazz: str)
Bases:
BaseStructureLearner
Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.
- __init__(data: GaussianData, clazz: str)
ctor.
- Parameters
data – Data.
clazz – The clazz node.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.continuous.scblearn.Pc(data: GaussianData, cmi_threshold=0.0001)
Bases:
BaseStructureLearner
PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.
- __init__(data: GaussianData, cmi_threshold=0.0001)
ctor
- Parameters
data – Data.
depth – Maximum depth.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- learn_undirected_graph() Graph
Learns an undirected graph.
- Returns
Undirected graph.
- class pysparkbbn.continuous.scblearn.Tan(data: GaussianData, clazz: str, cmi_threshold=0.01)
Bases:
Mwst
Tree-augmented network. See Comparing Bayesian Network Classifiers.
- __init__(data: GaussianData, clazz: str, cmi_threshold=0.01)
ctor.
- Parameters
data – Data.
clazz – The clazz node.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- class pysparkbbn.continuous.scblearn.Tpda(data: GaussianData, cmi_threshold=0.006)
Bases:
Mwst
Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.
- __init__(data: GaussianData, cmi_threshold=0.006)
ctor.
- Parameters
data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
- get_network() DiGraph
Gets the network structure.
- learn_undirected_graph() Graph
Learns an undirected graph.
- Returns
Undirected graph.
Search-and-Scoring
Search-and-scoring structure learning.
- class pysparkbbn.continuous.ssslearn.Ga(data: GaussianData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)
Bases:
object
Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.
- __init__(data: GaussianData, sc: SparkContext, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)
ctor
- Parameters
data – Data.
sc – Spark context.
max_parents – Maximum number of parents (default=4).
mutation_rate – Mutation rate (default=0.25).
pop_size – Population size (default=100).
crossover_prob – Crossover probability (default=0.5).
max_iters – Maximum iterations (default=20).
convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).
ordering – Ordering method: mwst, pc or tpda (default=mwst).
ordering_params – Ordering parameters to the ordering method.
seed – Seed for random number generation (default=37).
- get_network() DiGraph
Gets the network structure.