M
Documentation

Documentation

for BayesDB alpha 0.2.0

Bayesian Query Language (BQL)

Bayesian Query Language is a SQL-like language that adds support for running inference and executing predictive queries based on a Bayesian model of the data. The pages introduces a number of commonly used commands. For complete documentation see the documentation site.

Loading data

CREATE BTABLE <btable> FROM <filename.csv>

Creates a btable by importing data from the specified CSV file. The file must be in CSV format, and the first line must be a header indicating the names of each column.

UPDATE SCHEMA FOR <btable> SET <col1>=<type1>[,<col2>=<type2>...]

Types are multinomial (categorical) , continuous (numeric), ignore, and key. Key types are ignored for inference, but can be used lower to uniquely identify rows instead of using ID. Note that datatypes cannot be updated once the model has been analyzed.

EXECUTE FILE <filename.bql>

You may write BQL commands in a file and run them all at once by use EXECUTE FILE and passing your .bql file. This is especially handy with UPDATE SCHEMA for tables with many columns, where you may want to write a long, cumbersome UPDATE SCHEMA query in a separate file to preserve it.

Initializing models and running analysis

It is necessary to run INITIALIZE MODELS and ANALYZE in order for BayesDB to evaluate any query involving any predictions. The models that this command creates can be thought of as possible explanations for the underlying structure of the data in a Bayesian probability model. In order to generate reasonable predictions, it is recommended that a minimum of about 16 models should be initialized (which is the default value), and each model should be analyzed for a minimum of about 250 iterations. The more models you create, the higher quality your predictions will be, so if it is computationally feasible to initialize and analyze 128 or 256 models for your dataset, it is highly recommended. It is sensible to use 250-500 iterations of analyze for each model, no matter how many models you have.

INITIALIZE [<num_models> MODELS] FOR <btable> [WITH CONFIG <config>]

Initializes num_models models. Advanced users interested in using custom model configurations may pass config here.

ANALYZE <btable> [MODEL[S] <model_index>-<model_index>] FOR (<num_iterations> ITERATIONS | <minutes> MINUTES) [WAIT]

Analyze the specified models for the specified number of iterations (by default, analyze all models). If WAIT is included, the user is unable to make any other queries until ANALYZE has finished; by default ANALYZE runs in the background. The state of a background ANALYZE may be checked with the query SHOW ANALYZE FOR <btable> and ANALYZE can be canceled with CANCEL ANALYZE FOR <btable>

Examining the state of your btables

There are a few convenience commands available to help you view the internal state of BayesDB

LIST BTABLES

View the list of all btable names in BayesDB.

SHOW SCHEMA FOR <btable>

View each column name, and each column’s datatype.

SHOW MODELS FOR <btable>

Display all models, their ids, and how many iterations of ANALYZE have been performed on each one.

SHOW DIAGNOSTICS FOR <btable>

Advanced feature: show diagnostic information for your btable’s models.

Saving and loading models

Save and load models allow you to export your models from one instance of BayesDB (save), and then import them back into any instance of BayesDB (load), so that you don’t have to re-run the potentially time-consuming ANALYZE step.

LOAD MODELS <filename.pkl.gz> INTO <btable>

SAVE MODELS FROM <btable> TO <filename.pkl.gz>

Deleting

You may delete an entire btable (including all its associated models - careful!), or just some of the models.

DROP BTABLE <btable>

DROP MODEL[S] [<model_index>-<model_index>] FROM <btable>

Querying

BayesDB has five fundamental query statements: SELECT, INFER, SIMULATE, ESTIMATE COLUMNS, and ESTIMATE PAIRWISE. They bear a strong resemblance to SQL queries.

SELECT is just like SQL’s SELECT, except in addition to selecting, filtering (with where), and ordering raw column values, you may also use predictive functions in any of those clauses:

SELECT <columns|functions> FROM <btable> [WHERE <whereclause>] [ORDER BY <columns|functions>] [LIMIT <limit>] [INTO <newbtablename>]

INFER is just like SELECT, except that it also tries to fill in missing values. The user may specify the desired confidence level to use (a number between 0 and 1, where 0 means “fill in every missing value with whatever your best guess is”, and 1 means “only fill in a missing value if you’re sure what it is”). If confidence is not specified, no values are filled in. Confidence may be specified with more granularity by following a column name with CONF <confidence> to allow you to specify different confidence levels for different columns. Optionally, the user may specify the number of samples to use when filling in missing values: the default value is good in general, but if you know what you’re doing and want higher accuracy, you can increase the number of samples used:

INFER <columns|functions> FROM <btable> [WHERE <whereclause>] [WITH CONFIDENCE <confidence>] [WITH <numsamples> SAMPLES] [ORDER BY <columns|functions>] [LIMIT <limit>] [INTO <newbtablename>]

To specify individual confidence levels for each value to be filled in:

INFER col1 CONF 0.9, col2, col3 CONF 0.5 FROM table WHERE col1 > 100 CONF 0.8 AND col2 = 'True' CONF 0.9

To specify the same confidence level for every value of the query:

INFER col1, col2, col3 FROM table WHERE col1 > 100 AND col2 = 'True' WITH CONFIDENCE 0.9   

SIMULATE generates new rows from the underlying probability model a specified number of times:

SIMULATE <columns> FROM <btable> [GIVEN <column>=<value>] TIMES <times> [SAVE TO <file>]

The optional INTO clause at the end of SELECT, INFER, or SIMULATE queries allows you to create a new btable from the query results. The new btable’s schema will be created based on the schema of the original table in the query.

ESTIMATE COLUMNS is like a SELECT statement, but lets you select columns instead of rows:

ESTIMATE COLUMNS FROM <btable> [WHERE <whereclause>] [ORDER BY <functions>] [LIMIT <limit>] [AS <column_list>]

With ESTIMATE PAIRWISE, you may use any function that takes two columns as input, i.e. DEPENDENCE PROBABILITY, CORRELATION, or MUTUAL INFORMATION, and generates a matrix showing the value of that function applied to each pair of columns. See the predictive functions section for more information.

In addition, you may also add SAVE CLUSTERS WITH THRESHOLD <threshold> AS <column_list> in order to compute groups of columns, where the value of the pairwise function is at least <threshold> between at least one pair of columns in the group. Then, those groups of columns are saved as column lists with names column_list_<id>, where id is an integer starting with 0:

ESTIMATE PAIRWISE <function> FROM <btable> [FOR <columns>] [SAVE TO <file>] [SAVE CLUSTERS WITH THRESHOLD <threshold> AS <column_list>]

You may also compute pairwise functions of rows with ESTIMATE PAIRWISE ROW:

ESTIMATE PAIRWISE ROW SIMILARITY [WITH RESPECT TO <columns|column_lists>] FROM <btable> [FOR <rows>] [SAVE TO <file>] [SAVE CLUSTERS WITH THRESHOLD <threshold> [INTO|AS] <btable>]

In the above query specifications, you may be wondering what some of the notation, such as <columns|functions> and <whereclause>, means. <columns|functions> just means a list of comma-separated column names or function specifications:

SELECT name, age, date FROM...
SELECT name, TYPICALITY, age, date FROM...

Additionally, a WITH <num_samples> SAMPLES may be specified for any of the above queries to specify the number of predictive samples that should be used to evaluate predictive queries, including INFER or any predictive function. If this clause is not present, reasonable defaults are selected:

WITH <num_samples> SAMPLES

Where Clause

For SELECT, INFER, and ESTIMATE COLUMNS, you may include a where clause to filter results much like SQL. Where clauses have the following format:

WHERE <column|function> <operator> <value> [CONF <confidence>] [AND <column|function> <operator> <value> [CONF <confidence>]...]

SELECT and INFER where clauses may include columns and non-aggregate functions such as PREDICTIVE PROBABILITY and TYPICALITY. ESTIMATE COLUMNS where clause may include aggregate functions of columns such as MUTUAL INFORMATION or PROBABILITY. Only INFER allows CONF to be specified in its where clauses. The operator can be one of (=, <, >, <=, >=, in)

SELECT * FROM table WHERE name = 'Bob' AND age <= 18 AND TYPICALITY > 0.5 ....

Order By

The order by clause changes the order of results by one or more conditions

ORDER BY <column|function> [CONF <confidence>] [ASC|DESC] [, <column|function> [CONF <confidence>] [ASC|DESC]]

Columns or rows returned are ordered by each condition in the order the conditions were specified. If not specified by ASC or DESC, the order is assumed to be descending. Only ORDER BY clauses within INFER queries may use CONF.

Query Summarizers

SUMMARIZE or PLOT may be prepended to any query that returns table-formatted output (almost every query) in order to return a summary of the data table instead of the raw data itself. This is extremely useful as a tool to quickly understand a huge result set: it quickly becomes impossible to see trends in data by eye without the assistance of SUMMARIZE or PLOT.

SUMMARIZE displays summary statistics of each of the output columns: for numerical data, it displays information like the mean, standard deviation, min, and max, and for categorical data it displays the most common values and their probabilities

SUMMARIZE SELECT * FROM table...

PLOT displays plots of the marginal distributions of every single output column, as well as the joint distributions of every pair of output columns. PLOT displays a heat map for pairs of numerical columns, the exact joint distribution for pairs of categorical columns, and a series of box plots for mixed numerical/categorical data. Many tools, like R and pandas, have functionality similar to PLOT when all the data is the same type, but PLOT is specially designed and implemented from the ground up to behave well with mixed datatypes

PLOT SELECT * FROM table...

Column Lists

Instead of manually typing in a comma-separated list of columns for queries, you may instead use a ‘column list’ in any query that asks for a list of columns. Column lists are created with ESTIMATE COLUMNS, which allows you to filter the columns you want included with a where clause, order the columns by some function, limit the number of columns, and save the column list by giving it a name with the AS clause:

ESTIMATE COLUMNS FROM <btable> [WHERE <whereclause>] [ORDER BY <functions>] [LIMIT <limit>] [AS <column_list>]

Since it may be hard to see example what you’d put in the WHERE or ORDER by clause, take a look at an example, and be sure to read the predictive functions section below:

ESTIMATE COLUMNS FROM table WHERE TYPICALITY > 0.6 ORDER BY DEPENDENCE PROBABILITY WITH name;  

You can print out the names of the stored column lists in your btable with

SHOW COLUMN LISTS FOR <btable>

And you can view the columns in a given column list or table with

SHOW COLUMNS FOR <column_list|btable>

Row Lists

In addition to storing lists of columns, BayesDB also allows you to store lists of rows. Currently, the only way to create row lists is by running ESTIMATE PAIRWISE ROW SIMILARITY with SAVE CLUSTERS. The clusters will be saved as row lists, which you can then view with the following command:

SHOW ROW LISTS FOR <table>

To execute a query only on rows that are in a specific row list, just add the following predicate to any WHERE clause in a SELECT or INFER statment

WHERE key in <row_list>

Predictive Functions

Functions of rows

Functions that take a row as input may be used in many types of queries, including

SELECT
INFER
ORDER BY (except in ESTIMATE COLUMNS)
WHERE (except in ESTIMATE COLUMNS)

Functions in this category include

SIMILARITY TO <row> [WITH RESPECT TO <column>]

Similarity measures the similarity between two rows. This can be interpreted by thinking of dividing the rows up into clusters, and measuring how likely it would be that these two rows would be in the same cluster. By default, similarity considers all columns when deciding how similar to rows are, but you may optionally specify a specific column to compute similarity with respect to.

TYPICALITY

The typicality of a row measures how similar to other rows this row is. If a row is more dependent, on average, with other rows, then it becomes more typical.

PROBABILITY OF <column>=<value>

The probability of a cell taking on a particular value is the probability that the Bayesian probability model assigns to this particular outcome.

PREDICTIVE PROBABILITY OF <column>

The predictive probability of a value is similar to the PROBABILITY OF <column>=<value> query, but it measures the probability that each cell takes on its observed value, as opposed to a specific value that the user specifies.

Here are some examples:

SELECT SIMILARITY TO 0 WITH RESPECT TO name, TYPICALITY FROM btable WHERE PROBABILITY OF name='Bob' > 0.8 ORDER BY PREDICTIVE PROBABILITY OF name;

Functions of two columns

Functions of two columns may be used in the following queries

ESTIMATE PAIRWISE (omit the 'OF' clause)
SELECT (include the 'OF' clause; they only return one row)  

Here are the three functions

DEPENDENCE PROBABILITY OF <column1> WITH <column2>

The dependence probability between two columns is a measure of how likely it is that the two columns are dependent (opposite of indepdendent). Note that this does not measure the strength of the relationship between the two columns; it merely measures the probability that there is any relationship at all.

MUTUAL INFORMATION OF <column1> WITH <column2>

Mutual information between two columns measures how much information a value in one column gives you about the value in the other column. If mutual information is 0, then knowing the first column tells you nothing about the other column (they are independent). Mutual information is always nonnegative, and is measured in bits.

CORRELATION OF <column1> WITH <column2>

This is the standard Pearson correlation coefficient between the two columns. All rows with missing values in either or both of the two columns will be removed before calculating the correlation coefficient.

Here are some examples

ESTIMATE PAIRWISE DEPENDENCE PROBABILITY OF name WITH age;
SELECT MUTUAL INFORMATION OF name WITH age FROM table...

Functions of one column, for SELECT

Functions in this category take one column as input, and can only be used in

SELECT (but they only return one row)

There is only one function like this

TYPICALITY OF <column>

The typicality of a column measures how similar to other columns this column is. If a column is more dependent, on average, with other columns, then it becomes more typical.

Here is an example

SELECT TYPICALITY OF age FROM...

Functions of one column, for ESTIMATE COLUMNS

For each of the functions of one or two columns above (that were usable in SELECT, and sometimes ESTIMATE PAIRWISE), there is a version of the function that is usable in ESTIMATE COLUMNS, in the following clauses:

WHERE (in ESTIMATE COLUMNS only)
ORDER BY (in ESTIMATE COLUMNS only)

Here are the functions:

TYPICALITY

This is the same function as TYPICALITY OF <column> above, but the column argument is implicit.

CORRELATION WITH <column>

This is the same function as CORRELATION OF <column1> WITH <column2> above, but one of the column arguments is implicit.

DEPENDENCE PROBABILITY WITH <column>

This is the same function as DEPENDENCE PROBABILITY OF <column1> WITH <column2> above, but one of the column arguments is implicit.

MUTUAL INFORMATION WITH <column>

This is the same function as MUTUAL INFORMATION OF <column1> WITH <column2> above, but one of the column arguments is implicit.

Here are some examples

ESTIMATE COLUMNS FROM table WHERE TYPICALITY > 0.6 AND CORRELATION WITH name > 0.5 ORDER BY DEPENDENCE PROBABILITY WITH name;

Summary Statistics

To view summary statistics of query results, SUMMARIZE can be prepended to SELECT, INFER, and SIMULATE statements

SUMMARIZE <SELECT|INFER|SIMULATE> <columns|functions> FROM <btable> [WHERE <whereclause>] [ORDER BY <columns|functions>] [LIMIT <limit>]

The first column of the output from SUMMARIZE will be statistic labels: count, unique, mean, std, min, 25%, 50%, 75%, and max correspond to the output from pandas.Series.describe, which is dependent on whether the column is discrete or continuous.

prob_mode1, prob_mode2, prob_mode3, prob_mode4, prob_mode5 are the empirical probabilities of the corresponding i-th most common value (number of occurrences / number of observations including missing values)

Modal values and their empirical probabilities are returned for every column, whether discrete or continuous.

Frequency and Histogram Tables

Prepending a SELECT, INFER, or SIMULATE statement with the keyword FREQ or HIST will return a frequency or histogram table, respectively. If multiple columns are included in the statement, the frequency or histogram table is only returned for the first column.

A frequency table returns the number and percentage of occurrences of each distinct value in the column

FREQ <SELECT|INFER|SIMULATE> <columns|functions> FROM <btable> [WHERE <whereclause>] [LIMIT <limit>]

A histogram calculates a number of equal-width bins based on the total number of values selected, using Sturges’ rule, \( k = \text{ceil}(\text{log}_2(n) + 1) \), and returns a table showing each bin interval, and the number and percentage of values within each bin.

HIST <SELECT|INFER|SIMULATE> <columns|functions> FROM <btable> [WHERE <whereclause>] [LIMIT <limit>]

While FREQ works for all data types, HIST will not work for multinomial columns, since there isn’t an intuitive way to calculate numeric intervals for multinomial values. If a multinomial column contains values that could be interpreted as numeric values, use UPDATE SCHEMA to set the column’s data type to continuous, and then use HIST.

Saving and Reviewing Metadata

Metadata (data describing the data) can remind the user about what’s contained in a btable, or what a particular column of data means. For each btable, metadata is stored as pairs consisting of a key and a value, and is saved at two different levels: metadata related to entire btables and metadata related to columns of data (typically referred to as column labels).

For example, a user might set the key original_file_name = data_download_2014_04_17.csv in order to recall which version of the file is saved in the btable, or might set a column label yr = Year of observation.

Metadata for btables

There are no restrictions on metadata keys, but some examples at the btable level might be original_file_name, origin_url, date_retrieved, misc_note, etc.

To add metadata to a btable directly

UPDATE METADATA FOR <btable> SET <metadata-key1 = value1>[, <metadata-key2 = value2>...]

Metadata keys and values should not be quoted unless the quotes are intended to be part of the key or label, and should also not include commas.

Adding a lot of metadata to a btable might become tedious, especially if the process ever needs to be repeated, so it’s also possible to add metadata to a btable from a file

UPDATE METADATA FOR <btable> FROM <filename.csv>

The file in <filename.csv> should be a text CSV file with two columns, with the first value on each line being a column name and the second value its intended label. The first line of the file will be assumed a header and therefore ignored. As an example, the first three lines of the file might be

key,value
original_file_name,data_download_2014_04_17.csv
sample_note,data in btable is a 20% random sample of the full original file

Metadata for columns of btables

Labeling columns is a common metadata operations, and has its own statement to assign labels directly

LABEL COLUMNS FOR <btable> SET <column1 = column-label-1> [, <column-name-2 = column-label-2>, ...]

Column labels should be quoted if they’re longer than one word, and should not include commas. Similarly to btable-level metadata, column labels can be added to a btable from a file

LABEL COLUMNS FOR <btable> FROM <filename.csv>

As with loading btable-level metadata from a file, the file in <filename.csv> should be a text CSV file with two columns, with the first value on each line being a column name and the second value its intended label. The first line of the file will be assumed a header and therefore ignored. As an example, the first three lines of the file might be

column,label
age,Observed student's age as of 1 Jan 2014
grade,Student's enrolled grade at the beginning of the 2013-14 school year

Reviewing btable metadata and column labels

To see all metadata stored for a given btable as metadata key and value pairs

SHOW METADATA FOR <btable>

To see only the metadata values associated with specific keys

SHOW METADATA FOR <btable> [<metadata-key1> [, <metadata-key2>...]]

Similarly to the SHOW METADATA statements, column labels can be reviewed either all at once, by not specifying any column names

SHOW LABEL FOR <btable>

Or, if a set of column names is given, the output shows column name and label pairs for those columns

SHOW LABEL FOR <btable> [<column-name-1> [, <column-name-2>...]]