Machines and Minds

The Synthetic-Analytic Distinction

2026-02-20T00:00:00+00:00

The distinction between analytic and synthetic propositions is often associated with logical positivism, a philosophy propounded by A.J. Ayer, among others.

Synthetic propositions are those that are associated with some level of uncertainty. Their referents are data that can be gathered using the senses.

Analytic propositions, on the other hand, are true because of the way that we have defined them. If a synthetic proposition has some probability $p$, where $p < 1$, then an analytic proposition has a probability of 1.

One might object to the notion that anything can ever be fully certain. Those objections are, of course, well-founded; however, to build upward, certain necessary assumptions must provide us with a foundation.

That we defend analytic propositions is not to mire ourselves in dogma but to show that they enable a vast array of cherished propositions.

Whether mathematical operations are uncertain

Suppose you did try to incorporate uncertainty into your math, and you did so using the language of probability.

For every probability calculation, you would multiply each posterior by $1- \epsilon$, where $\epsilon$ is a penalty constant. Here, I want to emphasize that it would be the operation itself that we are skeptical of, not hardware limitations, i.e., errors due to limited precision floating point numbers.

Then, since that multiplication by $1 - \epsilon$ is also a probability calculation, we would need to multiply the resulting posterior by $1 - \epsilon$ again, and again, resulting in an infinite regress.

Note that this is not the same as saying that there could have been a misstep in a body of calculations or a proof. In that case, we simply say that the author made a mistake, not that math itself is somehow suspect. The written result either does or does not follow from the written premises, and we either do or do not assume that it follows until we have verified for ourselves that it does not. Such cases are legion, but they are always the fallible human.

Whether anything in nature cannot be explained by math

The objection to all of this might just be that the uncertainty does not have to be distributed over the mathematical operations. Instead, what we are uncertain about is whether there is anything in nature that could be explained by math.

I literally cannot conceive of what that would be. What thought could anyone have that could not be formulated as mathematics? What property would it have to have that it would not simply be one more relation added to humanity’s knowledge of mathematics, and yet still be capable of being formulated as a synthetic proposition?

Whether all of this is nonsense because we are exalting math as some kind of higher being

From a purely conventionalist standpoint, we have said nothing inconsistent. “Math” is not a term that describes the scribbles in the notebook of a calculus student. It is not the syntax. It is the semantics.

To have a coherent theory of logic and language, we must employ the concept of meaning. How else would math fit into this picture except how we have painted it when we have already assumed that math can imply nothing about the real world?

“Ah, but you should not have assumed that.”

Then let’s claim that math can imply observations and tell a little parable to illustrate the possible consequences.

In ancient Greece, two philosophers decided to set out to find the result of $1 + 1$. One traveled north, and the other went south.

In the north, one philosopher came to stay as a guest in the home of a couple for most of a year. In that time, the couple conceived ($1 + 1$) and the woman birthed a baby ($= 3$).

The other philosopher came to a pond with water so crystal clear that you could see right to the bottom. He spent the better part of a year watching the fish. The fish had large mouths that they could open wide and use to gobble up smaller fish. He saw a fish ($1$) eat a smaller fish ($+ 1$), and the result was only one fish ($= 1$). This was a pattern that he saw repeat itself many times.

The two philosophers journeyed back to their starting point and reconvened.

The northern philosopher said, “I have discovered the answer. One plus one equals three.”

“Why, no,” the southern philosopher said. “One plus one equals one.”

To reconcile this difference, they decided that in the north $1 + 1 = 3$ and in the south $1 + 1 = 1$. In other words, the result depended on the geographic location.

Claims regarding analytic propositions are seen as suspect because of instincts that carry over from examining synthetic propositions.

A synthetic proposition, p, is one for which there is an observation we could encounter for which we would have to adjust our credence regarding p.

We will illustrate two examples in which a proposition might at first be taken to be synthetic but is in fact nonsense.

The dragon in the garage

This hypothetical famously comes to us from Carl Sagan in A Demon-Haunted World. Suppose someone claims that there is a dragon in his garage.

I follow the claimant to his home to examine this alleged dragon. Within, however, I see no dragon. Subsequently, the claimant explains my observation away by saying that this dragon is invisible.

“No problem,” I say. “We will simply spread flour on the floor to capture the footprints.”

Except our claimant says that the dragon constantly floats in the air, never touching the ground.

I go on to propose other tests, such as using an infrared detector and spray paint on the dragon. After each such proposal, I am met by some reason why that experiment will not result in observations other than what would in the case where the dragon didn’t exist at all.

There is no observation we could make that would change our minds about whether there is a dragon in his garage.

Exists and only exists

In Language, Truth, and Logic, A.J. Ayer supports the idea that “existence” is not a predicate. This is because any predicate must have the property that it can stand alone. This sets us up for a contradiction in the following way:

Imagine that we have a database of propositions consisting of predicates and atoms. In the proposition p(a), p is a predicate and a is an atom. If the atom a appears in p(a) and q(a), then p and q are properties of the same thing.

Suppose we add to the database existence(b) without the atom b ever appearing in any other proposition. Then this is to be interpreted in English as, “b exists and yet does not have any other properties.”

In other words, it is an invisible, floating, incorporeal dragon. Our senses could not gather any data about this b, but we suppose that it exists anyway.

Making “existence” a predicate makes possible a proposition that flies in the face of common sense.

These two examples are both relevant to our argument in the following way.

“An analytic proposition is not a synthetic proposition” accords with our usage of language

An analytic proposition is not a synthetic proposition in that no observation we could make could ever change our mind about an analytic proposition, and this is consistent with the way we use language.

“Dragon” refers to a member of a class of fearsome, scaly creatures. On the other hand, x as a variable is not a symbol that conventionally refers to any member of any class. Neither does any mathematical relation.

“But wait!” you say. “What if I am solving a problem where we suppose that x is a quantity of money in a certain situation and there is enclosing that x a formula relating several other variables and constants?”

Then you merely suppose x refers to that quantity and quality situationally. That does mean that x has the same characteristics of a symbol like “dragon”, which can call to mind a definite set of sensory experiences even when it is spoken in isolation.

In that situation, your usage of the formula is an empirical proposition. You are asserting that the real-world quantities and qualities can be related in such a way that they can be described by the formula. When an observation is encountered that does not accord with this proposition, you would be well-advised to discard it in favor of another. You do not, however, discard any analytic proposition because no such proposition could ever have implied anything about your observation.

You come to me with a proof beautifully typed in LaTeX. Do I conclude that the conclusion of this proof is 100% certain to follow from the premises? Not necessarily. I could have a severe sinus infection that day, and that skews my judgment.

Is it the analytic proposition referred to by the syntax of the proof that is uncertain, then?

No, what is uncertain is whether the syntax, the physical marks on the page, agrees with my brain about which analytic propositions are referred to therein.

To use an example that hits a bit closer to home for us programmers, suppose that I wrote an app and pushed it to GitHub. My users download it. One of them opens issue #1 in the Issues pane of GitHub. This issue documents that the behavior of the software is other than what the user would expect. In other words, it failed the test t, where t is whatever test the user made of the software.

To break this down precisely, the code represents one analytic proposition p; there’s another empirical proposition, q, in my brain represented by whatever I think the code represents; and there’s another analytic proposition, r, corresponding to a Minimum Viable Product that would pass test t.

Whether the software will subsequently pass test t depends on whether I successfully analyze the code, updating my q to q’; modify the code, updating it to analytic proposition p’; and then update my q’ to q’‘, or the empirical proposition referring to whatever analytic proposition that the code is.

If the user closes the issue, then they believe that the software represents r. In other words, the user holds s(r), or the empirical proposition that the program passes t.

Why care about all of this?

Eventually, I aim to create a probabilistic logic programming language with a general-purpose knowledge base. When I do so, some productive assumptions will need to be made, such as

Some propositions are associated with a certain probability.
Some propositions are true because they are true by definition.

This corresponds to the synthetic/analytic distinction.

Bibliography

Ayer, Alfred J. Language, Truth and Logic. Reprinted. Penguin Books, 1990.

Sagan, Carl. The Demon-Haunted World: Science As a Candle in the Dark. With Ann Druyan. Random House Publishing Group, 2011.

Imputation Using Random Sampling and K-Nearest Neighbors

2026-02-13T00:00:00+00:00

In machine learning, imputation refers to the creation of synthetic data from existing data for the purpose of filling missing data. Missing data are any NaN or null cells in the dataframe. Missing data is to be avoided as it can be problematic for training machine learning models.

The first method of imputation described in this post is designed for categorical data. If the feature you want to impute is continuous, then you’ll want to use the imputation functions built into scikit-learn, as detailed later in this post.

Because it contains categorical features, we’ll be using the Titanic dataset hosted on OpenML to demonstrate.

from sklearn.datasets import fetch_openml
from sklearn.impute import KNNImputer
import numpy as np
import pandas as pd
import random

random.seed(1)

# Fetch the Titanic dataset
data = fetch_openml(data_id=40945, as_frame=True)
df = data.frame
df.head()

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON

We shuffle the DataFrame for the purpose of randomly deleting values.

The call to reset_index is necessary because, otherwise, our later slicing of the DataFrame will take all elements up to the nth index instead of subsetting up to the nth row.

# shuffled dataframe
sdf = df.sample(frac=1).reset_index(drop=True)

sdf['sex'].value_counts()

sex
male      843
female    466
Name: count, dtype: int64

Because there are a large number of values in each of the categories of these columns, it will be impossible for the below code to delete all of any category.

NUM_DELETE = 100
sdf.loc[:NUM_DELETE - 1, ['sex']] = None
sdf['sex'].value_counts()

sex
male      791
female    418
Name: count, dtype: int64

In a previous post, we talked about the Mind-Reading Machine, which uses a Markov Chain. A Markov Chain depends only on the previous state.

This method of imputation does not depend on the previous state, and therefore not capable of being considered a Markov Chain. It is, however, similar to the Mind-Reading Machine in that we will choose at random from an array, thus having the probability of choosing each unique element in proportion to how frequently it shows up in the array.

sexes = sdf[~sdf['sex'].isna()]['sex']
choices = random.choices(sexes.array, k=NUM_DELETE)
sdf.loc[:NUM_DELETE - 1, ['sex']] = choices
sdf.head()

	pclass	survived	name	sex	age	ticket	fare	cabin	embarked	boat	body	home.dest
0	3	0	Youseff, Mr. Gerious	male	45.5	2628	7.2250	NaN	C	NaN	312.0	NaN
1	1	1	Candee, Mrs. Edward (Helen Churchill Hungerford)	male	53.0	PC 17606	27.4458	NaN	C	6	NaN	Washington, DC
2	3	1	Olsson, Mr. Oscar Wilhelm	female	32.0	347079	7.7750	NaN	S	A	NaN	NaN
3	3	0	Theobald, Mr. Thomas Leonard	female	34.0	363294	8.0500	NaN	S	NaN	176.0	NaN
4	3	0	Svensson, Mr. Johan	female	74.0	347060	7.7750	NaN	S	NaN	NaN	NaN

In the original dataset, some of the age values are missing. Fortunately, scikit-learn contains convenient means of imputing data, including numerical data.

First, we encode sex as elements of the set ${0, 1}$, because

Although it is unlikely to me that sex will predict the age, I want to demonstrate encoding.
This feature is currently categorical.
The K-Nearest Neighbors imputer (AKA KNNImputer) requires that the input be numerical.

(Scikit-Learn)

# Encode the categorical labels
sdf['male'] = pd.get_dummies(sdf['sex'])['male']

As mentioned, KNNImputer only wants numeric types. We will therefore provide ourselves with a means of selecting only the numeric columns from the dataframe.

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
selected_columns = sdf.select_dtypes(include=numerics).columns
selected_columns

Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'body'], dtype='object')

A high-level of interpretation of the fit_transform method of KNNImputer is as follows:

For each row, for each cell, if that cell is missing, do nothing. Otherwise, proceed to the next step.
Create a value for that cell using the K-Nearest Neighbor algorithm. This uses the other cells in that row and in the neighbors to predict this one.

imp = KNNImputer().set_output(transform='pandas')
transformed = imp.fit_transform(sdf[selected_columns])

sdf[selected_columns] = transformed
sdf.head()

	pclass	survived	name	sex	age	ticket	fare	cabin	embarked	boat	body	home.dest	male
0	3.0	0	Youseff, Mr. Gerious	male	45.5	2628	7.2250	NaN	C	NaN	312.0	NaN	True
1	1.0	1	Candee, Mrs. Edward (Helen Churchill Hungerford)	male	53.0	PC 17606	27.4458	NaN	C	6	177.2	Washington, DC	True
2	3.0	1	Olsson, Mr. Oscar Wilhelm	female	32.0	347079	7.7750	NaN	S	A	117.4	NaN	False
3	3.0	0	Theobald, Mr. Thomas Leonard	female	34.0	363294	8.0500	NaN	S	NaN	176.0	NaN	False
4	3.0	0	Svensson, Mr. Johan	female	74.0	347060	7.7750	NaN	S	NaN	167.6	NaN	False

The below tells us that the only columns that have missing values are the ones that we didn’t intend to impute.

sdf.isna().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        2
boat          823
body            0
home.dest     564
male            0
dtype: int64

There are scenarios in which deleting rows or columns containing missing values is acceptable. Whether and how we do so depends on a number of factors, including the number of missing values and why those values are missing (if we can know the reason).

An extended discussion of those facets are outside the scope in this post. Instead, I’ll leave the reader with the following takeaways.

To Sum Up

We showed how to replace missing values solely by randomly selecting from the distribution of those existing values.
We showed how to impute missing numerical values using a built-in scikit-learn method.

Future Work

In a future post, we’ll explore a method of training a machine learning model without imputing missing data and compare it with imputation.

Bibliography

“7.4. Imputation of Missing Values.” Scikit-Learn, https://scikit-learn/stable/modules/impute.html. Accessed 13 Feb. 2026.

OpenML. https://www.openml.org/search?type=data&sort=version&status=any&order=asc&exact_name=Titanic&id=40945. Accessed 13 Feb. 2026.

Gaussian-noise Linear Regression vs Multivariate Normal

2026-02-12T00:00:00+00:00

We will discuss the creation of generative statistical models. For the purposes of demonstration, we’ll use the California housing dataset taken from scikit-learn. “The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars” (8.2. Real world datasets). This dataset has been chosen simply because the target variable is continuous, making it capable of being predicted with linear regression, which is one of the models that we’ll be exploring. The other is the Multivariate Normal distribution.

from scipy.stats import fit, norm
from scipy.stats import multivariate_normal
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

random.seed(1)

data = fetch_california_housing()
attribute_names = [
    'Median Income',
    'House Age',
    'Average Rooms',
    'Average Bedrooms',
    'Population',
    'Average Occupation',
    'Latitude',
    'Longitude'
]

It is possible that not all of the variables belong to Gaussian distributions. Let’s plot their histograms overlayed with their respective Gaussian distributions to verify this hunch.

for i, col in enumerate(data.data.T):
    mu, sigma = np.mean(col), np.std(col)
    x = np.linspace(min(col), max(col), 200)
    plt.hist(col, bins=30, density=True)
    plt.xlabel(attribute_names[i])
    plt.plot(x, norm.pdf(x, mu, sigma))
    plt.show()

The distribution of longitude, for one, clearly does not follow a Gaussian distribution. Though different distributions might better serve us here, let’s keep things simple for now by assuming a multivariate Normal distribution for the X.

Of course, the fact that further refinement is possible (i.e., the employment of a multimodal distribution) has been noted for the sake of future work.

What about the target variable? What does that distribution look like?

mu, sigma = np.mean(data.target), np.std(data.target)
x = np.linspace(min(data.target), max(data.target), 200)
plt.hist(data.target, bins=30, density=True)
plt.plot(x, norm.pdf(x, mu, sigma))
plt.show()

A Gaussian curve is not exactly flattering when worn by this variable. It is close enough, however, for our purposes in this article. In future work, I would like to experiment with other types of distributions.

Linear regression is defined as

\[y = \beta_0 + \beta_1 x + \epsilon\]

We want this to not simply be a linear regression model, but a generative model, i.e., a Gaussian-noise linear regression model. (The noise is the $\epsilon$). When sampling from this model, we therefore sample the noise from the Gaussian distribution. The variance is

\[\epsilon \sim \mathcal{N}(0, \sigma^2)\]

(Shalizi, 2017).

Depending on our data, we could sample the noise from any distribution where the expected value of $\epsilon$ is zero,

\[\mathbb{E}[\epsilon] = 0\]

(Shalizi, 2017).

Below, we define our Gaussian-noise linear regression model. The sample method is what makes this model generative. We add noise to the samples because the observed data does not lay flat on a linear regression line. Rather, it expands outwards in a cloud that centers on the line. With noise introduced, therefore, the samples are more realistic.

# Gaussian noise linear regression model
class GaussianLinearRegression:
    def __init__(self, X, y):
        self.mu, self.cov = multivariate_normal.fit(X)
        self.lin_reg = LinearRegression()
        self.lin_reg.fit(X, y)
        y_pred = self.lin_reg.predict(X)
        self.sd = np.sqrt(mean_squared_error(y, y_pred))
        
    def sample(self, size=1):
        """
        Randomly sample from the probability distribution
        """
        X = multivariate_normal(self.mu, self.cov).rvs(size=size)
        if size == 1:
            X = X.reshape(1, -1)
        loc = self.lin_reg.predict(X)
        sample = norm(loc=loc, scale=self.sd).rvs(size=size)
        return np.append(X, sample)
    
    def log_likelihood(self, X, y):
        log_px = multivariate_normal(self.mu, self.cov).logpdf(X).sum()
        y_pred = self.lin_reg.predict(X)
        log_py_given_x = norm(loc=y_pred, scale=self.sd).logpdf(y).sum()
        return log_px + log_py_given_x
    
    def pdf(self, X, y):
        y_pred = self.lin_reg.predict(X)
        return norm(loc=y_pred, scale=self.sd).pdf(y)

glr = GaussianLinearRegression(data.data, data.target)
glr.sample()

array([ 1.65427924e+00,  1.64418672e+01,  6.12191566e+00,  1.33692717e+00,
        1.38043401e+03,  1.09108617e+01,  3.56291888e+01, -1.18543129e+02,
        8.78771712e-01])

I want to dwell a moment on the above implementation of log_likelihood. Log-likelihood is a measure of the goodness-of-fit (Taboga).

We can use it to compare the goodness-of-fit of one statistical model to the goodness-of-fit of another.

glr.log_likelihood(X=data.data, y=data.target)

-503782.5946747418

For our purposes pdf result that is closer to 0 is more 👎. A logarithm tends to negative infinity as the input tends towards 0. Therefore, a lower logpdf can be thought of as a comparatively more unlikely value. To demonstrate:

temp = norm(loc=0, scale=1)
temp.pdf(-1), temp.pdf(0), temp.pdf(1), temp.pdf(100), temp.logpdf(-1), temp.logpdf(0), temp.logpdf(1), temp.logpdf(100)

(0.24197072451914337,
 0.3989422804014327,
 0.24197072451914337,
 0.0,
 -1.4189385332046727,
 -0.9189385332046727,
 -1.4189385332046727,
 -5000.918938533205)

The PDF is the probability density function. Because the probability of a continuous random variable taking on any value is 0, we use the PDF.

Because logarithms have the property that $\log(a b) = \log(a) \log(b)$, we summed their results to attain the log-likelihood.

def create_X_y(data):
    return np.append(data.data.T, [data.target], axis=0).T
X_y = create_X_y(data)
print('Shape of X_y', X_y.shape)
mu, cov = multivariate_normal.fit(X_y)
multivariate_normal(mu, cov).logpdf(X_y).sum()

Shape of X_y (20640, 9)

-503782.5946747417

**The Gaussian-noise linear Multivariate Normal Simply because the target is not a linear function of the input.

We’ll show that to be the case by visually comparing the distribution of the target variable and the model’s guess for that target variable.

#plt.figure()
plt.hist(glr.pdf(data.data, data.target), bins=50, alpha=0.6, color='blue')

plt.hist(data.target, bins=50, alpha=0.6, color='red')

plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Distribution of PDF Values vs Target Values')
plt.show()

If our ultimate goal were to build a model that accurately describes the data, then we would need to do better than our GaussianLinearRegression. Let’s also randomly sample from the model to compare that to the actual distribution.

sample = glr.sample(len(data.target))
plt.hist(sample, bins=50, alpha=0.6, color='blue')
plt.hist(data.target, bins=50, alpha=0.6, color='red')
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Distribution of PDF Values vs Target Values')
plt.show()

That doesn’t look like the same distribution to me.

The goodness-of-fit of the Gaussian-noise linear regression model is no better than that of the Multivariate Normal model.

In that case, why use it? Sure, you can use the former to predict y | X, but you can also do that with with the Multivariate Normal model. Doing so requires more code than what we’ve written, but not that much more.

In my opinion, if you’re trying to create a model that can generate, evaluate the pdf of an observation, and predict, then it’s a toss up between the two models. In a future post, we’ll be looking at other means of doing the same that also happen to fit the data more faithfully.

Bibliography

“8.2. Real World Datasets.” Scikit-Learn, https://scikit-learn/stable/datasets/real_world.html. Accessed 10 Feb. 2026.

Shalizi, Cosma. 36-401 Modern Regression, Fall 2017. 2017, https://www.stat.cmu.edu/~larry/=stat401/lecture-04.pdf.

Taboga, Marco. Model Selection Criteria. https://www.statlect.com/fundamentals-of-statistics/model-selection-criteria. Accessed 11 Feb. 2026.

Implementing Linear Regression from Scratch in Python to Understand How It Works

2026-01-24T00:00:00+00:00

First, let’s use the built-in Scikit-Learn. We’ll use the California Housing dataset as that is listed in the documentation as a “regression” dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = fetch_california_housing(as_frame=True)
X = df.data
y = df.target
np.random.seed(0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.head(), y_train.head()

(       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
4.2386       6.0  7.723077   1.169231       228.0  3.507692     33.83   
4.3898      52.0  5.326622   1.100671      1485.0  3.322148     37.73   
3.9333      26.0  4.668478   1.046196      1022.0  2.777174     33.83   
 1.4653      38.0  3.383495   1.009709       749.0  3.635922     34.01   
 3.1765      52.0  4.119792   1.043403      1135.0  1.970486     34.08   
 
        Longitude  
  -117.55  
  -122.44  
  -118.00  
   -118.26  
   -118.36  ,
  5.00001
  2.70000
  1.96100
   1.18800
   2.25000
 Name: MedHouseVal, dtype: float64)

What is r2_score? It’s typically written as: $R ^ 2$ Pronounced “R-squared”, the coefficient of determination measures what proportion of the target variable is predicted by the model. It ranges from 0 to 1 and is sometimes stated as a percentage.

model = LinearRegression()
model.fit(X_train, y_train)
print("TRAIN:")
print(r2_score(y_true=y_train, y_pred=model.predict(X_train)))
print("")
print("TEST:")
r2_score(y_true=y_test, y_pred=model.predict(X_test))

TRAIN:
0.6088968118672871

TEST:

0.5943232652466202

def model(column: str):
    model = LinearRegression()
    X_train_subset = X_train[[column]]
    model.fit(X_train_subset, y_train)
    train_score = r2_score(y_true=y_train, y_pred=model.predict(X_train_subset))
    test_score = r2_score(y_true=y_test, y_pred=model.predict(X_test[[column]]))
    return train_score, test_score

for c in X_train.columns:
    print(model(c))

(0.47991412719941495, 0.4466846804895943)
(0.01133589722637418, 0.010112709993501445)
(0.023847425986299742, 0.019686674517510605)
(0.0019727324864367013, 0.0026742213470939413)
(0.0007318879607208784, -0.00022540672756665714)
(0.0011001698651382785, -0.006489558238010673)
(0.020363987996845134, 0.022215172774302072)
(0.0022351265327293923, 0.0012984715729211782)

No single column has greater predictive power than using all of the columns together. However, we’re going to use the first column because that will make the math slightly easier.

class CustomLinearRegression:
    a_hat: float
    b_hat: float
    
    def train(self, X_train: np.array, y_train: np.array):

        y_summed = y_train.sum()
        X_dot_X = X_train.dot(X_train)
        X_summed = X_train.sum()
        X_dot_y = X_train.dot(y_train)

        n = len(X_train)
        self.a_hat = (y_summed * X_dot_X - X_summed * X_dot_y) / (n * X_dot_X - X_summed ** 2)
        
        self.b_hat = (n * X_dot_y - X_summed * y_summed) / (n * X_dot_X - X_summed ** 2)
    
    def predict(self, X: np.array):
        return X * self.a_hat + self.b_hat

model = CustomLinearRegression()
X_train_subset = X_train.iloc[:, 0].to_numpy()
y_train_array = y_train.to_numpy()
model.train(X_train=X_train_subset, y_train=y_train_array)

r2_score(y_true=y_train, y_pred=model.predict(X_train_subset))

y_pred = model.predict(X_train_subset)
print(r2_score(y_true=y_train_array, y_pred=y_pred))
print("TRAIN:")
print(r2_score(y_true=y_train_array, y_pred=model.predict(X_train_subset)))
print("")
print("TEST:")
X_test_subset = X_test.iloc[:, 0].to_numpy()
r2_score(y_true=y_test, y_pred=model.predict(X_test_subset))

0.4752542635984037
TRAIN:
0.4752542635984037

TEST:

0.43971655102712115

To end, let’s visualize the performance of the custom linear regresion model on the training data, because visualization give me the warm fuzzies.

plt.figure(figsize=(8, 6))
plt.scatter(X_train_subset, y_train, color='green', alpha=0.5, label='Training Data')
plt.plot(X_train_subset, y_pred, color='red', linewidth=2, label='Regression Line')
plt.title('Visualization for Custom Linear Regression')
plt.xlabel('MedInc')
plt.ylabel('Median House Values')
plt.legend()
plt.show()

In just a few lines of code, we got a score that is similar to what we achieved with the SciKit-Learn implementation.

So what happened? Consider the following formula in which the training data, the estimated coefficients, and the target are terms.

\[\begin{bmatrix} n & \sum_{i=1}^n x_i \\[1ex] \sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2 \end{bmatrix} \begin{bmatrix} \widehat{\alpha} \\[1ex] \widehat{\beta} \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^n y_i \\[1ex] \sum_{i=1}^n y_i x_i \end{bmatrix}\]

From the previous can be derived the below. Conveniently, the coefficients to be estimated are isolated on the left-side.

\[\begin{align} \widehat{\alpha} &= \frac{ \left(\sum_{i=1}^n y_i\right)\left(\sum_{i=1}^n x_i^2\right) - \left(\sum_{i=1}^n x_i\right)\left(\sum_{i=1}^n x_i y_i\right) }{ n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)^2 } \\[8pt] \widehat{\beta} &= \frac{ n \sum_{i=1}^n x_i y_i - \left(\sum_{i=1}^n x_i\right)\left(\sum_{i=1}^n y_i\right) }{ n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)^2 }. \end{align}\]

(Wikipedia)

Unlike other paradigms (e.g., neural networks), a linear regression model is one that can be derived from the data analytically in a reasonable amount of time.

This model can also be interpreted in an intuitive way. After all, it’s just a line running through the two variables, signifying their correlation, and it comes pre-packaged with a measure of how far the data is from the line on average. We’re going to take advantage of this characteristic in a future post.

Mind-Reading Machine

2026-01-24T00:00:00+00:00

Re-structuring the blog required that I temporarily take down this post. I hope to have back up soon.

Deep Energy-Based Model Fitted with Noise Contrastive Estimation

2026-01-17T00:00:00+00:00

Deep Energy-Based Model Fitted with Noise Contrastive Estimation (NCE)

This notebook is my attempt to better understand the process of training an Energy-Based Model.

The model is fitted to tabular data (the wine quality data set from the UCI repository). Credit is due to volagold (https://github.com/volagold/nce/), without whose code example I would not have been able to write this.

import math

import torch
from torch import nn, optim
import pandas as pd
import numpy as np

# device selection for GPU/CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# the number of categorical and numerical columns
df = pd.read_csv("./wine_quality/winequality-red.csv", delimiter=";")
num_features = len(df.columns)
batch_size = df.shape[0]
X = torch.tensor(df.values, dtype=torch.float32).to(device)

Define a feed-forward neural network to output the energy score.

From the PyTorch docs:

Parameter is a “kind of Tensor that is to be considered a module parameter”.

That is, a Parameter is automatically included in the parameters() iterator.

In NCE, $\log_Z(\theta)$ is treated as a learnable parameter (Song et al., 2021).

class FeedForwardNN(nn.Module):
    
    def __init__(self, dims=32):
        super(FeedForwardNN, self).__init__()
        self.log_Z_of_theta = nn.Parameter(torch.tensor([1.0], requires_grad=True))
        self.f = nn.Sequential(
            nn.Linear(num_features, dims),
            nn.LeakyReLU(0.2),
            nn.Linear(dims, dims),
            nn.LeakyReLU(0.2),
            nn.Linear(dims, 1)
        )
        
    def forward(self, x):
        return -self.f(x) - self.log_Z_of_theta

From the PyTorch docs, MultivariateNormal creates “a multivariate normal (also called Gaussian) distribution parameterized by a mean vector and a covariance matrix”.

From the PyTorch docs, eye returns “a 2-D tensor with ones on the diagonal and zeros elsewhere”.

This code will return a Gaussian with dims equal to num_features, a mean at 0, and a covariance matrix of I.

model = FeedForwardNN().to(device)
num_rows = num_features
noise = torch.distributions.MultivariateNormal(torch.zeros(num_rows, device=device), torch.eye(num_rows, device=device))
optimizer = optim.Adam(model.parameters())
criterion = nn.MSELoss()

MAX_EPOCHS = 1000
for i in range(MAX_EPOCHS):
     
    optimizer.zero_grad()
    

    # GENERATE NOISE
    gen = noise.sample((len(X),))
    
    
    # CALCULATE THE ENERGY LOSS
    logp_x = model(X)
    logq_x = noise.log_prob(X).unsqueeze(1)
    logp_gen = model(gen)
    logq_gen = noise.log_prob(gen).unsqueeze(1)
    
    value_data = logp_x - torch.logsumexp(torch.cat([logp_x, logq_x], dim=1), dim=1, keepdim=True)
    value_gen = logq_gen - torch.logsumexp(torch.cat([logp_gen, logq_gen], dim=1), dim=1, keepdim=True)
    
    loss = -(value_data.mean() + value_gen.mean())
    
    with torch.no_grad():
        r_x = torch.sigmoid(logp_x - logq_x)
        r_gen = torch.sigmoid(logq_gen - logp_gen)
        acc = ((r_x > 0.5).float().mean() + (r_gen > 0.5).float().mean()) / 2
    
    
    loss.backward()
    optimizer.step()

    if i % 100 == 0:
        print("Loss:", loss.item())

x = torch.randn(num_features, device=device, requires_grad=True)
optimizer = optim.Adam([x])
model(x)

The above, randomly generated vector has a higher energy score than the vectors below that are the result of optimization. That’s appropriate. The higher the energy score, the more out-of-distribution it is.

STEPS = 100
for _ in range(STEPS):
    optimizer.zero_grad()
    energy = model(x)
    energy.backward()
    optimizer.step()
    
x_star = x.detach()
x_star

example = X[10].detach().clone()
example[-1] = 9.0
example

w_opt = torch.nn.Parameter(example[-1].detach().clone())
optimizer = optim.Adam([w_opt])


for _ in range(1000):
    optimizer.zero_grad()
    ex = example.clone()
    ex[-1] = w_opt
    energy = model(ex)
    energy.backward()
    optimizer.step()
    
example[-1] = w_opt
example

The true rating given to the above data point was 5. The model estimates the rating to be 10.

Clearly, this model is not ready to generate plausible data points. More work is needed.

One last thing: for our amusement, let’s show that the energy score is higher when the rating is lower for this data point.

a = example.clone()
a[-1] = 9.0
(model(a), model(example))