<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="tracymedcalf.com/feed.xml" rel="self" type="application/atom+xml" /><link href="tracymedcalf.com/" rel="alternate" type="text/html" /><updated>2026-06-13T01:31:28+00:00</updated><id>tracymedcalf.com/feed.xml</id><title type="html">Machines and Minds</title><subtitle>The blog of Tracy Medcalf, a data scientist and machine learning engineer.</subtitle><entry><title type="html">The Synthetic-Analytic Distinction</title><link href="tracymedcalf.com/philosophy/logic/2026/02/20/synthetic-analytic.html" rel="alternate" type="text/html" title="The Synthetic-Analytic Distinction" /><published>2026-02-20T00:00:00+00:00</published><updated>2026-02-20T00:00:00+00:00</updated><id>tracymedcalf.com/philosophy/logic/2026/02/20/synthetic-analytic</id><content type="html" xml:base="tracymedcalf.com/philosophy/logic/2026/02/20/synthetic-analytic.html"><![CDATA[<p>The distinction between analytic and synthetic propositions is often associated with logical positivism, a philosophy propounded by A.J. Ayer, among others.</p>

<p>Synthetic propositions are those that are associated with some level of uncertainty. Their referents are data that can be gathered using the senses.</p>

<p>Analytic propositions, on the other hand, are true because of the way that we have defined them. If a synthetic proposition has some probability $p$, where $p &lt; 1$, then an analytic proposition has a probability of 1.</p>

<p>One might object to the notion that anything can ever be fully certain. Those objections are, of course, well-founded; however, to build upward, certain necessary assumptions must provide us with a foundation.</p>

<p>That we defend analytic propositions is not to mire ourselves in dogma but to show that they enable a vast array of cherished propositions.</p>

<h2 id="whether-mathematical-operations-are-uncertain">Whether mathematical operations are uncertain</h2>

<p>Suppose you did try to incorporate uncertainty into your math, and you did so using the language of probability.</p>

<p>For every probability calculation, you would multiply each posterior by $1- \epsilon$, where $\epsilon$ is a penalty constant. Here, I want to emphasize that it would be the operation itself that we are skeptical of, not hardware limitations, i.e., errors due to limited precision floating point numbers.</p>

<p>Then, since that multiplication by $1 - \epsilon$ is also a probability calculation, we would need to multiply the resulting posterior by $1 - \epsilon$ again, and again, resulting in an infinite regress.</p>

<p>Note that this is not the same as saying that there could have been a misstep in a body of calculations or a proof. In that case, we simply say that the author made a mistake, not that math itself is somehow suspect. The written result either does or does not follow from the written premises, and we either do or do not assume that it follows until we have verified for ourselves that it does not. Such cases are legion, but they are always the fallible human.</p>

<h2 id="whether-anything-in-nature-cannot-be-explained-by-math">Whether anything in nature cannot be explained by math</h2>

<p>The objection to all of this might just be that the uncertainty does not have to be distributed over the mathematical operations. Instead, what we are uncertain about is whether there is anything in nature that could be explained by math.</p>

<p>I literally cannot conceive of what that would be. What thought could anyone have that could not be formulated as mathematics? What property would it have to have that it would not simply be one more relation added to humanity’s knowledge of mathematics, and yet still be capable of being formulated as a synthetic proposition?</p>

<h2 id="whether-all-of-this-is-nonsense-because-we-are-exalting-math-as-some-kind-of-higher-being">Whether all of this is nonsense because we are exalting math as some kind of higher being</h2>

<p>From a purely conventionalist standpoint, we have said nothing inconsistent. “Math” is not a term that describes the scribbles in the notebook of a calculus student. It is not the syntax. It is the semantics.</p>

<p>To have a coherent theory of logic and language, we must employ the concept of meaning. How else would math fit into this picture except how we have painted it when we have already assumed that math can imply nothing about the real world?</p>

<p>“Ah, but you should not have assumed that.”</p>

<p>Then let’s claim that math can imply observations and tell a little parable to illustrate the possible consequences.</p>

<hr />

<p>In ancient Greece, two philosophers decided to set out to find the result of $1 + 1$. One traveled north, and the other went south.</p>

<p>In the north, one philosopher came to stay as a guest in the home of a couple for most of a year. In that time, the couple conceived ($1 + 1$) and the woman birthed a baby ($= 3$).</p>

<p>The other philosopher came to a pond with water so crystal clear that you could see right to the bottom. He spent the better part of a year watching the fish. The fish had large mouths that they could open wide and use to gobble up smaller fish. He saw a fish ($1$) eat a smaller fish ($+ 1$), and the result was only one fish ($= 1$). This was a pattern that he saw repeat itself many times.</p>

<p>The two philosophers journeyed back to their starting point and reconvened.</p>

<p>The northern philosopher said, “I have discovered the answer. One plus one equals three.”</p>

<p>“Why, no,” the southern philosopher said. “One plus one equals one.”</p>

<p>To reconcile this difference, they decided that in the north $1 + 1 = 3$ and in the south $1 + 1 = 1$. In other words, the result depended on the geographic location.</p>

<hr />

<p>Claims regarding analytic propositions are seen as suspect because of instincts that carry over from examining synthetic propositions.</p>

<p>A synthetic proposition, <em>p</em>, is one for which there is an observation we could encounter for which we would have to adjust our credence regarding <em>p</em>.</p>

<p>We will illustrate two examples in which a proposition might at first be taken to be synthetic but is in fact nonsense.</p>

<h3 id="the-dragon-in-the-garage">The dragon in the garage</h3>

<p>This hypothetical famously comes to us from Carl Sagan in <em>A Demon-Haunted World</em>. Suppose someone claims that there is a dragon in his garage.</p>

<p>I follow the claimant to his home to examine this alleged dragon. Within, however, I see no dragon. Subsequently, the claimant explains my observation away by saying that this dragon is invisible.</p>

<p>“No problem,” I say. “We will simply spread flour on the floor to capture the footprints.”</p>

<p>Except our claimant says that the dragon constantly floats in the air, never touching the ground.</p>

<p>I go on to propose other tests, such as using an infrared detector and spray paint on the dragon. After each such proposal, I am met by some reason why that experiment will not result in observations other than what would in the case where the dragon didn’t exist at all.</p>

<p>There is no observation we could make that would change our minds about whether there is a dragon in his garage.</p>

<h3 id="exists-and-only-exists">Exists and only exists</h3>

<p>In <em>Language, Truth, and Logic</em>, A.J. Ayer supports the idea that “existence” is not a predicate. This is because any predicate must have the property that it can stand alone. This sets us up for a contradiction in the following way:</p>

<p>Imagine that we have a database of propositions consisting of predicates and atoms. In the proposition <em>p(a)</em>, <em>p</em> is a predicate and <em>a</em> is an atom. If the atom <em>a</em> appears in <em>p(a)</em> and <em>q(a)</em>, then <em>p</em> and <em>q</em> are properties of the same thing.</p>

<p>Suppose we add to the database <em>existence(b)</em> without the atom <em>b</em> ever appearing in any other proposition. Then this is to be interpreted in English as, “<em>b</em> exists and yet does not have any other properties.”</p>

<p>In other words, it is an invisible, floating, incorporeal dragon. Our senses could not gather any data about this <em>b</em>, but we suppose that it exists anyway.</p>

<p>Making “existence” a predicate makes possible a proposition that flies in the face of common sense.</p>

<hr />

<p>These two examples are both relevant to our argument in the following way.</p>

<h3 id="an-analytic-proposition-is-not-a-synthetic-proposition-accords-with-our-usage-of-language">“An analytic proposition is not a synthetic proposition” accords with our usage of language</h3>

<p>An analytic proposition is not a synthetic proposition in that no observation we could make could ever change our mind about an analytic proposition, and <em>this is consistent with the way we use language</em>.</p>

<p>“Dragon” refers to a member of a class of fearsome, scaly creatures. On the other hand, <em>x</em> as a variable is not a symbol that conventionally refers to any member of any class. Neither does any mathematical relation.</p>

<p>“But wait!” you say. “What if I am solving a problem where we suppose that <em>x</em> is a quantity of money in a certain situation and there is enclosing that <em>x</em> a formula relating several other variables and constants?”</p>

<p>Then you merely suppose <em>x</em> refers to that quantity and quality situationally. That does mean that <em>x</em> has the same characteristics of a symbol like “dragon”, which can call to mind a definite set of sensory experiences even when it is spoken in isolation.</p>

<p>In that situation, your usage of the formula is an <em>empirical proposition</em>. You are asserting that the real-world quantities and qualities can be related in such a way that they can be described by the formula. When an observation is encountered that does not accord with this proposition, you would be well-advised to discard it in favor of another. You <em>do not</em>, however, discard any analytic proposition because no such proposition could ever have implied anything about your observation.</p>

<hr />

<p>You come to me with a proof beautifully typed in LaTeX. Do I conclude that the conclusion of this proof is 100% certain to follow from the premises? Not necessarily. I could have a severe sinus infection that day, and that skews my judgment.</p>

<p>Is it the analytic proposition referred to by the syntax of the proof that is uncertain, then?</p>

<p>No, what is uncertain is whether the syntax, the physical marks on the page, agrees with my brain about which analytic propositions are referred to therein.</p>

<p>To use an example that hits a bit closer to home for us programmers, suppose that I wrote an app and pushed it to GitHub. My users download it. One of them opens issue #1 in the Issues pane of GitHub. This issue documents that the behavior of the software is other than what the user would expect. In other words, it failed the test <em>t</em>, where <em>t</em> is whatever test the user made of the software.</p>

<p>To break this down precisely, the code represents one analytic proposition <em>p</em>; there’s another empirical proposition, <em>q</em>, in my brain represented by whatever I think the code represents; and there’s another analytic proposition, <em>r</em>, corresponding to a Minimum Viable Product that would pass test <em>t</em>.</p>

<p>Whether the software will subsequently pass test <em>t</em> depends on whether I successfully analyze the code, updating my <em>q</em> to <em>q’</em>; modify the code, updating it to analytic proposition <em>p’</em>; and then update my <em>q’</em> to <em>q’‘</em>, or the empirical proposition referring to whatever analytic proposition that the code is.</p>

<p>If the user closes the issue, then they believe that the software represents <em>r</em>. In other words, the user holds <em>s(r)</em>, or the empirical proposition that the program passes <em>t</em>.</p>

<h2 id="why-care-about-all-of-this">Why care about all of this?</h2>

<p>Eventually, I aim to create a probabilistic logic programming language with a general-purpose knowledge base. When I do so, some productive assumptions will need to be made, such as</p>

<ul>
  <li>Some propositions are associated with a certain probability.</li>
  <li>Some propositions are true because they are true by definition.</li>
</ul>

<p>This corresponds to the synthetic/analytic distinction.</p>

<h1 id="bibliography">Bibliography</h1>

<p>Ayer, Alfred J. Language, Truth and Logic. Reprinted. Penguin Books, 1990.</p>

<p>Sagan, Carl. The Demon-Haunted World: Science As a Candle in the Dark. With Ann Druyan. Random House Publishing Group, 2011.</p>]]></content><author><name></name></author><category term="philosophy" /><category term="logic" /><summary type="html"><![CDATA[Discussing the merits of the synthetic-analytic distinction]]></summary></entry><entry><title type="html">Imputation Using Random Sampling and K-Nearest Neighbors</title><link href="tracymedcalf.com/missing%20data/imputation/2026/02/13/imputation.html" rel="alternate" type="text/html" title="Imputation Using Random Sampling and K-Nearest Neighbors" /><published>2026-02-13T00:00:00+00:00</published><updated>2026-02-13T00:00:00+00:00</updated><id>tracymedcalf.com/missing%20data/imputation/2026/02/13/imputation</id><content type="html" xml:base="tracymedcalf.com/missing%20data/imputation/2026/02/13/imputation.html"><![CDATA[<p>In machine learning, imputation refers to the creation of synthetic data from existing data for the purpose of filling missing data. Missing data are any NaN or null cells in the dataframe. Missing data is to be avoided as it can be problematic for training machine learning models.</p>

<p>The first method of imputation described in this post is designed for categorical data. If the feature you want to impute is continuous, then you’ll want to use the imputation functions built into scikit-learn, as detailed later in this post.</p>

<p>Because it contains categorical features, we’ll be using the Titanic dataset hosted on OpenML to demonstrate.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">fetch_openml</span>
<span class="kn">from</span> <span class="nn">sklearn.impute</span> <span class="kn">import</span> <span class="n">KNNImputer</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">random</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># Fetch the Titanic dataset
</span><span class="n">data</span> <span class="o">=</span> <span class="n">fetch_openml</span><span class="p">(</span><span class="n">data_id</span><span class="o">=</span><span class="mi">40945</span><span class="p">,</span> <span class="n">as_frame</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">data</span><span class="p">.</span><span class="n">frame</span>
<span class="n">df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>pclass</th>
      <th>survived</th>
      <th>name</th>
      <th>sex</th>
      <th>age</th>
      <th>sibsp</th>
      <th>parch</th>
      <th>ticket</th>
      <th>fare</th>
      <th>cabin</th>
      <th>embarked</th>
      <th>boat</th>
      <th>body</th>
      <th>home.dest</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>1</td>
      <td>1</td>
      <td>Allen, Miss. Elisabeth Walton</td>
      <td>female</td>
      <td>29.0000</td>
      <td>0</td>
      <td>0</td>
      <td>24160</td>
      <td>211.3375</td>
      <td>B5</td>
      <td>S</td>
      <td>2</td>
      <td>NaN</td>
      <td>St Louis, MO</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>Allison, Master. Hudson Trevor</td>
      <td>male</td>
      <td>0.9167</td>
      <td>1</td>
      <td>2</td>
      <td>113781</td>
      <td>151.5500</td>
      <td>C22 C26</td>
      <td>S</td>
      <td>11</td>
      <td>NaN</td>
      <td>Montreal, PQ / Chesterville, ON</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>0</td>
      <td>Allison, Miss. Helen Loraine</td>
      <td>female</td>
      <td>2.0000</td>
      <td>1</td>
      <td>2</td>
      <td>113781</td>
      <td>151.5500</td>
      <td>C22 C26</td>
      <td>S</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>Montreal, PQ / Chesterville, ON</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>0</td>
      <td>Allison, Mr. Hudson Joshua Creighton</td>
      <td>male</td>
      <td>30.0000</td>
      <td>1</td>
      <td>2</td>
      <td>113781</td>
      <td>151.5500</td>
      <td>C22 C26</td>
      <td>S</td>
      <td>NaN</td>
      <td>135.0</td>
      <td>Montreal, PQ / Chesterville, ON</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>0</td>
      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>
      <td>female</td>
      <td>25.0000</td>
      <td>1</td>
      <td>2</td>
      <td>113781</td>
      <td>151.5500</td>
      <td>C22 C26</td>
      <td>S</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>Montreal, PQ / Chesterville, ON</td>
    </tr>
  </tbody>
</table>
</div>

<p>We shuffle the DataFrame for the purpose of randomly deleting values.</p>

<p>The call to <code class="language-plaintext highlighter-rouge">reset_index</code> is necessary because, otherwise, our later slicing of the DataFrame will take all elements up to the <em>nth</em> index instead of subsetting up to the <em>nth</em> row.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># shuffled dataframe
</span><span class="n">sdf</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">frac</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span><span class="p">[</span><span class="s">'sex'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sex
male      843
female    466
Name: count, dtype: int64
</code></pre></div></div>

<p>Because there are a large number of values in each of the categories of these columns, it will be impossible for the below code to delete all of any category.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">NUM_DELETE</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">loc</span><span class="p">[:</span><span class="n">NUM_DELETE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="p">[</span><span class="s">'sex'</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">sdf</span><span class="p">[</span><span class="s">'sex'</span><span class="p">].</span><span class="n">value_counts</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sex
male      791
female    418
Name: count, dtype: int64
</code></pre></div></div>

<p>In a previous post, we talked about the Mind-Reading Machine, which uses a Markov Chain. A Markov Chain depends only on the previous state.</p>

<p>This method of imputation does not depend on the previous state, and therefore not capable of being considered a Markov Chain. It is, however, similar to the Mind-Reading Machine in that we will choose at random from an array, thus having the probability of choosing each unique element in proportion to how frequently it shows up in the array.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sexes</span> <span class="o">=</span> <span class="n">sdf</span><span class="p">[</span><span class="o">~</span><span class="n">sdf</span><span class="p">[</span><span class="s">'sex'</span><span class="p">].</span><span class="n">isna</span><span class="p">()][</span><span class="s">'sex'</span><span class="p">]</span>
<span class="n">choices</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">choices</span><span class="p">(</span><span class="n">sexes</span><span class="p">.</span><span class="n">array</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="n">NUM_DELETE</span><span class="p">)</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">loc</span><span class="p">[:</span><span class="n">NUM_DELETE</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="p">[</span><span class="s">'sex'</span><span class="p">]]</span> <span class="o">=</span> <span class="n">choices</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>pclass</th>
      <th>survived</th>
      <th>name</th>
      <th>sex</th>
      <th>age</th>
      <th>sibsp</th>
      <th>parch</th>
      <th>ticket</th>
      <th>fare</th>
      <th>cabin</th>
      <th>embarked</th>
      <th>boat</th>
      <th>body</th>
      <th>home.dest</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>3</td>
      <td>0</td>
      <td>Youseff, Mr. Gerious</td>
      <td>male</td>
      <td>45.5</td>
      <td>0</td>
      <td>0</td>
      <td>2628</td>
      <td>7.2250</td>
      <td>NaN</td>
      <td>C</td>
      <td>NaN</td>
      <td>312.0</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1</td>
      <td>1</td>
      <td>Candee, Mrs. Edward (Helen Churchill Hungerford)</td>
      <td>male</td>
      <td>53.0</td>
      <td>0</td>
      <td>0</td>
      <td>PC 17606</td>
      <td>27.4458</td>
      <td>NaN</td>
      <td>C</td>
      <td>6</td>
      <td>NaN</td>
      <td>Washington, DC</td>
    </tr>
    <tr>
      <th>2</th>
      <td>3</td>
      <td>1</td>
      <td>Olsson, Mr. Oscar Wilhelm</td>
      <td>female</td>
      <td>32.0</td>
      <td>0</td>
      <td>0</td>
      <td>347079</td>
      <td>7.7750</td>
      <td>NaN</td>
      <td>S</td>
      <td>A</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3</td>
      <td>0</td>
      <td>Theobald, Mr. Thomas Leonard</td>
      <td>female</td>
      <td>34.0</td>
      <td>0</td>
      <td>0</td>
      <td>363294</td>
      <td>8.0500</td>
      <td>NaN</td>
      <td>S</td>
      <td>NaN</td>
      <td>176.0</td>
      <td>NaN</td>
    </tr>
    <tr>
      <th>4</th>
      <td>3</td>
      <td>0</td>
      <td>Svensson, Mr. Johan</td>
      <td>female</td>
      <td>74.0</td>
      <td>0</td>
      <td>0</td>
      <td>347060</td>
      <td>7.7750</td>
      <td>NaN</td>
      <td>S</td>
      <td>NaN</td>
      <td>NaN</td>
      <td>NaN</td>
    </tr>
  </tbody>
</table>
</div>

<p>In the original dataset, some of the age values are missing. Fortunately, scikit-learn contains convenient means of imputing data, including numerical data.</p>

<p>First, we encode sex as elements of the set ${0, 1}$, because</p>

<ol>
  <li>
    <p>Although it is unlikely to me that sex will predict the age, I want to demonstrate encoding.</p>
  </li>
  <li>
    <p>This feature is currently categorical.</p>
  </li>
  <li>
    <p>The K-Nearest Neighbors imputer (AKA <code class="language-plaintext highlighter-rouge">KNNImputer</code>) requires that the input be numerical.</p>
  </li>
</ol>

<p>(Scikit-Learn)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Encode the categorical labels
</span><span class="n">sdf</span><span class="p">[</span><span class="s">'male'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">sdf</span><span class="p">[</span><span class="s">'sex'</span><span class="p">])[</span><span class="s">'male'</span><span class="p">]</span>
</code></pre></div></div>

<p>As mentioned, KNNImputer only wants numeric types. We will therefore provide ourselves with a means of selecting only the numeric columns from the dataframe.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">numerics</span> <span class="o">=</span> <span class="p">[</span><span class="s">'int16'</span><span class="p">,</span> <span class="s">'int32'</span><span class="p">,</span> <span class="s">'int64'</span><span class="p">,</span> <span class="s">'float16'</span><span class="p">,</span> <span class="s">'float32'</span><span class="p">,</span> <span class="s">'float64'</span><span class="p">]</span>
<span class="n">selected_columns</span> <span class="o">=</span> <span class="n">sdf</span><span class="p">.</span><span class="n">select_dtypes</span><span class="p">(</span><span class="n">include</span><span class="o">=</span><span class="n">numerics</span><span class="p">).</span><span class="n">columns</span>
<span class="n">selected_columns</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Index(['pclass', 'age', 'sibsp', 'parch', 'fare', 'body'], dtype='object')
</code></pre></div></div>

<p>A high-level of interpretation of the <code class="language-plaintext highlighter-rouge">fit_transform</code> method of <code class="language-plaintext highlighter-rouge">KNNImputer</code> is as follows:</p>

<ul>
  <li>For each row, for each cell, if that cell is missing, do nothing. Otherwise, proceed to the next step.</li>
  <li>Create a value for that cell using the K-Nearest Neighbor algorithm. This uses the other cells in that row and in the neighbors to predict this one.</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">imp</span> <span class="o">=</span> <span class="n">KNNImputer</span><span class="p">().</span><span class="n">set_output</span><span class="p">(</span><span class="n">transform</span><span class="o">=</span><span class="s">'pandas'</span><span class="p">)</span>
<span class="n">transformed</span> <span class="o">=</span> <span class="n">imp</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">sdf</span><span class="p">[</span><span class="n">selected_columns</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span><span class="p">[</span><span class="n">selected_columns</span><span class="p">]</span> <span class="o">=</span> <span class="n">transformed</span>
<span class="n">sdf</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>pclass</th>
      <th>survived</th>
      <th>name</th>
      <th>sex</th>
      <th>age</th>
      <th>sibsp</th>
      <th>parch</th>
      <th>ticket</th>
      <th>fare</th>
      <th>cabin</th>
      <th>embarked</th>
      <th>boat</th>
      <th>body</th>
      <th>home.dest</th>
      <th>male</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>3.0</td>
      <td>0</td>
      <td>Youseff, Mr. Gerious</td>
      <td>male</td>
      <td>45.5</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>2628</td>
      <td>7.2250</td>
      <td>NaN</td>
      <td>C</td>
      <td>NaN</td>
      <td>312.0</td>
      <td>NaN</td>
      <td>True</td>
    </tr>
    <tr>
      <th>1</th>
      <td>1.0</td>
      <td>1</td>
      <td>Candee, Mrs. Edward (Helen Churchill Hungerford)</td>
      <td>male</td>
      <td>53.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>PC 17606</td>
      <td>27.4458</td>
      <td>NaN</td>
      <td>C</td>
      <td>6</td>
      <td>177.2</td>
      <td>Washington, DC</td>
      <td>True</td>
    </tr>
    <tr>
      <th>2</th>
      <td>3.0</td>
      <td>1</td>
      <td>Olsson, Mr. Oscar Wilhelm</td>
      <td>female</td>
      <td>32.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>347079</td>
      <td>7.7750</td>
      <td>NaN</td>
      <td>S</td>
      <td>A</td>
      <td>117.4</td>
      <td>NaN</td>
      <td>False</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3.0</td>
      <td>0</td>
      <td>Theobald, Mr. Thomas Leonard</td>
      <td>female</td>
      <td>34.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>363294</td>
      <td>8.0500</td>
      <td>NaN</td>
      <td>S</td>
      <td>NaN</td>
      <td>176.0</td>
      <td>NaN</td>
      <td>False</td>
    </tr>
    <tr>
      <th>4</th>
      <td>3.0</td>
      <td>0</td>
      <td>Svensson, Mr. Johan</td>
      <td>female</td>
      <td>74.0</td>
      <td>0.0</td>
      <td>0.0</td>
      <td>347060</td>
      <td>7.7750</td>
      <td>NaN</td>
      <td>S</td>
      <td>NaN</td>
      <td>167.6</td>
      <td>NaN</td>
      <td>False</td>
    </tr>
  </tbody>
</table>
</div>

<p>The below tells us that the only columns that have missing values are the ones that we didn’t intend to impute.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sdf</span><span class="p">.</span><span class="n">isna</span><span class="p">().</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        2
boat          823
body            0
home.dest     564
male            0
dtype: int64
</code></pre></div></div>

<p>There are scenarios in which deleting rows or columns containing missing values is acceptable. Whether and how we do so depends on a number of factors, including the number of missing values and why those values are missing (if we can know the reason).</p>

<p>An extended discussion of those facets are outside the scope in this post. Instead, I’ll leave the reader with the following takeaways.</p>

<h3 id="to-sum-up">To Sum Up</h3>

<ul>
  <li>
    <p>We showed how to replace missing values solely by randomly selecting from the distribution of those existing values.</p>
  </li>
  <li>
    <p>We showed how to impute missing numerical values using a built-in scikit-learn method.</p>
  </li>
</ul>

<h3 id="future-work">Future Work</h3>

<p>In a future post, we’ll explore a method of training a machine learning model without imputing missing data and compare it with imputation.</p>

<h2 id="bibliography">Bibliography</h2>

<p>“7.4. Imputation of Missing Values.” Scikit-Learn, https://scikit-learn/stable/modules/impute.html. Accessed 13 Feb. 2026.</p>

<p>OpenML. https://www.openml.org/search?type=data&amp;sort=version&amp;status=any&amp;order=asc&amp;exact_name=Titanic&amp;id=40945. Accessed 13 Feb. 2026.</p>]]></content><author><name></name></author><category term="missing data" /><category term="imputation" /><summary type="html"><![CDATA[We explore two methods replacing missing data.]]></summary></entry><entry><title type="html">Gaussian-noise Linear Regression vs Multivariate Normal</title><link href="tracymedcalf.com/gaussian-noise%20linear%20regression/multivariate%20normal/statistical%20modeling/2026/02/12/gaussian-noise-linear-regression.html" rel="alternate" type="text/html" title="Gaussian-noise Linear Regression vs Multivariate Normal" /><published>2026-02-12T00:00:00+00:00</published><updated>2026-02-12T00:00:00+00:00</updated><id>tracymedcalf.com/gaussian-noise%20linear%20regression/multivariate%20normal/statistical%20modeling/2026/02/12/gaussian-noise-linear-regression</id><content type="html" xml:base="tracymedcalf.com/gaussian-noise%20linear%20regression/multivariate%20normal/statistical%20modeling/2026/02/12/gaussian-noise-linear-regression.html"><![CDATA[<p>We will discuss the creation of generative statistical models. For the purposes of demonstration, we’ll use the California housing dataset taken from scikit-learn. “The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars” (8.2. Real world datasets). This dataset has been chosen simply because the target variable is continuous, making it capable of being predicted with linear regression, which is one of the models that we’ll be exploring. The other is the Multivariate Normal distribution.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">fit</span><span class="p">,</span> <span class="n">norm</span>
<span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">multivariate_normal</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">fetch_california_housing</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">mean_squared_error</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">random</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>

<span class="n">data</span> <span class="o">=</span> <span class="n">fetch_california_housing</span><span class="p">()</span>
<span class="n">attribute_names</span> <span class="o">=</span> <span class="p">[</span>
    <span class="s">'Median Income'</span><span class="p">,</span>
    <span class="s">'House Age'</span><span class="p">,</span>
    <span class="s">'Average Rooms'</span><span class="p">,</span>
    <span class="s">'Average Bedrooms'</span><span class="p">,</span>
    <span class="s">'Population'</span><span class="p">,</span>
    <span class="s">'Average Occupation'</span><span class="p">,</span>
    <span class="s">'Latitude'</span><span class="p">,</span>
    <span class="s">'Longitude'</span>
<span class="p">]</span>
</code></pre></div></div>

<p>It is possible that not all of the variables belong to Gaussian distributions. Let’s plot their histograms overlayed with their respective Gaussian distributions to verify this hunch.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">col</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">T</span><span class="p">):</span>
    <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">col</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">col</span><span class="p">)</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">col</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">col</span><span class="p">),</span> <span class="mi">200</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">col</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="n">attribute_names</span><span class="p">[</span><span class="n">i</span><span class="p">])</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">))</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/images/gaussian-noise-linear-regression/main_4_0.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_1.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_2.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_3.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_4.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_5.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_6.png" alt="png" /></p>

<p><img src="/images/gaussian-noise-linear-regression/main_4_7.png" alt="png" /></p>

<p>The distribution of longitude, for one, clearly does not follow a Gaussian distribution. Though different distributions might better serve us here, let’s keep things simple for now by assuming a multivariate Normal distribution for the X.</p>

<p>Of course, the fact that further refinement is possible (i.e., the employment of a multimodal distribution) has been noted for the sake of future work.</p>

<p>What about the target variable? What does that distribution look like?</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">),</span> <span class="nb">max</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">),</span> <span class="mi">200</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span> <span class="n">density</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">norm</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/images/gaussian-noise-linear-regression/main_6_0.png" alt="png" /></p>

<p>A Gaussian curve is not exactly flattering when worn by this variable. It is close enough, however, for our purposes in this article. In future work, I would like to experiment with other types of distributions.</p>

<p>Linear regression is defined as</p>

\[y = \beta_0 + \beta_1 x + \epsilon\]

<p>We want this to not simply be a linear regression model, but a generative model, i.e., a Gaussian-noise linear regression model. (The noise is the $\epsilon$). When sampling from this model, we therefore sample the noise from the Gaussian distribution. The variance is</p>

\[\epsilon \sim \mathcal{N}(0, \sigma^2)\]

<p>(Shalizi, 2017).</p>

<p>Depending on our data, we could sample the noise from any distribution where the expected value of $\epsilon$ is zero,</p>

\[\mathbb{E}[\epsilon] = 0\]

<p>(Shalizi, 2017).</p>

<p>Below, we define our Gaussian-noise linear regression model. The <code class="language-plaintext highlighter-rouge">sample</code> method is what makes this model generative. We add noise to the samples because the observed data does not lay flat on a linear regression line. Rather, it expands outwards in a cloud that centers on the line. With noise introduced, therefore, the samples are more realistic.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Gaussian noise linear regression model
</span><span class="k">class</span> <span class="nc">GaussianLinearRegression</span><span class="p">:</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">cov</span> <span class="o">=</span> <span class="n">multivariate_normal</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
        <span class="n">y_pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">sd</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">))</span>
        
    <span class="k">def</span> <span class="nf">sample</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
        <span class="s">"""
        Randomly sample from the probability distribution
        """</span>
        <span class="n">X</span> <span class="o">=</span> <span class="n">multivariate_normal</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">cov</span><span class="p">).</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">size</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">size</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
            <span class="n">X</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">loc</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="n">sample</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">loc</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">sd</span><span class="p">).</span><span class="n">rvs</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="n">size</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">sample</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">log_likelihood</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="n">log_px</span> <span class="o">=</span> <span class="n">multivariate_normal</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">mu</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">cov</span><span class="p">).</span><span class="n">logpdf</span><span class="p">(</span><span class="n">X</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>
        <span class="n">y_pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="n">log_py_given_x</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">sd</span><span class="p">).</span><span class="n">logpdf</span><span class="p">(</span><span class="n">y</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>
        <span class="k">return</span> <span class="n">log_px</span> <span class="o">+</span> <span class="n">log_py_given_x</span>
    
    <span class="k">def</span> <span class="nf">pdf</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
        <span class="n">y_pred</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">lin_reg</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="n">y_pred</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">sd</span><span class="p">).</span><span class="n">pdf</span><span class="p">(</span><span class="n">y</span><span class="p">)</span>

<span class="n">glr</span> <span class="o">=</span> <span class="n">GaussianLinearRegression</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">)</span>
<span class="n">glr</span><span class="p">.</span><span class="n">sample</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>array([ 1.65427924e+00,  1.64418672e+01,  6.12191566e+00,  1.33692717e+00,
        1.38043401e+03,  1.09108617e+01,  3.56291888e+01, -1.18543129e+02,
        8.78771712e-01])
</code></pre></div></div>

<p>I want to dwell a moment on the above implementation of <code class="language-plaintext highlighter-rouge">log_likelihood</code>. Log-likelihood is a measure of the goodness-of-fit (Taboga).</p>

<p>We can use it to compare the goodness-of-fit of one statistical model to the goodness-of-fit of another.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">glr</span><span class="p">.</span><span class="n">log_likelihood</span><span class="p">(</span><span class="n">X</span><span class="o">=</span><span class="n">data</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-503782.5946747418
</code></pre></div></div>

<p>For our purposes <code class="language-plaintext highlighter-rouge">pdf</code> result that is closer to 0 is more 👎. A logarithm tends to negative infinity as the input tends towards 0. Therefore, a lower <code class="language-plaintext highlighter-rouge">logpdf</code> can be thought of as a comparatively more unlikely value. To demonstrate:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">temp</span> <span class="o">=</span> <span class="n">norm</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">temp</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="mi">100</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">logpdf</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">logpdf</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">logpdf</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">temp</span><span class="p">.</span><span class="n">logpdf</span><span class="p">(</span><span class="mi">100</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(0.24197072451914337,
 0.3989422804014327,
 0.24197072451914337,
 0.0,
 -1.4189385332046727,
 -0.9189385332046727,
 -1.4189385332046727,
 -5000.918938533205)
</code></pre></div></div>

<p>The PDF is the probability density function. Because the probability of a continuous random variable taking on any value is 0, we use the PDF.</p>

<p>Because logarithms have the property that $\log(a b) = \log(a) \log(b)$, we summed their results to attain the log-likelihood.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create_X_y</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">data</span><span class="p">.</span><span class="n">T</span><span class="p">,</span> <span class="p">[</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="n">T</span>
<span class="n">X_y</span> <span class="o">=</span> <span class="n">create_X_y</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Shape of X_y'</span><span class="p">,</span> <span class="n">X_y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">cov</span> <span class="o">=</span> <span class="n">multivariate_normal</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_y</span><span class="p">)</span>
<span class="n">multivariate_normal</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">cov</span><span class="p">).</span><span class="n">logpdf</span><span class="p">(</span><span class="n">X_y</span><span class="p">).</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Shape of X_y (20640, 9)





-503782.5946747417
</code></pre></div></div>

<p>**The Gaussian-noise linear Multivariate Normal
Simply because the target is not a linear function of the input.</p>

<p>We’ll show that to be the case by visually comparing the distribution of the target variable and the model’s guess for that target variable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#plt.figure()
</span><span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">glr</span><span class="p">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">),</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">)</span>

<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Density'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Distribution of PDF Values vs Target Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/images/gaussian-noise-linear-regression/main_16_0.png" alt="png" /></p>

<p>If our ultimate goal were to build a model that accurately describes the data, then we would need to do better than our <code class="language-plaintext highlighter-rouge">GaussianLinearRegression</code>. Let’s also randomly sample from the model to compare that to the actual distribution.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sample</span> <span class="o">=</span> <span class="n">glr</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">sample</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">.</span><span class="n">target</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.6</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Value'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Density'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Distribution of PDF Values vs Target Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/images/gaussian-noise-linear-regression/main_18_0.png" alt="png" /></p>

<p>That doesn’t look like the same distribution to me.</p>

<p><strong>The goodness-of-fit of the Gaussian-noise linear regression model is no better than that of the Multivariate Normal model.</strong></p>

<p>In that case, why use it? Sure, you can use the former to predict <code class="language-plaintext highlighter-rouge">y | X</code>, but you can also do that with with the Multivariate Normal model. Doing so requires more code than what we’ve written, but not that much more.</p>

<p>In my opinion, if you’re trying to create a model that can generate, evaluate the pdf of an observation, and predict, then it’s a toss up between the two models. In a future post, we’ll be looking at other means of doing the same that also happen to fit the data more faithfully.</p>

<h2 id="bibliography">Bibliography</h2>

<p>“8.2. Real World Datasets.” Scikit-Learn, https://scikit-learn/stable/datasets/real_world.html. Accessed 10 Feb. 2026.</p>

<p>Shalizi, Cosma. 36-401 Modern Regression, Fall 2017. 2017, https://www.stat.cmu.edu/~larry/=stat401/lecture-04.pdf.</p>

<p>Taboga, Marco. Model Selection Criteria. https://www.statlect.com/fundamentals-of-statistics/model-selection-criteria. Accessed 11 Feb. 2026.</p>]]></content><author><name></name></author><category term="gaussian-noise linear regression" /><category term="multivariate normal" /><category term="statistical modeling" /><summary type="html"><![CDATA[We create and compare two generative statistical models.]]></summary></entry><entry><title type="html">Implementing Linear Regression from Scratch in Python to Understand How It Works</title><link href="tracymedcalf.com/linear%20regression/python/2026/01/24/linear-regression.html" rel="alternate" type="text/html" title="Implementing Linear Regression from Scratch in Python to Understand How It Works" /><published>2026-01-24T00:00:00+00:00</published><updated>2026-01-24T00:00:00+00:00</updated><id>tracymedcalf.com/linear%20regression/python/2026/01/24/linear-regression</id><content type="html" xml:base="tracymedcalf.com/linear%20regression/python/2026/01/24/linear-regression.html"><![CDATA[<p>First, let’s use the built-in Scikit-Learn. We’ll use the California Housing dataset as that is listed in the documentation as a “regression” dataset.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">fetch_california_housing</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">r2_score</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">df</span> <span class="o">=</span> <span class="n">fetch_california_housing</span><span class="p">(</span><span class="n">as_frame</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">data</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">target</span>
<span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">.</span><span class="n">head</span><span class="p">(),</span> <span class="n">y_train</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
 12069  4.2386       6.0  7.723077   1.169231       228.0  3.507692     33.83   
 15925  4.3898      52.0  5.326622   1.100671      1485.0  3.322148     37.73   
 11162  3.9333      26.0  4.668478   1.046196      1022.0  2.777174     33.83   
 4904   1.4653      38.0  3.383495   1.009709       749.0  3.635922     34.01   
 4683   3.1765      52.0  4.119792   1.043403      1135.0  1.970486     34.08   
 
        Longitude  
 12069    -117.55  
 15925    -122.44  
 11162    -118.00  
 4904     -118.26  
 4683     -118.36  ,
 12069    5.00001
 15925    2.70000
 11162    1.96100
 4904     1.18800
 4683     2.25000
 Name: MedHouseVal, dtype: float64)
</code></pre></div></div>

<p>What is <code class="language-plaintext highlighter-rouge">r2_score</code>? It’s typically written as:
\(R ^ 2\)
Pronounced “R-squared”, the coefficient of determination measures what proportion of the target variable is predicted by the model. It ranges from 0 to 1 and is sometimes stated as a percentage.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"TRAIN:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_train</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"TEST:"</span><span class="p">)</span>
<span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>TRAIN:
0.6088968118672871

TEST:





0.5943232652466202
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">model</span><span class="p">(</span><span class="n">column</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
    <span class="n">X_train_subset</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">[[</span><span class="n">column</span><span class="p">]]</span>
    <span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
    <span class="n">train_score</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_train</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">))</span>
    <span class="n">test_score</span> <span class="o">=</span> <span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">[[</span><span class="n">column</span><span class="p">]]))</span>
    <span class="k">return</span> <span class="n">train_score</span><span class="p">,</span> <span class="n">test_score</span>

<span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">X_train</span><span class="p">.</span><span class="n">columns</span><span class="p">:</span>
    <span class="k">print</span><span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="n">c</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(0.47991412719941495, 0.4466846804895943)
(0.01133589722637418, 0.010112709993501445)
(0.023847425986299742, 0.019686674517510605)
(0.0019727324864367013, 0.0026742213470939413)
(0.0007318879607208784, -0.00022540672756665714)
(0.0011001698651382785, -0.006489558238010673)
(0.020363987996845134, 0.022215172774302072)
(0.0022351265327293923, 0.0012984715729211782)
</code></pre></div></div>

<p>No single column has greater predictive power than using all of the columns together. However, we’re going to use the first column because that will make the math slightly easier.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">CustomLinearRegression</span><span class="p">:</span>
    <span class="n">a_hat</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">b_hat</span><span class="p">:</span> <span class="nb">float</span>
    
    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X_train</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">,</span> <span class="n">y_train</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">):</span>

        <span class="n">y_summed</span> <span class="o">=</span> <span class="n">y_train</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
        <span class="n">X_dot_X</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
        <span class="n">X_summed</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
        <span class="n">X_dot_y</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="n">dot</span><span class="p">(</span><span class="n">y_train</span><span class="p">)</span>

        <span class="n">n</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">a_hat</span> <span class="o">=</span> <span class="p">(</span><span class="n">y_summed</span> <span class="o">*</span> <span class="n">X_dot_X</span> <span class="o">-</span> <span class="n">X_summed</span> <span class="o">*</span> <span class="n">X_dot_y</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="n">X_dot_X</span> <span class="o">-</span> <span class="n">X_summed</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">b_hat</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="n">X_dot_y</span> <span class="o">-</span> <span class="n">X_summed</span> <span class="o">*</span> <span class="n">y_summed</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">n</span> <span class="o">*</span> <span class="n">X_dot_X</span> <span class="o">-</span> <span class="n">X_summed</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">X</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">):</span>
        <span class="k">return</span> <span class="n">X</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">a_hat</span> <span class="o">+</span> <span class="bp">self</span><span class="p">.</span><span class="n">b_hat</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">CustomLinearRegression</span><span class="p">()</span>
<span class="n">X_train_subset</span> <span class="o">=</span> <span class="n">X_train</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">to_numpy</span><span class="p">()</span>
<span class="n">y_train_array</span> <span class="o">=</span> <span class="n">y_train</span><span class="p">.</span><span class="n">to_numpy</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">(</span><span class="n">X_train</span><span class="o">=</span><span class="n">X_train_subset</span><span class="p">,</span> <span class="n">y_train</span><span class="o">=</span><span class="n">y_train_array</span><span class="p">)</span>

<span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_train</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">))</span>

<span class="n">y_pred</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_train_array</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">y_pred</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"TRAIN:"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_train_array</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">""</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"TEST:"</span><span class="p">)</span>
<span class="n">X_test_subset</span> <span class="o">=</span> <span class="n">X_test</span><span class="p">.</span><span class="n">iloc</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">].</span><span class="n">to_numpy</span><span class="p">()</span>
<span class="n">r2_score</span><span class="p">(</span><span class="n">y_true</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span> <span class="n">y_pred</span><span class="o">=</span><span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_subset</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.4752542635984037
TRAIN:
0.4752542635984037

TEST:





0.43971655102712115
</code></pre></div></div>

<p>To end, let’s visualize the performance of the custom linear regresion model on the training data, because visualization give me the warm fuzzies.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">6</span><span class="p">))</span>
<span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'green'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Training Data'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X_train_subset</span><span class="p">,</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Regression Line'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Visualization for Custom Linear Regression'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'MedInc'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Median House Values'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
<span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>

<p><img src="/images/linear-regression.png" alt="A scatter plot showing a line of best fit" style="max-height: none; width: 100%;" /></p>

<p>In just a few lines of code, we got a score that is similar to what we achieved with the SciKit-Learn implementation.</p>

<p>So what happened? Consider the following formula in which the training data, the estimated coefficients, and the target are terms.</p>

\[\begin{bmatrix}
n &amp; \sum_{i=1}^n x_i \\[1ex]
\sum_{i=1}^n x_i &amp; \sum_{i=1}^n x_i^2
\end{bmatrix}
\begin{bmatrix}
\widehat{\alpha} \\[1ex]
\widehat{\beta}
\end{bmatrix}
=
\begin{bmatrix}
\sum_{i=1}^n y_i \\[1ex]
\sum_{i=1}^n y_i x_i
\end{bmatrix}\]

<p>From the previous can be derived the below. Conveniently, the coefficients to be estimated are isolated on the left-side.</p>

\[\begin{align}
\widehat{\alpha} &amp;= \frac{
\left(\sum_{i=1}^n y_i\right)\left(\sum_{i=1}^n x_i^2\right)
-
\left(\sum_{i=1}^n x_i\right)\left(\sum_{i=1}^n x_i y_i\right)
}{
n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)^2
}
\\[8pt]
\widehat{\beta} &amp;= \frac{
n \sum_{i=1}^n x_i y_i - \left(\sum_{i=1}^n x_i\right)\left(\sum_{i=1}^n y_i\right)
}{
n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)^2
}.
\end{align}\]

<p><a href="https://en.wikipedia.org/wiki/Simple_linear_regression">(Wikipedia)</a></p>

<p>Unlike other paradigms (e.g., neural networks), a linear regression model is one that can be derived from the data analytically in a reasonable amount of time.</p>

<p>This model can also be interpreted in an intuitive way. After all, it’s just a line running through the two variables, signifying their correlation, and it comes pre-packaged with a measure of how far the data is from the line on average. We’re going to take advantage of this characteristic in a future post.</p>]]></content><author><name></name></author><category term="linear regression" /><category term="python" /><summary type="html"><![CDATA[We review the code behind an analytic solution to linear regression.]]></summary></entry><entry><title type="html">Mind-Reading Machine</title><link href="tracymedcalf.com/markov%20chain/2026/01/24/mind-reading-machine.html" rel="alternate" type="text/html" title="Mind-Reading Machine" /><published>2026-01-24T00:00:00+00:00</published><updated>2026-01-24T00:00:00+00:00</updated><id>tracymedcalf.com/markov%20chain/2026/01/24/mind-reading-machine</id><content type="html" xml:base="tracymedcalf.com/markov%20chain/2026/01/24/mind-reading-machine.html"><![CDATA[<p>Re-structuring the blog required that I temporarily take down this post. I hope to have back up soon.</p>]]></content><author><name></name></author><category term="markov chain" /><summary type="html"><![CDATA[In which we beat humans in a game of pennies with a simple Markov Chain.]]></summary></entry><entry><title type="html">Deep Energy-Based Model Fitted with Noise Contrastive Estimation</title><link href="tracymedcalf.com/ebm/nce/deep%20learning/python/2026/01/17/nce_ebm.html" rel="alternate" type="text/html" title="Deep Energy-Based Model Fitted with Noise Contrastive Estimation" /><published>2026-01-17T00:00:00+00:00</published><updated>2026-01-17T00:00:00+00:00</updated><id>tracymedcalf.com/ebm/nce/deep%20learning/python/2026/01/17/nce_ebm</id><content type="html" xml:base="tracymedcalf.com/ebm/nce/deep%20learning/python/2026/01/17/nce_ebm.html"><![CDATA[<h1 id="deep-energy-based-model-fitted-with-noise-contrastive-estimation-nce">Deep Energy-Based Model Fitted with Noise Contrastive Estimation (NCE)</h1>

<p>This notebook is my attempt to better understand the process of training an Energy-Based Model.</p>

<p>The model is fitted to tabular data (the wine quality data set from the UCI repository). Credit is due to volagold (https://github.com/volagold/nce/), without whose code example I would not have been able to write this.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">math</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">torch</span> <span class="kn">import</span> <span class="n">nn</span><span class="p">,</span> <span class="n">optim</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="c1"># device selection for GPU/CPU
</span><span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Using device:"</span><span class="p">,</span> <span class="n">device</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># the number of categorical and numerical columns
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"./wine_quality/winequality-red.csv"</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s">";"</span><span class="p">)</span>
<span class="n">num_features</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">columns</span><span class="p">)</span>
<span class="n">batch_size</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">values</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
</code></pre></div></div>

<p>Define a feed-forward neural network to output the energy score.</p>

<p>From the PyTorch docs:</p>

<p><code class="language-plaintext highlighter-rouge">Parameter</code> is a “kind of Tensor that is to be considered a module parameter”.</p>

<p>That is, a Parameter is automatically included in the <code class="language-plaintext highlighter-rouge">parameters()</code> iterator.</p>

<p>In NCE, $\log_Z(\theta)$ is treated as a learnable parameter (Song et al., 2021).</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FeedForwardNN</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dims</span><span class="o">=</span><span class="mi">32</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">(</span><span class="n">FeedForwardNN</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">log_Z_of_theta</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">([</span><span class="mf">1.0</span><span class="p">],</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">f</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">num_features</span><span class="p">,</span> <span class="n">dims</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">LeakyReLU</span><span class="p">(</span><span class="mf">0.2</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">,</span> <span class="n">dims</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">LeakyReLU</span><span class="p">(</span><span class="mf">0.2</span><span class="p">),</span>
            <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">dims</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
        <span class="p">)</span>
        
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="k">return</span> <span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">log_Z_of_theta</span>
</code></pre></div></div>

<p>From the PyTorch docs, <code class="language-plaintext highlighter-rouge">MultivariateNormal</code> creates “a multivariate normal (also called Gaussian) distribution parameterized by a mean vector and a covariance matrix”.</p>

<p>From the PyTorch docs, <code class="language-plaintext highlighter-rouge">eye</code> returns “a 2-D tensor with ones on the diagonal and zeros elsewhere”.</p>

<p>This code will return a Gaussian with dims equal to <code class="language-plaintext highlighter-rouge">num_features</code>, a mean at 0, and a covariance matrix of <code class="language-plaintext highlighter-rouge">I</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">FeedForwardNN</span><span class="p">().</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">num_rows</span> <span class="o">=</span> <span class="n">num_features</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">distributions</span><span class="p">.</span><span class="n">MultivariateNormal</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">num_rows</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">),</span> <span class="n">torch</span><span class="p">.</span><span class="n">eye</span><span class="p">(</span><span class="n">num_rows</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">))</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
<span class="n">criterion</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">MSELoss</span><span class="p">()</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">MAX_EPOCHS</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">MAX_EPOCHS</span><span class="p">):</span>
     
    <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
    

    <span class="c1"># GENERATE NOISE
</span>    <span class="n">gen</span> <span class="o">=</span> <span class="n">noise</span><span class="p">.</span><span class="n">sample</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">X</span><span class="p">),))</span>
    
    
    <span class="c1"># CALCULATE THE ENERGY LOSS
</span>    <span class="n">logp_x</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
    <span class="n">logq_x</span> <span class="o">=</span> <span class="n">noise</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">X</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">logp_gen</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">gen</span><span class="p">)</span>
    <span class="n">logq_gen</span> <span class="o">=</span> <span class="n">noise</span><span class="p">.</span><span class="n">log_prob</span><span class="p">(</span><span class="n">gen</span><span class="p">).</span><span class="n">unsqueeze</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
    
    <span class="n">value_data</span> <span class="o">=</span> <span class="n">logp_x</span> <span class="o">-</span> <span class="n">torch</span><span class="p">.</span><span class="n">logsumexp</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">([</span><span class="n">logp_x</span><span class="p">,</span> <span class="n">logq_x</span><span class="p">],</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">value_gen</span> <span class="o">=</span> <span class="n">logq_gen</span> <span class="o">-</span> <span class="n">torch</span><span class="p">.</span><span class="n">logsumexp</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">([</span><span class="n">logp_gen</span><span class="p">,</span> <span class="n">logq_gen</span><span class="p">],</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdim</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    
    <span class="n">loss</span> <span class="o">=</span> <span class="o">-</span><span class="p">(</span><span class="n">value_data</span><span class="p">.</span><span class="n">mean</span><span class="p">()</span> <span class="o">+</span> <span class="n">value_gen</span><span class="p">.</span><span class="n">mean</span><span class="p">())</span>
    
    <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
        <span class="n">r_x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">logp_x</span> <span class="o">-</span> <span class="n">logq_x</span><span class="p">)</span>
        <span class="n">r_gen</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">sigmoid</span><span class="p">(</span><span class="n">logq_gen</span> <span class="o">-</span> <span class="n">logp_gen</span><span class="p">)</span>
        <span class="n">acc</span> <span class="o">=</span> <span class="p">((</span><span class="n">r_x</span> <span class="o">&gt;</span> <span class="mf">0.5</span><span class="p">).</span><span class="nb">float</span><span class="p">().</span><span class="n">mean</span><span class="p">()</span> <span class="o">+</span> <span class="p">(</span><span class="n">r_gen</span> <span class="o">&gt;</span> <span class="mf">0.5</span><span class="p">).</span><span class="nb">float</span><span class="p">().</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="mi">2</span>
    
    
    <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>

    <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">100</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"Loss:"</span><span class="p">,</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">())</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">num_features</span><span class="p">,</span> <span class="n">device</span><span class="o">=</span><span class="n">device</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">([</span><span class="n">x</span><span class="p">])</span>
<span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
</code></pre></div></div>

<p>The above, randomly generated vector has a higher energy score than the vectors below that are the result of optimization. That’s appropriate. The higher the energy score, the more out-of-distribution it is.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">STEPS</span> <span class="o">=</span> <span class="mi">100</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">STEPS</span><span class="p">):</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
    <span class="n">energy</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">energy</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
    
<span class="n">x_star</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">detach</span><span class="p">()</span>
<span class="n">x_star</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">example</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="mi">10</span><span class="p">].</span><span class="n">detach</span><span class="p">().</span><span class="n">clone</span><span class="p">()</span>
<span class="n">example</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mf">9.0</span>
<span class="n">example</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">w_opt</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">].</span><span class="n">detach</span><span class="p">().</span><span class="n">clone</span><span class="p">())</span>
<span class="n">optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">([</span><span class="n">w_opt</span><span class="p">])</span>


<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
    <span class="n">ex</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
    <span class="n">ex</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">w_opt</span>
    <span class="n">energy</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">ex</span><span class="p">)</span>
    <span class="n">energy</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
    
<span class="n">example</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">w_opt</span>
<span class="n">example</span>
</code></pre></div></div>

<p>The true rating given to the above data point was 5. The model estimates the rating to be 10.</p>

<p>Clearly, this model is not ready to generate plausible data points. More work is needed.</p>

<p>One last thing: for our amusement, let’s show that the energy score is higher when the rating is lower for this data point.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">a</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span>
<span class="n">a</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mf">9.0</span>
<span class="p">(</span><span class="n">model</span><span class="p">(</span><span class="n">a</span><span class="p">),</span> <span class="n">model</span><span class="p">(</span><span class="n">example</span><span class="p">))</span>
</code></pre></div></div>]]></content><author><name></name></author><category term="ebm" /><category term="nce" /><category term="deep learning" /><category term="python" /><summary type="html"><![CDATA[...my attempt to better understand the process of training an Energy-Based Model.]]></summary></entry></feed>