## Archive for the 'Good Math' category

To me, the thing that makes probability fun is that the results are frequently surprising. We've got very strong instincts about how we expect numbers to work. But when you do anything that involves a lot of computations with big numbers, our intuition goes out the window - nothing works the way we expect it to. A great example of that is something called the birthday paradox.

Suppose you've got a classroom full of people. What's the probability that there are two people with the same birthday? Intuitively, most people expect that it's pretty unlikely. It seems like it shouldn't be likely - 365 possible birthdays, and 20 or 30 people in a classroom? Very little chance, right?

Let's look at it, and figure out how to compute the probability.

Interesting probability problems are all about finding out how to put things together. You're looking at things where there are huge numbers of possible outcomes, and you want to determine the odds of a specific class of outcomes. Finding the solutions is all about figuring out how to structure the problem.

A great example of this is something called the birthday paradox. This is a problem with a somewhat surprising outcome. It's also a problem where finding the right way to structure the problem is has a dramatic result.

Here's the problem: you've got a group of 30 people. What's the probability that two people out of that group of thirty have the same birthday?

We'll look at it with some simplifying assumptions. We'll ignore leap year - so we've got 365 possible birthdays. We'll assume that all birthdays are equally likely - no variation for weekdays/weekends, no variation for seasons, and no holidays, etc. Just 365 equally probable days.

How big is the space? That is, how many different ways are there to assign birthdays to 30 people? It's 36530 or something in the vicinity of 7.4*1076.

To start off, we'll reverse the problem. It's easier to structure the problem if we try to ask "What's the probability that no two people share a birthday". If P(B) is the probability that no two people share a birthday, then 1-P(B) is the probability that at least two people share a birthday.

So let's look at a couple of easy cases. Suppose we've got two people? What's the odds that they've got the same birthday? 1 in 365: there are 3652 possible pairs of birthdays; there are 365 possible pairs. So there's a probability of 365/3652 that the two people have the same birthday. For just two people, it's pretty easy. In the reverse form, there's a 364/365 chance that the two people have different birthdays.

What about 3 people? It's the probability of the first two having different birthdays, and the probability of the third person having a different birthday that either of those first two. There are 365 possible birthdays for the third person, and 363 possible days that don't overlay with the first two. So for N people, the probability of having distinct birthdays is .

At this point, we've got a nice recursive definition. Let's say that is the probability of people having distinct birthdays. Then:

1. For 2 people, the probability of distinct birthdays is 364/365. ()
2. For N>2 people, the probability of distinct birthdays is
.

Convert that to a closed form, and you get: . For 30 people, that's
. Work it out, and that's
0.29 - so the probability of everyone having distinct
birthdays is 29% - which means that the probability of at least
two people in a group of 30 having the same birthday is 71%!

You can see why our intuitions are so bad? We're talking about something where one factor in the computation is the factorial of 365!

Let's look a bit further: how many people do you need to have, before there's a 50% chance of 2 people sharing a birthday? Use the formulae we wrote up above, and it turns out to be 23. Here's the numbers - remember that this is the reverse probability, the probability of all birthdays being distinct.

 1 1 2 0.99726 3 0.991796 4 0.983644 5 0.972864 6 0.959538 7 0.943764 8 0.925665 9 0.905376 10 0.883052 11 0.858859 12 0.832975 13 0.80559 14 0.776897 15 0.747099 16 0.716396 17 0.684992 18 0.653089 19 0.620881 20 0.588562 21 0.556312 22 0.524305 23 0.492703 24 0.461656 25 0.4313 26 0.401759 27 0.373141 28 0.345539 29 0.319031 30 0.293684 31 0.269545 32 0.246652 33 0.225028 34 0.204683 35 0.185617 36 0.167818 37 0.151266 38 0.135932 39 0.12178 40 0.108768 41 0.0968484 42 0.0859695 43 0.0760771 44 0.0671146 45 0.0590241 46 0.0517472 47 0.0452256 48 0.039402 49 0.0342204 50 0.0296264

With just 23 people, there's a greater than 50% chance that two people will have the same birthday. By the time you get to just 50 people, there's a greater than 97% chance that two people have the same birthday!

As an amusing aside, the first time I saw this problem worked through was in an undergraduate discrete probability theory class, with 37 people in the class, and no duplicate birthdays!

Now - remember at the beginning, I said that the trick to working probability problems is all about how you formulate the problem. There's a much, much better way to formulate this.

Think of the assignment of birthdays as a function from people to birthdays: . The number of ways of assigning birthdays to people is the size of the set of functions from people to birthdays. How many possible functions are there? . is the number of days in the year - 365, and is the number of people in the group.

The set of assignments to unique birthdays is the number of injective functions. (An injective function is a function where .) How many injective functions are there? .

The probability of all birthdays being unique is the size of the set of injective functions divided by the size of the set of all assignments: .

So we've got the exact same result - but it's a whole lot easier in term of the discrete functions!

## The Elegance of Uncertainty

I was recently reading yet another botched explanation of Heisenberg's uncertainty principle, and it ticked me off. It wasn't a particularly interesting one, so I'm not going disassemble it in detail. What it did was the usual crackpot quantum dance: Heisenberg said that quantum means observers affect the universe, therefore our thoughts can control the universe. Blah blah blah.

It's not worth getting into the cranky details. But it inspired me to actually take some time and try to explain what uncertainty really means. Heisenberg's uncertainty principle is fascinating. It's an extremely simple concept, and yet when you realize what it means, it's the most mind-blowingly strange thing that you've ever heard.

One of the beautiful things about it is that you can take the math of uncertainty and reduce it to one simple equation. It says that given any object or particle, the following equation is always true:

Where:

• is a measurement of the amount of uncertainty
about the position of the particle;
• is the uncertainty about the momentum of the particle; and
• is a fundamental constant, called the reduced Plank's constant, which is roughly .

That last constant deserves a bit of extra explanation. Plank's constant describes the fundamental granularity of the universe. We perceive the world as being smooth. When we look at the distance between two objects, we can divide it in half, and in half again, and in half again. It seems like we should be able to do that forever. Mathematically we can, but physically we can't! Eventually, we get to a point where where is no way to subdivide distance anymore. We hit the grain-size of the universe. The same goes for time: we can look at what happens in a second, or a millisecond, or a nanosecond. But eventually, it gets down to a point where you can't divide time anymore! Planck's constant essentially defines that smallest unit of time or space.

Back to that beautiful equation: what uncertainty says is that the product of the uncertainty about the position of a particle and the uncertainty about the momentum of a particle must be at least a certain minimum.

Here's where people go wrong. They take that to mean that our ability to measure the position and momentum of a particle is uncertain - that the problem is in the process of measurement. But no: it's talking about a fundamental uncertainty. This is what makes it an incredibly crazy idea. It's not just talking about our inability to measure something: it's talking about the fundamental true uncertainty of the particle in the universe because of the quantum structure of the universe.

Let's talk about an example. Look out the window. See the sunlight? It's produced by fusion in the sun. But fusion should be impossible. Without uncertainty, the sun could not exist. We could not exist.

Why should it be impossible for fusion to happen in the sun? Because it's nowhere near dense or hot enough.

There are two forces that you need to consider in the process of nuclear fusion. There's the electromagnetic force, and there's the strong nuclear force.

The electromagnetic force, we're all familiar with. Like charges repel, different charges attract. The nucleus of an atom has a positive charge - so nuclei repel each other.

The nuclear force we're less familiar with. The protons in a nucleus repel each other - they've still got like charges! But there's another force - the strong nuclear force - that holds the nucleus together. The strong nuclear force is incredibly strong at extremely short distances, but it diminishes much, much faster than electromagnetism. So if you can get a proton close enough to the nucleus of an atom for the strong force to outweigh the electromagnetic, then that proton will stick to the nucleus, and you've got fusion!

The problem with fusion is that it takes a lot of energy to get two hydrogen nuclei close enough to each other for that strong force to kick in. In fact, it turns out that hydrogen nuclei in the sun are nowhere close to energetic enough to overcome the electromagnetic repulsion - not by multiple orders of magnitude!

But this is where uncertainty comes in to play. The core of the sun is a dense soup of other hydrogen atoms. They can't move around very much without the other atoms around them moving. That means that their momentum is very constrained - is very small, because there's just not much possible variation in how fast it's moving. But the product of and have to be greater than , which means that needs to be pretty large to compensate for the certainty about the momentum.

If is large, that means that the particle's position is not very constrained at all. It's not just that we can't tell exactly where it is, but it's position is fundamentally fuzzy. It doesn't have a precise position!

That uncertainty about the position allows a strange thing to happen. The fuzziness of position of a hydrogen nucleus is large enough that it overlaps with the the nucleus of another atom - and bang, they fuse.

This is an insane idea. A hydrogen nucleus doesn't get pushed into a collision with another hydrogen nucleus. It randomly appears in a collided state, because it's position wasn't really fixed. The two nuclei that fused didn't move: they simply didn't have a precise position!

So where does this uncertainty come from? It's part of the hard-to-comprehend world of quantum physics. Particles aren't really particles. They're waves. But they're not really waves. They're particles. They're both, and they're neither. They're something in between, or they're both at the same time. But they're not the precise things that we think of. They're inherently fuzzy probabilistic things. That's the source uncertainty: at macroscopic scales, they behave as if they're particles. But they aren't really. So the properties that associate with particles just don't work. An electron doesn't have an exact position and velocity. It has a haze of probability space where it could be. The uncertainty equation describes that haze - the inherent uncertainty that's caused by the real particle/wave duality of the things we call particles.

## Basic Data Structures: Hash Tables

I'm in the mood for a couple of basics posts. As long-time readers might know, I love writing about data structures.

One of the most important and fundamental structures is a hashtable. In fact, in a lot of modern programming languages have left hashtables behind, for reasons I'll discuss later. But if you want to understand data structures and algorithmic complexity, hashtables are one of the essentials.

A hashtable a structure for keeping a list of (key, value) pairs, where you can look up a value using the key that's associated with it. This kind of structure is frequently called either a map, an associative array, or a dictionary.

For an example, think of a phonebook. You've got a collection of pairs (name, phone-number) that make up the phonebook. When you use the phonebook, what you do is look for a person's name, and then use it to get their phone number.

A hashtable is one specific kind of structure that does this. I like to describe data structures in terms of some sort of schema: what are the basic operations that the structure supports, and what performance characteristics does it have for those operations.

In those schematic terms, a hashtable is very simple. It's a structure that maintains a mapping from keys to values. A hashtable really only needs two operations: put and get:

1. put(key, value): add a mapping from key to value to the table. If there's already a mapping for the key, then replace it.
2. get(key): get the value associated with the key.

In a hashtable, both of those operations are extremely fast.

Let's think for a moment about the basic idea of a key-value map, and what kind of performance we could get out of a cople of simple naive ways of implementing it.

We've got a list of names and phone numbers. We want to know how long it'll take to find a particular name. How quickly can we do it?

How long does that take, naively? It depends on how many keys and values there are, and what properties the keys have that we can take advantage of.

In the worst case, there's nothing to help us: the only thing we can do is take the key we're looking for, and compare it to every single key. If we have 10 keys, then on average, we'll need to do an average of about 5 steps before we find the key we're looking for. If there are 100 keys, then it'll take, on average, about 50 steps. If there are one million keys, then it'll take an average of half a million steps before we can find the value.

If the keys are ordered - that is, if we can compare one key to another not just for equality, but we can ask which came first using a "less than or equal to" operator, then we can use binary search. With binary search, we can find an entry in a list of 10 elements in 4 steps. We can find an entry in a list of 1000 keys in 10 steps, or one in a list of one million keys in 20 steps.

With a hashtable, if things work right, in a table of 10 keys, it takes one step to find the key. 100 keys? 1 step. 1000 keys? 1 step. 1,000,000,000 keys? Still one step. That's the point of a hashtable. It might be really hard to do something like general a list of all of the keys - but if all you want to do is look things up, it's lightning.

How can it do that? It's a fairly simple trick: the hashtable trades space for time. A hashtable, under normal circumstances, uses a lot more space than most other ways of building a dictionary. It makes itself fast by using extra space in a clever way.

A hashtable creates a bunch of containers for (key, value) pairs called buckets. It creates many more buckets than the number of (key, value) pairs than it expects to store. When you want to insert a value into the table, it uses a special kind of function called a hash function on the key to decide which bucket to put the (key, value) into. When you want to look for the value associated with a key, it again uses the hash function on the key to find out which bucket to look in.

It's easiest to understand by looking at some actual code. Here's a simple, not at all realistic implementation of a hashtable in Python:

  class Hashtable(object):
def __init__(self, hashfun, size):
self._size = size
self._hashfun = hashfun
self._table = [[] for i in range(size)]

def hash(self, key):
return self._hashfun(key) % self._size

def get(self, key, value):
self._table[self.hash(key)].append((key, value))

def get(self, key):
for k,v in self._table[self.hash(key)]:
if k == key:
return v
return None


If you've got a good hash function, and your hashtable is big enough, then each bucket will end up with no more than one value in it. So if you need to insert a value, you find an (empty) bucket using its hashcode, and dump it in: one step. If you need to find a value given its key, find the bucket using its hashcode, and return the value.

There are two big problems with hashtables.

First, everything is dependent on the quality of your hash function. If you hash function maps a lot of values to the same bucket, then your performance is going to suck. In fact, in the worst case, it's no better than just searching a randomly ordered list. Most of the time, you can come up with a hash function that does pretty good - but it's a surprisingly tricky thing to get right.

Second, the table really needs to be big relative to the number of elements that you expect to have in the list. If you set up a hashtable with 40 buckets, and you end up with 80 values stored in it, your performance isn't going to be very good. (In fact, it'll be slightly worse that just using a binary search tree.)

So what makes a good hash function? There are a bunch of things to consider:

1. The hash function must be deterministic: calling the hash on the same key value must always produce the same result. If you're writing a python program like the one I used as an example above, and you use the value of the key objects fields to compute the hash, then changing the key objects fields will change the hashcode!
2. The hash function needs to focus on the parts of the key that distinguish between different keys, not on their similarities. To give a simple example, in some versions of Java, the default hash function for objects is based on the address of the object in memory. All objects are stored in locations whose address is divisible by 4 - so the last two bits are always zero. If you did something simple like just take the address modulo the table size, then all of the buckets whose numbers weren't divisible by four would always be empty. That would be bad.
3. The hash function needs to be uniform. That means that it needs to map roughly the same number of input values to each possible output value. To give you a sense of how important this is: I ran a test using 3125 randomly generated strings, using one really stupid hash function (xoring together the characters), and one really good one (djb2). I set up a small table, with 31 buckets, and inserted all of the value into it. With the xor hash function, there were several empty buckets, and the worst bucket had 625 values in it. With djb2, there were no empty buckets; the smallest bucket had 98 values, and the biggest one had 106.

So what's a good hash function look like? Djb2, which I used in my test above, is based on integer arithmetic using the string values. It's an interesting case, because no one is really entirely sure of exactly why it works better than similar functions, but be that as it may, we know that in practice, it works really well. It was invented by a guy named Dan Bernstein, who used to be a genius poster in comp.lang.c, when that was a big deal. Here's djb2 in Python:

def djb2(key):
hash = 5381
for c in key:
hash = (hash * 33) + ord(c)
return hash


What the heck is it doing? Why 5381? Because it's prime, and it works pretty well. Why 33? No clue.

Towards the beginning of this post, I alluded to the fact that hashtables have, at least to some degree, fallen out of vogue. (For example, in the Go language standard library, the map type is implemented using a red-black tree.) Why?

In practice, it's rarely any faster to really use a hashtable than to use a balanced binary tree like a red-black tree. Balanced trees have better worst-case bounds, and they're not as sensitive to the properties of the hash function. And they make it really easy to iterate over all of the keys in a collection in a predictable order, which makes them great for debugging purposes.

Of course, hash tables still get used, constantly. The most commonly used data structures in Java code include, without a doubt, the HashMap and HashSet, which are both built on hashtables. They're used constantly. You usually don't have to implement them yourself; and usually system libraries provide a good default hash function for strings, so you're usually safe.

There's also a lot of really fascinating research into designing ideal hash functions for various applications. If you know what your data will look like in advance, you can even build something called a perfect hash function, which guarantees no collisions. But that's a subject for another time.

## Combining Non-Disjoint Probabilities

In my previous post on probability, I talked about how you need to be careful about covering cases. To understand what I mean by that, it's good to see some examples.

And we can do that while also introducing an important concept which I haven't discussed yet. I've frequently talked about independence, but equally important is the idea of disjointness.

Two events are independent when they have no ability to influence one another. So two coin flips are independent. Two events are disjoint when they can't possibly occur together. Flipping a coin, the event "rolled a head" and the event "rolled a tail" are disjoint: if you rolled a head, you can't roll a tail, and vice versa.

So let's think about something abstract for a moment. Let's suppose that we've got two events, A and B. We know that the probability of A is 1/3 and the probability of B is also 1/3. What's the probability of A or B?

Naively, we could say that it's P(A) + P(B). But that's not necessarily true. It depends on whether or not the two events are disjoint.

Suppose that it turns out that the probability space we're working in is rolling a six sided die. There are three basic scenarios that we could have:

1. Scenario 1: A is the event "rolled 1 or 2", and B is "rolled 3 or 4". That is, A and B are disjoint.
2. Scenario 2: A is the event "rolled 1 or 2", and B is "rolled 2 or 3". A and B are different, but they overlap.
3. Scenario 3: A is the event "rolled 1 or 2", and B is the event "rolled 1 or 2". A and B are really just different names for the same event.

In scenario one, we've got disjoint events. So P(A or B) is P(A) + P(B). One way of checking that that makes sense is to look at how the probability of events work out. P(A) is 1/3. P(B) is 1/3. The probability of neither A nor B - that is, the probability of rolling either 5 or 6 - is 1/3. The sum is 1, as it should be.

But suppose that we looked at scenario 2. If we made a mistake and added them as if they were disjoint, how would things add up? P(A) is 1/3. P(B) is 1/3. P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total of these three probabilities is 1/3 + 1/3 + 1/2 = 7/6. So just from that addition, we can see that there's a problem, and we did something wrong.

If we know that A and B overlap, then we need to do something a bit more complicated to combine probabilities. The general equation is:

Using that equation, we'd get the right result. P(A) = 1/3; P(B) =
1/3; P(A and B) = 1/6. So the probability of A or B is 1/3 + 1/3 - 1/6 = 1/2. And P(neither A nor B) = P(4 or 5 or 6) = 1/2. The total is 1, as it should be.

From here, we'll finally start moving in to some more interesting stuff. Next post, I'll look at how to use our probability axioms to analyze the probability of winning a game of craps. That will take us through a bunch of applications of the basic rules, as well as an interesting example of working through a limit case.

And then it's on to combinatorics, which is the main tool that we'll use for figuring out how many cases there are, and what they are, which as we've seen is an essential skill for probability.

## Correction, Sigma Algebras, and Mass functions

So, I messed up a bit in the previous post. Let me get that out of the way before we move forward!

In measure theory, you aren't just working with sets. You're working with something called σ-algebras. It's a very important distinction.

The problem is, our intuition of sets doesn't always work. Sets, as defined formally, are really pretty subtle. We expect certain things to be true, because they make sense. But in fact, they are not implied by the definition of sets. A σ-algebra is, essentially, a well-behaved set - a set whose behavior matches our usual expectations.

To be formal, a sigma algebra over a set S is a collection Σ of subsets of S such that:

1. Σ is closed over set complement.
2. Σ is closed over countable union.

The reason why you need to make this restriction is, ultimately, because of the axiom of choice. Using the axiom of choice, you can create sets which are unmeasurable. They're clearly subsets of a measurable set, and supersets of other measurable sets - and yet, they are, themselves, not measurable. This leads to things like the Banach-Tarski paradox: you can take a measurable set, divide it into non-measurable subsets, and then combine those non-measurable subsets back into measurable sets whose size seem to make no sense. You can take a sphere the size of a baseball, slice it into pieces, and then re-assemble those pieces into a sphere the size of the earth, without stretching them!

These non-measurable sets blow away our expectations about how things should behave. The restriction to σ algebras is just a way of saying that we need to be working in a space where all sets are measurable. When we're looking at measure theory (or probability theory, where we're building on measures), we need to exclude non-measurable sets. If we don't, we're seriously up a creek without a paddle. If we allowed non-measurable sets, then the probability theory we're building would be inconsistent, and that's the kiss of death in mathematics.

Ok. So, with that out of the way, how do we actually use Kolmogorov's axioms? It all comes down to the idea of a sample space. You need to start with an experiment that you're going to observe. For that experiment, there are a set of possible outcomes. The set of all possible outcomes is the sample space.

Here's where, sadly, even axiomatized probability theory gets a bit handwavy. Given the sample space, you can define the structure of the sample space with a function, called the probability mass function, f, which maps each possible event in the sample space to a probability. To be a valid mass function for a sample space S, it's got to have the following properties:

1. For each event e in S, f(e) ≥ 0 and f(e) <= 1..
2. The sum of the probabilities in the sample space must be 1:

So we wind up with a sort of circularity: in order to describe the probability of events, we need to start by knowing the probability of events. In fact, this isn't really a problem: we're talking about taking something than we observe in the real world, and mapping it into the abstract space of math. Whenever we do that, we need to take our observations of the real world and create an approximation as a mathematical model.

The point of probability theory isn't to do that primitive mapping. In general, we already understand how rolling a single die works. We know how it should behave, and we know how and why its actual behavior can vary from our expectation. What we want to know is really how many events combine.

We don't need any special theory to figure out what the probability of rolling a 3 on a six-sided die is: that's easy, and it's obvious: it's 1 in 6. But what's the probability of winning a game of craps?

If all days of the year 2001 are equally likely, then we don't need anything fancy to ask what the probability of someone born in 2001's birthday being July 21st. It's easy: 1 in 365. But if I've got a group of 35 people, what's the probability of two of them sharing the same birthday?

Both of those questions start with the assignment of a probability mass function, which is trivial. But they involve combining the probabilities given by those mass functions, and use them with Kolmogorov's axioms to figure out the probabilities of the complicated events.

## Kolmogorov's Axioms of Probability

The way that I've talked about probability so far is mostly informal. That's the way that probability theory was treated for a long time. You defined probability spaces over collections of equal probability sets. You combined probability spaces by combining their events into other kinds of equally probable events.

The problem with that should be obvious: it's circular. You want to define the probability of events; to do that, you need to start with equally probable events, which means that on some level, you already know the probabilities. If you don't know the probabilities, you can't talk about them. The reality is somewhat worse than that, because this way of looking at things completely falls apart when you start trying to think about infinite probability spaces!

So what can you do?

The answer is to reformulate probability. Mathematicians knew about this kind of problem for a very long time, but what they mostly just ignored it: probability wasn't considered a terribly interesting field.

Then, along came Kolmogorov - the same brilliant guy who's theory of computational complexity is so fascinating to me! Kolmogorov created a new formulation of probability theory. Instead of starting with a space of equally probable discrete events, you start with a measure space.

Before we can look at how Kolmogorov reformulated probability (the Kolmogorov axioms), we need to look at just what a measure space is.

A measure space is just a set with a measure function. So let X be a set. A measure μ on X is a function from a subset of X to a real number: with the following properties:

• Measures are non-negative:
• The measure of the empty set is always 0:
• The measure of a finite sequence of unions is the sum of the individual measures

So the idea is pretty simple: a measure space is just a way of defining the size of a subset in a consistent way.

To work with probability, you need a measure space where the measure of the entire set is 1. With that idea in mind, we can put together a proper, formal definition of a probability space that will really allow us to work with, and to combine probabilities in a rigorous way.

Like our original version, a probability space has a set of events, called its event space. We'll use F to represent the set of all possible events, and e to represent an event in that set.

There are three fundamental axioms of probability, which are going to look really similar to the three axioms of a measure space:

1. Basic measure: the probability of any event is a positive real number: .
2. Unit measure: the probability that some event will occur is 1, which we write as ( is called the unit event, and is the union of all possible events.) Alternatively, the probability of no event occurring is 0: .
3. Combination: For any two distinct events or sets of events and , the probability of or is : . This can be extended to any countable sequence of unions.

This is very similar to the informal version we used earlier. But as we'll see later, this simple formulation from measure theory will give us a lot of additional power.

It's worth taking a moment to point out two implications of these axioms. (In fact, I've seen some presentations that treat some of these as additional axioms, but they're provable from the first three.

• Monotonicity: if , then .
• Upper Bound: for any event or set of events , .

The brilliance of Kolmogorov was realizing that these rules were everything you need to work out any probability you want - in both finite and infinite spaces. We'll see that there's a lot of complexity in the combinatorics of probability, but it will all always ultimately come back to these three rules.

## Independence and Combining probabilities.

As I alluded to in my previous post, simple probability is really a matter of observation, to produce a model. If you roll a six sided die over and over again, you'll see that each face comes up pretty much equally often, and so you model it as 1/6 probability for each face. There's not really a huge amount of principle behind the basic models: it's really just whatever works. This is the root of the distinction between interpretations: a frequentist starts with an experiment, and builds a descriptive model based on it, and says that the underlying phenomena being tested has the model as g, a property; a Bayesian does almost the same thing, but says that the model describes the state of their knowledge.

Where probability starts to become interesting in when you combine things. I know the probability of outcomes for rolling one die: how can I use that to see what happens when I roll five dice together? I know the probability of drawing a specific card from a deck: what are the odds of being dealt a particular poker hand?

We'll start with the easiest part: combining independent probabilities. The probability of two events are independent when there's no way for the outcome of one to influence the outcome of the other. For example, if you're flipping a coin several times, the result of one coin flip has no effect on the result of a subsequent flip. On the other hand, dealing 10 cards from a deck is a sequence of dependent events: once you've dealt one card, the next deal must come from the remaining cards: you can't deal the same card twice.

If you know the probability space of your trials, then recognizing an independent situation is easy: if the outcome of one trial doesn't alter the probability space of other trials, then they're independent.

Look back at the coin flip example: we know what the probability space of a coin flip looks like: it's got two, equally probable outcomes. If you've flipped a coin once, and you're going to flip another coin, the result of the first flip can't do anything that alters the probability space of a subsequent flip.

But if you think about dealing cards, that's not true. With a standard deck of cards, the initial probability space has 52 outcomes, each of which is equally likely. So the odds of being dealt the 5 of spades is exactly 1/52.

Now, suppose that you got lucky, and you did get dealt the 5 of spades on the first card. What's the probability of being dealt the 5 of spades's on the second? If they were independent events, it would still be 1/52. But once you've dealt one card, you can't deal it again. The probability of being dealt the 5 of spades as the second card is 0: it's impossible. The probability space only has 51 possible outcomes, and the 5 of spades is not one of them. The space has changed. That's the definition of a dependent event.

When you're faced with dependent probabilities, you need to figure out how the probability space will be changed, and incorporate that into your computation. Once you've incorporated the change in the probability space of the second test, then you've got a new independent probability, and you can combine them. Figuring out how to alter the probability space can be extremely difficult, but that's what makes it interesting.

When you're dealing with independent events, it's easy to combine them. There are two basic ways of combining event probabilities,
and they should be familiar from logic: event1 AND event2, and event1 OR event2.

Suppose you're looking at two test with independent outcomes. I know that the probability of event e is P(e), and the probability of event f is P(f) Then the outcome of e & f - that is, of having e as the outcome of the first trial, and f as the outcome of the second, is P(e)×P(f). The odds of rolling HTTH on a coin is (1/2)*(1/2)*(1/2)*(1/2)=(1/16).

If you're looking at independent alternatives - that is, the probability of e OR F, you combine the probabilities of the event with addition: P(e) + P(f). So, the odds of drawing any heart from a deck: for each card, it's 1/52. There are thirteen different hearts. So the odds of drawing a red are 1/52 + 1/52 + ... = 13/52 = 1/4.

That still doesn't get us to the really interesting stuff. We still can't quite work out something like the odds of being dealt a flush. To get there, we need to learn some combinatorics, which will allow us to formulate the probability spaces that we need for an interesting probability.

## Probability Spaces

Sorry for the slowness of the blog lately. I finally got myself back onto a semi-regular schedule when I posted about the Adria Richards affair, and that really blew up. The amount of vicious, hateful bile that showed up, both in comments (which I moderated) and in my email was truly astonishing. I've written things which pissed people off before, and I've gotten at least my fair share of hatemail. But nothing I've written before came close to preparing me for the kind of unbounded hatred that came in response to that post.

I really needed some time away from the blog after that.

Anyway, I'm back, and it's time to get on with some discrete probability theory!

I've already written a bit about interpretations of probability. But I haven't said anything about what probability means formally. When I say that the probability of rolling a 3 with a pair of fair six-sided dice is 1/18, how do I know that? Where did that 1/6th figure come from?

The answer lies in something called a probability space. I'm going to explain the probability space in frequentist terms, because I think that that's easiest, but there is (of course) an equivalent Bayesian description.) Suppose I'm looking at a particular experiment. In classic mathematical form, a probability space consists of three components (Ω, E, P), where:

1. Ω, called the sample space, is a set containing all possible outcomes of the experiment. For a pair of dice, Ω would be the set of all possible rolls: {(1,1), (1,2), (1,3), (1,4), (1,5), (1, 6), (2,1), ..., (6, 5), (6,6)}.
2. E is an equivalence relation over Ω, which partitions Ω into a set of events. Each event is a set of outcomes that are equivalent. For rolling a pair of dice, an event is a total - each event is the set of outcomes that have the same total. For the event "3" (meaning a roll that totalled three), the set would be {(1, 2), (2, 1)}.
3. P is a probability assignment. For each event e in E, P(e) is a value between 0 and 1, where:

(That is, the sum of the probabilities of all of the possible events in the space is exactly 1.)

The probability of an event e being the outcome of a trial is P(e).

So the probability of any particular event as the result of a trial is a number between 0 and 1. What's it mean? If the probability of event e is p, then if we repeat the trial N times, we expect N*p of those trials to have e as their result. If the probability of e is 1/4, and we repeat the trial 100 times, we'd expect e to be the result 25 times.

But in an important sense, that's a cop-out. We've defined probability in terms of this abstract model, where the third component is the probability. Isn't that circular?

Not really. For a given trial, we create the probability assignment by observation and/or analysis. The important point is that this is really just a bare minimum starting point. What we really care about in probability isn't the change associated with a single, simple, atomic event. What we want to do is take the probability associated with a group of single events, and use our understanding of that to allow us to explore a complex event.

If I give you a well-shuffled deck of cards, it's easy to show that the odds of drawing the 3 of diamonds is 1/52. What we want to do with probability is things like ask: What are the odds of being dealt a flush in a poker hand?

The construction of a probability space gives us a well-defined platform to use for building probabilistic models of more interesting things. Give a probability space of two single dice, we can combine them together to create the probability space of the two dice rolled together. Given the probability space of a pair of dice, we can construct the probability space of a game of craps. And so on.

## Probability and Interpretations

I'm going to do some writing about discrete probability theory. Probability is an extremely important area of math. We encounter aspects of it every day. It's also a very poorly understood area - it's one that we see abused or just fouled up every day.

I'm going to focus on discrete probability theory. What that means is that we're going to look at things where the space containing the things that we're going to look at contains a countable number of elements. The probability of getting a certain sequence of coin flips, or of getting a certain hand of cards are described by discrete probability theory. On the other hand, the odds of a radioactive isotope decaying at a particular time requires continuous probability theory.

Before getting into the details, there's one important thing to mention. When you're talking about probability, there are two fundamental schools of interpretetation. There are frequentist interpretations, and there are Bayesian interpretations.

In a frequentist interpretation, when you say the probability of an event is 0.6, what you mean is that if you were to perform a series of experiments precisely reproducing the event, then on average, if you did 100 experiments, the event would occur 60 times. In the frequentist interpretation, the probability is an intrinsic property of the event. For a frequentist, it makes sense to say that there is a "real" probability associated with an event.

In a Bayesian interpretation, when you say that the probability of an event is 0.6, what you mean is that based on your current state of knowledge about the event, you have a 60% certainty that the event will occur. In a strict Bayesian interpretation, the event doesn't have any kind of intrinsic probability associated with it. The specific event that you're interested in either will occur, or it won't. There's no real probability involved. What probability measures is how certain you are about whether or not it will occur.

For example, think about flipping a fair coin.

A frequentist would say that you can flip a coin many times, and half of the time, it will land on heads. So the probability of a coin flip landing on the head of the coin is 0.5. A Bayesian would say that the coin will land either on heads or on tails. Since you don't know which, and you have no other information to use to be able to make a better prediction, you can have a certainty of 0.5 that it will land on the head of the coin.

In the real world, I think that most people are really somewhere in between.

I think that all but the most fervent Bayesians do rely on an intuitive notion of the "intrinsic" probability of an event. They may describe it in different terms, but when it comes down to it, they're using the basic frequentist notion. And I don't think that you can find a sane frequentist anywhere who won't use Bayes theorem to update their priors in the face of new information - which is the most fundamental notion in the Bayesian interpretation.

One note before I finish this, and get started on the real meaty posts. In the past, when I've talked about probability, people have started stupid flamewars in the comments. People get downright religious about interpretations of probability. There are religious Bayesians, who think that all frequentists are stupid idiots who should be banished from the field of math; likewise, there are religious frequentists who think that Bayesians are all a crop of arrogant know-it-alls who should be sent to Siberia. I am not going to tolerate any of that nonsense. If you feel that you cannot read posts on probability without going into a diatribe about those stupid frequentists/Bayesians and their deliberately stupid ideas, please go away and don't even read these posts. If you do go into such a diatribe, I will delete your comments without any hesitation.

## What the heck is a DNS amplification DoS attack?

A couple of weeks ago, there was a bunch of news about a major DOS attack on Spamhaus. Spamhaus is an online service that maintains a blacklist of mail servers that are known for propagating spam. I've been getting questions about what a DoS attack is, and more specifically what a "DNS amplification attack" (the specific attack at the heart of last week's news) is. This all became a bit more relevant to me last week, because some asshole who was offended by my post about the Adria Richards affair launched a smallish DoS attack against scientopia. (This is why we were interrmitently very slow last week, between tuesday and thursday. Also, to be clear, the DNS amplification attack was used on Spamhaus. Scientopia was hit by a good old fashioned DDoS attack.)

So what is a DoS attack? And what specifically is a DNS amplification attack?

Suppose that you're a nastly person who wants to take down a website like scientopia. How could you do it? You could hack in to the server, and delete everything. That would kill it pretty effectively, right?

It certainly would. But from the viewpoint of an attacker, that's not a particularly easy thing to do. You'd need to get access to a privileged account on our server. Even if we're completely up to date on patches and security fixes, it's probably possible to do that, but it's still probably going to be a lot of work. Even for a dinky site like scientopia, getting that kind of access probably isn't trivial. For a big security-focused site like spamhaus, that's likely to be close to impossible: there are layers of security that you'd need to get through, and there are people constantly watching for attacks. Even if you got through, if the site has reliable backups, it won't be down for long, and once they get back up, they'll patch whatever hole you used to get in, so you'd be back to square one. It's a lot of work, and there are much easier ways to take down a site.

What you, as an attacker, want is a way to take the site down without having any kind of access to the system. You want a way that keeps the site down for as long as you want it down. And you want a way that doesn't leave easily traced connections to you.

That's where the DoS attack comes in. DoS stands for "denial of service". The idea of a DoS attack is to take a site down without really taking it down. You don't actually kill the server; you just make it impossible for legitimate users to access it. If the sites users can't access the site even though the server is technically still up and running, you've still effectively killed it.

How do you do that? You overwhelm the server. You target some finite resource on the server, and force it to use up that resource just dealing with requests or traffic that you sent to the server, leaving it with nothing for its legitimate users.

In terms of the internet, the two resources that people typically target are CPU and network bandwidth.

Every time that you send a request to a webserver, the server has to do some computation to process that request. The server has a finite amount of computational capability. If you can hit it with enough requests that it spends all of its time processing your requests, then the site becomes unusable, and it effectively goes down. This is the simplest kind of DoS attack. It's generally done in a form called a DDoS - distributed denial of server attack, where the attacker users thousands or millions of virus-infected computers to send requests. The server gets hit by a vast storm of requests, and it can't distinguish the legitimate requests from the ones generated by the attacker. This is the kind of attack that hit Scientopia last week. We were getting streams of a couple of thousands malformed requests per second.

This kind of attack can be very effective. It's hard - not impossible, but hard - to fight. You need to identify the common traits of the attackers, and set up some kind of filter to discard those requests. From the attacker's point of view, it's got one problem: price. Most people don't have a personal collection of virus-infected machines that they can use to mount an attack. What they actually do is rent machines! Virus authors run services where they'll use the machines that they've to run an attack for you, for a fee. They typically charge per machine-hour. So to keep a good attack going for a long time is expensive! Another problem with this kind of attack is that the amount of traffic that you can inflict on the server per attacker is also used by the client. The client needs to establish a connection to the server. That consumes CPU, network connections, and bandwidth on the client.

The other main DoS vector is network bandwidth. Every server running a website is connected to the network by a connection with a fixed capacity, called it's bandwidth. A network connection can only carry a certain quantity of information. People love to make fun of the congressman who said that the internet is like a series of tubes, but that's not really a bad analogy. Any given connection is a lot like a pipe. You can only cram so much information through that pipe in a given period of time. If you can send enough traffic to completely fill that pipe, then the computer on the other end is, effectively, off the network. It can't receive any requests.

For a big site like spamhaus, it's very hard to get enough machines attacking to effectively kill the site. The amount of bandwidth, and the number of different network paths connecting spamhaus to the internet is huge! The number of infected machines available for an attack is limited, and the cost of using all of them is prohibitive.

What an attacker would like for killing something like Spamhaus is an attack where the amount of work/cpu/traffic used to generate the attack is much smaller than the amount of work/cpu/traffic used by the server to combat the attack. That's where amplification comes in. You want to find some way of using a small amount of work/traffic on your attacker machines to cause your target to lost a large amount of work/traffic.

In this recent attack on Spamhaus, they used an amplification attack, that was based on a basic internet infrastructure service called the Domain Name Service (DNS). DNS is the service which is used to convert between the name of a server (like scientopia.org), and its numeric internet address (184.106.221.182). DNS has some technical properties that make it idea for this kind of attack:

1. It's not a connection-based service. In most internet services, you establish a connection to a server, and send a request on that connection. The server responds on the same connection. In a connection-based service, that means two things. First, you need to use just as much bandwidth as the target, because if you drop the connection, the server sees the disconnect and stops processing your request. Second, the server knows who it's connected to, and it always sends the results of a request to the client that requested it. But DNS doesn't work that way. In DNS, you send a request without a connection, and in the request, you provide an address that the response should be sent to. So you can fake a DNS request, by putting someone else's address as the "respond-to" address in the request.
2. It's possible to set up DNS to create very large responses to very small requests. There are lots of ways to do this. The important thing is that it's really easy to use DNS in a way that allows you to amplify the amount of data being sent to a server by a factor of 100. In one common form of DNS amplification, you send 60 byte requests, which generate responses larger than 6,000 bytes.

Put these two properties together, and you get a great attack vector: you can send tiny, cheap requests to a server, which don't cause any incoming traffic on your attacker machine, and which send large quantities of data to your target. Doing this is called a DNS amplification attack: it's an amplification attack which uses properties of DNS to generate large quantities of data send to your server, using small quantities of data sent by your attackers.

That's exactly what happened to Spamhaus last week. The attackers used a very common DNS extension, which allowed them to amplify 60 byte requests into 4,000 byte responses, and to send the responses to the spamhaus servers.

There are, of course, more details. (For example, when direct attacks didn't work, they tried an indirect attack that didn't target the spamhaus servers, but instead tried to attack other servers that spamhaus relied on.) But this is the gist.

Older posts »

• Scientopia Blogs