**The chance likelihood of a specified event or circumstance, usually expressed as the ratio of the number of times that event occurs to the large number of trials that take place to observe its occurrence or failure to occur.** Although probability theory derives its notion and terminology from intuition, a vague statement such as “John will probably come” is as remote from it as the statement “John is forceful and energetic” is remote from mechanics. Probability theory constructs abstract models, mostly of a qualitative nature, and only experience can show whether these reasonably describe laws of nature or life. As always in mathematics, only logical relations and implications enter the theory, and the notion of probability is just as undefinable (and as intuitive) as are the notions of point, line, or mass. An actual assignment of numerical probabilities is frequently unnecessary or impossible. For example, telephone exchanges are based on a theoretical comparison of several possible systems; only the optimal ones are built and the others discarded. Thus a huge industry depends on theoretical models of exchanges which will never exist.

An uncomplicated illustration of the nature of probability models is found in an experiment of Lord Rutherford. To measure radioactive intensity, he proceeded as follows. Observers *A*_{1} and *A*_{2} counted scintillations on a screen and observed, respectively, *N*_{1} and *N*_{2} scintillations; of these, *N*_{12} were common to both observers. To estimate the unknown true number *X*, Rutherford assumed that each scintillation has fixed probabilities *p*_{1} and *p*_{2} to be observed by *A*_{1} and *A*_{2}, and furthermore that the observations are independent in the sense that a scintillation observed by *A*_{1} has still probability *p*_{2} to be observed by *A*_{2}. In reality the likelihood of observing a scintillation varies with growing fatigue and the proximity of the preceding scintillation; also, the observers are affected by common causes and are therefore not independent. Equating probabilities with observed frequencies (another approximation), Rutherford set *N*_{1} = *Xp*_{1}, *N*_{2} = *Xp*_{2}, *N*_{12} = *Xp*_{1}*p*_{2}, whence $X\text{=}N{}_{\text{1}}N{}_{\text{2}}\text{/}N{}_{\text{12}}$. The three equations may be solved for *p*_{1} and *p*_{2}, but these “probabilities” are purely fictitious and, as experience shows, inaccessible to experimental verification. The model is justified by plausibility and success.

#### Sample space

One speaks of probabilities only in connection with conceptual (not necessarily performable) experiments and must first define the possible outcomes. Thus, by convention, tossing coins results in heads *H* or tails *T*; regardless of experimental or philosophical difficulties the age of a person is taken as an exact number and each positive number is taken as a possible age. Throwing two dice results in one of the 36 combinations (1.1), (1.2), …, (6.6). An outcome such as “sum 4” is a compound event which can be further decomposed by enumeration: Sum 4 occurs if the outcome is (1.3), (2.2), or (3.1). Thus it is necessary to distinguish between elementary (indivisible) and compound outcomes or events. Each elementary outcome is called sample point; their aggregate is the sample space. The conceptional experiment is defined by the sample space, and it must be introduced and established at the outset.

For example, the experiment “distributing 3 balls in 3 cells” has 27 possible outcomes (sample points) listed in tabulation (1).

Note that “*n* balls in 7 cells” may represent the distribution of *n* hits among 7 targets, or of *n* accidents in 7 weekdays, and so on.

Consider next the experiment of placing 3 indistinguishable balls into 3 cells. Whether actual balls are indistinguishable is irrelevant; they are treated as such and, by convention, there is now a space of only 10 sample points. It is listed in tabulation (2).

In playing roulette, each point on a circle represents a possible outcome and the sample space is the interval 0 ≦ ϑ < 2π. When one observes the motion of a particle under diffusion, every function *x*(*t*) represents a conceivable outcome and the sample space is a complicated function space.

#### Events

In examining a bridge hand, one may ask whether it contains an ace or satisfies some other condition. In principle each such event may be described by specifying the sample points which do satisfy the stipulated condition. Thus every compound event is represented by an aggregate of sample points, and in probability theory these terms are synonymous. The standard notations of set theory are used to describe relations among events. * See also: ***Set theory**

Given an event *A* one may consider the case that *A* does not occur. This is the negation or complement of *A*, denoted by *A*′; it consists of those sample points that do not belong to *A.* Given two events *A* and *B*, the event *C* that either *A* or *B* or both occur is the union of *A* and *B* and denoted by *C* = *A* ∪ *B.* In particular *A* ∪ *A*′ is the whole sample space which therefore represents certainty. The event *D*, both *A* and *B* occur, is the intersection of *A* and *B* and written *D* = *A* ∩ *B.* It consists of the points common to *A* and *B.* If there are no such common points (as in the case of *A* and *A*′), *A* and *B* cannot occur simultaneously and they are called mutually exclusive, written *A* ∩ *B* = 0. The event “*A* but not *B*” is simply *A* ∩ *B*′.

For example, in tabulation (1), the event *A* “one cell multiply occupied” is the aggregate of the points numbered 1–21. The event *B* “first cell not empty” is the aggregate of the points 1, 4–15, and 22–27. Because every point belongs either to *A* or to *B* (or both). *A* ∪ *B* = is the certain event. Next, *D* = *A* ∩ *B* consists of the points 1, 4–14. Finally, *A*′ may be described as “no cell empty.”

#### Probabilities in finite spaces

If the sample space contains only *N* points *E*_{1}, …, *E*_{N} their probabilities may be any numbers such that *P*{*E*_{j}} = 0 and *P*{*E*_{1}} + + *P*{*E*_{N}} = 1. The probability *P*{*A*} of an event *A* is the sum of the probabilities of all points contained in *A*; thus *P*{} = 1. To find *P*{*A* ∪ *B*} one considers all points belonging to either *A* or *B*, but those belonging to both *A* and *B* are counted only once. Therefore *P*{*A* ∪ *B*} = *P*{*A*} + *P*{*B*} − *P*{*A* ∩ *B*}. In particular, for mutually exclusive events, there is the addition rule *P*{*A* ∪ *B*} = *P*{*A*} + *P*{*B*}.

Frequently considerations of symmetry lead one to consider all *E*_{j} as equally likely; that is, to set $P\text{(}E{}_{j}\text{) = 1/}N$. In this case $P\text{(}A\text{) =}n\text{/}N$ where *n* is the number of points in *A*; for a gambler betting on *A*, these represent the “favorable cases.” For example, in throwing a pair of “perfect” dice, one naturally assumes that the 36 possible outcomes are equally likely. This model does not lose its justification or usefulness by the fact that actual dice do not live up to it. The assumption of perfect randomness in games, card shuffling, industrial quality control, or sampling is rarely realized, and the true usefulness of the model stems from the experience that noticeable departures from the ideal scheme lead to the detection of assignable causes and thus to theoretical or experimental improvements.

How the success of probability theory depends on the disregard of preconceived philosophical ideas and on the readiness to adapt models to unexpected circumstances is illustrated by Bose-Einstein statistics. In the example of tabulation (1), the notion of perfect randomness leads to the assignment of probability $\text{1/27}$ to each point. In the case of indistinguishable balls, tabulation (2), it has been argued that an experiment is unaffected by failure to distinguish between balls; physically there remain 27 possibilities grouped in 10 distinguishable forms. This argument leads to assigning probability $\text{1/27}$ to each of the points 1–3, probability $\text{1/9}$ to each of the points 4–9, and $\text{2/9}$ to point 10. This reasoning (sound in certain situations) has been accepted as evident in statistical mechanics for the distribution of *r* particles in *n* cells (Maxwell-Boltzmann statistics). Surprisingly, it turned out that no physical particles behave this way and it was revolutionary when Bose and Einstein showed that for one type of particle all distinguishable arrangements are equally likely. This model assigns probability $\text{1/10}$ to each point of tabulation (2). * See also: ***Boltzmann statistics**; **Bose-Einstein statistics**

A useful, although vague, intuitive description of probability describes *P*{*A*} as the relative frequency of the event *A* if the experiment is repeated many times under identical circumstances. The laws of large numbers render this more precise, but the description often lacks operational meaning. Experiments in agriculture and human sampling cannot be repeated under remotely similar conditions, and in the case of telephone exchanges, useful probability models refer to situations which will never materialize.

#### Probabilities in infinite spaces

Two examples, unending coin tossing and roulette, illustrate the novel features of this topic.

##### Unending coin tossing

In the study of limit laws, one must consider potentially infinite sequences of coin tossings. The possible outcome of this experiment is an infinite sequence of heads and tails, and every sequence such as *HTHTTH* … represents a simple point. Finitely many tosses are the beginning of an infinite sequence, and the event “first four trials resulted in *HTTH*” is the aggregate of the infinitely many sequences with the prescribed beginning. Such an event is called an interval of length 4. There are 2^{n} intervals of length *n*, and they are mutually exclusive. For reasons of symmetry, one attributes the probability 2^{−n} to each interval of length *n.* Thus the assignment of basic probabilities refers to intervals rather than to points. A point such as *HTHT* … is the limit of an infinite sequence of contracting intervals *H, HT, HTH*, …, and therefore probability zero must be attributed to each individual point.

The probabilities of other events are similarly defined by limiting procedures. For example, consider the event *A* that an infinitely prolonged sequence of trials never produces a run of at least two consecutive heads, or two consecutive tails. It is more convenient to enumerate the points of the complementary event *A*′ that two equal symbols do occur in succession. Clearly *A*′ is the union of the infinitely many mutually exclusive intervals *HH, TT*; *HTT, THH*; *HTHH, THTT*, and so on. Here there are 2 intervals of length *n* = 2, and therefore *P*{*A*′} = 2(2^{−2} + 2^{−3} + 2^{−4} + ) = 1, whence *P*{*A*′} = 0. The indicated result of the experiment is thinkable, but probability zero is attributed to it. A similar, although more complicated, limiting procedure leads to the law of large numbers according to which the event “the frequencies of *H* and *T* in the first *n* trials tend to $\text{1/2}$ as *n* → ∞” has probability one.

##### Roulette

Here the sample space consists of the angles 0 ≦ ϑ < 2π, and the notion of a perfect roulette assumes equal probabilities for intervals of equal length; thus an interval of length *a* carries probability $a\text{/2\pi}$. If the roulette is divided into 32 equal numbered intervals, the event “even number” consists of 16 intervals and has probability $\text{1/2}$.

The situation encountered here is not peculiar to probability but is common in measure theory. One starts with a collection of basic events, called intervals, and attributes probabilities to them. By simple and natural limiting procedures, probabilities can then be defined for a much wider class F of events which are obtainable by applying the operations of set theory to intervals (in finite or infinite numbers). F is the Borel field generated by the intervals. Probability is simply a measure on F; that is, to each event *A* in F, there corresponds a probability *P*{*A*} = 0 which is completely additive. If *A* is the union of the mutually exclusive events *A*_{1}, *A*_{2},…, then *P*{*A*} = Σ*P*{*A* _{i}}. The probability of the whole space is, of course, unity.

The extension of the addition rule from finitely to infinitely many summands may be defended by considerations of continuity, but ultimately it is justified by its simplicity and its success.

#### Conditional probability—independence

Suppose that a population of *N* people includes *N*_{A} color-blind persons and *N*_{H} females. To the event *A* “a randomly chosen person is color-blind” can be ascribed probability $P\text{{}A\text{} =}N{}_{A}\text{/}N$, and similarly for the event *H* that a person be female one has $P\text{{}B\text{} =}N{}_{H}\text{/}N$. If *N*_{AH} is the number of color-blind females, the ratio $N{}_{\mathrm{AH}}\text{/}N{}_{H}$ may be interpreted as probability that a randomly chosen female be color-blind; here the experiment “random choice in the population” is replaced by a selection from the female subpopulation. In the original experiment, $N{}_{\mathrm{AH}}\text{/}N$ is the probability of the simultaneous occurrence of both *A* and *H*, so that $N{}_{\text{AH}}\text{/}N{}_{\text{H}}\text{=}P\text{{}A\text{\u2229}H\text{}/}P\text{{}H\text{}}$. Similar situations occur so frequently that it is convenient to introduce Eq. (3)

and to call this the conditional probability of the event *A* relative to *H.* This concept is useful whenever it is desired to restrict the consideration to those cases where the event *H* occurs (or where the hypothesis *H* is fulfilled). Thus, in betting on an event *A* the knowledge that *H* occurred would induce one to replace *P*{*A*} by *P*{*A*|*H}.* If all sample points are equally likely, *P*{*A*|*H*} still represents the ratio of favorable cases to the total of cases possible when it is known that *H* has occurred.

Despite its simplicity the notion of conditional probability is exceedingly important, and frequently the probabilities in sample space are defined only in terms of conditional probabilities. An example of the concept is provided by a bolt factory in which three machines manufacture, respectively, 25, 35, and 40% of the total. Of their output 5, 4, and 2% are defective bolts. Classification of the bolts according to the number of the machine and the quality (*d* for defective, *c* for conforming) gives the six categories *c*_{1}, *c*_{2}, *c*_{3}, *d*_{1}, *d*_{2}, and *d*_{3}. A random choice of a bolt results in one of these six outcomes, but their probabilities are not given directly. Instead, the data relating to the first machine are given by

It follows that *P*{*d*_{1}} = 0.0125 and similarly for the other points. This example may also serve to illustrate the reasoning following Bayes concerning the probability of causes. Supposing a bolt was found to be defective (hypothesis *H*), what is the probability that it came from the first machine (cause *A*)? Here

and thus the required answer is given by

In tabulation (1) the probability of *H* “ball *a* is in the first cell” equals $\text{1/3}$, and the probability of *A* “first cell is multiply occupied” = $\text{7/27}$. Now given that the ball *a* is in the first cell, the conditional probability that this cell is multiply occupied becomes $\text{5/9}$. The knowledge that *H* has occurred should increase one's readiness to bet on *A.* By contrast, for the event *B* “ball *b* is in the second cell,” $P\text{{}B\text{|}H\text{} = 1/3 =}P\text{{}B\text{}}$, and so the knowledge that *H* has occurred gives no clue as to *B.* Therefore, *B* is said to be independent of *H* if *P*{*B*|*H*} = *P*{*B*}, that is, if

Clearly, in this case *P*{*H*|*B*} = *P*{*H*} so that *H* is also independent of *B.* Accordingly, two events *B* and *H* are independent of each other if the probability of their simultaneous occurrence follows the multiplication rule *P*{*B* ∩ *H*} = *P*{*B*}*P*{*H*}. This notion carries over to systems of more than two events.

#### Independent trials

The intuitive frequency interpretation of probability is based on the concept of experiments repeated under identical conditions; a theoretical model for this concept can be developed.

Consider an experiment described by a sample space ; for simplicity of language it can be assumed that consists of finitely many sample points *E*_{1}, … , *E*_{N}. When the same experiment is performed twice in succession, the thinkable outcomes are the *N*^{2} pairs of sample points (*E*_{1},*E*_{1}), (*E*_{1},*E*_{2}), … , (*E*_{N}, *E*_{N}), and these now constitute the new sample space. It is called the combinatorial product of by itself and denoted by × with reference to analytic geometry, one speaks of the first and second coordinate of the point (*E*_{i}, *E*_{j}). These notions apply equally to infinite sample spaces and to products × × of more than two factors. For example, the cartesian plane of points (*x, y*) is the product of the real line by itself. In tossing a coin once, contains only the points *H* and *T*; tossing the coin *n* times leads to the *n*-tuple product : × whose points have *n* coordinates and are of the form (*HT* *T*).

Probabilities must be assigned to the events in × . The case of dependent trials will be treated in the next section; if the second trial is independent of the first, the probabilities in × follow the productive rule *P*{*E _{i}, E_{j}*} =

*P*{

*E*}

_{i}*P*{

*E*

_{j}}.

In the case of *n* tossings of a coin, this rule leads to the probability 2^{−2} for each sample point in agreement with the requirement of equally likely cases. The present approach is more flexible and more general as shown by the Bernoulli trials.

##### Bernoulli trials

Suppose each trial results in success *S* or failure *F*, and *P*{*S*} = *p, P*{*F*} = *q* where *p* + *q* = 1. (This may be considered as the model of a skew coin.) A succession of *n* independent trials of this kind leads to the sample space of *n*-tuples (*SFFS* *FS*), and the probability of such a point is the product (*pqqp* *qp*) obtained on replacing each *S* by *p* and each *F* by *q.*

This model has obvious applications to repeated observations and to gambling. Independence is an assumption to be verified experimentally. Conceivably a coin could be endowed with memory and avoid runs of more than 17 successive heads. That the sex distribution within families resembles Bernoulli trials is purely a matter of experience. Many gamblers fully accept the independence and yet believe that they can influence fate by using “systems,” for example, by skipping the game after each failure, or waiting for a run of 3 successes, and so on. The theorem on systems shows this to be a fallacy; a gambler not endowed with foresight may use any system or random choice of the times when he plays or skips the game; he remains confronted with Bernoulli trials and is exactly in the same situation as if he played at each trial. * See also: ***Distribution (probability)**

##### Geometric probabilities

In the interval 0 < *x* < 1 a point is chosen at random. This interval is the sample space and the probability of each subinterval equals its length. The sample space × is the unit square of the *x*,*y* plane, and the probability of any figure equals its area. The event “the two successive choices result in a sum <1” is represented by the triangle below the main diagonal and has probability $\text{1/2}$. The event “the greater of the two choices is <*t*” is represented by the square 0 < *x* < *t*, 0 < *y* < *t* and has probability *t*^{2}.

#### Dependent trials; Markov chains

Many phenomena can be analyzed in terms of dependent trials. In their description the convenient and picturesque terminology of urn models shall be adopted, which should not detract from the general nature of the schemes being presented.

Consider an urn containing *N* balls, of which *r* are red *R* and *b* = *N* − *r* black *B.* Assuming perfect randomness, the probability that a randomly drawn ball be red equals $r\text{/}N$. If the ball is replaced and the procedure repeated, the result is Bernoulli trials with $p\text{=}r\text{/}N$. Without replacement, the sample space corresponding to two drawings contains four points *RR, RB, BR*, and *BB*, to which probabilities are assigned as follows: If the first ball drawn is red (probability $r\text{/}N$), the conditional probabilities of *R* and *B* at the second trial become $\text{(}r\text{\u22121)/(}N\text{\u2212 1)}$ and $b\text{/(}N\text{\u2212 1)}$. By Eq. (3), therefore,

A more general urn model is obtained by letting the composition of the urn vary from trial to trial. For definiteness consider the following scheme: Each time a ball is drawn, it is replaced, and *c* balls of the color drawn and *d* balls of the opposite color are added to the urn. Here *c* and *d* are fixed numbers which may be negative. This scheme contains interesting special cases such as the following:

1. When *c* = *d* = 0, drawing with replacement occurs, and for *c* = −1, *d* = 0, drawing without replacement occurs. In the latter case, the process terminates after *N* drawings.

2. The Polya model of contagion is the special case when *c* > 0 is fixed and *d* = 0. Here the drawing of either color increases the probability of the same color at subsequent trials, just as in the contagious disease each occurrence increases the probability of further occurrences. This model represents only a crude first approximation to phenomena of contagion, but it leads to comparatively simple formulas and has been applied with astonishing success to a variety of experiences from sickness insurance to baseball scores.

3. The Ehrenfest model for heat exchange considers two containers, I and II, and *N* particles distributed in them. A particle is chosen at random and removed from its container into the other. This scheme differs only linguistically from the urn scheme. If the particles in I are called red and those in II black, then each trial changes the color of one ball and gives the special case *c* = −1 and *d* = 1.

The probabilities of the various possible outcomes in the general scheme are obtained as above. For example, $P\text{{}\mathrm{RBR}\text{} =}r\text{(}b\text{+}d\text{)(}r\text{+}c\text{+}d\text{)/}N\text{(}N\text{+}c\text{+}d\text{)(}N\text{+ 2}c\text{+ 2}d\text{)}$, and so on.

Markov chains represent another important scheme for dependent trials. Suppose that at each trial the possible outcomes are *E*_{1}, … , *E*_{N} and that whenever *E*_{i} occurs the conditional probability of *E*_{j} at the next trial is *p*_{ij}, independently of what happened at the preceding trials. Here, of course, *p*_{ij} = 0 and *p*_{ij} + *p*_{i2} + + *p*_{iN} = 1 for each *i.* The *p*_{ij} are called transition probabilities. The whole process is now determined if the initial probabilities, π_{i}, at the first trial are known. For example, *P*{*E _{a}E_{b}E_{c}*} = π

_{a}

*p*The probability of the event “

_{ab}p_{bc.}*E*

_{c}at the third trial” is obtained by summation over all

*a*and

*b*, and so on. Markov chains, and their analog with continuous time, represent the simplest type of stochastic process. The Ehrenfest model considered above may be treated as a Markov chain by letting

*E*

_{i}represent the event that container I contains

*i*particles. Then $p{}_{i\text{,}i\text{\u22121}}\text{=}i\text{/}N$, $p{}_{i\text{,}i\text{+1}}\text{= (}N\text{\u2212}i\text{)/}N$, and

*p*= 0 for all other combinations of

_{ij}*ij.*Other examples of Markov chains are the gambler's accumulated fortune, the composition of a deck of cards under random shuffling, and random walks. Important applications are to queueing theory where one encounters also processes with more complicated aftereffects.

*See also:*

**Queueing theory**;

**Stochastic process**

#### Random variables and their distributions

The theory of probability traces its origin to gambling, and the gambler's gain may still serve as the simplest example of a random variable. With every possible outcome (sample point) there is associated a number, namely, the corresponding gain. In other words, the gain is a function on the sample space, and such functions are called random variables. (In infinite spaces the idea is the same, but a somewhat more cautious definition is in order.) With the same experiment, one may associate many random variables. As an example, consider the sample space of tabulation (1) with probability $\text{1/27}$ for each point. A typical random variable is the number *N* of occupied cells; it assumes the value 1 at the three points numbered 1–3; the value 2 at the eighteen points 4–21; and the value 3 at the six points 22–27. One says, therefore, that the probability distribution of *N* is given by $P\text{{}N\text{= 1} = 1/9}$, $P\text{{}N\text{= 2} = 2/3}$, $P\text{{}N\text{= 3} = 2/9}$. Another variable is the number *X* of balls in the first cell. An inspection of tabulation (1) shows that its probability distribution is given by $P\text{{}X\text{= 0} = 8/27}$, $P\text{{}X\text{= 1} = 12/27}$, $P\text{{}X\text{= 2} = 6/27}$, $P\text{{}X\text{= 3} = 1/27}$. One may also consider the two variables simultaneously and find, for example, that the combination *N* = 1, *X* = 0 occurs at two points, whence $P\text{{}N\text{= 1,}X\text{= 0} = 2/27}$. The probabilities of all pairs are given by the joint probability distribution of *N* and *X* exhibited in tabulation (4).

Adding the entries in the rows and columns gives the distribution of *N* and *X*, respectively, and they are therefore occasionally called marginal distributions. * See also: ***Combinatorial theory**

The “Geometric probabilities” example given earlier may be used to illustrate the case of break continuous random variables. This example considers two consecutive selections of a point in the interval 0 < *x* < 1. Let *S* be the random variable denoting the sum of the two choices, and *L* the larger of the two. One sees that for 0 < *t* < 1 the event *L* ≦ *t* has probability *t*^{2}; thus, setting *P*{*L* < *t*} = *F*(*t*) = *t*^{2} when 0 ≦ *t* ≦ 1; for *t* < 0 and for *t* > 1 one has trivially *F*(*t*) = break 0 and *F*(*t*) = 1, respectively. This is the distribution function of *L.* From it can be calculated all probabilities relating to *L.* Similarly the event *S* ≦ *u* is represented by the region in the unit square below the line *x* + *y* = *u*; therefore, the distribution function of *S*, namely, *P*{*S* ≦ *u*} = *G*(*u*), is given by *G*(*u*) = 0 for break *u* ≦ 0, $G\text{(}u\text{) = (1/2)}u{}^{\text{2}}$ for 0 ≦ *u* ≦ 1, $G\text{(}u\text{) = 1 \u2212 break (1/2) (2 \u2212}u\text{)}{}^{\text{2}}$ for 1 ≦ *u* ≦ 2, and *G*(*u*) = 1 for *u* ≦ 2. In like manner, the joint distribution function *P*{*L* ≦ *t, S* ≦ *u*} = *H*(*t*,*u*) of the pair *L*,*S* can be calculated.

Every random variable *X* has a distribution function *F*(*t*) = *P*{*X* ≦ *t*}. If *X* assumes only finitely many values, then *F*(*t*) is a step function. Thus, in tabulation (4), *F*(*t*) assumes the values 0, $\text{8/27}$, $\text{20/27}$, $\text{26/27}$, and 1, respectively, in intervals *t* < 0, 0 ≦ *t* < 1, 1 ≦ *t* < 2, 2 ≦ *t* < 3, and *t* ≦ 3. In such cases the notion of distribution function is used mainly for uniformity of language. The notion is really convenient when *F*(*t*) is not only continuous but also has a derivative *f*(*t*) = *F*′(*t*); then *f*(*t*) is called the probability density of *X.* In the above example the variable *L* has a density defined by 2*t* for 0 < *t* < 1 and 0 elsewhere; the density of *S* is 0 for *u* < 0 and *u* > 2; it equals *u* for 0 < *u* < 1 and equals 2 − *u* for 1 < *u* < 2.

The notion of independence carries over: Two random variables *X* and *Y* are independent if *P*{*X* ≦ *x, Y* ≦ 2} = *P*{*X* ≦ *s*} · *P*{*Y* ≦ *t*}. It is easily seen that for independent variables with distribution functions *F*(*t*) and *G*(*t*) the distribution function of the sum *S* = *X* + *Y* is given by the convolution shown in Eq. (5). In terms of densities, Eq. (5) reads as Eq. (6).

In the random choice example, the coordinates of the points chosen are independent variables with the rectangular density *f*(*s*) = *g*(*s*) = 1 for 0 < *s* < 1. The distribution of their sum *S* which has been calculated above can be found also using Eq. (6).

#### Expectations

Given a random variable *X* one may interpret its distribution function *F*(*t*) as describing the distribution of a unit mass along the real axis such that the interval *a* < *x* ≦ *b* carries mass *F*(*b*) − *F*(*a*). In the case of a discrete variable assuming the values *x*_{1}, *x*_{2}, … with probabilities *p*_{1}, *p*_{2}, … the entire mass is concentrated at the points *x*_{i}; if *F*′(*x*) = *f*(*x*) exists, it represents the ordinary mass density as defined in mechanics. The center of gravity of this mass distribution is called the expectation of *X*; the usual symbol for it is *E*(*X*), but physicists and engineers use notations such as 〈*X*?, 〈*X*?_{Av}, or $\stackrel{\u2015}{\text{X}}$. In the cases mentioned,

In all cases *E*(*X*) is given by the Stieltjes integral over *x dF*(*x*). (To be precise, one speaks of expectations only when the integral converges absolutely.)

Before discussing the significance of the new concept, a few frequently used definitions are appropriate. Put *m* = *E*(*X*). Then (*X* − *m*)^{2} is, of course, a random variable. In mechanics, its expectation represents the moment of inertia of the mass distribution. In probability, it is called variance of *X*; its positive root is the standard deviation. Clearly,

The variance is a measure of spread: It is zero only if the entire mass is concentrated at the point *m*, and it increases as the mass is moved away from *m.* In the case of two variables *X*_{1} and *X*_{2} with expectations *m*_{1} and *m*_{2} it is necessary to consider not only the two variances *s*^{2}_{i} = *E*[(*X _{i}* −

*m*)

_{i}^{2}] but also the covariance Cov(

*X*

_{1},

*X*

_{2}) =

*E*[(

*X*−

*m*

_{1}) (

*X*

_{2}−

*m*

_{2})] =

*E*(

*X*

_{1}

*X*

_{2}) =

*m*

_{1}

*m*

_{2}. The covariance divided by

*s*

_{1}

*s*

_{2}is called the correlation coefficient of

*X*

_{1}and

*X*

_{2}. If it vanishes,

*X*

_{1}and

*X*

_{2}are called uncorrelated. Every pair of independent variables is uncorrelated, but the converse is not true.

If *X*_{1}, *X*_{2}, … , *X _{n}* are random variables with expectations

*m*

_{1}, … ,

*m*and variances

_{n}*s*

^{2}

_{1}, …,

*s*

^{2}

_{n}, the expectation of their sum

*S*=

_{n}*X*

_{1}+ +

*X*is always given by

_{n}*E*(

*S*) =

_{n}*m*

_{1}+ +

*m*; if all the covariances of

_{n}*X*and

_{1}*X*vanish, then clearly Var(

_{j}*S*) =

_{n}*s*

^{2}

_{1}+ +

*s*

^{2}

_{n}.

When *X* represents a physical quantity, then *X*^{*} = (*X* − *m*)*s*^{−1} represents the same quantity measured from a different origin and in new units. In the physicist's terminology, *X*^{*} is the quantity *X* referred to dimensionless units. In probability, *X*^{*} is called the reduced or standardized variable.

It was once assumed that every reasonable random variable has finite expectation and variance. Modern theory refutes this assumption. Many recurrence times in important physical processes have no finite expectations. Even in the simple coin-tossing game, the number of trials up to the time when the gambler's accumulated gain first reaches a positive level has infinite expectation.

#### Laws of large numbers

To explain the meaning of the expectation and, at the same time, to justify the intuitive frequency interpretation of probability, consider a gambler who at each trial may gain the amounts *x*_{1}, *x*_{2}, …, *x _{n}* with probabilities

*p*

_{1},

*p*

_{2}, …,

*p*The gains at the first and second trials are independent random variables

_{n.}*X*

_{1},

*X*

_{2}with the indicated distribution and the common expectation

*m*= +

*p*The event that an individual gain equals

_{i}x_{i.}*x*has probability

_{i}*p*, and the frequency interpretation of probability leads one to expect that in a large number

_{i}*n*of trials this event should happen approximately

*np*times. If this is true, the total gain

_{i}*S*=

_{n}*X*

_{1}+

*X*

_{2}+ +

*X*should be approximately

_{n}*nm*; that is, the average gain $\text{(1/}n\text{)}S{}_{\text{n}}$, should be close to

*m.*The law of large numbers in its simplest form asserts this to be true. More precisely, for each ∊ > 0 it assures one that

As a special case, one can obtain a frequency interpretation of probability. In fact, consider an event *A* with *P*{*A*} = *p* and suppose that in a sequence of independent trials a gambler receives a unit amount each time when *A* occurs. Then the expectation of the individual gain equals *p*, and *S*_{n} is the number of times the event *A* has occurred in *n* trials. It follows that

That is, the relative frequency of the occurrence of *A* is likely to be close to *p.*

Without this theorem, probability theory would lose its intuitive foundation, but its practical value is minimal because it tells one nothing concerning the manner in which the averages *n*^{−1}*S _{n}* are likely to approach their limit

*m.*In the regular case where the

*X*have finite variances, the central limit theorem gives much more precise and more useful information; for example, it tells one that for large

_{j}*n*the difference

*S*−

*np*is about as likely to be positive as negative and is likely to be of the magnitude $n{}^{\text{1/2}}$. When the

*X*

_{k}have no finite variances, the central limit theorem fails and the sums

*S*

_{n}may behave oddly in spite of the law of large numbers. For example, it is possible that

*E*(

*X*) = 0 but

_{k}*P*{

*S*< 0} → 1. In gambling language this game is “fair,” and yet the gambler is practically certain to sustain an ever-increasing loss.

_{n}There exist many generalizations of the law of large numbers, and they cover also the case of variables without finite expectation, which play an increasingly important role in modern theory. * See also: ***Game theory**; **Probability (physics)**; **Statistics**