Friday 29 March 2013

Selecting a probability model for your data

A common task is to choose a statistical model given some data. Here are some general ideas on how to start to think about what is the most appropriate statistical model:

Continuous data eg. measurements, time periods, very large numbers of counts

Examples of continuous distributions are exponential, continuous uniform, and Normal distributions.

Measurement data eg. heights of people; weights of people: a Normal distribution might be appropriate. Characteristics of data for which a Normal model may be appropriate are: the distribution has range from -Infinity to Infinity; is symmetric around a single mode that coincides with the mean; and values that are far from the mean are unlikely.

Very large counts of objects observed eg. number of lottery tickets sold per week; number of fish caught each week by European fishermen: a Normal distribution might be appropriate. You can use a Normal probability plot to help you decide whether a Normal model is appropriate.

Waiting times between successive events eg. time intervals between buses passing a particular bus stop; time intervals between earthquakes: an exponential distribution might be appropriate. An assumption of an exponential model is that events occur at random in time. Characteristics of data for which an exponential distribution may be appropriate are: the distribution has range from 0 to Infinity; is right-skewed (has skewness of 2); has its mode at 0; and the mean and standard deviation of an exponential distribution are equal. You can use an exponential probability plot to help you decide whether an exponential model is appropriate.

Which particular value occurs, out of all possible values in a particular interval (a, b): a continuous uniform distribution might be appropriate. An assumption of a continuous uniform distribution is that each value in the interval is believed to be equally likely. Characteristics of data for which a continuous uniform distribution may be appropriate are: the distribution has finite range from a to b, and has no mode.

Discrete data eg. small numbers of counts

Examples of discrete distributions are Bernoulli (really a special case of the binomial with n = 1), binomial, discrete uniform, genometric and Poisson distributions.

Counts of objects or events observed in a fixed interval of time/space eg. number of goals in each of 100 different soccer matches; number of fish caught in each of 100 different nets: a Poisson distribution might be appropriate. An assumption of a Poisson(mu) distribution is that events occur at random. Characteristics of data for which a Poisson model may be appropriate are: the distribution has an unbounded range {0,1,2...}; is right-skewed (if skewed at all); has one mode; and the mean and variance of a Poisson distribution are equal. 

Number of successes in n trials, eg. number of sixes that you get in 100 throws of a die; number of faulty light-bulbs out of 10,000 produced by a factory; number of 1000 people suffering a particular disease who recover when given a particular drug: a Binomial distribution might be appropriate. Assumptions of a B(n, p) model are that the n trials are independent, and that there is a constant probability p of success at each trial. Characteristics of data for which a Binomial model may be appropriate are: the distribution has a finite range {0...n}; and has just one mode.

Number of trials up to and including the first success, eg. number of times you throw a die before you get the first six: a Geometric distribution might be appropriate. Assumptions of a G(p) distribution are that you have a sequence of independent trials, and that there is a constant probability of success  p between trials. Characteristics of data for which a Geometric model may be appropriate are: the distribution has an unbounded range {1,2,3...}; is right skewed; and has one mode at 1.

Which particular outcome happens out of a set of several equally likely outcomes, eg. getting 1, 2, 3, 4, 5, or 6 when you throw a die: a discrete uniform distribution might be appropriate. Assumptions of a discrete uniform distribution are that there is a finite set of outcomes possible, and that every outcome is believed to be equally likely. Characteristics of data for which a discrete uniform model may be appropriate are: the distribution has a finite range, and no mode.

No comments: