## Statistical functions

### Descriptive Statistics

All descriptive statistics functions operate on data stored in either arrays or matrices. If the data is stored in a matrix, the descriptive measure is computed for each column.

The set of implemented functions includes: `mean()`, `median()`, `mode()`, `geometricMean()`, `harmonicMean()`, `rootMeanSquare()`, `quantile()`, `inverseQuantile()`, `range()`, `sum()`, `variance()`, `varianceMLE()`, `standarDeviation()`, `meanDeviation`, `moment()`, `skew()`, `kurtosis()`, `kstatistics()`, `zeroMean()`, `standarize()`, `kurtosisExcess()`, `trimmedMean()`, `covariance()`, `covarianceMLE()`, `correlation()`, `autocorrelation()`, `max()`, `min()`.

Additional functions exist to sample and compute frequencies and probabilities. The following lines show examples of using some of the functions above.

Let's store some data in an array and find the mean of the values:

a = ( 3, 7, 5, 1, 12, 2 )
3  7  5  1  12  2
m = mean(a)
5

The sample variance and standard deviation are:

variance( a )
16.4
standardDeviation( a )
4.0497

The maximum likelihood estimate of the variance is calculated as follows:

varianceMLE( a )
13.6667

Let's create some random data in a matrix using function `randommatrix()`. The first two parameters define the number of rows and columns while the next to define the bounds of the generated values. In this case, we generate integers between 1 and 10.

b = randommatrix(5,3, 1,10)
10  8  1
7  1  1
9  1  9
10 10  8
4  4  5

The kurtosis of each column is computed as follows:

kurtosis( b )
2.145  1.4045  1.2911

All other statistical measures are found in a similar way. If the data is stored in a matrix, then the descriptive measure is computed for each column. See detailed description of each function in the Function Reference.

### Plotting the Data

Depending on the nature of your data, several plotting options are available. Let's assume your data is stored in the array `x` below.

x
12.1482 5.6388 7.4344 12.9942 7.2976 6.0825 8.3326
16.6051 13.8746 4.3337 15.2019 15.878 8.682 12.1221
9.1053 14.2063 26.2596 11.0928 7.851 12.6484 6.8564
9.8487 11.9521 7.8879 11.8009 10.3276 10.9104 6.5925
4.4416 10.8958 2.76 10.3302 6.4403 12.5306 3.8333
10.7918 16.8642 11.1614 8.7245 18.168 5.9511 12.4027
9.5716 8.6207 12.7529 8.0273 15.105 6.5782 6.3219 7.2727
7.3226 14.7494 7.2624 7.8146 11.8298 12.8237 6.6754
10.4441 7.2507 5.1458 5.6291 13.2985 4.5098 6.8594
14.8297 6.0229 5.2568 4.182 12.6368 10.9129 8.2295
14.0113 6.7413 4.21 12.2426 12.288 13.891 8.4721 11.7411
11.3759 6.6798 13.8307 15.8475 10.605 16.8224 8.8597
9.6864 5.4744 6.5236 8.1828 5.1005 12.2552 4.5843
19.0124 6.1869 4.1979 4.8829 12.1826 6.4031 10.4952

We use function `frequencies()` with a class width equal to 2.0. This function computes the number of data points withing each class. Function `barchart()` can be used to display the resulting matrix.

h = frequencies(x,2)
3.0   2
5.0  15
7.0  23
9.0  13
11.0  16
13.0  17
15.0   8
17.0   3
19.0   2
21.0   0
23.0   0
25.0   0
27.0   1
barchart( h )

Figure 1. Displaying frequencies.

A custom plot can be prepared using function `plot()`. Function `plot()` can display the contents of any two columns of a matrix.

plot( h, 1, 2, "noline", "bar:1.8:red" )

Figure 2. Custom plot. Class width equal to 2.0.

Let's select 1.0 as a new reference to create the classes, and compare the new plot.

h2 = probabilities(x,2,1)
plot( h2,1, 2, "noline", "bar:1.8:nofill" )

Figure 3. Custom plot. Reference equal to 1.0.

A box-and-whiskers plot of the data can be constructed as follows. Function `boxwhiskers()` expects the data stored in a matrix. Each row corresponds to a data set. In this case we build a matrix with one row only.

b[1,1] = "x"
b[1,2] = min(x)
b[1,3] = quantile(x,.25); // first quartile
b[1,4] = mean(x)
b[1,5] = quantile(x,.75); // fourth quartile
b[1,6] = max(x)
boxwhiskers( b, "horizontal", "background:white" )

Figure 4. Box-and-whiskers plot of data.

Depending on your data, you may also use functions `pie()` or `pareto()`.

### Working the Data

There are several functions you may use to work your data, sort it, or find an arbitrary proportion.

Function `sort()` can be used to sort your data in ascending order.

a=( 1, 3, -2, 7, 5);
sort(a)
-2  1  3  5  7

Function `apply()` can modify each entry of an array or matrix. As an illustration, let's assume you need to square each entry and divide it by 3. You would start by defining a suitable function and applying it to each entry of the array as follows:

f(x) = x*x/3;
a
-2  1  3  5  7
apply( a, f )
4/3  1/3  3  25/3  49/3

To find a proportion, use function `findall()`. For example, to find all the entries x such that 3<x<5, begin by defining a function and then use `findall()` as shown below. Function `findall()` would return an array with all the entries satisfying the condition.

f(x) = 0 < x && x < 5;
b = findall( a, f );

### Probability Distributions

Calcugator has a practical way of dealing with probability distributions; it treats a probability distribution like an "object" you create and you query for information.

To create a probability distribution use the `distribution()` function. For example, a StudentT distribution with 10 degrees of freedom is created as follows:

d = distribution("StudentT", 10)
Distribution StudentT (n=10.0)

To find the value of the probability density function of this distribution at x=2.2, do:

pdf(d, 2.2)
0.0444

To find the value of the cumulative probability for x=2.2, do:

cdf(d, 2.2)
0.9738

To find the value of x such that the area of the left tail is 0.2, do:

leftTail(d, 0.2)
-0.8791

To find the value of x such that the area of the right tail is 0.05, do:

rightTail(d, 0.05)
1.8125

To create a plot of the distribution pdf, create a function and use the `plot()` function as follows:.

f(x) = pdf(d,x)
plot(f,-5,5, 100)

Figure 5. StudentT distribution (n=10.0).

A distribution object can also be used to generate random variates. For example, to generate 1000 random numbers from this distribution, do:

x = variate(d,1000);

Notice the semicolon after the command. Using a semicolon cancels the display of the command's response. In this case we avoid displaying 1000 numbers.

To see a visual display of the generated numbers, first use the `probabilities()` function. The first parameter is an array and the second is the class width. Using a class width equal to 1.0 we obtain the results below.

h = probabilities(x,1)
-4.5   0.001
-3.5   0.009
-2.5   0.031
-1.5   0.15
-0.5   0.304
0.5   0.324
1.5   0.132
2.5   0.041
3.5   0.008

The first column of matrix h is the center of each class (bin) and the second column is the frequency of points in that class divided by the number of points. A histogram can be created by using the "bar" option in the `plot()` function. Notice that the `plot()` function can plot any two columns of a given matrix. In this case we plot column 1 versus column 2.

plot( h, 1, 2, "noline", "bar:0.85:nofill:red" )

Figure 6. Random variates, StudentT (n=10.0)

The distribution object can also inform about the theoretical mean and variance.

mean(d)
0.0
variance(d)
1.25

Function `pdfinfo()` displays a summary of the probability function used by a distribution.

### Supported Distributions

Here is a list of the supported distributions. Each one is presented with an example.

Poisson distribution

m = 2.2 // mean
d = distribution( "Poisson", m )

Geometric distribution

a = 0.45 // 0<a<1
d = distribution( "Geometric", a )

Logarithmic distribution

p = 0.5 // 0<p<1
d = distribution( "Logarithmic", p )

Binomial distribution

n = 10
p = 0.6 // 0<p<1
d = distribution( "Binomial", n, p )

Negative binomial distribution

n = 10
p = 0.5 // 0<p<1
d = distribution( "NegativeBinomial", n, p )

Hyper geometric distribution

N = 10
s = 3
n = 2
d = distribution("Hypergeometric", N, s, n )

Uniform discrete distribution

m1 = 5
m2 = 10
d = distribution( "UniformDiscrete", m1, m2 )

Chi square distribution

df = 15
d = distribution( "ChiSquare", df )

Non central Chi Square distribution

df = 10
nc = 5.5
d = distribution( "NonCentralChiSquare", df, nc )

Non central Student-t distribution

df = 15
nc = 4.4
d = distribution( "NonCentralStudentT", df, nc )

Gaussian distribution

m = 2.3
s = 0.65
d = distribution( "Gauss", m , s )

Gamma distribution

alpha = 2.1
lambda = 0.8
d = distribution( "Gamma", alpha, lambda )

Uniform distribution

u1 = 3.33
u2 = 8.83
d = distribution( "Uniform", u1, u2 )

Beta distribution

a = 0.3
b = 2.1
d = distribution( "Beta", a, b )

Student-t distribution

df = 10
d = distribution( "StudentT", df )

Laplace distribution

a = 2.2
b = 3.5
d = distribution( "Laplace", a, b )

Logistic distribution

a = 3.2
b = 2.5
d = distribution( "Logistic", a, b )

Exponential distribution

lambda = 2.9
d = distribution( "Exponential", lambda )

F distribution

n = 10
m = 12
d = distribution( "Fisher", n, m )

Non central F distribution

n = 10
m = 12
nc = 3.3
d = distribution( "NonCentralFisher", n, m, nc )

### z-values

Functions `snd()` and `invsnd()` provide the so called z-values.

Standard Normal Distribution. Function `snd(z)` returns the area under the standard normal distribution from `0.` to `z` (z>0). The function replaces the use of the typical tables found in textbooks.

Example:

snd(1.2)
0.3849

Figure 7. Find area given z.

There is also a utility function `invsnd(a)` which returns the value `z` such that the area under the standard normal distribution curve is equal to `a`. Example:

invsnd(0.3849)
1.1998

Figure 8. Find z given area.

### Random Values

The next group of functions are useful to generate sets of random results. Calcugator uses the Mersenne Twister algorithm as its core random engine.

• `random()` returns a random real number between 0. and 1.
• `random(n)` returns an array of size `n` with random numbers between 0. and 1.0.
• `random(n,v)` returns an array of size `n` with random values. If value `v` is real, all values are real between `0.` and `v`. If value `v` is integer, all values are integers between `0` and `v` inclusive.
• `random(n,v1,v2)` returns an array of size `n` with random values. If any of the values `v1` or `v2` is real, all values are real between `v1` and `v2`. If both values `v1` and `v2` are integer, all values integer between `v1` and `v2` inclusive.
• `coin()` returns either "H" (Heads) or "T" (Tails) at random.
• `coin(n)` returns an array of size `n` with random Heads and Tails.
• `dice()` returns a random integer between 1 and 6.
• `dice(n)` returns an array of size `n` with random integers between 1 and 6.
• `die()` returns two random integers between 1 an 6.
• `die(n)` returns an array of size `n` with random pairs of integers between 1 and 6.

See section Random matrices to create random matrices.

### Linear Regression

Use function `linearreg` to find the least squares regression line:

• linearreg( x, y )

`f` is any name you choose and `x` and `y` are the arrays containing the x and and y coordinates respectively.

For example,

x = ( 4, 6, 8, 9, 11, 14 )
4  6  8  9  11  14
y = ( 8, 10, 18, 19, 19, 30 )
8  10  18  19  19  30

a least squares regression line is obtained as follows,

g = linearreg( x, y )
g(x) = (2.1263 * x + -1.0947)

now you may use function `g` like any other user defined function.

Let's plot the data. Let's use blue squares of size 8 pixels, no fills, and let's join the points using gray lines,

plot( x, y, "square:blue:8:nofill", "line", "gray")

Let's plot the linear regression line on the interval [4,14] using 30 points to build the line and using color red:

plot( g, 4, 14, 30, "red")

Figure 9. Linear regression example.

Another statistical measure of the linear regression model is the correlation coefficient. To obtain it, use function `corrcoeff(x,y)`.

• `c=corrcoeff(x,y)` returns the correlation coefficient corresponding to the data in arrays `x` and `y`.

Example:

corrcoef(x,y)
0.9652

### General Linear Models

What follows is an example of finding fitting parameters for a second degree polynomial using the Least Squares method. Assume the data is a set of values (x,y) stored in a matrix. Column 1 stores the x values and column 2 stores the y values.

M = ( 1, 2 ; 2, 3 ; 3, 7 ; 4, 10 ; 4, 20 ; 5, 30 )
1  2
2  3
3  7
4 10
4 20
5 30
plot( M, 1, 2, "noline", "square:10:nofill:red" )

Figure 10. Raw points.

The x values and y values can be grabbed as follows:

x = column( M, 1 );
y = column( M, 2 );

Assuming the polynomial to fit is of the form y = b0 + b1*x + b2*x^2, the corresponding Vandermonde matrix is:

X = vandermonde( x, 2 )
1.0  1.0   1.0
1.0  2.0   4.0
1.0  3.0   9.0
1.0  4.0  16.0
1.0  4.0  16.0
1.0  5.0  25.0

Finally, we solve for the parameters and build a function with them.

b = solve( transpose(X)*X, transpose(X)*y )
7.4
-7.2667
2.3333
f(x) = b[1] + b[2]*x + b[3]*x^2
plot( f, 0.5, 5.5, 100 )

Figure 11. Fitted curve.

### Principal Component Analysis

This section illustrates the use of some of the Linear Algebra functions in connection with statistical analysis. Given an observation matrix M where each row corresponds to an observation of a vector value, the covariance matrix can be found using the `covariance()` function.

M = ( 2, 3, 2 ; 3, 1, 15 ; 7, 8, 2 ; 5, 5, 10 )
2 3  2
3 1 15
7 8  2
5 5 10
S = covariance(M)
4.9167   5.9167   -3.4167
5.9167   8.9167  -12.0833
-3.4167 -12.0833   40.9167

The total variance is the trace of the covariance matrix.

tv = trace(S)
54.75

The eigenvalues of S and the corresponding principal components (eigenvectors) are:

L = eigenvalues(S)
0.0677   0       0
0    9.0634    0
0      0    45.6188
U = eigenmatrix(S)
-0.7251  0.6769  -0.1263
0.6745  0.6612  -0.3285
0.1389  0.3234   0.936

The transformed uncorrelated data can be computed as M*U.

### Miscellaneous Functions

The following functions are implemented:

• `sort(a)` arranges all values stored in array `a` in increasing order.
• `max(...)` returns the maximum value of the arguments passed.
• `min(...)` returns the minimum value of the arguments passed.
• `factorial(n)` returns the factorial of integer `n`.
• `combination(n,m)` returns the number of combinations of `m` elements taken from a set of `n` elements.
• `fibonacci(n)` returns the nth term of the Fibonacci series.
• `sample(A,n)` returns a sample of size n from array `A`.
• `sample(A,n,m)` returns a sample matrix with `n` rows and `m` columns. Each column is a sample of array `A`.
• `probabilities(A,w)` returns a matrix in which the first row are the middle point of classes with width `w`. The second column is the number of points in that class divided by the total number of points.
• `probabilities(A,w,r)`. As above but the classes boundaries start at reference point `r`.
• `frequencies(A,w)`returns a matrix in which the first row are the middle point of classes with width `w`. The second column is the number of points in that class.
• `frequencies(A,w,r)`. As above but the classes boundaries start at reference point `r`.