Statistical functions

  1. Descriptive Statistics
  2. Plotting the Data
  3. Working the Data
  4. Probability Distributions
  5. Supported Distributions
  6. z-values
  7. Random Values
  8. Linear Regression
  9. General Linear Models
  10. Principal Component Analysis
  11. Miscellaneous Functions

Descriptive Statistics

All descriptive statistics functions operate on data stored in either arrays or matrices. If the data is stored in a matrix, the descriptive measure is computed for each column.

The set of implemented functions includes: mean(), median(), mode(), geometricMean(), harmonicMean(), rootMeanSquare(), quantile(), inverseQuantile(), range(), sum(), variance(), varianceMLE(), standarDeviation(), meanDeviation, moment(), skew(), kurtosis(), kstatistics(), zeroMean(), standarize(), kurtosisExcess(), trimmedMean(), covariance(), covarianceMLE(), correlation(), autocorrelation(), max(), min().

Additional functions exist to sample and compute frequencies and probabilities. The following lines show examples of using some of the functions above.

Let's store some data in an array and find the mean of the values:


a = ( 3, 7, 5, 1, 12, 2 )
3  7  5  1  12  2
m = mean(a)
5

The sample variance and standard deviation are:


variance( a )
16.4
standardDeviation( a )
4.0497

The maximum likelihood estimate of the variance is calculated as follows:


varianceMLE( a )
13.6667

Let's create some random data in a matrix using function randommatrix(). The first two parameters define the number of rows and columns while the next to define the bounds of the generated values. In this case, we generate integers between 1 and 10.


b = randommatrix(5,3, 1,10)
10  8  1
 7  1  1
 9  1  9
10 10  8
 4  4  5

The kurtosis of each column is computed as follows:


kurtosis( b )
2.145  1.4045  1.2911

All other statistical measures are found in a similar way. If the data is stored in a matrix, then the descriptive measure is computed for each column. See detailed description of each function in the Function Reference.

Plotting the Data

Depending on the nature of your data, several plotting options are available. Let's assume your data is stored in the array x below.


x
12.1482 5.6388 7.4344 12.9942 7.2976 6.0825 8.3326 
16.6051 13.8746 4.3337 15.2019 15.878 8.682 12.1221 
9.1053 14.2063 26.2596 11.0928 7.851 12.6484 6.8564 
9.8487 11.9521 7.8879 11.8009 10.3276 10.9104 6.5925
4.4416 10.8958 2.76 10.3302 6.4403 12.5306 3.8333 
10.7918 16.8642 11.1614 8.7245 18.168 5.9511 12.4027 
9.5716 8.6207 12.7529 8.0273 15.105 6.5782 6.3219 7.2727 
7.3226 14.7494 7.2624 7.8146 11.8298 12.8237 6.6754 
10.4441 7.2507 5.1458 5.6291 13.2985 4.5098 6.8594 
14.8297 6.0229 5.2568 4.182 12.6368 10.9129 8.2295 
14.0113 6.7413 4.21 12.2426 12.288 13.891 8.4721 11.7411
11.3759 6.6798 13.8307 15.8475 10.605 16.8224 8.8597 
9.6864 5.4744 6.5236 8.1828 5.1005 12.2552 4.5843 
19.0124 6.1869 4.1979 4.8829 12.1826 6.4031 10.4952 

We use function frequencies() with a class width equal to 2.0. This function computes the number of data points withing each class. Function barchart() can be used to display the resulting matrix.


h = frequencies(x,2)
 3.0   2
 5.0  15
 7.0  23
 9.0  13
11.0  16
13.0  17
15.0   8
17.0   3
19.0   2
21.0   0
23.0   0
25.0   0
27.0   1
barchart( h )

Figure 1. Displaying frequencies.

A custom plot can be prepared using function plot(). Function plot() can display the contents of any two columns of a matrix.


plot( h, 1, 2, "noline", "bar:1.8:red" )

Figure 2. Custom plot. Class width equal to 2.0.

Let's select 1.0 as a new reference to create the classes, and compare the new plot.


h2 = probabilities(x,2,1)
plot( h2,1, 2, "noline", "bar:1.8:nofill" )

Figure 3. Custom plot. Reference equal to 1.0.

A box-and-whiskers plot of the data can be constructed as follows. Function boxwhiskers() expects the data stored in a matrix. Each row corresponds to a data set. In this case we build a matrix with one row only.


b[1,1] = "x"
b[1,2] = min(x)
b[1,3] = quantile(x,.25); // first quartile
b[1,4] = mean(x)
b[1,5] = quantile(x,.75); // fourth quartile
b[1,6] = max(x)
boxwhiskers( b, "horizontal", "background:white" )

Figure 4. Box-and-whiskers plot of data.

Depending on your data, you may also use functions pie() or pareto().

Working the Data

There are several functions you may use to work your data, sort it, or find an arbitrary proportion.

Function sort() can be used to sort your data in ascending order.


a=( 1, 3, -2, 7, 5);
sort(a)
-2  1  3  5  7

Function apply() can modify each entry of an array or matrix. As an illustration, let's assume you need to square each entry and divide it by 3. You would start by defining a suitable function and applying it to each entry of the array as follows:


f(x) = x*x/3;
a
-2  1  3  5  7
apply( a, f )
4/3  1/3  3  25/3  49/3

To find a proportion, use function findall(). For example, to find all the entries x such that 3<x<5, begin by defining a function and then use findall() as shown below. Function findall() would return an array with all the entries satisfying the condition.


f(x) = 0 < x && x < 5;
b = findall( a, f );

Probability Distributions

Calcugator has a practical way of dealing with probability distributions; it treats a probability distribution like an "object" you create and you query for information.

To create a probability distribution use the distribution() function. For example, a StudentT distribution with 10 degrees of freedom is created as follows:


d = distribution("StudentT", 10)
Distribution StudentT (n=10.0)

To find the value of the probability density function of this distribution at x=2.2, do:


pdf(d, 2.2)
0.0444

To find the value of the cumulative probability for x=2.2, do:


cdf(d, 2.2)
0.9738

To find the value of x such that the area of the left tail is 0.2, do:


leftTail(d, 0.2)
-0.8791

To find the value of x such that the area of the right tail is 0.05, do:


rightTail(d, 0.05)
1.8125

To create a plot of the distribution pdf, create a function and use the plot() function as follows:.


f(x) = pdf(d,x)
plot(f,-5,5, 100)

Figure 5. StudentT distribution (n=10.0).

A distribution object can also be used to generate random variates. For example, to generate 1000 random numbers from this distribution, do:


x = variate(d,1000);

Notice the semicolon after the command. Using a semicolon cancels the display of the command's response. In this case we avoid displaying 1000 numbers.

To see a visual display of the generated numbers, first use the probabilities() function. The first parameter is an array and the second is the class width. Using a class width equal to 1.0 we obtain the results below.


h = probabilities(x,1)
-4.5   0.001
-3.5   0.009
-2.5   0.031
-1.5   0.15
-0.5   0.304
 0.5   0.324
 1.5   0.132
 2.5   0.041
 3.5   0.008

The first column of matrix h is the center of each class (bin) and the second column is the frequency of points in that class divided by the number of points. A histogram can be created by using the "bar" option in the plot() function. Notice that the plot() function can plot any two columns of a given matrix. In this case we plot column 1 versus column 2.


plot( h, 1, 2, "noline", "bar:0.85:nofill:red" )

Figure 6. Random variates, StudentT (n=10.0)

The distribution object can also inform about the theoretical mean and variance.


mean(d)
0.0
variance(d)
1.25

Function pdfinfo() displays a summary of the probability function used by a distribution.

Supported Distributions

Here is a list of the supported distributions. Each one is presented with an example.

Poisson distribution


m = 2.2 // mean
d = distribution( "Poisson", m )

Geometric distribution


a = 0.45 // 0<a<1
d = distribution( "Geometric", a )

Logarithmic distribution


p = 0.5 // 0<p<1
d = distribution( "Logarithmic", p )

Binomial distribution


n = 10
p = 0.6 // 0<p<1
d = distribution( "Binomial", n, p )

Negative binomial distribution


n = 10
p = 0.5 // 0<p<1
d = distribution( "NegativeBinomial", n, p )

Hyper geometric distribution


N = 10
s = 3
n = 2
d = distribution("Hypergeometric", N, s, n )

Uniform discrete distribution


m1 = 5
m2 = 10
d = distribution( "UniformDiscrete", m1, m2 )

Chi square distribution


df = 15
d = distribution( "ChiSquare", df )

Non central Chi Square distribution


df = 10
nc = 5.5
d = distribution( "NonCentralChiSquare", df, nc )

Non central Student-t distribution


df = 15
nc = 4.4
d = distribution( "NonCentralStudentT", df, nc )

Gaussian distribution


m = 2.3
s = 0.65
d = distribution( "Gauss", m , s )

Gamma distribution


alpha = 2.1
lambda = 0.8
d = distribution( "Gamma", alpha, lambda )

Uniform distribution


u1 = 3.33
u2 = 8.83
d = distribution( "Uniform", u1, u2 )

Beta distribution


a = 0.3
b = 2.1
d = distribution( "Beta", a, b )

Student-t distribution


df = 10
d = distribution( "StudentT", df )

Laplace distribution


a = 2.2
b = 3.5
d = distribution( "Laplace", a, b )

Logistic distribution


a = 3.2
b = 2.5
d = distribution( "Logistic", a, b )

Exponential distribution


lambda = 2.9
d = distribution( "Exponential", lambda )

F distribution


n = 10
m = 12
d = distribution( "Fisher", n, m )

Non central F distribution


n = 10
m = 12
nc = 3.3
d = distribution( "NonCentralFisher", n, m, nc )

z-values

Functions snd() and invsnd() provide the so called z-values.

Standard Normal Distribution. Function snd(z) returns the area under the standard normal distribution from 0. to z (z>0). The function replaces the use of the typical tables found in textbooks.

Example:


snd(1.2)
0.3849

Figure 7. Find area given z.

There is also a utility function invsnd(a) which returns the value z such that the area under the standard normal distribution curve is equal to a. Example:


invsnd(0.3849)
1.1998

Figure 8. Find z given area.

Random Values

The next group of functions are useful to generate sets of random results. Calcugator uses the Mersenne Twister algorithm as its core random engine.

See section Random matrices to create random matrices.

Linear Regression

Use function linearreg to find the least squares regression line:

f is any name you choose and x and y are the arrays containing the x and and y coordinates respectively.

For example,


x = ( 4, 6, 8, 9, 11, 14 )
4  6  8  9  11  14
y = ( 8, 10, 18, 19, 19, 30 )
8  10  18  19  19  30

a least squares regression line is obtained as follows,


g = linearreg( x, y )
g(x) = (2.1263 * x + -1.0947)

now you may use function g like any other user defined function.

Let's plot the data. Let's use blue squares of size 8 pixels, no fills, and let's join the points using gray lines,


plot( x, y, "square:blue:8:nofill", "line", "gray")

Let's plot the linear regression line on the interval [4,14] using 30 points to build the line and using color red:


plot( g, 4, 14, 30, "red")

Figure 9. Linear regression example.

Another statistical measure of the linear regression model is the correlation coefficient. To obtain it, use function corrcoeff(x,y).

Example:


corrcoef(x,y)
 0.9652

General Linear Models

What follows is an example of finding fitting parameters for a second degree polynomial using the Least Squares method. Assume the data is a set of values (x,y) stored in a matrix. Column 1 stores the x values and column 2 stores the y values.


M = ( 1, 2 ; 2, 3 ; 3, 7 ; 4, 10 ; 4, 20 ; 5, 30 )
1  2
2  3
3  7
4 10
4 20
5 30
plot( M, 1, 2, "noline", "square:10:nofill:red" )

Figure 10. Raw points.

The x values and y values can be grabbed as follows:


x = column( M, 1 );
y = column( M, 2 );

Assuming the polynomial to fit is of the form y = b0 + b1*x + b2*x^2, the corresponding Vandermonde matrix is:


X = vandermonde( x, 2 )
1.0  1.0   1.0
1.0  2.0   4.0
1.0  3.0   9.0
1.0  4.0  16.0
1.0  4.0  16.0
1.0  5.0  25.0

Finally, we solve for the parameters and build a function with them.


b = solve( transpose(X)*X, transpose(X)*y )
 7.4
-7.2667
 2.3333
f(x) = b[1] + b[2]*x + b[3]*x^2
plot( f, 0.5, 5.5, 100 )

Figure 11. Fitted curve.

Principal Component Analysis

This section illustrates the use of some of the Linear Algebra functions in connection with statistical analysis. Given an observation matrix M where each row corresponds to an observation of a vector value, the covariance matrix can be found using the covariance() function.


M = ( 2, 3, 2 ; 3, 1, 15 ; 7, 8, 2 ; 5, 5, 10 )
2 3  2
3 1 15
7 8  2
5 5 10
S = covariance(M)
 4.9167   5.9167   -3.4167
 5.9167   8.9167  -12.0833
-3.4167 -12.0833   40.9167 

The total variance is the trace of the covariance matrix.


tv = trace(S)
54.75

The eigenvalues of S and the corresponding principal components (eigenvectors) are:


L = eigenvalues(S)
0.0677   0       0
  0    9.0634    0
  0      0    45.6188
U = eigenmatrix(S)
-0.7251  0.6769  -0.1263 
 0.6745  0.6612  -0.3285
 0.1389  0.3234   0.936

The transformed uncorrelated data can be computed as M*U.

Miscellaneous Functions

The following functions are implemented: