Coefficient of Correlation

WEEK 6

TOPIC: CORRELATION AND REGRESSION

CONTENT:

1. Concept of corellation as emasure of realtionship
2. Scatter diagrams
3. Rank correlation
4. Tied ransk

Subtopic 1:

Concept of correlation as measure of relationships

We have consideered in our last discussions on finding measures of location and the spread. We call such data univariate data. In this discussion, we want to look at situation when two variables are ovbserved on each unit of the variables. When considering marks of a stuent in physics and mathematics. This is called bivariate data. We can also consider the height and weight together.

When we consider the two sets of univariate data as one bivariate, Association is formed. When a stuent scored high mark in maths we want to see if he will score hgih in English or he will score low. Then we want to find out it the two marks can be considered. When we do that an association is formed, when we begin to find out how we can use the marks in mathematics to predict the mark in English then we are looking at correlation between them. When two variables x and y are related, then we say they are correlation.

Correlation coefficient is the measure of the degree of association betweeen x and y. There are two things we shall consider:

1. When both variables are quantitatively measured such as marks, height and age.
2. When the variable values are put in ranks form 1st, 2nd, 3rd

Sub-Topic 2: Scatter Diagram

Suppose x and y are random variables with measured values of students marks in mathematics and chemistry. If the scores are recorded as (x1, y1), (x2, y2), … (x2n, yn), then, we want to plot them on a rectangular coordinate system. The reuslting set of pints is called a scatter diagram. See some example of scatter diagram below.

Y

1. Positive correlation X
2. Negative correlation xX

Y

1. Positive correlation X[mediator_tech]

y

1. Positive correlation X

y

If a scatter diagram looks like (1) above then there is a positive correlation between x and y. This is because an increase in x leads to an increase in y.

The second diagram shows a negative correlation between x and y since an increase in x leads to a decrease in y.

Figure 4. above shows no correlation between x and y since sometimes a high value of y yields a high value of x and vice-versa. Thus x and y are said to be uncorrelated. If all points seem to lie near some curve, the correlation is said to be non-linear.

Line of best fit

Generally, more than one curve of a given type will appear to fit a set of data. If every student in the class is asked to draw a free hand line that best fits the scatter diagram in (1), it is likely that many different lines will emerged.

In order to avoid such individual error in constructing a line that best fits the scatter diagram, it is necessary to devise a means of achieving this. A way of drawing this line is to draw a line that passes through the point (,) in such a way that it is about equidistant from the extreme values on both sides. See example below:

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

Line of best fit

Y

X

The concept is based on the gradient-intercept form of equation of a straight line. Remember y = ax+b, where a is the slope and b is the intercept on the y-axis.

The average of the squared distances of scatter points from the least squares line is minimum. This is why we referred to the line as least squares line. Thus the sum of the squares of the deviations of each point must be very close to zero d12 + d22 + … + dn2. This line is also called regression line.

Mathematically, the least squares line is give by

b = – a

also

= =

• )2 = n)2

Examples 1. Fit a regression line for the bivariate data

 Text score 6 8 4 7 3 5 9 10 6 1 Aptitude score 5 6 3 5 3 2 7 8 4 2

Solution:

 X Y x2 Xy 68473591061 5635327842 3664164992581100361 304812359106380242 59 45 59 313

a == 0.689

= = 5.9

= = 4.9

b = – a= 4.5 – 01.6894 x 5.9= 4.5-4.067= 0.433

Hence, the regression line equation is

y = 0.6894 x + 0.433

This is a regression line of y on x. It is equally possible to have line of regression of x on y.

In that case y becomes the independent variables while x is dependent on y. The scatter diagram and the regression line show us if there is relationship between x and y and the type of relationship that exists between them. It may be linear or non-linear relationship.

When confirmed that there is a linear association between x and y, we then want to evaluate quantitatively the amount of linear relationship between x and y is called coefficient of correlation.

EVALUATION

Draw a scatter diagram for the following bivariate data

 X 2 3 5 6 8 9 10 11 14 Y 3 2 5 4 8 7 9 11 12

Fit a regression line of y on x as best as you can. From your graph, obtain the regression coefficient of y and x.

Sub-Topic:

Rank Correlation

1. Coefficient of correlation: the coefficient of correlation is defined as r = where r – the coefficient of correlation a and aa1 are the slopes of the regression line y on x and x on y respectively.

r can be considered as the geometric mean of a and a1.

From previous discussions.

a1 =

This (r) above is usually called Pearson’s coefficient of correlation or product moment correlation coefficient.

Example 2.[mediator_tech]

The following data are marks scored by 10 students out of a maximum of 10 marks for each subject.

 Maths 3 6 4 6 4 7 5 5 4 7 PHY 4 6 5 7 4 7 6 5 5 8

Draw the scatter diagram and calculate the coefficient of correlation.

Solution

Let x represents Maths marks and y the Physics marks.

The correlation coefficient between x and y is given as:

 X Y x2 y2 xy 3646475547 4657476558 9361636164936252564 16362549164936252564 12362042164930252056 52 57 277 341 306

∑y = 57 ∑x2 = 277

∑y2 = 341

=

=

=

r = 0.931

The scatter Diagram

x

x

x

x

x

x

x

x

x

x

y

x

1

2

3

4

5

6

7

8

9

10

x

x

x

x

1

2

3

4

5

6

7

8

9

10

Some characteristics of r are

1. The value of r is same whichever way we label the data.
2. The value of r satisfies the inequality -1 ≤ r ≤ 1.
3. If r is close to +1, x and y are highly positively correlated. If r is close to -1 then x and y are highly negatively correlated. When r is close to O, then the correlation between x and y is very low, when r = 0. Then there is no correlation at all.
4. The degree to which r is close to -1 or +1 determines how good a predictor the least squares line is.

Spearman’s rank correlation coefficient:

There are occasions that we are giving position of the variables without necessarily awarding marks. There are some situations that there are no marks but position. The opinion polls conducted on sensitive issue are given position. This is called Ranking. If two corresponding sets of values are ranked in such manner, the coefficient of rank correlation is given by where d = difference between ranks of corresponding values of x and y

This formula is called Spearman’s formula.

Example 3: Two judges x and y ranked 10 constants in a singing competition as follows.

 Contestants Q B C D E F G H I J Rank by X 8 3 9 2 7 10 4 6 1 5 Rank by Y 9 5 10 1 7 7 3 4 2 6

Do the judges differ from each other in ranking the contestants?

Solution

In order to answer, we need to calculate using Spearman’s rank correlation coefficient.

 X Y X-Y = di 83927104615 95101773426 -1-210312-1-1-2 1411091411 23

;

Since R is high, we say that the judges to a reasonable extent in their judgement or ranking agree.

EVALUATION

The following table shows the positions of ten students in Mathematics and Further Maths tests.

 Maths 10 1 9 5 3 7 4 7 6 2 F.Maths 9 3 8 10 2 7 6 4 5 1

Calculate the coefficient of rank correlation.

GENERAL EVALUATION

1. Five students scored the following marks in physics and chemistry during an end of year examination
 Physics 90 70 50 47 80 F.Maths 70 80 66 62 85
1. Draw the scatter diagram (b) Calculate the coefficient of correlation.
2. The differences in the ranks given by judges x and y are as follows: -0.5, 1.5, 0. Calculate the Spearman’s correlation coefficient.
3. Calculate the coefficient of correlation to show the association between the two sets of quiz marks using product moment formula.
 Marks of 1st quiz 6 5 8 8 7 6 10 4 9 7 Marks of 2nd quiz 8 7 7 10 5 8 10 6 8 6

W