Statistical Analysis CA 3 Data Management & Analytics

Data Management & Analytics

Sinead McConn- 10013026

CA 3 Statistical Analysis

_______________________________________________________________________

Q1: Lift Analysis

Please calculate the following lift values for the table correlating Burger & Chips below:

Lift(Burger, Chips)

Lift(Burgers, ^Chips)

Lift(^Burgers, Chips)

Lift(^Burgers, ^Chips)

  Chips ^Chips Total Row
Burgers 600 400 1000
^Burgers 200 200 400
Total Column 800 600 1400

Solution 1

(Burgers u Chips)=600/1400=3/7=0.43

(Burgers)=1000/1400=5/7=0.71

(Chips)=800/1400=4/7=0.57

LIFT(Burgers,Chips)=0.43/(0.71*0.57)=0.43/0.40=1.075

*One meaning we arrive at is that Burgers & Chips are positively correlated.

Solution 2

(Burgers u ^Chips)=400/1400=2/7=0.29

(Burgers)=1000/1400=5/7=0.71

(^Chips)= 600/1400=3/7=0.43

LIFT(Burgers,^Chips)=0.29/(0.71*0.43)=0.29/0.31=0.94

*One meaning we arrive at is that Burgers & ^Chips are negatively correlated.

Solution 3

(^Burgers u Chips)=200/1400=1/7=0.14

(^Burgers)=400/1400=2/7=0.29

(Chips)=800/1400=4/7=0.57

LIFT(^Burgers,Chips)=0.14/(0.29*0.57)=0.14/0.17=0.82

*One meaning we arrive at is that ^Burgers & Chips are negatively correlated

Solution 4

(^Burgers u^Chips)=200/1400=1/7=0.14

(^Burgers)=400/1400=2/7=0.29

(^Chips)=600/1400=3/7=0.43

LIFT(^Burgers,^Chips)=0.14/(0.29*0.43)=0.14/0.12=1.7

*One meaning we arrive at is that ^Burgers & ^Chips are positively correlated.

Q2: Lift Analysis

Please calculate the following lift values for the table correlating Ketchup & Shampoo below:

Lift(Ketchup, Shampoo)

Lift(Ketchup, ^Shampoo)

Lift(^Ketchup, Shampoo)

Lift(^Ketchup, ^Shampoo)

  Shampoo ^Shampoo Total Row
Ketchup 100 200 300
^Ketchup 200 400 600
Total Column 300 600 900

Solution 1

(Ketchup u Shampoo)=100/900=1/9=0.11

(Ketchup)=300/900=1/3=0.33

(Shampoo)= 300/900=1/3=0.33

LIFT(Ketchup,Shampoo)=0.11/(0.33*0.33)=0.11/0.11=1

*One meaning we arrive at is that Ketchup & Shampoo are independent

Solution 2

(Ketchup u^Shampoo)=200/900=2/9=0.22

(Ketchup)=300/900=1/3=0.33

(^Shampoo)=600/900=2/3=0.67

LIFT(Kethcup,^Shampoo)=0.22/(0.33*0.67)=0.22/0.22=1

*One meaning we arrive at is that Ketchup & Shampoo are independent

Solution 3

(^Ketchup u Shampoo)=200/900=2/9=0.22

(^Ketchup)= 600/900=2/3=0.67

(Shampoo)=300/900=1/3=0.33

LIFT(^Ketchup,Shampoo)=0.22/(0.67*0.33)=0.22/0.22=1

*One meaning we arrive at is that Ketchup & Shampoo are independent

Solution 4

(^Ketchup u^Shampoo)=400/900=4/9=0.44

(^Ketchup)= 600/900=2/3=0.67

(^Shampoo)= 600/900=2/3=0.67

LIFT(^Ketchup,^Shampoo)=0.44/(0.67*0.67)=0.44/0.44=1

*One meaning we arrive at is that Ketchup & Shampoo are independent

 

Question 3: Chi Squared Analysis

Please calculate the following chi squared values for the table correlating burger and sausages below (Expected values in brackets).

Burgers & Sausages

Burgers & Not Sausages)

Sausages & Not Burgers

Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

  Chips ^Chips Total Row
Burgers 900(800) 100(200) 1000
^Burgers 300(400) 200(100) 500
Total Column 1200 300 1500

Solution

X2=(900-800)2/800+(100-200)2/200+(300-400)2/400+(200-100)2/100

=1002/800+(-100)2/200+(-100)2/400+1002/400+1002/100

=10000/800+10000/200+10000/400+10000/100

= 12.5+50+25+100=187.5

Burgers & Chips are correlated because X2>0.

*Expected values are 800 & observed value is 900 we can be certain that burgers & chips are positively correlated.

*Expected values are 200 & observed value is 100 we can say Burgers & ^Chips are positively correlated.

*Expected values are 400 & observed value is 300 we can say ^Burgers & Chips are positively correlated.

*Expected values are 100 & observed value is 200 we can say ^Burgers & ^Chips are positively correlated.

 

Q4: Chi Squared Analysis

Please calculate the following Chi Squared values for the table correlating Burger & Sausages below (expected values in brackets)

Burgers & Sausages

Burgers & Not Sausages

Sausages & Not Burgers

Not Burgers and Not Sausages

For the above options, please also indicate if each of your answer would suggest independent, positive correlation, or negative correlation?

  Sausages ^Sausages Total Row
Burgers 800(800) 200(200) 1000
^Burgers 400(400) 100(100) 500
Total Column 1200 300 1500

 

Solution

X2=(800-800)2/800+(200-200)2/200+(400-400)2/400+(100-100)2/100

=02/800+02/200+02/400+02/100=0

*Burgers & Sausages are independent because X2=0. Burgers & Sausages observed &   expected values are the same (800)-independent

*Burgers &^Sausages -observed & expected values are the same (200)-independent

*^Burgers & Sausages – observed & expected values are the same (400)-independent

*^Burgers & ^Sausages – observed & expected values are the same (100)-independent

 

Question 5:LIFT/Chi Squared Analysis

A: Under what conditions would Lift & Chi Squared analysis prove to be a poor algorithms to evaluate correlation/dependency between two events?

Solution

If there were too many null transactions Lift & Chi Analysis would prove to be poor algorithms to analyses the data.

B: Please suggest another algorithm that could be used to rectify the flaw in the Lift & Chi Squared Analysis?

Some other Algorithms that could be used include AllConf, Jaccard, MaxConf & Kulczynski.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

What is Management Information Systems ?

What is MIS?

MIS is short for management information system or management information services.

Management information system, or MIS, broadly refers to a computer-based system that provides managers with the tools to organize, evaluate and efficiently manage departments within an organization. In order to provide past, present and prediction information, a management information system can include software that helps in decision making, data resources such as databases, the hardware resources of a system, decision support systems, people management and project management applications, and any computerized processes that enable the department to run efficiently.

Management Information System Managers

The role of the management information system (MIS) manager is to focus on the organization’s information and technology systems. The MIS manager typically analyzes business problems and then designs and maintains computer applications to solve the organization’s problems.

(http://www.webopedia.com/TERM/M/MIS.html)

Try R CA 2 Data Management & Analytics

 Introduction

According to (http://www.revolcutionanalytics.com/what-r) R is hot!!

This assignment, aims to back up this statement, by examining the reasons why the programming tool is so appealing. By providing some information on R & it’s functions.

During the last decade, the momentum coming from both academia and industry has lifted the R programming language to become the single most important tool for computational statistics, visualization and data science.

Code School

To begin, we were asked to complete the R Programming course on tryr.codeschool.com; this course provided an excellent introduction to the world of R & some of its commands & functionality.

Code school - R completion

Case Study

In order to get an understanding of R studio, I decided to analyze a data file, containing data on nutritional food content & used R to run some basic functions.

Process steps

  1. I created a folder & saved it to my desktop, I renamed the folder R
  2. Within the folder I included a .csv file, which included a breakdown of food components ( file sourced from Moodle)
  3. I saved the .csv file to the R folder & loaded this as my workspace into R Studio
  4. I opened R, & set my working directory to the R folder I created on my desktop
  5. The first command was asking R to read the file from the Working Directory

Reading the data

A: Get the current directory

Command = getwd()

B: Read the csv file

Command = read.csv(“USDA.csv”)

C: Structure of data

Command = str(USDA)

 Results

Image 1

D: Summarise dataset

Command = summary(USDA)

Screenshot of some of the commands run

Image 2

E: If I wanted to find which product contained the max amount of sodium, I would use the next command

Command = which.max(USDA$Sodium)

R Graphics

Plots

The article I read on (http://www.revolutionanalytics.com/what-r) emphases that R is known for creating “beautiful and unique data visualization”, to put that to the test, I went on to run some plotting commands

F: I wanted to look at the protein & total fat content

Command = plot(USDA$Protein, USDA$TotalFat)

image 3

G: I used the next command to compare the protein to fat content & added in some colour to make the graph more appealing to the eye

Command = plot(USDA$Protein, USDA$TotalFat, xlab=”Protein”, ylab = “Fat”, main = “Protein vs Fat”, col = “red”)

Image 4

 

Boxplots

H: The boxplot command was used next to show the sugar content

Command: boxplot(USDA$Sugar, ylab = “Sugar (g)”, main = “Boxplot of Sugar”)

I: The final command used was to find out how many products have a higher than average fat & sodium content?

Command = table(USDA$HighSodium, USDA$HighFat)

Image 5

Conclusion

On completion of the above tasks using R Studio, I surmise that R is a great tool, for data visualisation, accuracy & speed in terms of getting the results you want faster & more efficiently.

 

Google Fusion CA 1 Data Management & Analytics

Question 1:

A) A chart that breaks down population by Gender

Image 1

B) A chart/table that shows population by province

Image 2

C) Use a pivot table to find out what % of the population does each county account for in 2011.

Row Labels Sum of Total Persons percentage
Carlow 54612 1.190257205
Cavan 73183 1.595008295
Clare 117196 2.554262495
Cork 519032 11.31219471
Donegal 161137 3.511947469
Dublin 1273069 27.74627462
Galway 250653 5.462930109
Kerry 145502 3.17118589
Kildare 210312 4.583706388
Kilkenny 95419 2.0796373
Laois 80559 1.755766684
Leitrim 31798 0.69303081
Limerick 191809 4.180437343
Longford 39000 0.849996905
Louth 122897 2.678514606
Mayo 130638 2.847228095
Meath 184135 4.013184106
Monaghan 60483 1.318214431
Offaly 76687 1.671377248
Roscommon 64065 1.396283378
Sligo 65393 1.425226862
Tipp 158754 3.460010479
Waterford 113795 2.480138406
Westmeath 86164 1.877926496
Wexford 145320 3.167219237
Wicklow 136640 2.978040439
Grand Total 4588252  

 

 

Question 2:

Image 1

Map 1

Image 2

Map 2

Image 3

Map 3

 

Question 3:

As part of our assignment on the Irish Census Data for 2011, we were ask to use Google’s Fusion Tables to create a Heat map, this was done by cleansing the data file provided, extracting the information that was required for the Heatmap.  I didn’t have a Gmail account set up, so the first step was to create an account, one set up, I logged into Google Drive & selected Fusion Tables.  The first file I loaded was the clean map KML file which was uploaded to Moodle, I repeated the same steps to load the cleansed census data I had manipulated. Once loaded, selected File, Merge from the dropdown.

I updated the county column on both files to ensure they had the same naming convention, once completed I used this column to merge the data together.

Next a pop up appeared to select the columns to be used on the heat map, I deselected males, females & description & hit view table.

Once I navigated to the map of geometry tab, my first map appeared- as shown above ref: Image 1.

When feature map was selected the map appears as shown in image 2.

Fusion Tables allows you to present your data in a visual, interactive way. This data was collated after importing into fusion table and merged on the County attribute to obtain the total population and male and female population heat maps as required.

The map aims to reflect the population data from the 26 counties in ROI.

The benefits of data visualization to a business/decision maker are to collate & understand the data more easily. In addition it helps to determine & spot patterns & trends in business operations.

It is quite apparent from the image 3 that within the heat map, Dublin & Cork are jumping out above the other 24 counties, for a business this information helps, decision makers identify to abnormalities or highlights immediately. In this case Both Dublin & Cork are the most densely populated counties, & visual synopsis’s such as this would can prove to be very useful for fast moving businesses who need to spot trends & risks as quickly as possible.