October 6

For solutions, purchase a LIVE CHAT plan or contact us

SSCI 3910: Advanced Data Analysis

1. How much variance is explained by a correlation of 0.7 (1 point)
a. 49%
b. 14%
c. 7%
d. None of the above
2. Given your previous success as the Raptors’ statistics consultant, they have decided to hire you
again! Luck you! This time, the Raptors are interested in the relationship between various
metrics about players’ age, height, position and number of games played, and their performance
during the game. (see table below for full list of variables). They have sent you a spreadsheet
(file bball.csv on Canvas) containing information about 106 randomly selected players from the
league, and they want you to help find the relationships between the different variables.
As always show the function you used and the output of that function
GAMES: # games played in previous season
PPM: average points scored per minute
MPG: average minutes played per game
HGT: height of player (centimetres)
FGP: field-goal percentage (% successful shots from 3-point line)
AGE: age of player (years)
FTP: percentage of successful penalty free throws
POS: Position they play (c = Centre; pf = power forward; sf = small forward; sg = shooting guard; pg =
point guard)
POS_num: numerical assignment for each position.

a. Generate a scatter plot with PPM (points per minute) and AGE. Based on the figure,
describe the relationship between those 2 variables. (2 points)
b. Calculate the correlations between each pair of variables PPM (excluding POS and
POS_num). Show me the correlations matrix
i. Report the strongest and the weakest correlations along with their p values
and degrees of freedom. (e.g., The strongest correlations with PPM is xx, r(df)
= .xx, p = xx and the weakest, r(df) = .xx, p = .xx) (2 point)
ii. Provide a brief description of the meaning of the strongest and weakest
correlations in terms of explained variance. Can you think of why two of the
metrics would have a large correlation whereas the other two would have
weaker correlations? (4 point)

c. Compute the correlation between Position (using variable POS_num) with the other
variables. Why does this correlation have to be computed separately? (2 point)
d. Is position correlated with any of the other variables (list the r, and p value)?
i. If so, explain what the relationship(s) mean (2 points)
e. What is the relationship between field goal percentage (FGP) and points per game (PPM)
after controlling for age (AGE). Hint: you may need to convert the data into a useable
form (eg. varname = dataframe[, c("var1", "var2", " control1")] or use the select
function).
i. Is the relationship significant (list t, df and p value)? What proportion of the
variance in the relationship between field goal percentage (FGP) and points
per game (PPM) is accounted for by age? (4 points)

f. Is the correlation between PPM and MPG different than the correlation between PPM
and FTP? (3 points)

=========================================================================

CITS4009
Due: Friday, October 21st, 2022, 11:59pm

2. Data
To demonstrate a full data science life cycle, you are strongly suggested to use the data and domain
you have explored in Project 1:
 A specified dataset
 Data from public repositories
Specified data
If you do not have any datasets or domain of interests in mind, we suggest you to use the US
Accident Injury dataset as described in project 1.
Data from public repositories
Please refer to the links given in project 1. You can choose a different public dataset if you prefer;
however, this will require you to spend extra time to go through the EDA process to understand the
new dataset.
3. Modelling
Classification
Firstly, study your dataset and choose the response (i.e., target) variable suitable for a classification
task. The remaining variables are your feature variables. You may need to discard some character
string and categorical columns. Columns that have a unique value for each row should be discarded,

e.g., customer ID, accident ID, etc. If possible, formulate it as a binary classification problem, as
multi-class classification is very difficult and is not covered in the lectures.
Your next step is to split the data into a training set and a test set. You can use any meaningful split
ratio (90/10, 80/20, etc). You should implement R code for two different classification
techniques (e.g., decision tree classifier versus logistic regression classifier, Naïve Bayes
classifier versus K nearest neighbours classifier, or decision tree classifier versus Naïve Bayes
classifier, etc.) and compare the performance of the two models. Your report should include
discussions about how feature variables and other related parameters (e.g., the threshold value for
the logistic regression classifier) could be best selected to optimize the performance.
For example, using the US Accident Injury dataset, we can predict the injury level. Note: depending
on what predication question you’d like to be answered, you may need to tidy the data into the right
shape. You can start with single-variate models and then multi-variate models, following the same
process demonstrated in the lecture slides.
A good demonstration of the following investigations is expected:
1. Understanding of what a null model would look like in this context.
2. Aggregating, sub-setting, sampling or reshaping the data for better data preparation if
necessary.
3. Transforming the categorical variables into numerical for single-variate model selection.
4. Using various measures to select a good combination of variables for multi-variate models.
5. Using LIME (see Labweek09) to find the determining feature(s) for the classification of a few
instances from the test set. Discuss about your experimental findings in the report.
6. Evaluation of models. This step involves comparing your multi-variate models for the two
implemented machine learning techniques on the training set and the test set using various
measures (such as ROC plots, confusion matrices, deviances, etc).
Clustering
Choose or compute a set of feature variables; apply a clustering algorithm to these variables;
visualise the clustering results and explain the rules discovered.
 Explain how and why you choose the distance measures and how the choices affect your
clustering outcome.
 An investigation on the selection of kk – the number of clusters.

========================================================================

PSYC3010 Advanced Statistics for Psychology
Friday 21st Oct 11:59pm

General overview: Your task is to design, ‘conduct’, analyse and write a brief report based on the
analyses for a research study, following APA guidelines. You will invent your own theoretical rationale
for the investigation, and test specific hypotheses (based on the contrast coefficients provided below).
Step 1: Design your study
Design a between-subjects study with two levels on one independent variable (IVA) and four levels on
the other (IVB), i.e., a 2 × 4 design. You can be as creative as you wish – it may be set in the distant
past/future, under the sea or on a make-believe planet – but it should be of interest to a psychologist.
For at least one of the independent variables, participants must be randomly allocated to conditions;
that is, only one of the independent variables can be a classification variable (you cannot use gender or
age – you need to be more creative than that). Clearly define your dependent variable, which must be
measured in whole numbers (i.e., no decimal places) and have a range of 0-50.
In designing the study, you must set theoretically meaningful contrasts before collecting the data, which
will be tested by the contrast coefficients below. Your contrasts must adhere to the following:
1. The interaction contrasts should follow from the main-effect contrasts.
2. Do not design trend hypotheses - these are not trend contrasts.
3. You cannot use examples from lectures or tutorials.
Main effect contrast coefficients:
IVA Group a1 Group a2 IVB Group b1 Group b2 Group b3 Group b4
ΨA1 1 -1 ΨB1 1 -3 1 1
ΨB2 2 0 -1 -1
ΨB3 0 0 1 -1

Step 2: ‘Collect’ your data
Once you have designed your study, you will “collect” sample data which has been obtained for you from
an appropriate population by an independent participant recruitment company (i.e., you are free to
specify any population of interest relevant to your study). Once recruited, participants were allocated to
one cell in the study design such that there were equal number of participants in each cell of the study.
The data for your assignment is specific to you. You can collect your data via Canvas:
Modules ≫ ANOVA Assignment - Data File and Submission Inboxes
If you do not use the data allocated to you, you will receive a 20-mark penalty.

Step 3: Statistical analyses
Use the appropriate syntax in SPSS to test all main-effect contrasts and all interaction contrasts that follow
from them. Write up the results for your chosen contrasts (including non-significant ones) in the Results
section of your report, ensuring that all appropriate controls for family-wise error rate are taken.
Step 4: Write the report
Write a concise report containing an Introduction, Method, Results, and Discussion. Write it as if for
publication in an APA journal – it should read like a polished and published piece of research rather than
a statistics assignment. If you wish, you can include references (real or fake) and a reference list, but this
is not necessary and will not be assessed. Reports will be evaluated based on specific criteria on the
following page.

==========================================================================

Econ 324
Economic Data Analysis

I. True/False/Uncertain - Briefly explain. No credit without an explanation (7 marks each).
1. Omitted variable bias (OVB) occurs if the excluded variable is correlated with any of the included
variables.
2. The problem with over-fitting is imprecision.
3. With 10 Xs, in the first step of both the forward and backward automatic search procedure, there are
10 regressions.
4. If the confidence interval (CI) for the beta coefficient is (0.98, 1.17), one should fail to reject the null
hypothesis H0 : e
β = 1.

5. A problem with the Linear Probability Model (LPM) is that β / ˆ ∈ [0, 1].
II. Problem - Use SAS for your computations. You have to show your work. No credit without
an explanation (8 marks each).
1. We are interested in the factors explaining whether consumers make online purchases or not. A
questionnaire was administered with a sample of 435 people. The data are in the file ps1.sas7bdat
in the 324 data folder. Here is the variables description:
• Sex: 0=male, 1=female;
• Age - in years;
• Purchase: 1=online purchase made last year, 0=no online purchase made last year;
Variables X1 to X34 are measured on a scale 1 to 7 where 1=strongly disagree and 7=strongly agree
and are the answers to the question: “Indicate to what extent you agree with the following statements”:

1 I always purchase the types of products I want from the Internet
2 There is a high risk for purchasing online
3 Internet retailers encourage me to make suggestions
4 The Internet retailers' websites provide in-depth information to answer my questions
5 I can buy the products that are not available in retail shops through the Internet
6 The website designs of the Internet retailers are aesthetically attractive
7 Online shopping is not as secure as traditional retail shopping
8 I do not feel secure about providing my bank card details to a payment platform
9 I have regular access to a computer
10 Internet shopping offers a wide variety of products
11 Online shopping offers better value for my money compared to traditional retail
shopping
12 Family/friends encourage me to make purchases through the Internet
13 Online shopping allows me to bay the same, or similar products, at cheaper prices than
traditional retailing stores
14 It is easy to receive a personalized customer service from an Internet retailer
15 Online shopping allows me to save money as I do not need to pay transportation costs
16 It is quick and easy for me to complete a transaction through the website
17 It takes only a little time and effort to make a purchase through the Internet
18 I have knowledge about how to make purchases through the Internet
19 I have regular access to the Internet
20 Internet retailers offer good after sales service
21 Internet retailers understand my needs
22 I am very skilled at using the Internet
23 I am not confident that the information I provide to an Internet retailer is not used for
other purposes
24 Internet retailers honour their product guarantees
25 I am not confident that my personal information is protected by an Internet retailer
26 The products I ordered are delivered to me within the time promised by the Internet
retailers
27 The quantity and quality of the products I receive from Internet retailers are exactly the
same as I order
28 Internet shopping saves me time, so I can do other activities
29 The links within the website allow me to move back and forth easily between pages of
the website
30 It is more convenient to shop through the Internet when compared to traditional retail
shopping
31 Internet retailers' websites are easy to navigate
32 I think the Internet offers lower prices compared to retail stores
33 Marketing communication influenced my decision to make purchases through the
Internet
34 The media influenced my decision to make purchases through the Internet

(a) Provide descriptive statistics for the variables “Sex”, “Age” and “Purchase”.
(b) Present in a table the mean and standard deviation of each of the 34 variables X1-X34 ordered
from the highest to the lowest value of the mean. Which of them seem the most important?
(c) Present the means of the 34 Xs for women and men separately. Are there any differences?
(d) Do a Linear Probability Model (LPM) with the variable “Purchase” as a dependent variable and
36 independent variables - 34 Xs, “Sex” and “Age”. Test them for statistical significance at 1%
and interpret the ones that are.
(e) Do a Logistic regression model (Logit) with the same variables as in (d). Test for statistical
significance at 1% and interpret the coefficients that are. Are they different that the ones with
the LPM?
(f) Now do a Logit with only X1-X17 in addition to “Age” and “Sex”. Which of the two models is
preferable - logit with all or with half the Xs in addition to “Age” and “Sex”?
(g) The data file ps1 g.xlsx contains one line - the profile of a potential respondent. Using the
unrestricted Logit model (with all 36 explanatory variables) estimate the probability and the 95%
confidence interval that that person made a purchase last year.
(h) Do the same with the restricted Logit model (with “Age”, “Sex” and X1-X17) - which of the two
predictions is more precise and why?

For solutions, purchase a LIVE CHAT plan or contact us

Limited time offer: