October 4

For solutions, purchase a LIVE CHAT plan or contact us

MATH1309/2142

This study involves 400 small molecules data retrieved from the DrugBank 3.0 database a unique
chem-informatics resource analysed by Hudson et al., (2014, 2017, 2019, 2020).
• The data set contains 9 physico-chemical variables (MW, PSA, log P, Log D, ...), and the molecule’s
mode of delivery (oral versus non-oral). See Table 1 below.
Table 1:

In addition, the data set contains new druggability rules (score functions counting the number of violations
for each molecule on each of the 9 variables) developed by Hudson et al. These account for the molecule’s
size, permeability etc., but use new cutpoints for each of 9 molecular parameters (Table 2), different to
those conventionally used by the Food and Drug Advisory group (FDA) (Lipinski’s rule Ro5, Table 2).
Hudson et al based on the 9 molecular variables (ADME variables) found distinct clusters of the molecules
identified as “poor” versus “good” druggables. The data set contains the 9 ADME variables (Table 1), and
a scoring function (score9_LogD) along with the molecule’s mode of delivery (oral versus non-oral).
The score is denoted as score9_ LogD. Note that the function score9_LogD is a continuous variable of
range 0 to 9 - comprised of the 4 traditional parameters of the rule of five (Ro5) (Lipinski, 2016) (Table 1)
plus 4 extra parameters (PSA, number of rotatable bonds, rings, N and O atoms) with 2 extra candidates
lipophicility, log P or logD, the latter is the distribution coefficient, recently suggested as a possible
preferable predictor for permeation, preferable to Lipinski’s traditional partition coefficient, Log P, an
often used predictor for permeation.
5 Questions, Total Marks = 315, Worth = 40% of final course grade

We also dichotomise the score9_LogD_ into 2 groups based on the cutpoint of 4 violations:
• Cutpoint <=4 is a non-violator molecule
• Cutpoint >4 is violator (non-druggable) molecule.
This is equivalent to: Score9_Log D_group <=4 (non-violators) versus Score9_log D_group >4 (violators)

Question 1: PCA analysis [85 marks]
Perform the following in SAS (ensure to include your code and outputs and interpretations):

a) Perform a principal component analysis using SAS on the correlation matrix for the 9 ADME
variables. Show your full SAS code and output. (10 marks)
b) Ensure you obtain the following 5 types of plots related to PROC PCA. (All plots should be
placed in clearly labelled Appendices). (10 marks)

• Scree plot
• Profile plot
• Component Pattern plots
• Score plots
• Loading Plots
c) Report the eigenvalues and the eigenvectors. (5 marks)
d) What percentage of the total sample variation and cumulative variation is accounted for by each of
the PCs? (5 marks)
e) Write out the formulation for the PCs. (10 marks)
f) Interpret the PCs via eigenvalues, your component pattern profiles AND your loading plots
from SAS. (10 marks)
g) Label your score plot for PC2 versus PC1 by violator and non-violator status and summarise any
trends and findings. (5 marks)
h) Label your score plot for PC2 versus PC1 by oral status and summarise any trends and findings. (5
marks)
i) Label your score plot for PC3 versus PC2 by violator and non-violator status and summarise any
trends and findings. (5 marks)
j) Label your score plot for PC3 versus PC2 by oral status and interpret any trends and findings. (5
marks)
k) Using BOTH a formal test of hypothesis and relevant plots can the data be effectively
summarized in fewer than 9 dimensions, k< p? Report k and justify your answer and establish what
your k is via the relevant hypothesis test. Show your SAS code and formula. (15 marks)

Question 2: PCA with reduced k < p for plots [40 marks]
Using your reduced dimensionality k determined in Question 1 (k), rerun the PCA on the 9 ADME
variables for the violators and the non-violators groups separately (where violatory status is delineated by
score_9 log D ).
a) Recreate the 5 plots related to PROC PCA for your given k.
(All plots should be placed in clearly labelled Appendices) (10 marks)
b) Interpret the PCs via eigenvalues, your component pattern profiles AND your loading plots from
SAS based on your reduced dimensionality k and k PC’s. (15 marks)
c) Label your score plot for PC2 versus PC1 by oral status and summarise any trends and findings. (5
marks)
d) Label your score plot for PC2 versus PC1 by violatory status, summarise any trends and findings.
(5 marks)
e) Which of the k PCs are skewed? Use matrix plots of the PC scores to answer this. (5 marks)

Question 3: DISCRIMINANT ANALYSIS ON 9 ADME VARIABLES BY 2 GROUPS OF
MOLECULES [55 marks]
Aim: to run PROC DISCRIM to investigate how the 9 ADME variables discriminate the violators
from the non-violators.
a) Generate the means, standard deviations, and variance-covariance matrix of the data for the
violators. (5 marks)

b) Generate the means, standard deviations, and variance-covariance matrix of the data for the non-
violators (5 marks)

c) Produce the correlation matrix with associated p values, and a matrix scatterplot of the inputted
data for the violators. (5 marks)
d) Produce the correlation matrix with associated p values, and a matrix scatterplot of the inputted
data for the non-violators. (5 marks)
e) Run SAS DISCRIM and from your resultant outputs answer the following questions.
HINT; Use priors: "violators"=0.30 "non-violators"=0.70. Ensure your output is clearly labelled in
an Appendix. (10 marks)
f) Is Σ1= Σ2 justify your answer based on the appropriate test statistic and output from SAS. (5
marks)
g) How is a molecule with X0

T = (MW, LogP, LogD, Hdonors, Hacceptors, PSA, ROT,
NATOM, NRING) = (445.429, -2.7, -3.28938, 8, 12, 207.27, 9, 55, 3) allocated? (10 marks)
h) Report the LDFs obtained from the output and describe what they mean? (5 marks)
i) Show the resultant confusion matrix and interpret it. (5 marks)

Question 4: STEPWISE DISCRIM ON 4 GROUPS OF MOLECULES [90 marks]
Now perform a stepwise DISCRIM using oral by violatory status groups defined below.
a) Create the following variable i.e., an interaction term between oral status and score 9_ Log D
violation status at 4 levels as defined below: (5 marks)

b) Obtain a cross-table in SAS or otherwise of oral by violatory status for the whole data set. How
many molecules and percentages are in each of these 4 levels? Along with the table create an
appropriate histogram. Interpret your results (10 marks)
c) Generate the means, standard deviations, and the variance-covariance matrix and correlation
matrices of the ADME data for each of the 4 levels defined by the Oral status by_violatory status
variable. Interpret your descriptive profiles in terms of how the variables differ across the 4 levels.
(20 marks)
d) Generate matrix plots of the 9 ADME variables for the 4 levels defined by the Oral status by_
violatory status variable. Interpret how the variables differ in distribution, correlation, across the 4
levels. (15 marks)
e) Run a STEPWISE DISCRIM analysis using the 9 ADME variables (Table 1) as the input and the
above 4 level grouping variable, Oral status by_violatory status. (25 marks)
f) Which variables best discriminate the 4 Oral status by_violatory status classes? (5 marks)
g) Give the mean, variances and correlations between these best discriminating variables across the 4
level Oral status by_violatory status variable and interpret trends. (10 marks)

Question 5: [45 marks]
a) Run a STEPWISE DISCRIM analysis using your subset of k PCs from Question 2, now as the
input variables and the above 4 level grouping variable, Oral status by _violatory status.
(20 marks)
b) Which PC variables best discriminate between the 4 oral by violatory groups/classes? (5 marks)
c) Give the mean vector, variance-covariance matrix and correlations between these chosen PCs
variables for each of the 4 oral by violatory groups/classes and interpret trends. (10 marks)
d) For the PC variables selected by the stepwise discriminant analysis determine the correlation
between them and the original data (i.e., the 9 ADME variables in Table 1). (10 marks)

==========================================================================

Data Analysis & Research Design Evaluation 2

Question 1 (18 marks; 3 marks for each part): You are a practicing health professional and
you are intrigued with the possibility of a link between mental health issues and excessive
‘screen time’. People are commonly found with eyes glued to their phones and other digital
devices that are commonly available these days and there is a surge in mental health related
issues and problems. You have already searched the current literature but unfortunately most
studies had flaws that you could easily identify. You are considering doing your own research
to explore the potential link. One of your fond memories of EPID1000 is the study designs you
learned and now having the opportunity to put that knowledge into practice, really excites you.
You think about six study designs (listed below) and what each would involve. For example,
which aspect of mental health you will specifically focus on, how you will carry out the study,
what sort of study participants you will need for each design, what information should be
collected to investigate this link, what measure of association/effect can be obtained from each
design, what will be one specific confounding factor individually for each of the six study
designs and how or why each confounder will affect the findings of your study for every
design. There will be no mark for repetition of same confounding factor for more than one
study design.
You will briefly describe the above aspects as one paragraph for each of the following
designs while following the formatting & line-limit instructions provided below.
You can use the review exercise (tutorial 11) as a general guide but address only those aspects
that are listed above. Please also re-visit the weekly epidemiology ilectures on study designs.
a) Cross-sectional study
b) Case Control study
c) Prospective Cohort study.
d) Retrospective Cohort study.
e) Quasi-experimental study.
f) Randomized Controlled Trial.
(No need to worry about the ethics for e and f parts); this is a hypothetical exercise to
assess your understanding of various study designs).
Q1 Note 1: Please be specific. No marks will be given for writing general design features

from ilectures. We want you to ‘apply’ your knowledge of study designs in light of the above
given scenario/description. Write in direct language e.g. I will carry out this study by…….
Q1 Note 2: Formatting & Line limit: For each of the six parts please use Times
New Roman size 12 font, normal page borders and 10 lines maximum for each
part, (10 lines DO NOT mean ten sentences or ten statements). Please do not
change the margins of this document. Failure to comply will result in penalty.
There are no formatting requirements for other questions.

Instructions for the Data Analysis Questions listed below:
a. You must provide SPSS output under every question along with your answer as
No SPSS output = No mark
b. You must include your assessment of statistical significance for every
question.
c. You will use your own sample of 450 from the provided dataset (how to video &
instructions provided separately).
d. You will NOT test any ASSUMPTION for any question: Please do not waste your time in
testing any assumptions and/or changing plans accordingly. You may incur penalty for doing
so. We want you to choose a suitable test for every question as if your data met all required
assumptions.
e. Copy and paste SPSS output under every question along with your written answer on this
template otherwise we will not mark your answer/s. Upload the completed template ONLY
(No zip folder, no Pages, no other files)
f. You can choose a different sample of 450 for every question or use the same sample
throughout, it is up to you (More info & instructions are in the Submission guidelines &
FAQs provided separately).
NO MARK is awarded:
 If you do not follow ALL the instructions listed above.
 If you use the entire data file provided to you instead of working with your randomly selected
450 cases.
 If SPSS output is missing.
 If you provide SPSS output without your written answer.
Question 2 (3 marks) Do people feel they are better at managing their money after Covid
compared to before Covid? Choose a suitable test to answer this question and provide a short
description of results from your data analysis.
Question 3 (3 marks) Test the hypothesis that there is no significant difference in the pain
scores before and after the therapy. Choose a suitable test to answer this question and provide a
short description of results from your data analysis.
Question 4 (3 marks) Is there any significant difference in agreement to the Pesticides

statement among people who belong to low, middle and high socioeconomic status? Choose a
suitable test to answer this question and provide a short description of results from your data
analysis.
Question 5 (3 marks) Is there a relationship between Depressive symptoms score and
response to the statement about Natural Environment being peaceful? Choose a suitable test to
answer this question and provide a short description of results from your data analysis.
Question 6 (3 marks)
Step 1: Recode Chinese animal zodiac signs into a new variable with two categories:
Group 1: (Combine first six)
Group 2: (Combine 7-12)
Step 2: Now test the hypothesis that there is no significant difference in Self-rated
Health between the two groups. Choose a suitable test to answer this question and
provide a short description of results from your data analysis.
Question 7 (3 marks)
Do the unemployed feel more under Family Pressure than those who are employed?
Choose a suitable test to answer this question and provide a short description of results from
your data analysis.
Question 8 (3 marks)
Test the hypothesis that pain scores after therapy were not significantly different between all
forms of therapy. Choose a suitable test to answer this question and provide a short description
of results from your data analysis.
Question 9 (3 marks) Choose only those cases who Spend 15 hours or more on Emails and
test the hypothesis that they represent a population where average daily resting energy
expenditure is 1350 calories.
Write null and alternative hypotheses.
Choose a suitable test to answer this question and provide a short description of results
from your data analysis.
Question 10 (1 mark) All participants reported having one recent argument/conflict or even
altercation with someone, either with someone they knew, or someone they were related to or a
stranger. They were asked to provide the reason for their recent conflict or altercation. Apart
from some serious and genuine reasons, researchers noted many interesting reasons that were
reported by some respondents. Researchers selected and coded top four most interesting reasons
and coded them as A, B, C and D (Variable: InterestingReasons). Please add a
description/label to each code and then give us a suitable graph. You can make up any whatever
you think these can be based on your imagination and/or creativity).
Please provide a suitable graph with a label for each code. No mark if you provide a suitable
graph but no labels/description). You do not need to describe the graph itself.

=========================================================================

Suppose you conduct a study where you want to study the relationship between HDL and some biomarkers. You collect the folllowing measurements from 80 subjects
age in years
gender (0=female and 1=male)
pulse
systolic BP (mm Hg)
diastolic BP (mm Hg)
HDL
LDL

Specific tasks
1. Use appropriate summary statistics to describe the following variables: GENDER, BMI, DIASTOLIC, PULSE, LDL. Present your results in tabular format. 10 marks

2. Calculate the correlation between all continuous variables. Interpret your results. 20 marks

3. Group the age into three different AGE brackets “18-25”, “26-45” and “46 and above”. Test the claim that subjects in those AGE brackets have the same mean LDL. 30 marks

4. Test whether DIASTOLIC blood pressure and PULSE rate varied by GENDER. What are the null and alternative hypotheses? 20 marks

5.Using GENDER, AGE, BMI, DIASTOLIC blood pressure, SYSTOLIC blood pressure, and PULSE rate to predict LDL. Interpret the result and present the regression equation. 20 marks

===========================================================================

EMBA-S22 – Analytics Group Assignment

The file SaltPath.xls contains 60 quarterly observations for the following variables:
Sales – quarterly sales of Salt Path Homemade Pasties (thousands of packets)
Prom – promotion expenditure in the quarter, adjusted for inflation (thousands of pounds)
Adv – advertising expenditure in the quarter, adjusted for inflation (thousands of pounds)
Index – economic index of general economic conditions
TempDev – deviation of average temperature from historical average for the quarter (degrees Celsius)
Q1, Q2, Q3, Q4 – quarterly dummy variables
The first row of data corresponds to a quarter that includes January, February and March.

Questions
1. Use regression to model the sales data. In your report describe your modelling process (40 marks),
and make sure to interpret your final model (40 marks).
2. In Quarter 61, the company will spend £100,000 on promotion, and the same on advertising.
Economists predict that the index is most likely to be 126, but it could be as low as 125 or as high as
129. Raynor has not been able to get a reliable forecast for the deviation of average temperature from
the historical average. What advice would you give regarding the forecasting of sales for Quarter 61?
(20 marks)

For solutions, purchase a LIVE CHAT plan or contact us

Limited time offer: