# MDM4U – Grade 12 Data Management – Exam

Exam

Unit 1: One Variable Analysis

Types of Data

• Numerical Data
• Discrete:  consists of whole numbers
• Ie. Number of trucks.
• Continuous: measured using real numbers
• Ie, Measuring temperature.
• Categorical Data: cannot be qualitatively measured
• Nominal: Data which any order presented makes sense
• Ie, Eye Colour, Hair Colour.
• Ordinal Data: better if sorted or ordered
• Ie, Date and Time, scalar options
• Collecting Data
• Primary: collected by yourself
• Secondary: collected by someone else
• Organizing Data
• Micro Data: information about an individual
• Aggregate Data: grouped data about a group; summarized data.
• Data collection
• Observational Data: group of people by characteristic, then observe
• Group by adult/children then look at sunlight’s effect on them
• Experimental Data: create groups and impose some treatment on them
• Create experimental groups then apply placebo drug treatments on them.
• Other Terms
• Population: entire group of people being studied
• Sample: the part of the population being studied
• Binary Data: only 2 choices/outcomes
• Non-Binary: more than 2 outcomes

Sampling Techniques

Characteristics of a good sample

-Each person must have an equal chance to be in the sample

-Sample must be vast enough to represent

• Simple Random: each member has equal chance of being selected
• Ie, picking members randomly apartments
• Sequential Random: go through population sequentially and select members
• Ie, Selecting every 5th person
• Stratified Sampling: a strata is a group of people that share common charactoristics
• Constraints the proportion of members in the strata from the population in the sample
• Ie, Each strata is represented based on their proportion in the population
• Cluster Sampling: random sample of 2 representative group
• Ie, picking 1 floor of people and survey them
• Multi-Stage Sampling: several levels of sampling
• Ie, Randomly selecting provinces, random cities, then random people.
• Voluntary Response Samples: invite members of the entire population to participate in the survey
• Ie, Sending the survey to everyone in the hotel
• Convenience Sample: easily accessible members are selected
• Ie, Asking people at the mall who walks closest to you

Types of Bias

• Good survey Questions are simple, specific, ethical, free of bias, and respects privacy
• Survey questions should prevent jargon, abbreviations, negatives, leading questions, and insensitivity
• Sampling Bias: occurs when the chosen sample doesn’t reflect the population
• Non-Response Bias: occurs when particular groups are under-represented in a survey because they chose not to participate.
• Ie, when respondents don’t respond, it leads the surveyor to make up their own thoughts
• Measurement Bias: occurs when the data collection method consistently under- or overestimates a characteristic of the population
• Leading questions also cause data over/under estimation
• Ie, police radar gun measuring for average speed of the road
• Response Bias: when participants in a survey give false or misleading answers
• Question quality might  lead to response bias
• Ie, A teacher asks class to raise their hands if they have completed their homework

Unit 2: Two Variable Analysis

• Correlation
• Scatter Plots graph data and is used to determine if there is a relation between the 2 variables
• Linear Correlation: changes in one variable tend to be proportional to changes in other variables
• The stronger the correlation, the more closely the data points cluster around the line of best fit.
• Correlation Coefficient ( r ): a value between -1 and 1 that provides a measure of how closely data points cluster around the line of best fit.
• -1 –  -0.62: negative, strong correlation
• -0.61 –  -0.33: negative, moderate correlation
• -0.32 –  0: negative, weak correlation
• 0 – 0.32: positive, weak correlation
• 0.33 – 0.61: positive, moderate correlation
• 0.62 – 1: positive, strong correlation
• Regression: finding a relationship that models the 2 variables

• Generating lines of best fit and Outliers
• TI-83 Graphing Calculator:
• Turn diagnostics on (2nd, O, DiagnosticsOn, Enter)
• Enter Data (STAT, 1:edit)
• Graph Data (2nd, y=, Turn Plot 1 on, zoom, 9:zoomStat)
• Equation of line of best fit (STAT, Calc, 4: LinReg(ax+b), Vars, yvars, 1: functions, 1:y1)
• Microsoft Excel
• Enter data
• Highlight data and construct scatterplot (Insert, Charts, Scatter)
• Equation for line of best fit (Chart Tools, Layout, Trend line)
• Fathom
• Enter data (Copy/Type/Open)
• Construct scatterplot (drag variables to axes)
• Equation for line of best fit (Graph, least squares line)
• Show Squares, residual plot to identify outliers
• Determine value of correlation coefficients

• Cause and Effect
• Cause and Effect
• A change in X causes a change in Y
• Ie. Time and tree trunk diameter
• Common Cause
• An external factor causes two variables to change in the same way
• Ie. Correlation between ski sales, and video rentals
• Where it’s caused by colder weather
• Reverse Cause and Effect
• The dependent and independent variables are reversed in ascertaining which caused which.
• Ie. Correlation between coffee consumption and anxiety theorized that drinking coffee causes anxiety and it is found that anxious people drink coffee
• Accidental Relationships
• A correlation without any casual relationship between the variables
• Ie Increase in SUV sales causes increase in chipmunk population
• Presumed Relationship
• A correlation that does not seem to be accidental even though no cause-and-effect or common cause relationship is apparent
• Ie. A correlation between the person’s level of fitness and the number of action movies they watch.

• When analyzing data, we should ask:
• Source: How reliable/current is the source?
• Sample: Does the sample reflect the opinions in the population?
• Was the sampling technique free foam bias?
• Graph: Is the graph accurately portrayed? (Axis starting at zero)
• Correlation: Is the correlation between the variables strong enough to make inferences?
• Is the causation assumed just because there is a correlation?
• Are there extraneous variables impacting the results?

• Number Manipulation
• Percentage Points: means that it’s X percentage points / the value
• Ie. 3 percentage points up from 75% is 75+(3/75*100) = 79%
• Making Numbers Larger: In order to make better sense of numbers, sometimes people use smaller scales to make them seem bigger
• Ie. 2,000,000 iPads sold in the first 3 months can be said as “2 iPads sold every second” to sound larger.

Unit 3: Permutations

• Multiplicative Principles
• If one operation can be performed in K1 ways, and for each operation that can be performed K2 ways, and for each operation that can be performed K3 ways..
• All of these ways can be performed K1 x K2 x K3.. ways

• If one mutually exclusive action can occur in K1 ways and a second can occur in K2 ways, then there are K1 + K2.. ways in which these actions can occur.

• Methods
• If a set of operations can be used to determine a result, then it’s called  Direct Method
• However, if it is difficult to determine directly, an indirect method may be used by subtracting certain possibilities so they are eliminated

• Factorial Notation
• For the following: r < n
• n! = n(n-1)(n-2)(n-3)(n-4)… (n-r+1)(n-r)!, n belongs to natural numbers
• n!/(n-r)! = n(n-1)(n-2)(n-3)(n-4)… (n-r+1)(n-r)!/(n-r)!
• ie. 6! = 6*5*4*3*2*1

• Permutations with some elements alike
• In general, the number of different arrangements of n objects K1 alike of one kind and k2 alike of another kind is:

n! / (k1!)(k2!)

• ie. in the word “COOL”, the permutations are as follows:

4! / (2!)  = 12

Unit 4: Combinations

• Venn Diagrams
• Venn Diagrams: a number of overlapping circles each represent their own properties. Overlapped areas show values which share both properties. Center where all circles overlap show values which share all properties
• Venn Diagrams placed in a rectangle have “s” to denote the universal set
• Operations on Venn Diagrams
• n(A): number of values with property A
• n(A U B): number of values with property A or B (Union)
• n(A n B): number of values with only A and B (Intersection)
• Principle of Inclusion and Exclusion

n(A U B ) = n(A) + n(B) – n(A n B)

n(B U C U P) = n(B) + n(C) + n(P) – n(B U C) – n(B n P) – n(C n P) + n(B n C n P)

• Combinations
• Combination: a combination of n distinct objects taken r at a time is a selection of r of the n objects without regard to order.
• Denoted as: C(n,r) or (n r) or nCr or “n choose r”

C(n,r) = n! / [(n-r)!*r!]

where n, r E W, n >= r

• If some elements are alike and if atleast one item is to be chosen, then the total number of selections from P alike items, Q alike items, R alike items and so on is:

(P+1)(Q+1)(R+1).. -1

• Each way P, Q, or R can be chosen is added by 1 for the possibility that it isn’t chosen
• 1 is subtracted for the possibility where all aren’t chosen

• Properties of Pascal’s Triangle
• it’s symmetrical
• potentially infinite in size
• each number is the sum of the 2 numbers above it to the left and right
• Combinations in the form C(row number, element number) also form Pascal’s Triangle
• Pascal’s Identity: (n , r) = (n-1 , r-1) + (n-1 , r)

• Row n: nC0*nC1*nC2*nC3*nC4… nCn
• Sum of nth row: 2n

Unit 5: Probability

• Experimental and Theoretical Probability
• Probability: is the value between 0 and 1 that describes that likelihood of an occurrence of a certain event.
• Experimental Probability: making predictions based on a large number of previous results.
• Theoretical Probability: Make predictions based on a mathematical model.
• In general, experimental probability will approach theoretical probability as the number of trials increase.
• Discrete Sample Space: a sample space where you can count the number of outcomes ie. blue balls
• Continuous Sample Space: decimal numbers with infinite possibilities ie. Time.
• Event: is the occurrence of a specific outcome in the sample space.

P(A) = n(A) / n(S)

Probability of A is number of outcomes for A over total possibilities

• P(A’) the probability that event A will not occur.
• P(A’) = 1 – P(A)

• Odds
• Odds: a ratio used to represent a degree of confidence in whether or not an event will occur.
• Odds In favour: P(A) : P(A’)
• = n(A) : n(A’)
• Odds Against: P(A’) : P(A)
• = n(A’) : n(A)

• Probability using counting principles
• Instead of listing out all possibilities, counting principles such as combinations and permutations can be used to calculate all the possibilities of outcome and the possibilities of the event occurrence.

• Independent and Dependent Event
• Two events are independent if the occurrence of one event has no effect on the occurrence of another event.
• If two events are independent, then P (A n B) = P(A) P(B)
• Drawing tree diagrams with probability percentages on the branches can be multiplied
• P(AA) = P(A)*P(A)
• ie. When drawing disks from a bag, if the disks are replaced, the 2nd draw will be an independent event.
• ie. When drawing disks from a bag, but the disks are not replaced, the 2nd draw will be a dependent event.

• Mutually Exclusive Events
• Two events are mutually exclusive if when one event occurs, the other event cannot occur.
• If two events are mutually exclusive, then P(A U B) = P(A) + P(B)
• If two events are not mutually exclusive, then P( A U B) = P(A) + P(B) – P(A U B)
• ie Probability of picking a KING or a FOUR is a mutually exclusive event.
• ie Probability of picking a KING or a RED card is non-mutually exclusive.

• Conditional Probability
• The probability that an event will occur given that another compatible event that already occurred.
• P(A / B) = P(A and B) / P(B)
• Probability of A given the occurrence of B is equal to the probability of A and B over the probability that B has occurred.
• ie. Probability of drawing a QUEEN if we know the chosen card is a face card is an example of conditional probability.

Unit 6: Probability Distributions

• Basic Probability Distributions
• Random Variable: By letting X be a random variable, can generalize the probability to obtain the number times something happens
• Probability Distributions can be created in a Table then graphed into a histogram to analyze the probability of each event happening
• Expected Value: Expectation or expected value, E(X), is the predicted average of all possible outcomes of a probability experiment. In essence, it is a weighted mean of all the outcomes.

E(x) = Summation of (X*P(x)

• X: Random variable value, P: Probability of the random variable

• Binomial Distribution
• All trials are independent
• Only 2 possible outcomes (Success or failures)
• Probability of success is the same on every level
• Usually replaceable items
• Binomial Distribution Formula
• P(x) = (nCx) Px Qn-x
• n: number of trials, P: probability, Q: 1-probability, X: random variable
• Shortcut Expected Value Formula
• E(x) = np
• n: number of trials, P: probability

• Hypergeometric Distributions
• Hypergeometric Distributions are used for sampling without replacement.
• Expected value of the sample should be proportional to the population
• Outcomes are still 2 possibilities (Success or Failures)
• Probabilities are not the same each time
• Dependent Events
• Not replaceable
• Formula for Hypergeometric Distributions

Unit 7: Continuous Probability Distributions

• Continuous Probability Distributions
• a random variable that can assume all possible random values (ie city temperature)
• Probability Density Function: a function that describes how likely this random variable will occur at a given point.
• Height formula:  height = 1/(b-a) where b is the top range, and a is the bottom range given.

• The Normal Distribution
• used to solve continuous probabilities
• total area under the curve is 1
• standard deviation is the distance from the mean to the point of inflection
• Any normal distribution can be described as by the mean and the variance: so we often write N(mean, variance) to describe a distribution
• The distribution chart shows area under the graph from the X value to the left end
• Z-Scores can be calculated using Normal distributions
• Z = x – mean / standard deviation
• Sometimes, you will have to subtract the mean to equalize. This makes it so the mean is on the center.

• Normal Approximation
• Step 1: Check if a normal approximation is appropriate. Test if np > 5 and nq > 5.
• Step 2: Estimate the mean and standard deviation (mean = nq, SD = √(npq) )
• Step 3: Estimate the probability using z-score method from above.