Using r for introductory statistics by john verzani pdf




















Unlike many other languages, the period is only used as punctuation. First, the comment character, , is used to make comments. Basically anything after the comment character is ignored by R, hopefully not the reader.

More importantly, the assignment to the first entry in the vector typos. This is done with square brackets []. It is important to keep this in mind: parentheses are for functions, and square brackets [] are for vectors and later arrays and lists.

In particular, we have the following values currently in typos. The last example is very important. You can take more than one value at a time by using another vector of index numbers. This is called slicing. By inspection, we can notice that pages 2 and 4 are a problem.

Can we do this with R in a more systematic manner? The underscore is being phased out and the equals sign is being phased in. This tests all the values of typos. The 2nd and 4th answer yes TRUE the others no.

Think of this as asking R a question. Is the value equal to 3? Now the question is — how can we get the indices pages corresponding to the TRUE values? You are not out of luck — but you will need to work harder. The basic idea is to create a new vector 1 2 Very useful [1] 2 4 To create the vector 1 2 We could have typed this in, but this is a useful thing to know. A more general R function is seq which is a bit more typing. To produce the above try seq a,b,1. The use of extracting elements of a vector using another vector of the same size which is comprised of TRUEs and FALSEs is referred to as extraction by a logical vector.

Notice this is different from extracting by page numbers by slicing as we did before. Knowing how to use slicing and logical vectors gives you the ability to easily access your data as you desire. Of course, we could have done all the above at once with this command but why?

This is an important point. To appreciate the use of R you need to understand how one composes the output of one function or operation with the input of another. In mathematics we call this composition. Finally, we might want to know how many typos we have, or how many pages still have typos to fix or what the difference is between drafts?

These can all be answered with mathematical functions. First, lets add the next two weeks worth of data to x. This was 48,49,51,50,49,41,40,38,35,40 We can add this several ways. All are useful, so lets explain.

Then we assigned directly to the 16th index. At the time of the assignment, x had only 15 indices, this automatically created another one. Finally, we assigned to a slice of indices. This latter make some things very simple to do. These may be preferable to some students. All are easy to use. The main confusion is that the variable x needs to be defined previously. Other data entry methods are discussed in the appendix on entering data. Before we leave this example, lets see how we can do some other functions of the data.

Here are a few examples. The moving average simply means to average over some previous number of days. Suppose we want the 5 day moving average day or day is more often used. Here is one way to do so. What is the maximum value of the stock? This is easy to answer with max x. However, you may be interested in a running maximum or the largest value to date. This too is easy — if you know that R had a built-in function to handle this.

It is called cummax which will take the cumulative maximum. For example, suppose the yearly number of whales beached in Texas during the period to is 74 79 What is the mean, the variance, the standard deviation?

First, one needs to remember the names of the functions. In this case mean is easy to guess, var is kind of obvious but less so, std is also kind of obvious, but guess what? So some other things were tried. First, we remember that the standard deviation is the square of the variance. Of course, it might be nice to have this available as a built-in function. Finally, if we had thought a little harder we might have found the actual built-in sd command. Here is a summary using both slicing and extraction by a logical vector.

At your last 6 fill-ups the mileage was Enter these numbers into R. Use the function diff on the data. What does it give? Use the max to find the maximum number of miles between fill-ups, the mean function to find the average number of miles and the min to get the minimum number of miles. Use the function max to find the longest commute time, the function mean to find the average and the function min to find the minimum. Oops, the 24 was a mistake. It should have been How can you fix this?

Do so, and then find the new average. How many times was your commute 20 minutes or more? What percent of your commutes are less than 17 minutes?

How can you answer this with R? Suppose your year has the following monthly amounts 46 33 39 37 46 30 48 32 49 35 30 48 Enter this data into a variable called bill. Use the sum command to find the amount you spent this year on the cell phone. What is the smallest amount you spent in a month? What is the largest? What percentage was this? Use R to find the minimum value and the maximum value.

Which price would you like to pay? Remember, the way to access entries in a vector is with []. Note, we use X1 to denote the first element of x which is 0 etc. Find log10 Xi for each i. Use the log function which by default is base e 3. Do it all at once 4. Find the difference between the largest and smallest values of x. This is the range. You can use max and min or guess a built in command. Section 3: Univariate Data There is a distinction between types of data in statistics and R knows about some of these differences.

In particular, initially, data can be of three basic types: categorical, discrete numeric and continuous numeric. Methods for viewing and summarizing the data depend on the type, and so we need to be aware of how each is handled and what we can do with it.

Categorical data is data that records categories. Examples could be, a survey that records whether a person is for or against a proposition. Or, a police force might keep track of the race of the individuals they pull over on the highway. The U. Again, there was one on race which in the year included 15 categories with write-in space for 3 more for this variable you could mark yourself as multi-racial.

The gender or the history of illnesses might be treated as categories. Continuing the doctor example, the age of a person and their weight are numeric quantities. These numbers are usually reported as integers. If one really needed to know precisely, then they could in theory take on a continuum of values, and we would consider them to be continuous. Why the distinction? In data sets, and some tests it is important to know if the data can have ties two or more data points with the same value.

For discrete data it is true, for continuous data, it is generally not true that there can be ties. A simple, intuitive way to keep track of these is to ask what is the mean average? Categorical data We often view categorical data with tables but we may also look at the data graphically with bar graphs or pie charts. Using tables The table command allows us to look at tables. Its simplest usage looks like table x where x is a categorical variable.

Example: Smoking survey A survey asks people if they smoke or not. Factors Categorical data is often used to classify data into various levels or factors. For example, the smoking data could be part of a broader survey on student health issues. R has a special class for working with factors which is occasionally important to know as R will automatically adapt itself when it knows it has a factor.

To make a factor is easy with the command factor or as. Bar charts A bar chart draws a bar with a a height proportional to the count in the table.

The height could be given by the frequency, or the proportion. The graph will look the same, but the scales may be different. Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were 1 Domestic can, 2 Domestic bottle, 3 Microbrew and 4 import.

This command is very useful for reading data from a file or by typing. You type in the data. It stops adding data when you enter a blank row. We divided by the number of data points which is 25 or length beer. The result is then handed off to barplot to make a graph.

Notice it has the same shape as the previous one, but the height axis is now between 0 and 1 as it measures the proportion and not the frequency. Pie charts The same data can be studied with pie charts using the pie function. This is done with the names which allows us to specify names to the categories. The resulting piechart shows how the names are used. Finally, we added color to the piechart.

This is done by setting the piechart attribute col. We set this equal to a vector of color names that was the same length as our beer. The help command? The ability to pass in named values to a function, makes it easy to have fewer functions as each one can have more functionality. Numerical data 2 Priorto version 1. They are still frequently found in the media. An interesting editorial comment is made in the help page for piechart. Univariate Data page 11 There are many options for viewing numerical data.

First, we consider the common numerical summaries of center and spread. Numeric measures of center and spread To describe a distribution we often want to know where is it centered and what is the spread.

These are typically measured with mean and variance or standard deviation , or the median and more generally the five-number sum- mary. The R commands for these are mean, var, sd, median, fivenum and summary. This is before being indicted for cooking the books.

Median Mean 3rd Qu. For a numeric variable it prints out the five number summary and the median. For other variables, it adapts itself in an intelligent manner.

Some Extra Insight: The difference between fivenum and the quantiles. You may have noticed the slight difference between the fivenum and the summary command. In particular, one gives 1. What is the difference? The story is below. The median is the point in the data that splits it into half. That is, half the data is above the data and half is below. For example, if our data in sorted order is 10, 17, 18, 25, 28 then the midway number is clearly 18 as 2 values are less and 2 are more.

Whereas, if the data had an additional point: 10, 17, 18, 25, 28, 28 Then the midway point is somewhere between 18 and 25 as 3 are larger and 3 are smaller. For concreteness, we average the two values giving Notice, the point where the data is split in half depends on the number of data points. The idea of a quantile generalizes this median. For example the. Univariate Data page 12 The. The first quartile is called Q 1 , and the third quartile is called Q3. These values are in the R function RCodesummary.

More generally, there is a quantile function which will compute any quantile between 0 and 1. The median is defined as above. The lower hinge is then the median of all the data to the left of the median, not counting this particular data point if it is one. The upper hinge is similarly defined. For example, if your data is again 10, 17, 18, 25, 28, 28, then the median is These are available in the function fivenum , and later appear in the boxplot function.

Various measures of center and spread have been developed to handle this. The median is just such a resistant measure. It is oblivious to a few arbitrarily large values. That is, is you make a measurement mistake and get 1,, for the largest value instead of 10 the median will be indifferent. Other resistant measures are available. A common one for the center is the trimmed mean.

This is useful if the data has many outliers like the CEO compensation, although better if the data is symmetric. We trim off a certain percentage of the data from the top and the bottom and then take the average. To do this in R we need to tell the mean how much to trim. Again notice how we used a named argument to the mean function.

The variance and standard deviation are also sensitive to outliers. Resistant measures of spread include the IQR and the mad. The IQR or interquartile range is the difference of the 3rd and 1st quartile. It finds the median of the absolute differences from the median and then multiplies by a constant.

Take the absolute value and then find the median of this new set of data. Finally, multiply by the constant. It is easier to do with R than to describe. Stem-and-leaf Charts There are a range of graphical summaries of data. If the data set is relatively small, the stem-and-leaf diagram is very useful for seeing the shape of the distribution and the values.

It takes a little getting used to. The number on the left of the bar is the stem, the number on the right the digit. You put them together to find the observation. It is stem and not stemleaf. The help command will help us find help on the given function or dataset once we know the name. For example help stem or the abbreviated? Suppose we wanted to break up the categories into groups of 5.

For example. The salaries could be placed into broad categories of million, million and over 5 million. To do this using R one uses the cut function and the table function. Suppose the salaries are again The output is the interval as a factor. This is why the table command is used to summarize the result of cut. Additionally, the names of the levels where changed as an illustration of how to manipulate these.

The most common is similar to the bar plot and is a histogram. The histogram defines a sequence of breaks and then counts the number of observation in the bins formed by the breaks. This is identical to the features of the cut function. It plots these with a bar similar to the bar chart, but the bars are touching. The height can be the frequencies, or the proportions.

In the latter case the areas sum to 1 — a property that will be sound familiar when you study probability distributions. In either case the area is proportional to probability. Suppose the top 25 ranked movies made the following gross receipts for a week 4 The first is the default graph which makes a histogram of frequencies total counts. The second does a histogram of proportions which makes the total area add to 1.

This is preferred as it relates better to the concept of a probability density. Note the only difference is the scale on the y axis. A nice addition to the histogram is to plot the points using the rug command. It was used above in the second graph to give the tick marks just above the x-axis. If your data is discrete and has ties, then the rug jitter x command will give a little jitter to the x values to eliminate ties. Notice these commands opened up a graph window.

The graph window in R has few options available using the mouse, but many using command line options. The basic histogram has a predefined set of break points for the bins. If you want, you can specify the number of breaks or your own break points figure 4.

To do so, you need to know that the median divides the histogram into two equal area pieces, the mean would be the point where the histogram would balance if you tried to, and the IQR captures exactly the middle half of the data.

Boxplots The boxplot eg. It is based on the 5-number summary. In its simplest usage, the boxplot has a box with lines at the lower hinge basically Q1 , the Median, the upper hinge basically Q3 and whiskers which extend to the min and max. To showcase possible outliers, a convention is adopted to shorten the whiskers to a length of 1.

Any points beyond that are plotted with points. These may further be marked differently if the data is more 4 Such data is available from movieweb. Thus the boxplots allows us to check quickly for symmetry the shape looks unbalanced and outliers lots of data points beyond the whiskers.

In figure 5 we see a skewed distribution with a long tail. Example: Movie sales, reading in a dataset In this example, we look at data on movie revenues for the 25 biggest movies of a given week. The data set here is from the data sets accompanying these notes. Notice, both distributions are skewed, but the gross sales are less so.

R Basics: Reading in datasets with library and data In the above example we read in a built-in dataset. Doing so is easy. First we need to load the package, and then ask to load the data. Univariate Data page 17 current receipts 0 5 10 15 20 25 30 gross receipts 0 50 Figure 6: Current and gross movie sales To list all available packages Use the command library. As in the example data lynx.

You first need to load the package to access its datasets as in the command library ts. To find out information about a dataset You can use the help command to see if there is documentation on the data set. For example help "lynx" or equivalently? Example: Seeing both the histogram and boxplot The function simple. The figure shows some examples on some randomly generated data.

The data would be described as bell shaped normal , short tailed, skewed and long tailed figure 7. Rather than draw a rectangle for each bin, put a point at the top of the rectangle and then connect these points with straight lines. This is called the frequency polygon. To generate it, we need to know the bins, and the heights. Here is a way to do so with R getting the necessary values from the hist command. Notice though that the basic information was available to us with the values labeled breaks and counts.

Densities The point of doing the frequency polygon is to tie the histogram in with the probability density of the parent population. More sophisticated densities functions are available, and are much less work to use if you are just using a built-in function. The built-in data set faithful help faithful tracks the time between eruptions of the old-faithful geyser. The R command density can be used to give more sophisticated attempts to view the data with a curve as the frequency polygon does.

The density function has means to do automatic selection of bandwidth. See the help page for the full description. If we use the default choice it is easy to add a density plot to a histogram.

We just call the lines function with the result from density or plot if it is the first graph. The details of the averaging can be quite complicated, but the main control for them is something called the bandwidth which you can control if desired.

You can also set this to be a fixed number if desired. In figure 9 are 3 examples with the bandwidth chosen to be 0. Notice, if the bandwidth is too small, the result is too jagged, too big and the result is too smooth. Problems 3. Notice choice of bandwidth is very important. Make a stem and leaf plot. Create two different histograms for two different times of defining x as above. Do you get the same histogram?

Which of these data sets is skewed? Which has outliers, which is symmetric. Try to predict the mean, median and standard deviation. Check your guesses with the appropriate R commands.

Make a table of the possible categories. Try to find the mean. You might need to try mean x,na. Make a histogram. Is it surprising? Can you do it for all 10 digits ? Try to use R to produce a similar figure.

Section 4: Bivariate Data The relationship between 2 variables is often of interest. For example, are height and weight related? Are age and heart rate related?

Are income and taxes paid related? Is a new drug better than an old drug? Does the weather depend on the previous days weather? Exploring and summarizing such relationships is the current goal. Handling bivariate categorical data The table command will summarize bivariate data in a similar manner as it summarized univariate data. Suppose a student survey is done to evaluate if students who smoke study less. The data recorded is Person Smokes amount of Studying 1 Y less than 5 hours 2 N 5 - 10 hours 3 N 5 - 10 hours 4 Y more than 10 hours 5 N more than 10 hours 6 Y less than 5 hours 7 Y 5 - 10 hours 8 Y less than 5 hours 9 N more than 5 hours 10 Y 5 - 10 hours We can handle this in R by creating two vectors to hold our data, and then using the table command.

For example, what proportion of smokers study 5 hours or less. The command prop. It needs to be told the table to work on, and a number to indicate if you want the row proportions a 1 or the column proportions a 2 the default is to just find proportions. For the smoking example, you could plot the amount variable for each of No or Yes, or the No and Yes variable for each level of smoking. In either case, you can use a barplot. We simply call it in the appropriate manner. Essentially, barplot plots each row of data.

The attribute legend. You can change the names, but the default of legend. Some Extra Insight: Conditional proportions You may also want to know about the conditional proportions. For example, among the smokers what are the proportions. To answer this, we need to divide the second row by 6. One or two rows is easy to do by hand, but how do we automate the work?

The function apply will apply a function to rows or columns of a matrix. In this case, we need a function to find the proportions of a vector.

A simple example might be in a drug test, where you have data in suitable units for an experimental group and for a control group. Bivariate Data page 22 experimental: 5 5 5 13 7 11 11 9 8 9 control: 11 8 4 5 9 5 10 5 4 10 You can summarize the data separately and compare, but how can you view the data together?

A side by side boxplot is a good place to start. More on this syntax will appear in the section on multivariate data.

Bivariate data: numerical vs. If the two variables are thought to be indepen- dent samples you might like to compare their distributions in some manner. However, if you expect a relationship between the variables, you might like to look for that by plotting pairs of points.

Comparing two distributions with plots If we wish to compare two distributions, we can do so with side-by-side boxplots, However, we may wish to compare histograms or some other graphs to see more of the data.

Here are several different ways to do so. Side by side boxplots with rug By using the rug command we can see all the data. It works best with smallish data sets otherwise use the jitter command to break ties. This puts the two data sets on the same scale so they can sensibly be compared. If you make this boxplot, you will see that the two distributions look quite a bit different. The full dataset homedata will show this even more.

Using stripcharts or dotplots The stripchart a dotplot will plot all the data in a way that makes it relatively easy to compare the distributions. This is hard to do with histograms. The function simple. For example the height of a father compared to their sons height. The plot command will gladly display two variables in a scatterplot.

Example: Home data The home data example of the previous section shows old assessed value versus new assessed value There should be some relationship. This should be available as a data set through the command data. R Basics: What does attaching do?

You may have noticed that when we attached home and homedata we have the same variable names: old and new. What exactly does attaching do? When you ask R to use a value of a variable or a function it needs to find it. By attaching a data frame, you put the names into the second environment searched the name of the dataframe is in the first.

These are masked by any variables which already have the same name. There are consequences to this to be aware of. First, you might be confused about which variable you are using. For example, we create a data frame df below with variables x and y. We see in these examples relationships between the data. Both were linear relationships.

The modeling of such relationships is a common statistical practice. It allows us to make predictions of the y variable based on the value of the x variable. Linear regression.

Linear regression is the name of a procedure that fits a straight line to the data. The idea is that the x value is something the experimenter controls, the y value one the experimenter measures.

The line is used to predict the value of y for a known value of x. The variable x is the predictor variable and y the response variable. The method of least squares is used to choose the values of b0 and b1 that minimize the sum or the squares of the residual errors. The abline function prints lines on the current graph window and is generally a useful function.

The line it prints is coming from the lm functions. This is the function for a linear model. This is the model formula syntax of R which can be tricky, but is fairly straightforward in this situation. As an alternative to the above, the function simple. This can also be done with the simple. Continuing the above example simple. The normal plot will be explained later. The lower left is a histogram of the residuals. For this data, we see a possible outlier that deserves attention.

This data set has a few typos in it. To access residuals directly, you can use the command resid on your lm result. This measures how one variable varies as the other does. Values or R2 close to 1 indicate a strong linear relationship, values close to 0 a weak one.

There still may be a relationship, just not a linear one. The Spearman rank correlation is the same thing only applied to the ranks of the data. The rank of a data set is simply another vector giving the relative rank in terms of size. The trend need not be linear. As a reminder, you can make a function to do this calculation for you. Some important ones allow us to identify and locate points on the graph. Example: Presidential Elections: Florida Consider this data set from the United States presidential election in the state of Florida.

We wish to investigate the relationship between the number of votes for Bush against the number of votes for Buchanan. Coefficients: Intercept x How can we identify these points? One way is to search through the data to find these values.

This works fine for smaller data sets, for larger ones, R provides a few useful functions: identify to find index of the closest x, y coordinates to the mouse click and locator to find the x, y coordinates of the mouse click. Further discussions of this data, of a more substantial nature, may be found on several web sites. County 50 is not surprisingly Miami-Dade county, the home of the infamous well maybe butterfly ballot that caused great confusion among the voters. One way to answer this is to find the regression line for the data without this data point and then to use the number of Bush votes to predict the number of Buchanan votes.

To eliminate one point from a data vector can be done with fancy indexing, by using a minus sign BUSH[50] is the 50th element, BUSH[] is all but the 50th element. How much difference does this make? Well the regression line predicts the value for a given x. This difference is much larger than the statewide difference that gave the U. Some Extra Insight: Using simple. Resistant regression This example also illustrates another important point.

That is, like the mean and standard deviation the regression line is very sensitive to outliers. Since we already have the equation for the line without the point, the simplest way to do so is to first draw the line for all the data, and then add in the line without Miami-Dade.

This is done with the abline function. There are various ways to create a resistant regression line. In R there are two in the package MASS that are used in a manner similar to the lm function but not the simple.

The function lqs works with a simple principle by default. Rather than minimize the sum of the squared residuals for all residuals, it does so for just a percentage of them. The rlm function uses something known as an M -estimator.

Both give similar results, but not identical. We will plot both the regular regression line and the resistant regression line fig We also illustrate how to change the line type lty and how to include a legend with legend. As well, you may plot the resistant regression line for the data, with and without the outliers as below, you will find as expected that the lines are the same.

R Basics: Plotting graphs using R In this section, we used the plot command to make a scatterplot and the abline command to add a line to it. There are other ways to manipulate plots using R that are useful to know. It helps to know that R has different functions to create an initial graph and to add to an existing graph. Creating new plots with plot and curve.

The plot function will plot points as already illustrated. In addition, it can be told to plot the points and connect them with straight lines. These commands will plot a parabola. Adding to a graph with points, abline, lines and curve.

We can add to the exiting graph window the several different functions. To add points we use the points command which is similar to the plot command. The lines function is used to add more general lines. It plots the points specified and connects them with straight lines. To illustrate, if we have the dataset mileage 0 4 8 12 16 20 24 28 32 tread wear Then the regression line has intercept and slope Suppose the answers to the first 3 questions are given in this table Student Ques.

Make a table of the results of question 1 and question 2 separately. Make a contingency table of questions 1 and 2. Make a stacked barplot of questions 2 and 3. Make a side-by-side barplot of all 3 questions. You can use tables, barplots, scatterplots etc. The relationship between manufacturer and shelf 2. The relationship between fat and vitamins 3. Use the cor to find the Pearson and Spearman correlation coefficients. Are they similar?

Plot the data using the plot command and see if you expect them to be similar. You should be unsatisfied with this plot. Next, plot the logarithm log of each variable and see if that makes a difference. Use old as the predictor variable. Does the data suggest a linear relationship? Are there any outliers? What may have caused these outliers? Multivariate Data page 32 4. Buchanan, there is another obvious outlier that indicated Buchanan received fewer votes than expected.

If you remove both the outliers, what is the predicted value for the number of votes Buchanan would get in Miami-Dade county based on the number of Bush votes? Identify the outlier and find the regression lines with this point, and without this point. Find the correlation coefficient both Pearson and Spearman between age and weight.

Repeat for the relationship between height and weight. Make scatter plots of each pair and see if your answer makes sense. Make a scatterplot with regression line using R. Load the data set data mtcars and try to answer the following: 1.

What are the variable names? Try names. Which car has this? What are the first 5 cars listed? Make a scatterplot of cylinders cyl vs. Fit a regression line. Is this a good candidate for linear regression? Use R to generate a similar figure.

Section 5: Multivariate Data Getting comfortable with viewing and manipulating multivariate data forces you to be organized about your data. R uses data frames to help organize big data sets and you should learn how to as well. Storing multivariate data in data frames Often in statistics, data is presented in a tabular format similar to a spreadsheet.

The columns are for different variables, and each row is a different measurement or variable for the same person or thing. For example, the dataset home which accompanies these notes contains two columns, the assessed value of a home and the year assessed value for the same home. R uses data frames to store these variables together and R has many shortcuts for using data stored this way.

If you are using a dataset which is built-in to R or comes from a spreadsheet or other data source, then chances are the data is available already as a data frame. You can make your own data frames of course and may need to.

To make data into a data frame you first need a data set that is an appropriate candidate: it will fit into a rectangular array. If so, then the data. Different names are possible if desired. You can give the rows names as well. Suppose the subjects were Mary, Alice, Bob and Judy, then the row. Accessing data in data frames The study data frame has three variables.

To access the data it helps to know that data frames can be thought of as lists or as arrays and accessed accordingly. To access as an array An array is a way of storing data so that it can be accessed with a row and column. Like a spreadsheet, only technically the entries must all be of the same type and one can have more than rows and columns.

Data frames are arrays as they have columns which are the variables and rows which are for the experimental unit. Thus we can access the data by specifying a row and a column. To access an array we use single brackets [row,column]. In general there is a row and column we can access. By letting one be blank, we get the entire row or column. A list is a set of objects, each of which can be any other object. A data frame is a list, where the objects are the columns as vectors.

To get just the females information. There are 3 groups a control and two treatments. For each group, weights are recorded. The data is generated this way, by recording a weight and group for each plant. However, you may want to plot boxplots for the data broken down by their group. How to do this? The unstack function will do this all at once for us. If the data is structured correctly, it will create a data frame with variables corresponding to the levels of the factor. It breaks the weight variable down by values of the group factor and hands this off to the boxplot command.

That is, break weight down by the values of group. When there are two variables involved things are pretty straightforward. When there are more than two predictor variables things get a little confusing. In particular, the usual math- ematical operators do not do what you may think. Here are a few different possibilities that will suffice for these notes. For the boxplot command it is different than the lm command. Also notice that usual mathematical meanings are available, but need to be included inside the I function.

Ways to view multivariate data Now that we can store and access multivariate data, it is time to see the large number of ways to visualize the datasets. If w,x,y,z are 4 variables, then the command table x,y creates a two-way table, table x,y,z creates two-way tables x versus y for each value of z. Finally x,y,z,w will do the same for each combination of values of z and w.

If the variables are stored in a data frame, say df then the command table df will behave as above with each variable corresponding to a column in the given order. See the appendix for more information on these. See the commands xtabs and ftable for more sophisticated usages. First you need to run your data through the table command or something similar.

The barplot command plots each column as a variable just like a data frame. The output of table when called with two variables uses the first variable for the row. The command boxplot x,y,z will produce the side by side boxplots seen previously.

The latter using the model formula notation. Example: Boxplot of samples of random data Here is an example, which will print out 10 boxplots of normal data with mean 0 and standard deviation 1. This uses the rnorm function to produce the random data.

It looks like 1 through 10 repeated times to make a factor of the same length of x. When the model notation is used, the boxplot of the y data is done for each level of the factor f.

That is, for each value of y when f is 1 and then 2 etc. It plots the actual data in a manner similar to rug which is used with histograms. Multivariate Data page 37 For example, as above, we will generate 10 sets of random normal numbers. Only this time each will contain only 10 random numbers. Both use the empirical density found by the density function to illustrate a variables distribution.

A violinplot is very similar to a boxplot, only the box is replaced by a density which is given a mirror image for clarity. A densityplot plots several densities on the same scale. Multiple histograms would look really awful, but multiple densities are manageable. As an illustration, we show for the same dataset all three in figure The density plot looks a little crowded, but you can clearly see that there are two different types of distributions being considered here.

Notice, that we use the functions in an identical manner to the boxplot. Download Free PDF. Solutions manual using r introductory statistics 2nd edition verzani. Hurtado Giamma A short summary of this paper. Calling abs makes all the values non-negative, and sum reduces the result to a single number, which is then divided by the length.

Volvo 2. The ordering of the la- bels should match the following: sort unique as. B , as the unary! That is, compar- isons are done character by character until a tie is broken. The com- parison of characters varies due to the locale.

The percentage difference is found by dividing by x[] and multiplying by Recall that x[] is all but the tenth 10th number of x. One could improve it by only looking at integer factors less or equal the square-root of x. A check shows it is We subtracted 0. The median and IQR can be identified on the boxplot giving estimates of 3.

We pad it out using rep, then plot. The histogram is very symmetric. Some year had over 93 feet of snow fall! Then you can access the entries using the respective state abbreviations. The median more accurately reflects the bulk of the data.

If your intention were to make the data seem outrageously large, then a mean might be used. Can you think of a shape for a distribution when this is actually okay? It makes an area look more affordable. For exclusive listings, the mean is often used to make an area seem more expensive. It is much easier to be relatively slow in a marathon, as it requires little talent and little training—just doggedness. Median Mean 3rd Qu. The skew of the inter-arrival times is twice as much and to the right.

As such, they should have a coefficient of variation that is nearly 1. If you find the median of the transformed data, you can take its exponential to get the median of the untransformed data.

Not so with the mean. Jittering can smooth this out try qqnorm jitter chest,3. It falls fairly close to a straight line. Star brightness is measured on a logarithmic scale—a difference of 5 is a factor of in terms of brightness.



0コメント

  • 1000 / 1000