TDM 10100: Project 3 — Fall 2023
Motivation: data.frames
are the primary data structure you will work with when using R. It is important to understand how to insert, retrieve, and update data in a data.frame
.
Context: In Project 2 we ran our first R code, learned about vectors and indexing, and explored some basic functions in R. In this project, we will continue to enforce what we’ve already learned and learn more about how dataframes, formally called data.frame
, work in R.
Scope: r, data.frames, factors
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/craigslist/vehicles.csv
Setting Up
First, let’s take a look at all of the data available to students. In order to do this, we are going to use a new function as listed below to list all of the files in the craigslist folder.
Let’s run the below command using the seminar-r kernel to view all the files in the folder.
list.files("/anvil/projects/tdm/data/craigslist")
As you can see, we have two different files worth of information from Craigslist.
For this project, we are interested in looking at the vehicles.csv
file
Before we read in the data, we should check the size of the file to get an idea of how big it is. This is important because if the file is too large, we may need more cores for our project or else our core will 'die'.
We can check the size of our file (in bytes) using the following command.
file.info("/anvil/projects/tdm/data/craigslist/vehicles.csv")$size
You can also use size- double: File size in bytes. |
Now that we have made sure our file isn’t too big (1.44 GB), let’s read it into a dataframe in the same way that we have done in the previous two projects.
We recommend using 2 cores for your Jupyter Lab session this week. |
Now we can read in the data and get started with our analysis.
myDF <- read.csv("/anvil/projects/tdm/data/craigslist/vehicles.csv")
Questions
Question 1 (1 pt)
-
How many rows and columns does our dataframe have?
-
What type/s of data are in this dataframe (example: numerical values, and/or text strings, etc.)
-
1-2 sentences giving an overall description of our data.
As we stressed in Project 2, familiarizing yourself with the data you are going to work with is an important first step. For this question, we want to figure out how many rows and columns are in our data along with what the types of data are in our data frame. The hint below contains all of the functions that we need to solve this problem. (We also covered these functions in detail in Project 2, so feel free to reference the previous project if you want more information.)
When answering sub-question C., consider talking about where the data appears to be taken from, what the data contains, and any important details that immediately stand out to you about the data.
The |
-
The number of rows and columns in our dataframe, in a markdown cell.
-
The types of data in our dataframe, in a markdown cell.
-
1-2 sentences summarizing our data.
Question 2 (1 pt)
-
Print the number of NA values in the 'year' column of
myDF
, and the percentage of the total number of rows inmyDF
that this represents. -
Create a new data frame called
goodyearsDF
with only the rows ofmyDF
that have a definedyear
(nonNA
values). Print thehead
of this new data frame. -
Create a new data frame called
missingyearsDF
with only the rows ofmyDF
that are missing data in theyear
column. Print thehead
of this new data frame.
Now that we have a better understanding of the general structure and contents of our data, let’s focus on some specific patterns in our data that may make analysis more challenging.
Often, one of these patterns is missing data. This can come in many forms, such as NA, NaN, NULL, or simply a blank space in one of our dataframes cells. When performing data analysis, it is important to consider missing data and decide how to handle it appropriately.
In this question, we will look at filtering out rows with missing data. The R
function is.na()
indicates TRUE
or FALSE
is the analogous data is missing or not missing (respectively). An exclamation mark changes TRUE
to FALSE
and changes FALSE
to TRUE
. For this reason, !is.na()
indicates which data are not NA
values, in other words, which data are not missing. As an example, if we wanted to create a new dataframe with all of the rows that are not missing the latitude values, we could do any of the following equivalent methods:
goodlatitudeDF <- subset(myDF, !is.na(myDF$lat))
goodlatitudeDF <- subset(myDF, !is.na(lat))
goodlatitudeDF <- myDF[!is.na(myDF$lat), ]
In the second method, the subset
function knows that we are working with myDF
, so we do not need to specify that lat
is the latitude column in the myDF
data frame, and instead, we can just refer to lat
and the subset
function knows that we are referring to a column.
In the third method, when we write myDF[ , ]
we put things before the comma that are conditions on the rows, and we put things after the comma that are conditions on the columns. So we are saying that we want rows of myDF
for which the lat
values are not NA
, and we want all of the columns of myDF
.
If we compare the sizes of the original data frame and this new data frame, we can see that some rows were removed.
dim(myDF)
dim(goodlatitudeDF)
To answer question 2, we want you to work (instead) with the year
column, and try the same things that we demonstrated above from the lat
column. We were simply giving you examples using the lat
column, so that you have an example about how to deal with missing data in the year
column.
-
The number of NA values in the
year
column ofmyDF
and the percentage of the total number of rows inmyDF
that this represents, in a markdown cell. -
A dataframe called
goodyearsDF
containing only the rows in myDF that have a definedyear
(non NA values), and print thehead
of that data frame. -
A dataframe called
missingyearsDF
containing only the rows in myDF that are missing theyear
data, and print thehead
of that data frame.
Question 3 (2 pts)
Use the |
-
Print the mean price of vehicles by
year
during the last 20 years. -
Find which
year
of vehicle appears most frequently in our data, and how frequently it occurs.
Using the
|
We want you to (instead) find the mean price
for cars by year
.
Finding the most frequent value in our data can be done using
|
Now we want you to (instead) find the year in which the most cars appear in the data set.
-
The mean price of each year of vehicle for the last 20 years, in a markdown cell.
-
The most frequent year in our data, and how frequently it occured.
Question 4 (2 pts)
-
Among the
region_url
values in the data set, whichregion_url
is most popular? -
What are the three most popular states, in terms of the number of craigslist listings that appear?
Use the table
, sort
, and tail
commands to find the most popular region_url
and the most popular three states.
(These two questions are not related to each other. In other words, when you look for the three states that appear most frequently, they have nothing at all to do with the region_url that you found.)
-
The most popular
region_url
. -
The three states that appear most frequently.
Question 5 (2 pts)
-
In question 3, we found the average price of vehicles by year. ("Average" and "mean" are two difference words for the very same concept.) Choose at least two different plot types in R, and create two plots that show the average vehicle price by year.
-
Write 3-5 sentences detailing any patterns present in the data along with your personal observations. (i.e. shape, outliers, etc.)
Remember, all plots should have a title and appropriate axis labels. Axes should also be scaled appropriately. It is also necessary to explain your plot using a few sentences. |
-
2 different plots of average price of vehicle by year.
-
A 3-5 sentence explanation of any patterns present in the data along with your personal observations.
Submitting your Work
Nice work, you’ve finished Project 3! Make sure that all of the below files are included in your submission, and feel free to come to seminar, post on Piazza, or visit some office hours if you have any further questions.
-
firstname-lastname-project01.ipynb
. -
firstname-lastname-project01.R
.
You must double check your You will not receive full credit if your |
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |