STAT 29000: Project 7 — Fall 2021
Bashing out liquor sales data
Motivation: A bash script is a powerful tool to perform repeated tasks. RCAC uses bash scripts to automate a variety of tasks. In fact, we use bash scripts on Scholar to do things like link Python kernels to your account, fix potential issues with Firefox, etc. awk
is a programming language designed for text processing. The combination of these tools can be really powerful and useful for a variety of quick tasks.
Context: This is the second project in a series of projects focused on bash and awk
. Here, we take a deeper dive and create some more complicated awk scripts, as well as utilize the bash skills learned in previous projects.
Scope: bash, awk
, bash scripts, R, Python
Dataset(s)
The following questions will use the following dataset(s):
-
/anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt
Questions
Question 1
You may have noticed that the "Store Location" column (8th column) contains latitude and longitude coordinates. That is some rich data that could be fun and useful.
The data will look something like the following:
1013 MAINKEOKUK 52632(40.39978, -91.387531)
What this means is that you can’t just parse out the latitude and longitude coordinates and call it a day — you need to use awk
functions like gsub
and split
to extract the latitude and longitude coordinates.
Use awk
to print out the latitude and longitude for each line in the original dataset. Output should resemble the following.
lat,lon 1.23,4.56
Make sure to take care of rows that don’t have latitude and longitude coordinates — just skip them. So if your results look like this, you need to add logic to skip the "empty" rows: 40.39978, -91.387531 40.739238, -95.02756 40.624226, -91.373211 , 41.985887, -92.579244 To do this, just go ahead and wrap your print in an if statement similar to:
|
|
If we have a bunch of data formatted like the following: 1013 MAINKEOKUK 52632(40.39978, -91.387531) If we first used
40.39978, -91.387531) Then, you could use
40.39978, -91.387531 At this point I’m sure you can see how to use |
Don’t forget any lingering space after the first comma! We don’t want that. |
-
Code used to solve this problem.
-
Output from running the code.
Question 2
Redo question (4) (and reproduce sales_by_store.csv
) from project (5), but this time add 2 additional columns to the dataset — lat
and lon
.
-
'lat': latitude
-
'lon': longitude
Before you panic (this was a tough question), we’ve provided the solution below as a starting point for you.
%%bash
awk -F';' 'BEGIN{ print "store_name;month_number;year;sold_usd;volume_sold" }
{
gsub(/\$/, "", $22); split($2, dates, "/", seps);
mysales[$4";"dates[1]";"dates[3]] += $22;
myvolumes[$4";"dates[1]";"dates[3]] += $24;
}
END{
for (mytriple in mysales)
{
print mytriple";"mysales[mytriple]";"myvolumes[mytriple]
}
}' /anvil/projects/tdm/data/iowa_liquor_sales/iowa_liquor_sales_cleaner.txt > sales_by_store.csv
It may take a few minutes to run this script. Grab a coffee, tea, or something else to keep you going. |
-
Code used to solve this problem.
-
Output from running the code.
Question 3
Believe it or not, awk
even supports geometric calculations like sin
and cos
. Write a bash script that, given a pair of latitude and pair of longitude, calculates the distance between the two points.
Okay, so how to get started? To calculate this, we can use the Haversine formula. The formula is:
$2*r*arcsin(\sqrt{sin^2(\frac{\phi_2 - \phi_1}{2}) + cos(\phi_1)*cos(\phi_2)*sin^2(\frac{\lambda_2 - \lambda_1}{2})})$
Where:
-
$r$ is the radius of the Earth in kilometers, we can use: 6367.4447 kilometers
-
$\phi_1$ and $\phi_2$ are the latitude coordinates of the two points
-
$\lambda_1$ and $\lambda_2$ are the longitude coordinates of the two points
In awk
, sin
is sin
, cos
is cos
, and sqrt
is sqrt
.
To get the arcsin
use the following awk
function:
function arcsin(x) { return atan2(x, sqrt(1-x*x)) }
To convert from degrees to radians, use the following awk
function:
function dtor(x) { return x*atan2(0, -1)/180 }
The following is how the script should work (with a real example you can test):
./question3.sh 40.39978 -91.387531 40.739238 -95.02756
309.57
To include functions in your
|
We want you to create a bash script called The following is some skeleton code that you can use to get started.
|
You may need to give your script execute permissions like this.
|
-
Code used to solve this problem.
-
Output from running the code.
Question 4
Create a new bash script called question4.sh
that accepts a latitude, longitude, filename, and n.
The latitude and longitude are a point that we want to calculate the distance from.
The filename is sales_by_store.csv
— our resulting dataset from question 3.
Finally, n is the number of stores from our sales_by_store.csv
file that we want to calculate the distance from the provided longitude and latitude.
./question4.sh 40.39978 -91.387531 sales_by_store.csv 3
Distance from (40.39978,-91.387531) store_name,distance The Music Station,253.915 KUM & GO #4 / LAMONI,213.455 KUM & GO #4 / LAMONI,213.447
To get you started, you can use the following "starter" code. Fix the code to work:
#!/bin/bash
lat_from=$1
lon_from=$2
file=$3
n=$4
awk -F';' -v n=$n -v lat_from=$lat_from -v lon_from=$lon_from 'function arcsin(x) { return atan2(x, sqrt(1-x*x)) }function dtor(x) { return x*atan2(0, -1)/180 }function distance(lat1, lon1, lat2, lon2) {
# question 2 code here (1)
return dist;
}BEGIN {
print "Distance from ("lat_from","lon_from")"
print "store_name,distance";
} NR>1 && NR <= n+1 {
lat2 = FIXME; (2)
lon2 = FIXME; (3)
dist = distance(lat_from, lon_from, FIXME, FIXME); (4)
print $1","dist
}' $file
1 | Add your code from question 2 here and make sure your distance is stored in a variable called dist (which we return). |
2 | Which value goes here? |
3 | Which value goes here? |
4 | Which values go here? |
-
Code used to solve this problem.
-
Output from running the code.
Question 5 (optional, 0 pts)
Use your choice of Python or R, with our sales_by_store.csv
to create a beautiful graphic mapping the latitudes and longitudes of the stores. If you want to, get creative and increase the size of the points on the map based on the number of sales. You could create a graphic for each month to see how sales change month-to-month. The options are limitless, get creative!
-
Code used to solve this problem.
-
Output from running the code.
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. |