Apache Spark Practices
Use Apache Spark to find the answer for the following questions.
- Find the # of flights each airline made so far from 1987 until recent.
- Find the mean departure delay per origination airport.
- What is the average departure delay from each airport?
- What day the delays are the worst?
- Which day of the week is the most of the flights cancelled?
- Which day of the month is the most of the flights cancelled?
- Find the on-time (ArrTime - CRSArrTime <= 0) performance for each unique carrier.
If using Google Colab, attach Google drive to Google Colab.
from google.colab import drive
drive.mount('/content/drive')Read all files in single code.
sc = spark.sparkContext
rdd = sc.textFile('/content/drive/path/to/files/*.csv.bz2')Use .take() or .first() instead of .collect().
rdd.take(2)Dataset is same as assignment 2, use if from the Google Drive folder. Use spu-bigdataanalytics-211/assignment-2 to see more on data dictionary.
You can find more information about this dataset in the website of Statistical Computing. Find out more information on Airline On-Time Performance Data from Bureau of Transportation Statistics (BTS).
- Download this repository with
git clone https://github.com/spu-bigdataanalytics-211/assignment-3.git. - Create a virtual environment and activate this environment everytime you need to use it.
- Install requirements.txt file using
pip install -r requirements.txt. - Create a notebook.
The repository should be self descriptive and it should guide you through assignment. Let me know if you have any questions.