This coursework examines the student’s ability to design a relational database, manipulate and analyse large datasets, and to interpret analysis results.
Part 1: Designing a Relational Database
Little Panda is a takeaway who wants to open up its business to accept online orders. To do this, it needs a database to store its food menu, customer data and orders.
Your task is to design a relational database that runs on MySQL. Here are some requirements:
Customers must register before they can make orders. They must provide enough details for home delivery.
Menu item prices may change. Customers are charged prices at the time of order.
Little Panda needs to know the status of an order so that they can follow up. e.g. either it is “waiting to be cooked”, “cooked and to be delivered”, or delivered, etc. You can assume all orders are paid before they enter the system.
Order details must be stored for accounting purpose, even after they are completed.
Other than these it is a usual takeaway scenario. You can apply your common sense to make reasonable assumptions.
Part 2: Analysing the “UN City Population” Dataset
You are given the “UN city population” dataset. Perform the following analysis using Pig:
Question 1: Find the number of countries in the dataset.
Question 2: List the countries together with the number of cities in each country1 . Question 3: List countries in ascending order of female-to-male ratio, throughout the years2 .
Question 4: List the top 10 most populated cities according to the most recent data in the dataset3 .
Question 5: List the top 10 cities which have the highest population change per year in percentage since the start of the survey4 .
You must use Pig.
Annotate your program code properly so that the marker can understand how it works. The annotation also contributes to the grade.
State any assumption that you made.
If you cannot complete a task, an incomplete solution may also bring you partial credit.
Part 3: Analysing Datasets of Your Choice
In this part of the coursework you need to:
1. Find a dataset, or multiple datasets.
Dataset(s) must be public domain and of a considerable size.
▪ A dataset cannot be too small. e.g. Just a few lines.
▪ There is no need to go for a GB or TB-sized dataset unless the dataset is very interesting.
◦ DO NOT choose a dataset similar to the one in part 2.
2. Propose 3 analysis tasks that you will perform on the dataset(s).
◦ Your proposed analyses should be insightful. e.g. give useful information for decision making.
◦ You may combine multiple datasets for some interesting analyses.
◦ DO NOT propose tasks similar to those in part 2.
3. Implement the 3 proposed analyses using Pig.
4. For each analysis, interpret the result.
Here are some starting places to look for datasets:
Show an ER diagram of your conceptual model.
State all assumptions you have made.
State all limitations (if any) of your design.
Show the process of normalisation up to the 3rd Normal Form.
• DO NOT just claim your schema is in 3NF. You must show evidence. e.g functional dependencies among attributes and keys.
• Show how the database schema changes as normalisation progresses.
Show your final database schema. i.e. All tables and their structures
Part 2 & 3:
Your submission should include the followings:
▪ All Pig scripts for the tasks.
• DO NOT combine all Pig scripts into one file.
• These scripts should be provided as separate text files which are ready to be executed.
▪ A report, preferably in PDF format.
• For part 2:
◦ State the result of each analysis task.
◦ For short results, you can copy-and-paste them into the report. For long results, include them as external text files.
• For part 3:
◦ Describe your chosen dataset. e.g. What is it about?
▪ State clearly the source of the dataset. e.g. What is its URL?
▪ Describe the format of the dataset. e.g. What does each field mean?
▪ Explain your motivation in selecting this dataset. For example:
• What makes you interested in this dataset?
• What useful information can you get from this dataset?
◦ For the 3 proposed analyses.
▪ Explain the motivation of each analysis.
▪ State the result of each analysis.