top of page

Titanic Data Analysis | Sample Assignment

Updated: May 10, 2022



Question 1


This assignment is a scenario-based assignment which uses Titanic Dataset and consists of 3 different questions. Read and understand the requirements and answer the questions carefully.


Dataset: Titanic disaster.


Data Dictionary:

Variable | Definition | Key

  • survival | Survival | 0 = No, 1 = Yes

  • pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd

  • sex | Sex | M or F

  • Age | Age in years

  • sibsp | # of siblings / spouses aboard the Titanic

  • parch | # of parents / children aboard the Titanic

  • ticket | Ticket number

  • fare | Passenger fare

  • cabin | Cabin number

  • embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes:

  • pclass: A proxy for socio-economic status (SES)

  • 1st = Upper

  • 2nd = Middle

  • 3rd = Lower

  • age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

  • sibsp: The dataset defines family relations in this way...

  • Sibling = brother, sister, stepbrother, stepsister

  • Spouse = husband, wife (mistresses and fiancés were ignored)

  • parch: The dataset defines family relations in this way

  • Parent = mother, father

  • Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.

Dataset Path:

The dataset Titanic_train.csv is present at the location

The dataset Titanic_test.csv is present at the location

Problem Statement:

You are provided with the datasets about people from the Titanic disaster. Use the dataset resolve the following issues:

Q1: Find the relation of the following columns (having discrete values) with the “Survived” columns and answer the below questions:

  • Pclass

  • Sex

  • Embarked


1. Find the total number of survivors from the 3rd PClass (Titanic_train.csv)



2. Find the total number of male who died in the accident (Titanic_train.csv)

3. Find the total number of the survivor who embarked the ship from "Southampton" (Titanic_train.csv)

Hint:

***Note: Write the code only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py***


Question 2


Dataset: Titanic disaster

Q: Some of the values in the "Age" column are missing. Use Linear Regression model to fill the missing values in the dataset.

(Hint: Dependent Variable(Age)) to fill(predict) the missing values.

1. Print the total number of cells having missing values in the Age column.

Example:

If Total number of cells with missing value is: 100

Output: 100

2. Print the sum of the index number of all the cells with missing values.

Example:

If the Index Number of cells with missing value is: (4,6,20,40)

Output: 70


3. Print the mean of all the new values filled using linear regression. [For this first divide the training dataset into two halves, first half will contain only those rows which have missing values in 'Age' Column(let us say this dataframe (df1), and the second half will contain the rows where you have valid numbers in 'Age' column(let us say this dataframe (df2)). Now we will train our model with df2 and predict the ages on the dataframe df1. Whatever age value we got for the df1 we will calculate the mean of it.]


***NOTE: Please use the features for predicting Age ['Pclass','Survived','GenderLabel']


Example:

If the new filled values are: (25.0,30.0, 30.0,35.0)

Output: 30.0


Steps to be followed: 1. Load the Titanic_train.csv file. 2. Calculate the missing values and count the occurrence. [Hint: You can use the isnull() with sum()] 3. Calculate the sum of the index where missing values are present. [Hint: You can use the is null() and pass the index to a list. Then you can sum the index of the list.] 4. Segregate the rows from the data having missing values(say in dataframe A) and rows from the dataframe having valid age values (say in dataframe B). 5. Convert the encode the string columns. So here we will encode the Sex column to “GenderLabel” columns 6. Now use the datarframe A from step 4 and fit into Linear Regression. [Hint: Use ‘Pclass’, ‘GenderLabel,’ ‘Survived’ as independent features.] 7. Now use the Linear regression model from step 5 and use it to predict the ‘age’ in dataframe B. 8. Once you get the predicted age from step 6, you can use the values to fit into the ‘age’ column of Dataframe B. 9. Calculate the mean for the Dataframe B having the age column and write the integer part of the mean. This will be the answer for part 3


***Note: Do not split the data into train_test split***


Input Dataset path:


***Note: Write the code only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py***


Question 3


Dataset: Titanic disaster.

Data Dictionary:

Variable | Definition | Key

  • survival | Survival | 0 = No, 1 = Yes

  • pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd

  • sex | Sex | M or F

  • Age | Age in years

  • sibsp | # of siblings / spouses aboard the Titanic

  • parch | # of parents / children aboard the Titanic

  • ticket | Ticket number

  • fare | Passenger fare

  • cabin | Cabin number

  • embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton

After performing the analysis from the previous question, derive a new column called “AdultOrChild” having categorical values as “Adult” or “Child” derived from Age column


Hint: A person having Age >=18 is an “Adult” and the one having Age < 18 is a “Child”.

1. Find its relation with the “Survived” Column and print the total number of survivors.


Example:

If Total survived children: 100, Total survived adults: 200


Output: 300

2. Consider below features to create a Classification model and predict the survived category

  • Pclass

  • Age

  • Sex (Encode values using LabelEncoder)

For the above prediction create a Confusion matrix for the model built by you and print the sum of all the elements of a matrix

***NOTE: 1. You should create the confusion matrix for the test data, not the training data.

2. Write the solution only in solution() function and do not pass any arguments to the function. For predefined stub refer stub.py***

Training Data: 'res/Titanic_train.csv'

Testing Data: 'res/Titanic_test.csv'

Example: If the Confusion Matrix is [2 2

2 2]

(2+2+2+2)

Output: 8


Hint: Use Logistic Regression as the classification model

3. Use confusion matrix to print the accuracy of the model

Example: (2+2)/8*100


Output: 50


***NOTE: You should check the accuracy for the test data not the training data.

Steps to be followed:

Step 1: In this question, you are supposed to read the CSV file using pandas.

Step 2: Print the total number of cells having missing values in the Age column. Hint: Using .isnull().sum()

Step 3: Find the sum of all the index numbers of the missing values.

Step 4: Derive a new column called “AdultOrChild” having categorical values as “Adult” or “Child” derived from Age column. Hint: A person having Age >=18 is an “Adult” and the one having Age < 18 is a “Child”.

Step 5: Find its relation with the “Survived” Column and print the total number of survivors. Obtain the complete dataset by combining it with the target attribute.

Step 6: Consider mentioned features to create a Classification model and predict the survived category. For the above prediction create a Confusion matrix for the model built by you and print the sum of all the elements

of a matrix. Hint: Use confusion_matrix(Y_train, Y_pred)

Step7: Use logistic regression on the titanic_test.csv data calculate accuracy score using: round(accuracy_score(Y_train, Y_pred)*100,2))

Step8: Finally create a dataframe of the final output and write the output to output.csv which is present at

output/output.csv


NOTE: Here, 100, 200 and 300 are the answer of 1st, 2nd and 3rd question respectively.


Output Format:

  • Perform the above operations and write (written above as print) your output to a file named output.csv, which should be present at the location output/output.csv

  • output.csv should contain the answer to each question on consecutive rows.


***NOTE: For all the questions the numerical values saved in output.csv file should be in integer format with no decimals.


Note: This question will be evaluated based on the number of test cases that your code passes.


Screenshot of Output






If you need solution for this assignment or have project a similar assignment, you can leave us a mail at contact@codersarts.com directly.

bottom of page