Ethnicity & Survival: The Titanic Dataset

How I accounted for ethnic/racial bias with feature engineering, which improved my models and boosted my rank in the Kaggle competition to the top 4%.

6 min readApr 27, 2021

Submerged person holding a sparkler — Photo by Kristopher Roller on Unsplash

Alright… To all the data scientists out there reading this, I know this dataset has been covered ad nauseam. I promise this is something a little different and worth your time.

I’m going show you how I boosted my submission for the Kaggle Titanic Competition from an OK ranking (top 40%) to a great ranking (top 4%) with nothing but feature engineering. Specifically, I’m going to show you how to extract probabilistic passenger ethnicity from names in the dataset, so that you can train stronger machine learning models. I did additional feature engineering (all the stuff you’ve seen before) such extraction of title and binning of age etc., but this has already been covered extensively and I don’t want to needlessly lengthen this article covering things that others already have.

If you’re interested in a number, my submission scored 0.80143.

Lets Get Down to Business

In order to get passenger ethnicity from names, we’re going to be using the Python ethnicolr library.

Note: ethnicolr downgrades many key packages which were needed later in the Kaggle version of the notebook. To simplify things and work around this, I did the feature engineering with ethnicolr in a Jupyter environment and then uploaded the resulting dataset to Kaggle. If you want to skip the process, I’ll link to my dataset at the end of the article.

Step 1: Install ethnicolr

…You may need to restart your kernel after this step

Step 2: Import Pandas, and load the datasets

Step 3: Split the ‘Name’ string

We are going to use the predict_wiki_full_name function from ethnicolr. The function takes first and last name as separate inputs, so we will need to split the string containing these into the desired pieces.

Remember to repeat this for the test set as well

Step 4: Using Ethnicolr

How it works, from the project documentation:

We exploit the US census data, the Florida voting registration data, and the Wikipedia data collected by Skiena and colleagues, to predict race and ethnicity based on first and last name or just the last name. The granularity at which we predict the race depends on the dataset. For instance, Skiena et al.’ Wikipedia data is at the ethnic group level, while the census data we use in the model (the raw data has additional categories of Native Americans and Bi-racial) merely categorizes between Non-Hispanic Whites, Non-Hispanic Blacks, Asians, and Hispanics.

Let’s try it out, using the predict_wiki_name function:

Here’s the resulting DataFrame…

Ethnicolr assigns probabilities that the name belongs to each of the racial/ethnic categories, as well as a predicted race based on those probabilities.

Voila! We now have the new features for training our models!

Exploring the Effects of Race/Ethnicity on Survival

The purpose of this article is to introduce you to a novel approach to feature engineering for the Titanic dataset. Ethnicolr provides a probabilistic assessment of ethnicity; as such, statistical analysis of ethnicity and survival using this dataset should be undertaken cautiously. With that warning out of the way, let’s do a gentle exploration of ethnicity and survival rates.

Looking at the above, we do see variation in survival rates between the different ethnic classes in the sample. Let’s take a closer look at survival rates, the means of other socioeconomic factors, and sample sizes:

We see there are several ethnic classes with very small sample sizes, making analysis on them more difficult.
British ethnicity is the most common amongst the passengers, by quite a stretch. This makes intuitive sense, as it was a British ship and set sail from Britain.
There is variation between the means of Age, Fare, and breakdown by Sex across groups. The variation between implied socioeconomic factors for sample ethnic populations would need to be investigated ceteris paribus to provide statistically significant conclusions about the effect of race/ethnicity on survival odds.

To avoid the issue of small sample sizes, lets focus on the groups with a sample size of 25 or more:

Looking at the survival rates for passengers, there seems to be evidence of divides between ethnic lines more than racial ones. Ethnically European peoples would typically fall under the categorization of Caucasian (White), but the survival rates vary greatly between Caucasian passengers.

A Potential Hypothesis

The Titanic sank in 1912, a couple of years prior to WWI. It seems reasonable to assume there were intra-European tensions prior to the war; and that these tensions would have been present between European passengers on the ship. There may be evidence of that in the data. As mentioned earlier, the Titanic was a British ship; survival rates appear to favour Britain and her soon to be allies (France, Italy) at the expense of their soon to be enemy in the war (Germany).

To simplify, I am excluding Eastern Europe from the hypothesis as the allegiances of nations within this broad category were split during the War.
It should also be clarified that Italy was an ally of Germany at the outbreak of the war in 1914, however the country decided to remain neutral until 1915 when they entered the war on the side of Britain and France.

Modeling and results

After a little additional feature engineering, and feature selection using Boruta, here are the results…

Results of a 5 fold cross-validation on full training set

Model Performance for 66/33 train/test split

Conclusion

I’d be lying if I told you I haven’t tried to grid search the model hyperparameters for this dataset before. Doing so yielded a neglible improvement in model performance. It also yielded me a respectable, but not particularly brag-worthy, top 40% ranking.

By giving up on spending countless hours of compute time trying to brute force my way to a better result, and exploring something novel, I was able to achieve a ranking in the top 4% of ~34,000 competitors purely with feature engineering. I achieved this with zero hyperparameter tuning for the models.

I hope you find this useful and it can help you improve your own performance on the dataset. Feel free to hit me up on LinkedIn or GitHub!

Link to my dataset here.

🛸💨🌵