Beyond the Iceberg: Addressing Hidden Fare Inflation in Titanic Data

Authors

  • Swee Chuan Tan Singapore University of Social Sciences

DOI:

https://doi.org/10.56042/jsir.v84i7.16992

Keywords:

Data preprocessing, Kaggle dataset, Machine learning, Passenger survival analysis, Titanic fare

Abstract

The Titanic-related data was originally compiled by the British Board of Trade as part of its investigation into the tragic sinking of the Royal Mail Ship Titanic. For many years, research on the Titanic disaster remained largely in the domains of historians and enthusiasts. Its popularity in the machine learning community surged after Kaggle released a curated version of the dataset for a data analysis competition. Since then, it has been widely adopted for data science education and research, including its use in teaching data preprocessing and analysis, as well as benchmarking the performance of different machine learning algorithms. However, there is a previously overlooked flaw in this dataset: this paper shows that the average passenger class fares computed from the Kaggle dataset differ substantially from those published by NBC Los Angeles News in June 2023. In particular, the incorrect assignment of group fares to individual passenger fares has caused systematic inflation of fare values, potentially leading to misinterpretations over the years. A methodological correction for the Fare attribute is proposed, whereby group fares are divided equally among all passengers within the same travel group. This adjustment yields a significant 15.6% improvement in Spearman’s correlation between the fare and passenger class. Additionally, experimental results demonstrate that fare correction improves prediction performance in classification and regression tree. It is hoped that this correction will enhance the dataset’s utility for future education and research.

Downloads

Published

25-07-2025

Issue

Section

Computer Sciences, Communication and Information Technology