1388 lines
226 KiB
Plaintext
1388 lines
226 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "395d61ec",
|
||
"metadata": {},
|
||
"source": [
|
||
"# TP1 du module 4 : le travail sur les données.\n",
|
||
"\n",
|
||
"Dans ce TP, nous allons explorer plus en détails le jeu de données du Titanic. Objectifs :\n",
|
||
"* Analyser des statistiques pour décrire les données\n",
|
||
"* Produire des visualisations pertinentes pour la compréhesion des données.\n",
|
||
"* Nettoyer le jeu de données\n",
|
||
"* Préparer les données pour qu'elles soient prêtes à être fournies à un algorithme d'apprentissage."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "5117092f",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.308342Z",
|
||
"start_time": "2025-09-16T10:06:44.305868Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Ajoutez ici les imports de librairies nécessaires\n",
|
||
"import numpy as np\n",
|
||
"import pandas as pd\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import seaborn as sns\n",
|
||
"\n",
|
||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
"import skitlearn"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 4
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fde8da96",
|
||
"metadata": {},
|
||
"source": [
|
||
"Commencez par recharger le jeu de données depuis un csv dans un dataframe Pandas. Rappel de l'adresse à laquelle vous pouvez le trouver : https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "33fba6ca",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.431049Z",
|
||
"start_time": "2025-09-16T10:06:44.315956Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 5
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "205f765d",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Exploration du jeu de données\n",
|
||
"\n",
|
||
"Commencez par répondre au question suivante. Prenez le temps de bien analyser vos réponses, afin de mieux vous familiariser avec le contenu du jeu de données.\n",
|
||
"\n",
|
||
"1. Combien de données dans le jeu de données Titanic ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "4ee3884e",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.443097Z",
|
||
"start_time": "2025-09-16T10:06:44.440519Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"print(\"Nombre de données : \", len(titanic))"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Nombre de données : 891\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 6
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "74f32328",
|
||
"metadata": {},
|
||
"source": [
|
||
"2. Combien d'attributs compte le jeu de données ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "80eeccc3",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.467871Z",
|
||
"start_time": "2025-09-16T10:06:44.465137Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"print(\"Nombre d'attributs : \", len(titanic.columns))"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Nombre d'attributs : 12\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 7
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "819573a7",
|
||
"metadata": {},
|
||
"source": [
|
||
"3. Identifiez quelles colonnes contiennent des données discrètes, et lesquelles contiennent des données continues."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "87aa38a1",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.497752Z",
|
||
"start_time": "2025-09-16T10:06:44.488952Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"print(titanic.info())"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 12 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 PassengerId 891 non-null int64 \n",
|
||
" 1 Survived 891 non-null int64 \n",
|
||
" 2 Pclass 891 non-null int64 \n",
|
||
" 3 Name 891 non-null object \n",
|
||
" 4 Sex 891 non-null object \n",
|
||
" 5 Age 714 non-null float64\n",
|
||
" 6 SibSp 891 non-null int64 \n",
|
||
" 7 Parch 891 non-null int64 \n",
|
||
" 8 Ticket 891 non-null object \n",
|
||
" 9 Fare 891 non-null float64\n",
|
||
" 10 Cabin 204 non-null object \n",
|
||
" 11 Embarked 889 non-null object \n",
|
||
"dtypes: float64(2), int64(5), object(5)\n",
|
||
"memory usage: 83.7+ KB\n",
|
||
"None\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 8
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "3daaaf5c",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Réponse :** \n",
|
||
"\n",
|
||
"* Données discrètes : PassengerId, Survived, Pclass, Name, Sex, SibSp, Parch, Cabin, Embarked\n",
|
||
"* Données continues : Age, Fare, Ticket"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "59733c69",
|
||
"metadata": {},
|
||
"source": [
|
||
"4. De la même manière, identifiez les colonnes de données qualitatives, et celles de données quantitatives.\n",
|
||
"\n",
|
||
"**Réponse :** \n",
|
||
"\n",
|
||
"* Données qualitatives : Survived, Name, Sex, Ticket, Cabin, Embarked\n",
|
||
"* Données quantitatives : PassengerId, Pclass, Age, SibSp, Parch, Fare"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "69c4bee2",
|
||
"metadata": {},
|
||
"source": [
|
||
"5. Affichez les statistiques de base sur les colonnes quantitatives du dataset. \n",
|
||
"Quelles informations pouvez-vous en retirer ? Pour chaque attribut, cherchez au moins une information pertinente que vous pouvez déduire de vos observations."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "82ebfbb6",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:44.540637Z",
|
||
"start_time": "2025-09-16T10:06:44.518122Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic.describe()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
" PassengerId Survived Pclass Age SibSp \\\n",
|
||
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
|
||
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
|
||
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
|
||
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
|
||
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
|
||
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
|
||
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
|
||
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
|
||
"\n",
|
||
" Parch Fare \n",
|
||
"count 891.000000 891.000000 \n",
|
||
"mean 0.381594 32.204208 \n",
|
||
"std 0.806057 49.693429 \n",
|
||
"min 0.000000 0.000000 \n",
|
||
"25% 0.000000 7.910400 \n",
|
||
"50% 0.000000 14.454200 \n",
|
||
"75% 0.000000 31.000000 \n",
|
||
"max 6.000000 512.329200 "
|
||
],
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>PassengerId</th>\n",
|
||
" <th>Survived</th>\n",
|
||
" <th>Pclass</th>\n",
|
||
" <th>Age</th>\n",
|
||
" <th>SibSp</th>\n",
|
||
" <th>Parch</th>\n",
|
||
" <th>Fare</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>count</th>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>714.000000</td>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>891.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>mean</th>\n",
|
||
" <td>446.000000</td>\n",
|
||
" <td>0.383838</td>\n",
|
||
" <td>2.308642</td>\n",
|
||
" <td>29.699118</td>\n",
|
||
" <td>0.523008</td>\n",
|
||
" <td>0.381594</td>\n",
|
||
" <td>32.204208</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>std</th>\n",
|
||
" <td>257.353842</td>\n",
|
||
" <td>0.486592</td>\n",
|
||
" <td>0.836071</td>\n",
|
||
" <td>14.526497</td>\n",
|
||
" <td>1.102743</td>\n",
|
||
" <td>0.806057</td>\n",
|
||
" <td>49.693429</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>min</th>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>0.420000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>25%</th>\n",
|
||
" <td>223.500000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>2.000000</td>\n",
|
||
" <td>20.125000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>7.910400</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>50%</th>\n",
|
||
" <td>446.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>3.000000</td>\n",
|
||
" <td>28.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>14.454200</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>75%</th>\n",
|
||
" <td>668.500000</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>3.000000</td>\n",
|
||
" <td>38.000000</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>0.000000</td>\n",
|
||
" <td>31.000000</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>max</th>\n",
|
||
" <td>891.000000</td>\n",
|
||
" <td>1.000000</td>\n",
|
||
" <td>3.000000</td>\n",
|
||
" <td>80.000000</td>\n",
|
||
" <td>8.000000</td>\n",
|
||
" <td>6.000000</td>\n",
|
||
" <td>512.329200</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
]
|
||
},
|
||
"execution_count": 9,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 9
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a5ad3de1",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observations sur les statistiques :** \n",
|
||
"* Il y a malheureusement eu plus de personnes ayant péri que de survivants. La proportion est d'environ 1/3 de survivants.\n",
|
||
"* Il semble y avoir eu une majorité (plus de la moitié) de voyageurs en 3e classe.\n",
|
||
"* Les âges sont très variés, avec un pic autour de la trentaine.\n",
|
||
"* Une majorité de passagers voyageaient sans frère, soeur ou conjoint à bord\n",
|
||
"* De même, peu de passagers voyageaient avec des parents ou des enfants.\n",
|
||
"* Les prix des tickets semblent très variables, avec une majorité de prix assez faibles et éloignés du prix maximum. Cela est cohérent avec la majorité de voyageurs en 3e classe."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "46384afa",
|
||
"metadata": {},
|
||
"source": [
|
||
"6. Sur une figure avec 6 sous-figures, proposez un histogramme pour visualiser la répartition des valeurs sur les attributs suivants : Survived, Pclass, Sex, Embarked, Age, Fare. Pour chaque figure, quelle(s) observation(s) pouvez-vous faire ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "94ddbac5",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:45.173695Z",
|
||
"start_time": "2025-09-16T10:06:44.560697Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"f, axes = plt.subplots(2, 3, figsize=(20, 10))\n",
|
||
" \n",
|
||
"# Histogramme pour la survie\n",
|
||
"sub1 = axes[0, 0]\n",
|
||
"sns.histplot(titanic['Survived'] , color=\"skyblue\", ax=sub1, bins=2)\n",
|
||
"sub1.set_xticks([0,1])\n",
|
||
"\n",
|
||
"# Histogramme pour la classe des passagers\n",
|
||
"sub1 = axes[0, 1]\n",
|
||
"sns.histplot(titanic['Pclass'] , color=\"skyblue\", ax=sub1, bins=3)\n",
|
||
"sub1.set_xticks([1,2,3])\n",
|
||
"\n",
|
||
"# Histogramme pour le genre des passagers\n",
|
||
"sub1 = axes[0, 2]\n",
|
||
"sns.histplot(titanic['Sex'] , color=\"skyblue\", ax=sub1, bins=2)\n",
|
||
"\n",
|
||
"# Histogramme pour le port d'embarquement\n",
|
||
"sub1 = axes[1, 0]\n",
|
||
"sns.histplot(titanic['Embarked'] , color=\"skyblue\", ax=sub1)\n",
|
||
"\n",
|
||
"# Histogramme pour l'âge des passagers, regroupés par dizaine\n",
|
||
"sub1 = axes[1, 1]\n",
|
||
"sns.histplot(titanic['Age'] , color=\"skyblue\", ax=sub1, binwidth=10)\n",
|
||
"\n",
|
||
"# Histogramme pour le prix du billet\n",
|
||
"sub1 = axes[1, 2]\n",
|
||
"sns.histplot(titanic['Fare'] , color=\"skyblue\", ax=sub1)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Axes: xlabel='Fare', ylabel='Count'>"
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Figure size 2000x1000 with 6 Axes>"
|
||
],
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data",
|
||
"jetTransient": {
|
||
"display_id": null
|
||
}
|
||
}
|
||
],
|
||
"execution_count": 10
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "fd730025",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observations :**\n",
|
||
"* Plus de décès que de survivants\n",
|
||
"* Majorité de personnes en 3e classe\n",
|
||
"* Plus d'hommes que de femmes\n",
|
||
"* Grande majorité de personnes ayant embarqué à Southampton\n",
|
||
"* Ages très répartis, pic autour de la trentaine (corrobore les statistiques)\n",
|
||
"* Beaucoup de billets à bas prix, valeurs très étalées."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "620fbbba",
|
||
"metadata": {},
|
||
"source": [
|
||
"7. Sur un même graphique, représentez, pour chaque genre, le nombre de personnes ayant survécu. Que pouvez-vous en déduire ? Le genre d'un passager vous parait-il pertinent pour qu'un modèle d'apprentissage puisse prédire si ce passager a survécu ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "f58cb499",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:45.449518Z",
|
||
"start_time": "2025-09-16T10:06:45.221430Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"sns.catplot(x =\"Sex\", hue =\"Survived\",\n",
|
||
"kind =\"count\", data = titanic)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<seaborn.axisgrid.FacetGrid at 0x203836716a0>"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Figure size 565.361x500 with 1 Axes>"
|
||
],
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data",
|
||
"jetTransient": {
|
||
"display_id": null
|
||
}
|
||
}
|
||
],
|
||
"execution_count": 11
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "4f61683e",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observation :** Il y avait plus d'hommes que de femmes à bord du bateau, mais les femmes ont été plus nombreuses à survivre. Ainsi, si la majorité des hommes ont péri, chez les femmes la tendance est inversée. Le genre est donc un attribut très pertinent pour prédire la survie ou non, car très discriminant. Il ne sera pas suffisant, mais pourra améliorer la prédiction s'il est combiné à d'autres attributs."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "412e2e95",
|
||
"metadata": {},
|
||
"source": [
|
||
"8. En vous basant sur une visualisation, observez-vous une corrélation entre certains attributs ? Que pouvez-vous en déduire pour un futur modèle d'apprentissage ?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "48d8ee62",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:45.719113Z",
|
||
"start_time": "2025-09-16T10:06:45.488383Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Filtrer les colonnes numériques, car titanic.corr() échoue sinon.\n",
|
||
"# Ajouter des annotations (optionnel mais utile) pour rendre les corrélations lisibles.\n",
|
||
"# Commenter ou analyser brièvement la heatmap après coup, pour répondre à \"Que pouvez-vous en déduire\".\n",
|
||
"\n",
|
||
"# 1. Sélection des colonnes numériques uniquement\n",
|
||
"numeric_df = titanic.select_dtypes(include='number')\n",
|
||
"\n",
|
||
"# 2. Calcul des corrélations\n",
|
||
"correlation_matrix = numeric_df.corr()\n",
|
||
"\n",
|
||
"# 3. Affichage de la heatmap\n",
|
||
"sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')\n",
|
||
"plt.title(\"Matrice de corrélation des attributs numériques du Titanic\")\n",
|
||
"plt.show()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 2 Axes>"
|
||
],
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data",
|
||
"jetTransient": {
|
||
"display_id": null
|
||
}
|
||
}
|
||
],
|
||
"execution_count": 12
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "5f9853b3",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observations :**\n",
|
||
"* Corrélation entre SibSp et Parch, ce qui est logique étant donné qu'ils sont tous les deux relatifs à la notion de famille\n",
|
||
"* Corrélation entre la survie et la classe des passagers : très intéressant pour l'apprentissage\n",
|
||
"* Corrélation entre l'âge et la classe des passagers"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ae39b80c",
|
||
"metadata": {},
|
||
"source": [
|
||
"9. En vous basant sur vos observations de la visualisation précédente, confirmez vos impressions en proposant deux visualisations. Par exemple, si vous avez observé une corrélation entre un attribut A et un attribut B, mettez en valeur le fait que les mêmes valeurs de A sont souvent trouvées avec les mêmes valeurs de B. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "987a4b5f",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:45.859230Z",
|
||
"start_time": "2025-09-16T10:06:45.727921Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Proposition 1 : survie et classe des passagers\n",
|
||
"sns.countplot(data=titanic, x='Pclass',hue='Survived')"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Axes: xlabel='Pclass', ylabel='count'>"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
],
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data",
|
||
"jetTransient": {
|
||
"display_id": null
|
||
}
|
||
}
|
||
],
|
||
"execution_count": 13
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "ae68615d",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observation :** les passagers de classe plus élevée ont eu plus de chance de survie."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "9d66606a",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.041959Z",
|
||
"start_time": "2025-09-16T10:06:45.865680Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Proposition 2 : âge et classe des passagers\n",
|
||
"sns.histplot(\n",
|
||
" data=titanic, x='Age', hue='Pclass', multiple='dodge',\n",
|
||
" bins=range(1, 110, 10))"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Axes: xlabel='Age', ylabel='Count'>"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
},
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<Figure size 640x480 with 1 Axes>"
|
||
],
|
||
"image/png": ""
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data",
|
||
"jetTransient": {
|
||
"display_id": null
|
||
}
|
||
}
|
||
],
|
||
"execution_count": 14
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "57cd6075",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Observation :** les passagers plus âgées sont souvent dans des classes plus élevées."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "67adc89f",
|
||
"metadata": {},
|
||
"source": [
|
||
"10. Mettez-vous à présent dans le rôle d'un data analyst face à un nouveau jeu de données : quelles autres visualisations vous paraissent nécessaires ? Proposez au moins trois visualisations qui vous semblent pertinentes. Gardez en tête que l'objectif sur ce jeu de données sera de réussir à prédire si un passager à survécu ou non."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "02d7f820",
|
||
"metadata": {},
|
||
"source": [
|
||
"**Suggestions de visualisations :**\n",
|
||
"* Survie par classe d'âge\n",
|
||
"* Survie et nombre de membres de la famille\n",
|
||
"* Age et prix du ticket"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "081aafe5",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Nettoyage des données\n",
|
||
"1. Pour chaque colonne, comptez le nombre de valeurs nulles."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "ba42c62a",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.062200Z",
|
||
"start_time": "2025-09-16T10:06:46.057233Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic.isna().sum()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"PassengerId 0\n",
|
||
"Survived 0\n",
|
||
"Pclass 0\n",
|
||
"Name 0\n",
|
||
"Sex 0\n",
|
||
"Age 177\n",
|
||
"SibSp 0\n",
|
||
"Parch 0\n",
|
||
"Ticket 0\n",
|
||
"Fare 0\n",
|
||
"Cabin 687\n",
|
||
"Embarked 2\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 15,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 15
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e5bc166d",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Traitement du port d'embarquement\n",
|
||
"\n",
|
||
"2. Les valeurs manquantes du port d'embarquement sont très peu nombreuses. De plus, s'agissant d'un attribut discret, nous pouvons considérer l'information de valeur nulle comme une valeur possible supplémentaire. Commencer par affichez les lignes pour lesquelles le port d'embarquement n'est pas renseigné."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "5219c39a",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.092543Z",
|
||
"start_time": "2025-09-16T10:06:46.083739Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic[titanic['Embarked'].isna()]"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
" PassengerId Survived Pclass Name \\\n",
|
||
"61 62 1 1 Icard, Miss. Amelie \n",
|
||
"829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) \n",
|
||
"\n",
|
||
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
||
"61 female 38.0 0 0 113572 80.0 B28 NaN \n",
|
||
"829 female 62.0 0 0 113572 80.0 B28 NaN "
|
||
],
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>PassengerId</th>\n",
|
||
" <th>Survived</th>\n",
|
||
" <th>Pclass</th>\n",
|
||
" <th>Name</th>\n",
|
||
" <th>Sex</th>\n",
|
||
" <th>Age</th>\n",
|
||
" <th>SibSp</th>\n",
|
||
" <th>Parch</th>\n",
|
||
" <th>Ticket</th>\n",
|
||
" <th>Fare</th>\n",
|
||
" <th>Cabin</th>\n",
|
||
" <th>Embarked</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>61</th>\n",
|
||
" <td>62</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Icard, Miss. Amelie</td>\n",
|
||
" <td>female</td>\n",
|
||
" <td>38.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>113572</td>\n",
|
||
" <td>80.0</td>\n",
|
||
" <td>B28</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>829</th>\n",
|
||
" <td>830</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Stone, Mrs. George Nelson (Martha Evelyn)</td>\n",
|
||
" <td>female</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>113572</td>\n",
|
||
" <td>80.0</td>\n",
|
||
" <td>B28</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 16
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "363787e8",
|
||
"metadata": {},
|
||
"source": [
|
||
"3. Remplacez ces valeurs nulles par la valeur 'U' (pour unknown). Vérifiez vos résultats en réaffichant les lignes obtenues ci-dessus :"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "2be8a958",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.297992Z",
|
||
"start_time": "2025-09-16T10:06:46.287441Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic['Embarked'] = titanic['Embarked'].replace(np.nan, 'U')\n",
|
||
"titanic[(titanic['PassengerId']==62) | (titanic['PassengerId']==830)]"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
" PassengerId Survived Pclass Name \\\n",
|
||
"61 62 1 1 Icard, Miss. Amelie \n",
|
||
"829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) \n",
|
||
"\n",
|
||
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
||
"61 female 38.0 0 0 113572 80.0 B28 U \n",
|
||
"829 female 62.0 0 0 113572 80.0 B28 U "
|
||
],
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>PassengerId</th>\n",
|
||
" <th>Survived</th>\n",
|
||
" <th>Pclass</th>\n",
|
||
" <th>Name</th>\n",
|
||
" <th>Sex</th>\n",
|
||
" <th>Age</th>\n",
|
||
" <th>SibSp</th>\n",
|
||
" <th>Parch</th>\n",
|
||
" <th>Ticket</th>\n",
|
||
" <th>Fare</th>\n",
|
||
" <th>Cabin</th>\n",
|
||
" <th>Embarked</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>61</th>\n",
|
||
" <td>62</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Icard, Miss. Amelie</td>\n",
|
||
" <td>female</td>\n",
|
||
" <td>38.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>113572</td>\n",
|
||
" <td>80.0</td>\n",
|
||
" <td>B28</td>\n",
|
||
" <td>U</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>829</th>\n",
|
||
" <td>830</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>1</td>\n",
|
||
" <td>Stone, Mrs. George Nelson (Martha Evelyn)</td>\n",
|
||
" <td>female</td>\n",
|
||
" <td>62.0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>0</td>\n",
|
||
" <td>113572</td>\n",
|
||
" <td>80.0</td>\n",
|
||
" <td>B28</td>\n",
|
||
" <td>U</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
]
|
||
},
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"execution_count": 17
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a5e22413",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Traitement du numéro de cabine\n",
|
||
"\n",
|
||
"4. Le numéro de cabine est l'attribut qui contient le plus de valeurs nulles. Cet attribut n'a pas de lien évident avec la survie des passagers. Supprimer cette colonne dans votre dataframe."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "aef1705c",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.473560Z",
|
||
"start_time": "2025-09-16T10:06:46.465828Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic = titanic.drop('Cabin', axis=1)\n",
|
||
"titanic.info()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 11 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 PassengerId 891 non-null int64 \n",
|
||
" 1 Survived 891 non-null int64 \n",
|
||
" 2 Pclass 891 non-null int64 \n",
|
||
" 3 Name 891 non-null object \n",
|
||
" 4 Sex 891 non-null object \n",
|
||
" 5 Age 714 non-null float64\n",
|
||
" 6 SibSp 891 non-null int64 \n",
|
||
" 7 Parch 891 non-null int64 \n",
|
||
" 8 Ticket 891 non-null object \n",
|
||
" 9 Fare 891 non-null float64\n",
|
||
" 10 Embarked 891 non-null object \n",
|
||
"dtypes: float64(2), int64(5), object(4)\n",
|
||
"memory usage: 76.7+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 18
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "112f1631",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Traitement de l'âge\n",
|
||
"5. L'âge est un attribut plus délicat à traiter : il contient un nombre conséquent de valeur nulle, mais il est très pertinent à prendre en compte pour la prédiction de la survie d'un passager, ces deux informations étant assez fortement corrélées. Il existe plein de stratégies pour remplacer ces valeurs manquantes :\n",
|
||
"* Mettre une valeur aléatoire '(tirée entre le min et le max)\n",
|
||
"* Remplacer par la moyenne\n",
|
||
"* Remplacer par une valeur déterminée en fonction des autres paramètres (classe, age, etc)\n",
|
||
"\n",
|
||
"Commencer par calculer pour chaque genre et pour chaque classe, l'âge moyen (6 valeurs à obtenir au total)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "2b2a06ff",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.635521Z",
|
||
"start_time": "2025-09-16T10:06:46.627947Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"age_avg_pclass_sex = titanic.groupby(['Pclass', 'Sex'], as_index=False)['Age'].mean()\n",
|
||
"print(age_avg_pclass_sex)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" Pclass Sex Age\n",
|
||
"0 1 female 34.611765\n",
|
||
"1 1 male 41.281386\n",
|
||
"2 2 female 28.722973\n",
|
||
"3 2 male 30.740707\n",
|
||
"4 3 female 21.750000\n",
|
||
"5 3 male 26.507589\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 19
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "e42013d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"6. Pour chaque ligne du jeu de données, si l'âge est manquant, remplacez la valeur nulle par une des valeurs calculées ci-dessus. basez-vous sur le genre et la classe du passager pour choisir la bonne valeur. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "3daa4cc6",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.716203Z",
|
||
"start_time": "2025-09-16T10:06:46.710056Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"#La méthode transform() est conçue pour conserver l'index d'origine, ce qui permet d’assigner directement le résultat à une colonne du DataFrame.\n",
|
||
"#apply() modifie l’index → pas idéal pour affecter une colonne.\n",
|
||
"#transform() conserve l’index → parfait pour des remplacements ligne à ligne.\n",
|
||
"\n",
|
||
"titanic['Age'] = titanic['Age'].groupby([titanic['Pclass'], titanic['Sex']]).transform(lambda x: x.fillna(x.mean()))\n",
|
||
"\n",
|
||
"#groupby(['Pclass', 'Sex'])['Age'] : groupe les âges selon la classe et le sexe.\n",
|
||
"#.transform('mean') : remplace chaque ligne du groupe par la moyenne du groupe, en gardant le même index.\n",
|
||
"#.fillna(...) : remplit uniquement les NaN avec la moyenne du groupe correspondant.\n",
|
||
"\n",
|
||
"# Vérifier que les valeurs manquantes ont bien été remplacées\n",
|
||
"print(titanic['Age'].isnull().sum()) # Nombre de NaN restants (devrait être 0 ou très réduit)"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"0\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 20
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "cf24aa8b",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Préparer les données\n",
|
||
"\n",
|
||
"Nous entrons dans la dernière phase de traitement des données : nous allons terminer des les mettre en forme, pour qu'elles soient prêtes à être manipulées dans un processus d'apprentissage."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "f2ab2605",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Encodage des données catégorielles\n",
|
||
"\n",
|
||
"1. Réaffichez les infos sur le jeu de données. Vous devez avoir 11 colonnes, toutes remplies avec 891 valeurs."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "3ac1186f",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.803405Z",
|
||
"start_time": "2025-09-16T10:06:46.796857Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic.info()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 11 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 PassengerId 891 non-null int64 \n",
|
||
" 1 Survived 891 non-null int64 \n",
|
||
" 2 Pclass 891 non-null int64 \n",
|
||
" 3 Name 891 non-null object \n",
|
||
" 4 Sex 891 non-null object \n",
|
||
" 5 Age 891 non-null float64\n",
|
||
" 6 SibSp 891 non-null int64 \n",
|
||
" 7 Parch 891 non-null int64 \n",
|
||
" 8 Ticket 891 non-null object \n",
|
||
" 9 Fare 891 non-null float64\n",
|
||
" 10 Embarked 891 non-null object \n",
|
||
"dtypes: float64(2), int64(5), object(4)\n",
|
||
"memory usage: 76.7+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 21
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a62f568e",
|
||
"metadata": {},
|
||
"source": [
|
||
"2. Trois colonnes sont liées à l'identification unique d'un passager, et ne sont pas pertinentes pour la prédiction de la survie. Supprimez ces trois colonnes de votre jeu de données."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "a4b7d99c",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.865523Z",
|
||
"start_time": "2025-09-16T10:06:46.858478Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic = titanic.drop(['PassengerId','Name', 'Ticket'], axis=1)\n",
|
||
"titanic.info()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 8 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Survived 891 non-null int64 \n",
|
||
" 1 Pclass 891 non-null int64 \n",
|
||
" 2 Sex 891 non-null object \n",
|
||
" 3 Age 891 non-null float64\n",
|
||
" 4 SibSp 891 non-null int64 \n",
|
||
" 5 Parch 891 non-null int64 \n",
|
||
" 6 Fare 891 non-null float64\n",
|
||
" 7 Embarked 891 non-null object \n",
|
||
"dtypes: float64(2), int64(4), object(2)\n",
|
||
"memory usage: 55.8+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 22
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "9126e8f9",
|
||
"metadata": {},
|
||
"source": [
|
||
"3. Reproduisez une manipulation vue au module 3 : combinez deux colonnes relatives aux familles de passagers pour n'en faire plus qu'une. Pensez à supprimer les deux anciennes colonnes."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "ad70a5d5",
|
||
"metadata": {
|
||
"scrolled": true,
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.923527Z",
|
||
"start_time": "2025-09-16T10:06:46.915421Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic['Famille'] = titanic['SibSp'] + titanic['Parch']\n",
|
||
"titanic = titanic.drop(['SibSp','Parch'], axis=1)\n",
|
||
"titanic.info()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 7 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Survived 891 non-null int64 \n",
|
||
" 1 Pclass 891 non-null int64 \n",
|
||
" 2 Sex 891 non-null object \n",
|
||
" 3 Age 891 non-null float64\n",
|
||
" 4 Fare 891 non-null float64\n",
|
||
" 5 Embarked 891 non-null object \n",
|
||
" 6 Famille 891 non-null int64 \n",
|
||
"dtypes: float64(2), int64(3), object(2)\n",
|
||
"memory usage: 48.9+ KB\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 23
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "a96edd21",
|
||
"metadata": {},
|
||
"source": [
|
||
"4. Parmi les colonnes restantes, sur lesquelles un encodage one-hot vous semble pertinent ? En vous appuyant sur [la documentation de scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), appliquez cet encodage aux colonnes identifiées. Enfin, pensez à supprimer les anciennes colonnes.\n",
|
||
"\n",
|
||
"Note : pour le genre, il n'y a que deux possibilités dans ce jeu de données. Avec l'option `if_binary` de l'encodeur, vous pouvez ne générer qu'une seule colonne (l'autre s'obtenant immédiatement par déduction)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "1316c770",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.969766Z",
|
||
"start_time": "2025-09-16T10:06:46.953407Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"enc = OneHotEncoder(drop='if_binary')\n",
|
||
"\n",
|
||
"one_hot = np.array(enc.fit_transform(titanic[['Sex', 'Embarked']]).toarray())\n",
|
||
"one_hot_label = enc.get_feature_names_out(['Sex', 'Embarked'])\n",
|
||
"\n",
|
||
"df = pd.DataFrame(data=one_hot, columns=one_hot_label)\n",
|
||
"titanic = titanic.join(df)\n",
|
||
"titanic = titanic.drop(['Sex', 'Embarked'], axis=1)\n",
|
||
"\n",
|
||
"\n",
|
||
"titanic.info()"
|
||
],
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"RangeIndex: 891 entries, 0 to 890\n",
|
||
"Data columns (total 10 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Survived 891 non-null int64 \n",
|
||
" 1 Pclass 891 non-null int64 \n",
|
||
" 2 Age 891 non-null float64\n",
|
||
" 3 Fare 891 non-null float64\n",
|
||
" 4 Famille 891 non-null int64 \n",
|
||
" 5 Sex_male 891 non-null float64\n",
|
||
" 6 Embarked_C 891 non-null float64\n",
|
||
" 7 Embarked_Q 891 non-null float64\n",
|
||
" 8 Embarked_S 891 non-null float64\n",
|
||
" 9 Embarked_U 891 non-null float64\n",
|
||
"dtypes: float64(7), int64(3)\n",
|
||
"memory usage: 69.7 KB\n"
|
||
]
|
||
}
|
||
],
|
||
"execution_count": 24
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "439cb51a",
|
||
"metadata": {},
|
||
"source": [
|
||
"5. Vous devez à présent avoir un jeu de données sur 10 attributs, tous numériques (`int64` ou `float64`), sans aucune valeur nulle. Enregistrer ce jeu de données au format csv, afin de pouvoir le réutiliser par la suite. Il est inutile de sauvegarder l'index présent dans le dataframe."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"id": "363a15e3",
|
||
"metadata": {
|
||
"ExecuteTime": {
|
||
"end_time": "2025-09-16T10:06:46.988251Z",
|
||
"start_time": "2025-09-16T10:06:46.978468Z"
|
||
}
|
||
},
|
||
"source": [
|
||
"titanic.to_csv('Titanic.csv', index=False)"
|
||
],
|
||
"outputs": [],
|
||
"execution_count": 25
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "abd641a1-6329-4e31-9625-4428b2d4f6d7",
|
||
"metadata": {},
|
||
"source": [
|
||
"### Le fichier est enregistré dans le même dossier que le notebook."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"id": "bfacd97a-51a4-4faf-887d-561b1cbfacf7",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Fin du TP !"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.12.7"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 5
|
||
}
|