.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/03_datetime_encoder.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end <sphx_glr_download_auto_examples_03_datetime_encoder.py>` to download the full example code. or to run this example in your browser via JupyterLite or Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_03_datetime_encoder.py: .. _example_datetime_encoder : =================================================== Handling datetime features with the DatetimeEncoder =================================================== In this example, we illustrate how to better integrate datetime features in machine learning models with the |DatetimeEncoder|. This encoder breaks down passed datetime features into relevant numerical features, such as the month, the day of the week, the hour of the day, etc. It is used by default in the |TableVectorizer|. .. |DatetimeEncoder| replace:: :class:`~skrub.DatetimeEncoder` .. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer` .. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder` .. |TimeSeriesSplit| replace:: :class:`~sklearn.model_selection.TimeSeriesSplit` .. |ColumnTransformer| replace:: :class:`~sklearn.compose.ColumnTransformer` .. |make_column_transformer| replace:: :class:`~sklearn.compose.make_column_transformer` .. |HGBR| replace:: :class:`~sklearn.ensemble.HistGradientBoostingRegressor` .. |ToDatetime| replace:: :class:`~skrub.ToDatetime` .. GENERATED FROM PYTHON SOURCE LINES 43-49 A problem with relevant datetime features ----------------------------------------- We will use a dataset of bike sharing demand in 2011 and 2012. In this setting, we want to predict the number of bike rentals, based on the date, time and weather conditions. .. GENERATED FROM PYTHON SOURCE LINES 49-64 .. code-block:: Python from pprint import pprint import pandas as pd from skrub import datasets data = datasets.fetch_bike_sharing().bike_sharing # Extract our input data (X) and the target column (y) y = data["cnt"] X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]] X .. rst-class:: sphx-glr-script-out .. code-block:: none Downloading 'bike_sharing' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/bike_sharing.zip (attempt 1/3) .. raw:: html <div class="output_subarea output_html rendered_html output_result"> <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>date</th> <th>holiday</th> <th>temp</th> <th>hum</th> <th>windspeed</th> <th>weathersit</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>2011-01-01 00:00:00</td> <td>0</td> <td>0.24</td> <td>0.81</td> <td>0.0000</td> <td>1</td> </tr> <tr> <th>1</th> <td>2011-01-01 01:00:00</td> <td>0</td> <td>0.22</td> <td>0.80</td> <td>0.0000</td> <td>1</td> </tr> <tr> <th>2</th> <td>2011-01-01 02:00:00</td> <td>0</td> <td>0.22</td> <td>0.80</td> <td>0.0000</td> <td>1</td> </tr> <tr> <th>3</th> <td>2011-01-01 03:00:00</td> <td>0</td> <td>0.24</td> <td>0.75</td> <td>0.0000</td> <td>1</td> </tr> <tr> <th>4</th> <td>2011-01-01 04:00:00</td> <td>0</td> <td>0.24</td> <td>0.75</td> <td>0.0000</td> <td>1</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>17374</th> <td>2012-12-31 19:00:00</td> <td>0</td> <td>0.26</td> <td>0.60</td> <td>0.1642</td> <td>2</td> </tr> <tr> <th>17375</th> <td>2012-12-31 20:00:00</td> <td>0</td> <td>0.26</td> <td>0.60</td> <td>0.1642</td> <td>2</td> </tr> <tr> <th>17376</th> <td>2012-12-31 21:00:00</td> <td>0</td> <td>0.26</td> <td>0.60</td> <td>0.1642</td> <td>1</td> </tr> <tr> <th>17377</th> <td>2012-12-31 22:00:00</td> <td>0</td> <td>0.26</td> <td>0.56</td> <td>0.1343</td> <td>1</td> </tr> <tr> <th>17378</th> <td>2012-12-31 23:00:00</td> <td>0</td> <td>0.26</td> <td>0.65</td> <td>0.1343</td> <td>1</td> </tr> </tbody> </table> <p>17379 rows × 6 columns</p> </div> </div> <br /> <br /> .. GENERATED FROM PYTHON SOURCE LINES 65-67 .. code-block:: Python y .. rst-class:: sphx-glr-script-out .. code-block:: none 0 16 1 40 2 32 3 13 4 1 ... 17374 119 17375 89 17376 90 17377 61 17378 49 Name: cnt, Length: 17379, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 68-69 We convert the dataframe's ``"date"`` column using |ToDatetime|. .. GENERATED FROM PYTHON SOURCE LINES 69-76 .. code-block:: Python from skrub import ToDatetime date = ToDatetime().fit_transform(X["date"]) print("original dtype:", X["date"].dtypes, "\n\nconverted dtype:", date.dtypes) .. rst-class:: sphx-glr-script-out .. code-block:: none original dtype: object converted dtype: datetime64[ns] .. GENERATED FROM PYTHON SOURCE LINES 77-86 Encoding the features ..................... We now encode this column with a |DatetimeEncoder|. During the instantiation of the |DatetimeEncoder|, we specify that we want to extract the day of the week, and that we don't want to extract anything finer than hours. This is because we don't want to extract minutes, seconds and lower units, as they are unimportant here. .. GENERATED FROM PYTHON SOURCE LINES 86-93 .. code-block:: Python from skrub import DatetimeEncoder date_enc = DatetimeEncoder().fit_transform(date) print(date, "\n\nHas been encoded as:\n\n", date_enc) .. rst-class:: sphx-glr-script-out .. code-block:: none 0 2011-01-01 00:00:00 1 2011-01-01 01:00:00 2 2011-01-01 02:00:00 3 2011-01-01 03:00:00 4 2011-01-01 04:00:00 ... 17374 2012-12-31 19:00:00 17375 2012-12-31 20:00:00 17376 2012-12-31 21:00:00 17377 2012-12-31 22:00:00 17378 2012-12-31 23:00:00 Name: date, Length: 17379, dtype: datetime64[ns] Has been encoded as: date_year date_month date_day date_hour date_total_seconds 0 2011.0 1.0 1.0 0.0 1.293840e+09 1 2011.0 1.0 1.0 1.0 1.293844e+09 2 2011.0 1.0 1.0 2.0 1.293847e+09 3 2011.0 1.0 1.0 3.0 1.293851e+09 4 2011.0 1.0 1.0 4.0 1.293854e+09 ... ... ... ... ... ... 17374 2012.0 12.0 31.0 19.0 1.356980e+09 17375 2012.0 12.0 31.0 20.0 1.356984e+09 17376 2012.0 12.0 31.0 21.0 1.356988e+09 17377 2012.0 12.0 31.0 22.0 1.356991e+09 17378 2012.0 12.0 31.0 23.0 1.356995e+09 [17379 rows x 5 columns] .. GENERATED FROM PYTHON SOURCE LINES 94-97 We see that the encoder is working as expected: the column has been replaced by features extracting the month, day, hour, day of the week and total seconds since Epoch information. .. GENERATED FROM PYTHON SOURCE LINES 99-106 One-liner with the TableVectorizer .................................. As mentioned earlier, the |TableVectorizer| makes use of the |DatetimeEncoder| by default. Note that ``X["date"]`` is still a string, but will be automatically transformed into a datetime in the |TableVectorizer|. .. GENERATED FROM PYTHON SOURCE LINES 106-112 .. code-block:: Python from skrub import TableVectorizer table_vec = TableVectorizer().fit(X) pprint(table_vec.get_feature_names_out()) .. rst-class:: sphx-glr-script-out .. code-block:: none array(['date_year', 'date_month', 'date_day', 'date_hour', 'date_total_seconds', 'holiday', 'temp', 'hum', 'windspeed', 'weathersit'], dtype='<U18') .. GENERATED FROM PYTHON SOURCE LINES 113-117 If we want to customize the |DatetimeEncoder| inside the |TableVectorizer|, we can replace its default parameter with a new, custom instance. Here, for example, we want it to extract the day of the week: .. GENERATED FROM PYTHON SOURCE LINES 117-122 .. code-block:: Python # use the ``datetime`` argument to customize how datetimes are handled table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X) pprint(table_vec_weekday.get_feature_names_out()) .. rst-class:: sphx-glr-script-out .. code-block:: none array(['date_year', 'date_month', 'date_day', 'date_hour', 'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum', 'windspeed', 'weathersit'], dtype='<U18') .. GENERATED FROM PYTHON SOURCE LINES 123-129 .. note: For more information on how to customize the |TableVectorizer|, see :ref:`sphx_glr_auto_examples_01_dirty_categories.py`. Inspecting the |TableVectorizer| further, we can check that the |DatetimeEncoder| is used on the correct column(s). .. GENERATED FROM PYTHON SOURCE LINES 129-131 .. code-block:: Python pprint(table_vec_weekday.transformers_) .. rst-class:: sphx-glr-script-out .. code-block:: none {'date': DatetimeEncoder(add_weekday=True), 'holiday': PassThrough(), 'hum': PassThrough(), 'temp': PassThrough(), 'weathersit': PassThrough(), 'windspeed': PassThrough()} .. GENERATED FROM PYTHON SOURCE LINES 132-140 Prediction with datetime features --------------------------------- For prediction tasks, we recommend using the |TableVectorizer| inside a pipeline, combined with a model that can use the features extracted by the |DatetimeEncoder|. Here we'll use a |HGBR| as our learner. .. GENERATED FROM PYTHON SOURCE LINES 140-146 .. code-block:: Python from sklearn.ensemble import HistGradientBoostingRegressor from sklearn.pipeline import make_pipeline pipeline = make_pipeline(table_vec, HistGradientBoostingRegressor()) pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor()) .. GENERATED FROM PYTHON SOURCE LINES 147-156 Evaluating the model .................... When using date and time features, we often care about predicting the future. In this case, we have to be careful when evaluating our model, because the standard settings of the cross-validation do not respect time ordering. Instead, we can use the |TimeSeriesSplit|, which ensures that the test set is always in the future. .. GENERATED FROM PYTHON SOURCE LINES 156-166 .. code-block:: Python from sklearn.model_selection import TimeSeriesSplit, cross_val_score cross_val_score( pipeline, X, y, scoring="neg_mean_squared_error", cv=TimeSeriesSplit(n_splits=5), ) .. rst-class:: sphx-glr-script-out .. code-block:: none array([ -6664.06809097, -6520.55512687, -21139.48809623, -12732.66488604, -13952.10808106]) .. GENERATED FROM PYTHON SOURCE LINES 167-174 Plotting the prediction ....................... The mean squared error is not obvious to interpret, so we visually compare the prediction of our model with the actual values. To do so, we will divide our dataset into a train and a test set: we use 2011 data to predict what happened in 2012. .. GENERATED FROM PYTHON SOURCE LINES 174-215 .. code-block:: Python import matplotlib.dates as mdates import matplotlib.pyplot as plt mask_train = X["date"] < "2012-01-01" X_train, X_test = X.loc[mask_train], X.loc[~mask_train] y_train, y_test = y.loc[mask_train], y.loc[~mask_train] pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) pipeline_weekday.fit(X_train, y_train) y_pred_weekday = pipeline_weekday.predict(X_test) fig, ax = plt.subplots(figsize=(12, 3)) fig.suptitle("Predictions with tree models") ax.plot( X.tail(96)["date"], y.tail(96).values, "x-", alpha=0.2, label="Actual demand", color="black", ) ax.plot( X_test.tail(96)["date"], y_pred[-96:], "x-", label="DatetimeEncoder() + HGBR prediction", ) ax.plot( X_test.tail(96)["date"], y_pred_weekday[-96:], "x-", label="DatetimeEncoder(add_weekday=True) + HGBR prediction", ) ax.tick_params(axis="x", labelsize=7, labelrotation=75) ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d")) _ = ax.legend() plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_03_datetime_encoder_001.png :alt: Predictions with tree models :srcset: /auto_examples/images/sphx_glr_03_datetime_encoder_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 216-217 As we can see, adding the weekday yields better predictions on our test set. .. GENERATED FROM PYTHON SOURCE LINES 220-228 Feature importances ------------------- Using the |DatetimeEncoder| allows us to better understand how the date impacts the bike sharing demand. To this aim, we can compute the importance of the features created by the |DatetimeEncoder|, using the :func:`~sklearn.inspection.permutation_importance` function, which basically shuffles a feature and sees how the model changes its prediction. .. GENERATED FROM PYTHON SOURCE LINES 230-258 .. code-block:: Python from sklearn.inspection import permutation_importance # In this case, we don't use the whole pipeline, because we want to compute the # importance of the features created by the DatetimeEncoder X_test_transform = pipeline[:-1].transform(X_test) result = permutation_importance( pipeline[-1], X_test_transform, y_test, n_repeats=10, random_state=0 ) result = pd.DataFrame( dict( feature_names=X_test_transform.columns, std=result.importances_std, importances=result.importances_mean, ) ).sort_values("importances", ascending=True) result.plot.barh( y="importances", x="feature_names", title="Feature Importances", xerr="std", figsize=(12, 9), ) plt.tight_layout() plt.show() .. image-sg:: /auto_examples/images/sphx_glr_03_datetime_encoder_002.png :alt: Feature Importances :srcset: /auto_examples/images/sphx_glr_03_datetime_encoder_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 259-269 We can see that the hour of the day, the temperature and the humidity are the most important features, which seems reasonable. Conclusion ---------- In this example, we saw how to use the |DatetimeEncoder| to create features from a datetime column. Also check out the |TableVectorizer|, which automatically recognizes and transforms datetime columns by default. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 7.382 seconds) .. _sphx_glr_download_auto_examples_03_datetime_encoder.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/03_datetime_encoder.ipynb :alt: Launch binder :width: 150 px .. container:: lite-badge .. image:: images/jupyterlite_badge_logo.svg :target: ../lite/lab/index.html?path=auto_examples/03_datetime_encoder.ipynb :alt: Launch JupyterLite :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: 03_datetime_encoder.ipynb <03_datetime_encoder.ipynb>` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: 03_datetime_encoder.py <03_datetime_encoder.py>` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: 03_datetime_encoder.zip <03_datetime_encoder.zip>` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_