.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/03_datetime_encoder.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_03_datetime_encoder.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_03_datetime_encoder.py:


.. _example_datetime_encoder :

===================================================
Handling datetime features with the DatetimeEncoder
===================================================

In this example, we illustrate how to better integrate datetime features
in machine learning models with the |DatetimeEncoder|.

This encoder breaks down passed datetime features into relevant numerical
features, such as the month, the day of the week, the hour of the day, etc.

It is used by default in the |TableVectorizer|.


.. |DatetimeEncoder| replace::
    :class:`~skrub.DatetimeEncoder`

.. |TableVectorizer| replace::
    :class:`~skrub.TableVectorizer`

.. |OneHotEncoder| replace::
    :class:`~sklearn.preprocessing.OneHotEncoder`

.. |TimeSeriesSplit| replace::
    :class:`~sklearn.model_selection.TimeSeriesSplit`

.. |ColumnTransformer| replace::
    :class:`~sklearn.compose.ColumnTransformer`

.. |make_column_transformer| replace::
    :class:`~sklearn.compose.make_column_transformer`

.. |HGBR| replace::
    :class:`~sklearn.ensemble.HistGradientBoostingRegressor`

.. |ToDatetime| replace::
    :class:`~skrub.ToDatetime`

.. GENERATED FROM PYTHON SOURCE LINES 43-49

A problem with relevant datetime features
-----------------------------------------

We will use a dataset of bike sharing demand in 2011 and 2012.
In this setting, we want to predict the number of bike rentals, based
on the date, time and weather conditions.

.. GENERATED FROM PYTHON SOURCE LINES 49-64

.. code-block:: Python


    from pprint import pprint

    import pandas as pd

    from skrub import datasets

    data = datasets.fetch_bike_sharing().bike_sharing

    # Extract our input data (X) and the target column (y)
    y = data["cnt"]
    X = data[["date", "holiday", "temp", "hum", "windspeed", "weathersit"]]

    X





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Downloading 'bike_sharing' from https://github.com/skrub-data/skrub-data-files/raw/refs/heads/main/bike_sharing.zip (attempt 1/3)


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>date</th>
          <th>holiday</th>
          <th>temp</th>
          <th>hum</th>
          <th>windspeed</th>
          <th>weathersit</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>2011-01-01 00:00:00</td>
          <td>0</td>
          <td>0.24</td>
          <td>0.81</td>
          <td>0.0000</td>
          <td>1</td>
        </tr>
        <tr>
          <th>1</th>
          <td>2011-01-01 01:00:00</td>
          <td>0</td>
          <td>0.22</td>
          <td>0.80</td>
          <td>0.0000</td>
          <td>1</td>
        </tr>
        <tr>
          <th>2</th>
          <td>2011-01-01 02:00:00</td>
          <td>0</td>
          <td>0.22</td>
          <td>0.80</td>
          <td>0.0000</td>
          <td>1</td>
        </tr>
        <tr>
          <th>3</th>
          <td>2011-01-01 03:00:00</td>
          <td>0</td>
          <td>0.24</td>
          <td>0.75</td>
          <td>0.0000</td>
          <td>1</td>
        </tr>
        <tr>
          <th>4</th>
          <td>2011-01-01 04:00:00</td>
          <td>0</td>
          <td>0.24</td>
          <td>0.75</td>
          <td>0.0000</td>
          <td>1</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>17374</th>
          <td>2012-12-31 19:00:00</td>
          <td>0</td>
          <td>0.26</td>
          <td>0.60</td>
          <td>0.1642</td>
          <td>2</td>
        </tr>
        <tr>
          <th>17375</th>
          <td>2012-12-31 20:00:00</td>
          <td>0</td>
          <td>0.26</td>
          <td>0.60</td>
          <td>0.1642</td>
          <td>2</td>
        </tr>
        <tr>
          <th>17376</th>
          <td>2012-12-31 21:00:00</td>
          <td>0</td>
          <td>0.26</td>
          <td>0.60</td>
          <td>0.1642</td>
          <td>1</td>
        </tr>
        <tr>
          <th>17377</th>
          <td>2012-12-31 22:00:00</td>
          <td>0</td>
          <td>0.26</td>
          <td>0.56</td>
          <td>0.1343</td>
          <td>1</td>
        </tr>
        <tr>
          <th>17378</th>
          <td>2012-12-31 23:00:00</td>
          <td>0</td>
          <td>0.26</td>
          <td>0.65</td>
          <td>0.1343</td>
          <td>1</td>
        </tr>
      </tbody>
    </table>
    <p>17379 rows × 6 columns</p>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 65-67

.. code-block:: Python

    y





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    0         16
    1         40
    2         32
    3         13
    4          1
            ... 
    17374    119
    17375     89
    17376     90
    17377     61
    17378     49
    Name: cnt, Length: 17379, dtype: int64



.. GENERATED FROM PYTHON SOURCE LINES 68-69

We convert the dataframe's ``"date"`` column using |ToDatetime|.

.. GENERATED FROM PYTHON SOURCE LINES 69-76

.. code-block:: Python


    from skrub import ToDatetime

    date = ToDatetime().fit_transform(X["date"])

    print("original dtype:", X["date"].dtypes, "\n\nconverted dtype:", date.dtypes)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    original dtype: object 

    converted dtype: datetime64[ns]




.. GENERATED FROM PYTHON SOURCE LINES 77-86

Encoding the features
.....................

We now encode this column with a |DatetimeEncoder|.

During the instantiation of the |DatetimeEncoder|, we specify that we want
to extract the day of the week, and that we don't want to extract anything
finer than hours. This is because we don't want to extract minutes, seconds
and lower units, as they are unimportant here.

.. GENERATED FROM PYTHON SOURCE LINES 86-93

.. code-block:: Python


    from skrub import DatetimeEncoder

    date_enc = DatetimeEncoder().fit_transform(date)

    print(date, "\n\nHas been encoded as:\n\n", date_enc)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0       2011-01-01 00:00:00
    1       2011-01-01 01:00:00
    2       2011-01-01 02:00:00
    3       2011-01-01 03:00:00
    4       2011-01-01 04:00:00
                    ...        
    17374   2012-12-31 19:00:00
    17375   2012-12-31 20:00:00
    17376   2012-12-31 21:00:00
    17377   2012-12-31 22:00:00
    17378   2012-12-31 23:00:00
    Name: date, Length: 17379, dtype: datetime64[ns] 

    Has been encoded as:

            date_year  date_month  date_day  date_hour  date_total_seconds
    0         2011.0         1.0       1.0        0.0        1.293840e+09
    1         2011.0         1.0       1.0        1.0        1.293844e+09
    2         2011.0         1.0       1.0        2.0        1.293847e+09
    3         2011.0         1.0       1.0        3.0        1.293851e+09
    4         2011.0         1.0       1.0        4.0        1.293854e+09
    ...          ...         ...       ...        ...                 ...
    17374     2012.0        12.0      31.0       19.0        1.356980e+09
    17375     2012.0        12.0      31.0       20.0        1.356984e+09
    17376     2012.0        12.0      31.0       21.0        1.356988e+09
    17377     2012.0        12.0      31.0       22.0        1.356991e+09
    17378     2012.0        12.0      31.0       23.0        1.356995e+09

    [17379 rows x 5 columns]




.. GENERATED FROM PYTHON SOURCE LINES 94-97

We see that the encoder is working as expected: the column has
been replaced by features extracting the month, day, hour, day of the
week and total seconds since Epoch information.

.. GENERATED FROM PYTHON SOURCE LINES 99-106

One-liner with the TableVectorizer
..................................

As mentioned earlier, the |TableVectorizer| makes use of the
|DatetimeEncoder| by default. Note that ``X["date"]`` is still
a string, but will be automatically transformed into a datetime in the
|TableVectorizer|.

.. GENERATED FROM PYTHON SOURCE LINES 106-112

.. code-block:: Python


    from skrub import TableVectorizer

    table_vec = TableVectorizer().fit(X)
    pprint(table_vec.get_feature_names_out())





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    array(['date_year', 'date_month', 'date_day', 'date_hour',
           'date_total_seconds', 'holiday', 'temp', 'hum', 'windspeed',
           'weathersit'], dtype='<U18')




.. GENERATED FROM PYTHON SOURCE LINES 113-117

If we want to customize the |DatetimeEncoder| inside the |TableVectorizer|,
we can replace its default parameter with a new, custom instance.

Here, for example, we want it to extract the day of the week:

.. GENERATED FROM PYTHON SOURCE LINES 117-122

.. code-block:: Python


    # use the ``datetime`` argument to customize how datetimes are handled
    table_vec_weekday = TableVectorizer(datetime=DatetimeEncoder(add_weekday=True)).fit(X)
    pprint(table_vec_weekday.get_feature_names_out())





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    array(['date_year', 'date_month', 'date_day', 'date_hour',
           'date_total_seconds', 'date_weekday', 'holiday', 'temp', 'hum',
           'windspeed', 'weathersit'], dtype='<U18')




.. GENERATED FROM PYTHON SOURCE LINES 123-129

.. note:
    For more information on how to customize the |TableVectorizer|, see
    :ref:`sphx_glr_auto_examples_01_dirty_categories.py`.

Inspecting the |TableVectorizer| further, we can check that the
|DatetimeEncoder| is used on the correct column(s).

.. GENERATED FROM PYTHON SOURCE LINES 129-131

.. code-block:: Python

    pprint(table_vec_weekday.transformers_)





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    {'date': DatetimeEncoder(add_weekday=True),
     'holiday': PassThrough(),
     'hum': PassThrough(),
     'temp': PassThrough(),
     'weathersit': PassThrough(),
     'windspeed': PassThrough()}




.. GENERATED FROM PYTHON SOURCE LINES 132-140

Prediction with datetime features
---------------------------------

For prediction tasks, we recommend using the |TableVectorizer| inside a
pipeline, combined with a model that can use the features extracted by the
|DatetimeEncoder|.
Here we'll use a |HGBR| as our learner.


.. GENERATED FROM PYTHON SOURCE LINES 140-146

.. code-block:: Python

    from sklearn.ensemble import HistGradientBoostingRegressor
    from sklearn.pipeline import make_pipeline

    pipeline = make_pipeline(table_vec, HistGradientBoostingRegressor())
    pipeline_weekday = make_pipeline(table_vec_weekday, HistGradientBoostingRegressor())








.. GENERATED FROM PYTHON SOURCE LINES 147-156

Evaluating the model
....................

When using date and time features, we often care about predicting the future.
In this case, we have to be careful when evaluating our model, because
the standard settings of the cross-validation do not respect time ordering.

Instead, we can use the |TimeSeriesSplit|,
which ensures that the test set is always in the future.

.. GENERATED FROM PYTHON SOURCE LINES 156-166

.. code-block:: Python

    from sklearn.model_selection import TimeSeriesSplit, cross_val_score

    cross_val_score(
        pipeline,
        X,
        y,
        scoring="neg_mean_squared_error",
        cv=TimeSeriesSplit(n_splits=5),
    )





.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    array([ -6664.06809097,  -6520.55512687, -21139.48809623, -12732.66488604,
           -13952.10808106])



.. GENERATED FROM PYTHON SOURCE LINES 167-174

Plotting the prediction
.......................

The mean squared error is not obvious to interpret, so we visually
compare the prediction of our model with the actual values.
To do so, we will divide our dataset into a train and a test set:
we use 2011 data to predict what happened in 2012.

.. GENERATED FROM PYTHON SOURCE LINES 174-215

.. code-block:: Python

    import matplotlib.dates as mdates
    import matplotlib.pyplot as plt

    mask_train = X["date"] < "2012-01-01"
    X_train, X_test = X.loc[mask_train], X.loc[~mask_train]
    y_train, y_test = y.loc[mask_train], y.loc[~mask_train]

    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)

    pipeline_weekday.fit(X_train, y_train)
    y_pred_weekday = pipeline_weekday.predict(X_test)

    fig, ax = plt.subplots(figsize=(12, 3))
    fig.suptitle("Predictions with tree models")
    ax.plot(
        X.tail(96)["date"],
        y.tail(96).values,
        "x-",
        alpha=0.2,
        label="Actual demand",
        color="black",
    )
    ax.plot(
        X_test.tail(96)["date"],
        y_pred[-96:],
        "x-",
        label="DatetimeEncoder() + HGBR prediction",
    )
    ax.plot(
        X_test.tail(96)["date"],
        y_pred_weekday[-96:],
        "x-",
        label="DatetimeEncoder(add_weekday=True) + HGBR prediction",
    )

    ax.tick_params(axis="x", labelsize=7, labelrotation=75)
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y-%m-%d"))
    _ = ax.legend()
    plt.tight_layout()
    plt.show()



.. image-sg:: /auto_examples/images/sphx_glr_03_datetime_encoder_001.png
   :alt: Predictions with tree models
   :srcset: /auto_examples/images/sphx_glr_03_datetime_encoder_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 216-217

As we can see, adding the weekday yields better predictions on our test set.

.. GENERATED FROM PYTHON SOURCE LINES 220-228

Feature importances
-------------------

Using the |DatetimeEncoder| allows us to better understand how the date
impacts the bike sharing demand. To this aim, we can compute the
importance of the features created by the |DatetimeEncoder|, using the
:func:`~sklearn.inspection.permutation_importance` function, which
basically shuffles a feature and sees how the model changes its prediction.

.. GENERATED FROM PYTHON SOURCE LINES 230-258

.. code-block:: Python

    from sklearn.inspection import permutation_importance

    # In this case, we don't use the whole pipeline, because we want to compute the
    # importance of the features created by the DatetimeEncoder
    X_test_transform = pipeline[:-1].transform(X_test)

    result = permutation_importance(
        pipeline[-1], X_test_transform, y_test, n_repeats=10, random_state=0
    )

    result = pd.DataFrame(
        dict(
            feature_names=X_test_transform.columns,
            std=result.importances_std,
            importances=result.importances_mean,
        )
    ).sort_values("importances", ascending=True)

    result.plot.barh(
        y="importances",
        x="feature_names",
        title="Feature Importances",
        xerr="std",
        figsize=(12, 9),
    )
    plt.tight_layout()
    plt.show()




.. image-sg:: /auto_examples/images/sphx_glr_03_datetime_encoder_002.png
   :alt: Feature Importances
   :srcset: /auto_examples/images/sphx_glr_03_datetime_encoder_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 259-269

We can see that the hour of the day, the temperature and the humidity
are the most important features, which seems reasonable.

Conclusion
----------

In this example, we saw how to use the |DatetimeEncoder| to create
features from a datetime column.
Also check out the |TableVectorizer|, which automatically recognizes
and transforms datetime columns by default.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 7.382 seconds)


.. _sphx_glr_download_auto_examples_03_datetime_encoder.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/skrub-data/skrub/main?urlpath=lab/tree/notebooks/auto_examples/03_datetime_encoder.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../lite/lab/index.html?path=auto_examples/03_datetime_encoder.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: 03_datetime_encoder.ipynb <03_datetime_encoder.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: 03_datetime_encoder.py <03_datetime_encoder.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: 03_datetime_encoder.zip <03_datetime_encoder.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_