# ETL


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Our plan

We want to try to build up some functionality to create etl pipelines
for pandas operations. We want to define classic extract, transform and
load functions which we can string together in a pipeline. We would also
like to add some functionality to inspect the intermediate dataframes
and some key properties.

## Our background

We work as Data Science Consultants mainly with SME’s in manufacturing.
We often have to build up etl pipelines from various sources to then use
for analytics or ML workloads. The pipelines can get pretty complex with
custom business logic, that why we want to have a way to quickly show
all the steps we take.

``` python
# Sample manufacturing data
data = {
    'order_id': [101, 102, 103, 104, 105],
    'product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'],
    'quantity': [50, 30, 75, 20, 45],
    'defects': [2, 1, 3, 0, 2],
    'production_time': [120, 95, 150, 80, 110]  # in minutes
}

df = pd.DataFrame(data)
df
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L18"
target="_blank" style="float:right; font-size:smaller">source</a>

### get_demo_data

``` python

def get_demo_data(
    
):

```

*Call self as a function.*

``` python
df[df["product"] != "Widget C"].groupby(["product"])[["quantity", "defects" ,"production_time"]].sum()
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
<tr>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th"></th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">Widget A</td>
<td>125</td>
<td>5</td>
<td>270</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">Widget B</td>
<td>75</td>
<td>3</td>
<td>205</td>
</tr>
</tbody>
</table>

</div>

# Solution space exploration

# Our preferred approach

For a the exploration and discussion with AI on the approaches see the
notebooks in the `nbs` folder.

So we would write a decorator that adds a logging keyword to each
function, then we can determine for each function in the pipeline what
to log/how much to log. We want to keep it as simple as possible. The
pipeline should just take a df and a list of functions as steps of the
pipeline.

``` python
def pipeline(df, steps, vrbs_default=True):
    for func, func_kwargs in steps:
        vrbs =  func_kwargs.get("vrbs", vrbs_default)
        func_kwargs.update({"vrbs": vrbs})
        df = func(df, **func_kwargs)
    return df
```

``` python
def track(func):
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        if vrbs:
            print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}")
            print(f"\nInput DataFrame shape: {in_df.shape}")
            print(in_df)
        out_df = func(in_df, *args, **kwargs)
        if vrbs:
            print(f"\nOutput DataFrame shape: {out_df.shape}")
            print(out_df)
        return out_df
    return wrapper
```

``` python
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()
```

We include a way to handle pipeline overwrites of the `vrbs` argument
for specific transformation steps. In this example, no information for
`aggregate_by_product` is printed:

``` python
steps = [
    (filter_products, {}),
    (aggregate_by_product, {"vrbs": False}),
]
```

``` python
pipeline(df, steps)
```


    ********** Step: filter_products **********

    Input DataFrame shape: (5, 5)
       order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    3       104  Widget C        20        0               80
    4       105  Widget B        45        2              110

    Output DataFrame shape: (4, 5)
       order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
<tr>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th"></th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">Widget A</td>
<td>125</td>
<td>5</td>
<td>270</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">Widget B</td>
<td>75</td>
<td>3</td>
<td>205</td>
</tr>
</tbody>
</table>

</div>

We want to add basic profiling of time and memory.

``` python
start_time = datetime.now()
sleep(3.14)
end_time = datetime.now()
total_time = end_time - start_time
total_time
```

    datetime.timedelta(seconds=3, microseconds=140237)

``` python
start_time
```

    datetime.datetime(2026, 1, 21, 15, 36, 55, 877745)

``` python
def track(func):
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        start_time = datetime.now()
        if vrbs:
            
            print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}")
            print(f"\nInput DataFrame shape: {in_df.shape}")
            print(f"Start time: {start_time}")
            print(in_df)
        out_df = func(in_df, *args, **kwargs)
        

        end_time = datetime.now()
        total_time = end_time - start_time
        if vrbs:
            print(f"\nOutput DataFrame shape: {out_df.shape}")
            print(f"End time: {end_time}")
            print(f"Total time: {total_time}")
            print(out_df)
        return out_df
    return wrapper
```

``` python
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()
```

``` python
steps = [
    (filter_products, {}),
    (aggregate_by_product, {}),
]
```

``` python
_df = pipeline(df, steps)
```


    ********** Step: filter_products **********

    Input DataFrame shape: (5, 5)
    Start time: 2026-01-21 15:36:59.046934
       order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    3       104  Widget C        20        0               80
    4       105  Widget B        45        2              110

    Output DataFrame shape: (4, 5)
    End time: 2026-01-21 15:36:59.048467
    Total time: 0:00:00.001533
       order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110

    ********** Step: aggregate_by_product **********

    Input DataFrame shape: (4, 5)
    Start time: 2026-01-21 15:36:59.049173
       order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110

    Output DataFrame shape: (2, 3)
    End time: 2026-01-21 15:36:59.050753
    Total time: 0:00:00.001580
              quantity  defects  production_time
    product                                     
    Widget A       125        5              270
    Widget B        75        3              205

# first reflection and next steps

- right now we are still printing the whole dataframe. We want to change
  that to just show the first 3, 5 random and the last 3 rows, maybe
  with some visual clue to show that the rows are truncated.

- we would also like show the dataframes to the right of the aggregate
  infos (like shape, etc.)

- include information like df.info()

- include information like df.describe()

- include information like diff columns (num cols changed, cols just in
  before, cols just in after)

- include information like diff rows (maybe just the number)

- source code of functions

- description of step, based on docstring

- nicer representation maybe in html or display or something like that

- new wrapper to include assertions and/or sanity checks (propertiy
  based, for example same number of mat_ids)

- function generate pipeline report, in excel (or something else)

Okay our first idea after picking the problem back up is to store the
information in a dictionary and then create different functions to
display the information.

``` python
def track(func):
    @wraps(func)
    def wrapper(in_df, vrbs=False, *args, **kwargs):
        meta_dict = {
            'step_name':func.__name__, # name of the pipeline step
            'in_time':datetime.now(), # time when the pipeline step starts
            'in_df_shape':in_df.shape, # shape of the input dataframe
            'in_df_head':in_df.head(3), # head of the input dataframe
            'in_df_sample':in_df.sample(min(in_df.shape[0], 5)), # sample of the input dataframe
            'in_df_tail':in_df.tail(3), # tail of the input dataframe
        }

        out_df = func(in_df, *args, **kwargs)

        out_time = datetime.now()
        total_time = out_time - meta_dict['in_time']
            
        meta_dict.update({
            'out_time':datetime.now(), # time when the pipeline step stops
            'out_df_shape':out_df.shape, # shape of the output dataframe
            'out_df_head':out_df.head(3), # head of the output dataframe
            'out_df_sample':out_df.sample(min(out_df.shape[0], 5)), # sample of the output dataframe
            'out_df_tail':out_df.tail(3), # tail of the output dataframe
            'total_time':total_time, # difference between in_time and out_time
        })

        if vrbs:
            print('Here we use a fuction to display the information')
            print(meta_dict)
        
        return out_df
    return wrapper
```

``` python
@track
def filter_products(df):
    return df[df["product"] != "Widget C"]

@track
def aggregate_by_product(df):
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()
```

``` python
steps = [
    (filter_products, {}),
    (aggregate_by_product, {}),
]
```

``` python
_df = pipeline(df, steps)
```

    Here we use a fuction to display the information
    {'step_name': 'filter_products', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 72286), 'in_df_shape': (5, 5), 'in_df_head':    order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, 'in_df_sample':    order_id   product  quantity  defects  production_time
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    0       101  Widget A        50        2              120
    4       105  Widget B        45        2              110
    3       104  Widget C        20        0               80, 'in_df_tail':    order_id   product  quantity  defects  production_time
    2       103  Widget A        75        3              150
    3       104  Widget C        20        0               80
    4       105  Widget B        45        2              110, 'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 73085), 'out_df_shape': (4, 5), 'out_df_head':    order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, 'out_df_sample':    order_id   product  quantity  defects  production_time
    4       105  Widget B        45        2              110
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, 'out_df_tail':    order_id   product  quantity  defects  production_time
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110, 'total_time': datetime.timedelta(microseconds=797)}
    Here we use a fuction to display the information
    {'step_name': 'aggregate_by_product', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 77323), 'in_df_shape': (4, 5), 'in_df_head':    order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, 'in_df_sample':    order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110
    1       102  Widget B        30        1               95, 'in_df_tail':    order_id   product  quantity  defects  production_time
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110, 'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 78316), 'out_df_shape': (2, 3), 'out_df_head':           quantity  defects  production_time
    product                                     
    Widget A       125        5              270
    Widget B        75        3              205, 'out_df_sample':           quantity  defects  production_time
    product                                     
    Widget B        75        3              205
    Widget A       125        5              270, 'out_df_tail':           quantity  defects  production_time
    product                                     
    Widget A       125        5              270
    Widget B        75        3              205, 'total_time': datetime.timedelta(microseconds=990)}

# display functions

Here we created a sample meta_dict to play around and create different
functions to display the information

``` python
meta_dict = {
    'step_name': 'filter_products',
    'step_description': 'Exclude products which are Widget C',
    'in_time': datetime.now(),
    'in_df_shape': (5, 5),
    'in_df_head': df.head(3),
    'in_df_sample': df.sample(min(df.shape[0], 5)),
    'in_df_tail': df.tail(3),
    'out_time': datetime.now(),
    'out_df_shape': (4, 5),
    'out_df_head': df[df["product"] != "Widget C"].head(3),
    'out_df_sample': df[df["product"] != "Widget C"].sample(min(4, 5)),
    'out_df_tail': df[df["product"] != "Widget C"].tail(3),
    'total_time': timedelta(microseconds=1287)
}

meta_dict
```

    {'step_name': 'filter_products',
     'step_description': 'Exclude products which are Widget C',
     'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 86925),
     'in_df_shape': (5, 5),
     'in_df_head':    order_id   product  quantity  defects  production_time
     0       101  Widget A        50        2              120
     1       102  Widget B        30        1               95
     2       103  Widget A        75        3              150,
     'in_df_sample':    order_id   product  quantity  defects  production_time
     2       103  Widget A        75        3              150
     1       102  Widget B        30        1               95
     3       104  Widget C        20        0               80
     4       105  Widget B        45        2              110
     0       101  Widget A        50        2              120,
     'in_df_tail':    order_id   product  quantity  defects  production_time
     2       103  Widget A        75        3              150
     3       104  Widget C        20        0               80
     4       105  Widget B        45        2              110,
     'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 87152),
     'out_df_shape': (4, 5),
     'out_df_head':    order_id   product  quantity  defects  production_time
     0       101  Widget A        50        2              120
     1       102  Widget B        30        1               95
     2       103  Widget A        75        3              150,
     'out_df_sample':    order_id   product  quantity  defects  production_time
     4       105  Widget B        45        2              110
     0       101  Widget A        50        2              120
     2       103  Widget A        75        3              150
     1       102  Widget B        30        1               95,
     'out_df_tail':    order_id   product  quantity  defects  production_time
     1       102  Widget B        30        1               95
     2       103  Widget A        75        3              150
     4       105  Widget B        45        2              110,
     'total_time': datetime.timedelta(microseconds=1287)}

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L37"
target="_blank" style="float:right; font-size:smaller">source</a>

### StepMeta

``` python

def StepMeta(
    step_name:str, step_description:str, in_time:datetime, in_df_shape:tuple, in_df_head:DataFrame,
    in_df_sample:DataFrame, in_df_tail:DataFrame, out_time:datetime, out_df_shape:tuple, out_df_head:DataFrame,
    out_df_sample:DataFrame, out_df_tail:DataFrame, total_time:timedelta
)->None:

```

*Metadata collected for a single ETL pipeline step.*

``` python
step_meta = StepMeta(
    step_name='filter_products',
    step_description='Exclude products which are Widget C',
    in_time=datetime.now(),
    in_df_shape=df.shape,
    in_df_head=df.head(3),
    in_df_sample=df.sample(min(df.shape[0], 5)),
    in_df_tail=df.tail(3),
    out_time=datetime.now(),
    out_df_shape=(4, 5),
    out_df_head=df[df["product"] != "Widget C"].head(3),
    out_df_sample=df[df["product"] != "Widget C"].sample(4),
    out_df_tail=df[df["product"] != "Widget C"].tail(3),
    total_time=timedelta(microseconds=1287)
)
step_meta
```

    StepMeta(step_name='filter_products', step_description='Exclude products which are Widget C', in_time=datetime.datetime(2026, 1, 21, 15, 36, 59, 108163), in_df_shape=(5, 5), in_df_head=   order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, in_df_sample=   order_id   product  quantity  defects  production_time
    3       104  Widget C        20        0               80
    4       105  Widget B        45        2              110
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, in_df_tail=   order_id   product  quantity  defects  production_time
    2       103  Widget A        75        3              150
    3       104  Widget C        20        0               80
    4       105  Widget B        45        2              110, out_time=datetime.datetime(2026, 1, 21, 15, 36, 59, 108396), out_df_shape=(4, 5), out_df_head=   order_id   product  quantity  defects  production_time
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150, out_df_sample=   order_id   product  quantity  defects  production_time
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110
    0       101  Widget A        50        2              120
    1       102  Widget B        30        1               95, out_df_tail=   order_id   product  quantity  defects  production_time
    1       102  Widget B        30        1               95
    2       103  Widget A        75        3              150
    4       105  Widget B        45        2              110, total_time=datetime.timedelta(microseconds=1287))

## print name and time

``` python
print(15*'*' + ' ' + step_meta.step_name + ' ' + 15*'*')
```

    *************** filter_products ***************

``` python
print(f'Total Time: {step_meta.total_time}')
print(f'Start: {step_meta.in_time}')
print(f'End: {step_meta.out_time}')
```

    Total Time: 0:00:00.001287
    Start: 2026-01-21 15:36:59.108163
    End: 2026-01-21 15:36:59.108396

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L80"
target="_blank" style="float:right; font-size:smaller">source</a>

### format_timedelta

``` python

def format_timedelta(
    td
):

```

*Call self as a function.*

``` python
print(f'Total Time: {format_timedelta(step_meta.total_time)}')
```

    Total Time: 1.29 ms

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L89"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_step_name

``` python

def print_step_name(
    step_meta:StepMeta
):

```

*Call self as a function.*

``` python
print_step_name(step_meta)
```

    *************** filter_products ***************

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L92"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_time

``` python

def print_time(
    step_meta:StepMeta
):

```

*Call self as a function.*

``` python
print_time(step_meta)
```

    Total Time: 1.29 ms

    Start: 2026-01-21 15:36:59.108163
      End: 2026-01-21 15:36:59.108396

## print sample df

``` python
pd.concat([step_meta.in_df_head, step_meta.in_df_sample, step_meta.in_df_tail])
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
</tbody>
</table>

</div>

``` python
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
</tr>
</tbody>
</table>

</div>

``` python
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna('...')
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</tbody>
</table>

</div>

``` python
pd.DataFrame(np.nan, index=range(3), columns=meta_dict['in_df_head'].columns).fillna('.')
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
<td>.</td>
</tr>
</tbody>
</table>

</div>

``` python
pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna(':')
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

</div>

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L99"
target="_blank" style="float:right; font-size:smaller">source</a>

### fill_between_df_parts

``` python

def fill_between_df_parts(
    df
):

```

*Call self as a function.*

``` python
fill_between_df_parts(meta_dict['in_df_head'])
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th"></td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
</tbody>
</table>

</div>

``` python
pd.concat(
    [step_meta.in_df_head, 
    fill_between_df_parts(step_meta.in_df_head),
    step_meta.in_df_sample, 
    fill_between_df_parts(step_meta.in_df_head),
    step_meta.in_df_tail]
    )
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th"></td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th"></td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
</tbody>
</table>

</div>

We want to build to functions: 1. a display function for within our
wrapper which only uses the meta_dict 2. a display function that just
takes in one df and gets the head, sample, tail from there

In the future we might want to turn this into one function but for now
we don’t want to store the whole df in meta_dict

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L103"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_sample_from_meta_dict

``` python

def print_sample_from_meta_dict(
    step_meta:StepMeta, mode:str='in'
):

```

*Call self as a function.*

``` python
print_sample_from_meta_dict(step_meta)
```

      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L113"
target="_blank" style="float:right; font-size:smaller">source</a>

### display_sample_from_df

``` python

def display_sample_from_df(
    df
):

```

*Call self as a function.*

``` python
display_sample_from_df(df)
```

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }
&#10;    .dataframe tbody tr th {
        vertical-align: top;
    }
&#10;    .dataframe thead th {
        text-align: right;
    }
</style>

<table class="dataframe" data-quarto-postprocess="true" data-border="1">
<thead>
<tr style="text-align: right;">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">order_id</th>
<th data-quarto-table-cell-role="th">product</th>
<th data-quarto-table-cell-role="th">quantity</th>
<th data-quarto-table-cell-role="th">defects</th>
<th data-quarto-table-cell-role="th">production_time</th>
</tr>
</thead>
<tbody>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th"></td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">0</td>
<td>101</td>
<td>Widget A</td>
<td>50</td>
<td>2</td>
<td>120</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">1</td>
<td>102</td>
<td>Widget B</td>
<td>30</td>
<td>1</td>
<td>95</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th"></td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
<td>:</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">2</td>
<td>103</td>
<td>Widget A</td>
<td>75</td>
<td>3</td>
<td>150</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">3</td>
<td>104</td>
<td>Widget C</td>
<td>20</td>
<td>0</td>
<td>80</td>
</tr>
<tr>
<td data-quarto-table-cell-role="th">4</td>
<td>105</td>
<td>Widget B</td>
<td>45</td>
<td>2</td>
<td>110</td>
</tr>
</tbody>
</table>

</div>

We will have to also consider what happen if we have to many columns.

Our idea is to truncate them as well and just add a dummy columns with
‘…’ as sign that its truncated

## print shape and shape change

``` python
print('(rows, columns) =', step_meta.in_df_shape)
print('        |          ')
print('        V          ')
print('(rows, columns) =', step_meta.out_df_shape)
```

    (rows, columns) = (5, 5)
            |          
            V          
    (rows, columns) = (4, 5)

``` python
print('(rows, columns) =', step_meta.in_df_shape)
print(' ↓ '*8)
print('(rows, columns) =', step_meta.out_df_shape)
```

    (rows, columns) = (5, 5)
     ↓  ↓  ↓  ↓  ↓  ↓  ↓  ↓ 
    (rows, columns) = (4, 5)

``` python
in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns"
print(in_str)
print('↓ ↓ ↓'.center(len(in_str), ' '))
print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns")
```

    5 rows, 5 columns
          ↓ ↓ ↓      
    4 rows, 5 columns

is there a convenietn way to space out a number of strings over a
specific length?

``` python
s1, s2, s3 = "5 rows", "4 columns", "time: 1.2ms"
total_width = 50
print(f"{s1:<{total_width//3}}{s2:<{total_width//3}}{s3}")
```

    5 rows          4 columns       time: 1.2ms

``` python
width = 20
print(s1.ljust(width) + s2.center(width) + s3.rjust(width))
```

    5 rows                   4 columns               time: 1.2ms

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L123"
target="_blank" style="float:right; font-size:smaller">source</a>

### space_strings

``` python

def space_strings(
    strings, total_width
):

```

*Call self as a function.*

``` python
space_strings(['|', '|', '|'], 20)
```

    '|        |        |'

``` python
len(space_strings(['|', '|', '|'], 20))
```

    19

``` python
in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns"
print(in_str)
print(space_strings(['↓','↓','↓'], len(in_str)))
print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns")
```

    5 rows, 5 columns
    ↓       ↓       ↓
    4 rows, 5 columns

``` python
in_rows, in_cols = step_meta.in_df_shape
out_rows, out_cols = step_meta.out_df_shape

in_str = f"{in_rows} rows, {in_cols} columns"
print(in_str)
print(space_strings(['↓','↓','↓'], len(in_str)))

print(f'{str(out_rows-in_rows).center(len(in_str)//2)}', f'{str(out_cols-in_cols).center(len(in_str)//2)}')

print(space_strings(['↓','↓','↓'], len(in_str)))
print(f"{out_rows} rows, {out_cols} columns")
```

    5 rows, 5 columns
    ↓       ↓       ↓
       -1       0    
    ↓       ↓       ↓
    4 rows, 5 columns

``` python
in_rows, in_cols = step_meta.in_df_shape
out_rows, out_cols = step_meta.out_df_shape
diff_rows, diff_cols = out_rows - in_rows, out_cols - in_cols

diff_rows_str = f"{diff_rows:+d}" if diff_rows != 0 else "0"
diff_cols_str = f"{diff_cols:+d}" if diff_cols != 0 else "0"

# Right-align each column
row_width = max(len(str(in_rows)), len(str(out_rows)), len(diff_rows_str))
col_width = max(len(str(in_cols)), len(str(out_cols)), len(diff_cols_str))

print(f"Input:  {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols")
print(f"        {' '*row_width}   ↓   {' '*col_width}   ↓")
print(f"Diff:   {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols")
print(f"        {' '*row_width}   ↓   {' '*col_width}   ↓")
print(f"Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols")
```

    Input:   5 rows, 5 cols
                 ↓       ↓
    Diff:   -1 rows, 0 cols
                 ↓       ↓
    Output:  4 rows, 5 cols

``` python
print(f"""
Input:  {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols
        {' '*row_width}   ↓   {' '*col_width}   ↓
Diff:   {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols
        {' '*row_width}   ↓   {' '*col_width}   ↓
Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols
""")
```


    Input:   5 rows, 5 cols
                 ↓       ↓
    Diff:   -1 rows, 0 cols
                 ↓       ↓
    Output:  4 rows, 5 cols

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L128"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_shape_change

``` python

def print_shape_change(
    step_meta:StepMeta
):

```

*Call self as a function.*

``` python
print_shape_change(step_meta)
```


            Input:   5 rows, 5 cols
                         ↓       ↓
            Diff:   -1 rows, 0 cols
                         ↓       ↓
            Output:  4 rows, 5 cols
            

## print docstring

``` python
@track
def my_func():
    """This is the docstring."""
    pass

print(my_func.__doc__)
```

    This is the docstring.

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L150"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_step_description

``` python

def print_step_description(
    step_meta:StepMeta
):

```

*Call self as a function.*

``` python
print_step_description(step_meta)
```

    '''Exclude products which are Widget C'''

## print all function

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L153"
target="_blank" style="float:right; font-size:smaller">source</a>

### print_step_info

``` python

def print_step_info(
    step_meta:StepMeta
):

```

*Call self as a function.*

``` python
print_step_info(step_meta)
```

    *************** filter_products ***************
    '''Exclude products which are Widget C'''


    Total Time: 1.29 ms

    Start: 2026-01-21 15:36:59.108163
      End: 2026-01-21 15:36:59.108396

    Input DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

            Input:   5 rows, 5 cols
                         ↓       ↓
            Diff:   -1 rows, 0 cols
                         ↓       ↓
            Output:  4 rows, 5 cols
            
    Output DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

# Testing the pipeline

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L166"
target="_blank" style="float:right; font-size:smaller">source</a>

### pipeline

``` python

def pipeline(
    df, steps, vrbs_default:bool=True
):

```

*Call self as a function.*

------------------------------------------------------------------------

<a
href="https://github.com/MIS-Analytics/mis_analytics/blob/main/mis_analytics/etl.py#L175"
target="_blank" style="float:right; font-size:smaller">source</a>

### track

``` python

def track(
    func
):

```

*Call self as a function.*

``` python
@track
def filter_products(df):
    '''Exclude products which are Widget C.'''
    return df[df["product"] != "Widget C"]
```

``` python
@track
def aggregate_by_product(df):
    '''aggregate by product and sum columns "quantity", "defects", "production_time"'''
    return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum()
```

``` python
steps = [
    (filter_products, {'vrbs':True}),
    (aggregate_by_product, {}),
]
```

``` python
_df = pipeline(df, steps, vrbs_default=False)
```

    *************** filter_products ***************
    '''Exclude products which are Widget C.'''


    Total Time: 574 µs

    Start: 2026-01-21 15:36:59.369533
      End: 2026-01-21 15:36:59.370107

    Input DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    2      103  Widget A       75       3             150
    1      102  Widget B       30       1              95
    0      101  Widget A       50       2             120
    4      105  Widget B       45       2             110
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

            Input:   5 rows, 5 cols
                         ↓       ↓
            Diff:   -1 rows, 0 cols
                         ↓       ↓
            Output:  4 rows, 5 cols
            
    Output DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    2      103  Widget A       75       3             150
    1      102  Widget B       30       1              95
    0      101  Widget A       50       2             120
    4      105  Widget B       45       2             110
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

``` python
_df = pipeline(df, steps, vrbs_default=True)
```

    *************** filter_products ***************
    '''Exclude products which are Widget C.'''


    Total Time: 401 µs

    Start: 2026-01-21 15:36:59.380798
      End: 2026-01-21 15:36:59.381199

    Input DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    1      102  Widget B       30       1              95
    4      105  Widget B       45       2             110
    2      103  Widget A       75       3             150
    0      101  Widget A       50       2             120
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110

            Input:   5 rows, 5 cols
                         ↓       ↓
            Diff:   -1 rows, 0 cols
                         ↓       ↓
            Output:  4 rows, 5 cols
            
    Output DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    3      104  Widget C       20       0              80
    1      102  Widget B       30       1              95
    4      105  Widget B       45       2             110
    2      103  Widget A       75       3             150
    0      101  Widget A       50       2             120
             :         :        :       :               :
    2      103  Widget A       75       3             150
    3      104  Widget C       20       0              80
    4      105  Widget B       45       2             110


    *************** aggregate_by_product ***************
    '''aggregate by product and sum columns "quantity", "defects", "production_time"'''


    Total Time: 800 µs

    Start: 2026-01-21 15:36:59.386272
      End: 2026-01-21 15:36:59.387072

    Input DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    2      103  Widget A       75       3             150
    4      105  Widget B       45       2             110
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
             :         :        :       :               :
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
    4      105  Widget B       45       2             110

            Input:   4 rows,  5 cols
                         ↓        ↓
            Diff:   -2 rows, -2 cols
                         ↓        ↓
            Output:  2 rows,  3 cols
            
    Output DataFrame:
      order_id   product quantity defects production_time
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
             :         :        :       :               :
    2      103  Widget A       75       3             150
    4      105  Widget B       45       2             110
    0      101  Widget A       50       2             120
    1      102  Widget B       30       1              95
             :         :        :       :               :
    1      102  Widget B       30       1              95
    2      103  Widget A       75       3             150
    4      105  Widget B       45       2             110

Summarize what we did so far

##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Here’s a summary of what you’ve built so far:

**Goal:** A lightweight ETL pipeline framework for pandas that lets you
chain transformation steps and inspect intermediate results — aimed at
your work as Data Science Consultants in manufacturing.

**What’s been built:**

1.  **`pipeline(df, steps, vrbs_default)`** — Runs a list of
    `(function, kwargs)` steps sequentially on a DataFrame, with
    per-step control over verbosity via the `vrbs` flag.

2.  **`@track` decorator** — Wraps each step function to optionally
    collect and display metadata (timing, shapes, data samples) using
    `@wraps` to preserve docstrings.

3.  **[`StepMeta`](https://MIS-Analytics.github.io/mis_analytics/etl.html#stepmeta)
    dataclass** — A structured container for all the metadata captured
    per step: name, description, timing, input/output shapes,
    head/sample/tail of DataFrames.

4.  **Display helper functions:**

    - [`print_step_name`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_name)
      — banner with step name
    - [`print_step_description`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_description)
      — docstring display
    - [`print_time`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_time)
      /
      [`format_timedelta`](https://MIS-Analytics.github.io/mis_analytics/etl.html#format_timedelta)
      — human-readable timing (µs, ms, s, min, h)
    - [`print_shape_change`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_shape_change)
      — nice visual showing row/col diffs with arrows
    - [`print_sample_from_meta_dict`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_sample_from_meta_dict)
      — head + sample + tail with `:` separator rows
    - [`display_sample_from_df`](https://MIS-Analytics.github.io/mis_analytics/etl.html#display_sample_from_df)
      — same but directly from a DataFrame
    - [`print_step_info`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_info)
      — combines all the above into one output
    - [`fill_between_df_parts`](https://MIS-Analytics.github.io/mis_analytics/etl.html#fill_between_df_parts)
      /
      [`space_strings`](https://MIS-Analytics.github.io/mis_analytics/etl.html#space_strings)
      — formatting utilities

**Still on the to-do list** (from your reflection note): - Column
truncation for wide DataFrames - Side-by-side layout (metadata next to
DataFrames) - `df.info()` / `df.describe()` integration - Column diff
(added/removed columns) - Source code display of step functions -
HTML/rich display - Assertion/sanity-check wrapper - Pipeline report
generation (e.g. Excel)
