# ETL ## Our plan We want to try to build up some functionality to create etl pipelines for pandas operations. We want to define classic extract, transform and load functions which we can string together in a pipeline. We would also like to add some functionality to inspect the intermediate dataframes and some key properties. ## Our background We work as Data Science Consultants mainly with SME’s in manufacturing. We often have to build up etl pipelines from various sources to then use for analytics or ML workloads. The pipelines can get pretty complex with custom business logic, that why we want to have a way to quickly show all the steps we take. ``` python # Sample manufacturing data data = { 'order_id': [101, 102, 103, 104, 105], 'product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B'], 'quantity': [50, 30, 75, 20, 45], 'defects': [2, 1, 3, 0, 2], 'production_time': [120, 95, 150, 80, 110] # in minutes } df = pd.DataFrame(data) df ```

	order_id	product	quantity	defects	production_time
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110

------------------------------------------------------------------------ source ### get_demo_data ``` python def get_demo_data( ): ``` *Call self as a function.* ``` python df[df["product"] != "Widget C"].groupby(["product"])[["quantity", "defects" ,"production_time"]].sum() ```

	quantity	defects	production_time
product
Widget A	125	5	270
Widget B	75	3	205

# Solution space exploration # Our preferred approach For a the exploration and discussion with AI on the approaches see the notebooks in the `nbs` folder. So we would write a decorator that adds a logging keyword to each function, then we can determine for each function in the pipeline what to log/how much to log. We want to keep it as simple as possible. The pipeline should just take a df and a list of functions as steps of the pipeline. ``` python def pipeline(df, steps, vrbs_default=True): for func, func_kwargs in steps: vrbs = func_kwargs.get("vrbs", vrbs_default) func_kwargs.update({"vrbs": vrbs}) df = func(df, **func_kwargs) return df ``` ``` python def track(func): def wrapper(in_df, vrbs=False, *args, **kwargs): if vrbs: print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}") print(f"\nInput DataFrame shape: {in_df.shape}") print(in_df) out_df = func(in_df, *args, **kwargs) if vrbs: print(f"\nOutput DataFrame shape: {out_df.shape}") print(out_df) return out_df return wrapper ``` ``` python @track def filter_products(df): return df[df["product"] != "Widget C"] @track def aggregate_by_product(df): return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum() ``` We include a way to handle pipeline overwrites of the `vrbs` argument for specific transformation steps. In this example, no information for `aggregate_by_product` is printed: ``` python steps = [ (filter_products, {}), (aggregate_by_product, {"vrbs": False}), ] ``` ``` python pipeline(df, steps) ``` ********** Step: filter_products ********** Input DataFrame shape: (5, 5) order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 Output DataFrame shape: (4, 5) order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110

	quantity	defects	production_time
product
Widget A	125	5	270
Widget B	75	3	205

We want to add basic profiling of time and memory. ``` python start_time = datetime.now() sleep(3.14) end_time = datetime.now() total_time = end_time - start_time total_time ``` datetime.timedelta(seconds=3, microseconds=140237) ``` python start_time ``` datetime.datetime(2026, 1, 21, 15, 36, 55, 877745) ``` python def track(func): def wrapper(in_df, vrbs=False, *args, **kwargs): start_time = datetime.now() if vrbs: print(f"\n{'*'*10} Step: {func.__name__} {'*'*10}") print(f"\nInput DataFrame shape: {in_df.shape}") print(f"Start time: {start_time}") print(in_df) out_df = func(in_df, *args, **kwargs) end_time = datetime.now() total_time = end_time - start_time if vrbs: print(f"\nOutput DataFrame shape: {out_df.shape}") print(f"End time: {end_time}") print(f"Total time: {total_time}") print(out_df) return out_df return wrapper ``` ``` python @track def filter_products(df): return df[df["product"] != "Widget C"] @track def aggregate_by_product(df): return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum() ``` ``` python steps = [ (filter_products, {}), (aggregate_by_product, {}), ] ``` ``` python _df = pipeline(df, steps) ``` ********** Step: filter_products ********** Input DataFrame shape: (5, 5) Start time: 2026-01-21 15:36:59.046934 order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 Output DataFrame shape: (4, 5) End time: 2026-01-21 15:36:59.048467 Total time: 0:00:00.001533 order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 ********** Step: aggregate_by_product ********** Input DataFrame shape: (4, 5) Start time: 2026-01-21 15:36:59.049173 order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 Output DataFrame shape: (2, 3) End time: 2026-01-21 15:36:59.050753 Total time: 0:00:00.001580 quantity defects production_time product Widget A 125 5 270 Widget B 75 3 205 # first reflection and next steps - right now we are still printing the whole dataframe. We want to change that to just show the first 3, 5 random and the last 3 rows, maybe with some visual clue to show that the rows are truncated. - we would also like show the dataframes to the right of the aggregate infos (like shape, etc.) - include information like df.info() - include information like df.describe() - include information like diff columns (num cols changed, cols just in before, cols just in after) - include information like diff rows (maybe just the number) - source code of functions - description of step, based on docstring - nicer representation maybe in html or display or something like that - new wrapper to include assertions and/or sanity checks (propertiy based, for example same number of mat_ids) - function generate pipeline report, in excel (or something else) Okay our first idea after picking the problem back up is to store the information in a dictionary and then create different functions to display the information. ``` python def track(func): @wraps(func) def wrapper(in_df, vrbs=False, *args, **kwargs): meta_dict = { 'step_name':func.__name__, # name of the pipeline step 'in_time':datetime.now(), # time when the pipeline step starts 'in_df_shape':in_df.shape, # shape of the input dataframe 'in_df_head':in_df.head(3), # head of the input dataframe 'in_df_sample':in_df.sample(min(in_df.shape[0], 5)), # sample of the input dataframe 'in_df_tail':in_df.tail(3), # tail of the input dataframe } out_df = func(in_df, *args, **kwargs) out_time = datetime.now() total_time = out_time - meta_dict['in_time'] meta_dict.update({ 'out_time':datetime.now(), # time when the pipeline step stops 'out_df_shape':out_df.shape, # shape of the output dataframe 'out_df_head':out_df.head(3), # head of the output dataframe 'out_df_sample':out_df.sample(min(out_df.shape[0], 5)), # sample of the output dataframe 'out_df_tail':out_df.tail(3), # tail of the output dataframe 'total_time':total_time, # difference between in_time and out_time }) if vrbs: print('Here we use a fuction to display the information') print(meta_dict) return out_df return wrapper ``` ``` python @track def filter_products(df): return df[df["product"] != "Widget C"] @track def aggregate_by_product(df): return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum() ``` ``` python steps = [ (filter_products, {}), (aggregate_by_product, {}), ] ``` ``` python _df = pipeline(df, steps) ``` Here we use a fuction to display the information {'step_name': 'filter_products', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 72286), 'in_df_shape': (5, 5), 'in_df_head': order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'in_df_sample': order_id product quantity defects production_time 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 0 101 Widget A 50 2 120 4 105 Widget B 45 2 110 3 104 Widget C 20 0 80, 'in_df_tail': order_id product quantity defects production_time 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110, 'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 73085), 'out_df_shape': (4, 5), 'out_df_head': order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'out_df_sample': order_id product quantity defects production_time 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'out_df_tail': order_id product quantity defects production_time 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110, 'total_time': datetime.timedelta(microseconds=797)} Here we use a fuction to display the information {'step_name': 'aggregate_by_product', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 77323), 'in_df_shape': (4, 5), 'in_df_head': order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'in_df_sample': order_id product quantity defects production_time 0 101 Widget A 50 2 120 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 1 102 Widget B 30 1 95, 'in_df_tail': order_id product quantity defects production_time 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110, 'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 78316), 'out_df_shape': (2, 3), 'out_df_head': quantity defects production_time product Widget A 125 5 270 Widget B 75 3 205, 'out_df_sample': quantity defects production_time product Widget B 75 3 205 Widget A 125 5 270, 'out_df_tail': quantity defects production_time product Widget A 125 5 270 Widget B 75 3 205, 'total_time': datetime.timedelta(microseconds=990)} # display functions Here we created a sample meta_dict to play around and create different functions to display the information ``` python meta_dict = { 'step_name': 'filter_products', 'step_description': 'Exclude products which are Widget C', 'in_time': datetime.now(), 'in_df_shape': (5, 5), 'in_df_head': df.head(3), 'in_df_sample': df.sample(min(df.shape[0], 5)), 'in_df_tail': df.tail(3), 'out_time': datetime.now(), 'out_df_shape': (4, 5), 'out_df_head': df[df["product"] != "Widget C"].head(3), 'out_df_sample': df[df["product"] != "Widget C"].sample(min(4, 5)), 'out_df_tail': df[df["product"] != "Widget C"].tail(3), 'total_time': timedelta(microseconds=1287) } meta_dict ``` {'step_name': 'filter_products', 'step_description': 'Exclude products which are Widget C', 'in_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 86925), 'in_df_shape': (5, 5), 'in_df_head': order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'in_df_sample': order_id product quantity defects production_time 2 103 Widget A 75 3 150 1 102 Widget B 30 1 95 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120, 'in_df_tail': order_id product quantity defects production_time 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110, 'out_time': datetime.datetime(2026, 1, 21, 15, 36, 59, 87152), 'out_df_shape': (4, 5), 'out_df_head': order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, 'out_df_sample': order_id product quantity defects production_time 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 2 103 Widget A 75 3 150 1 102 Widget B 30 1 95, 'out_df_tail': order_id product quantity defects production_time 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110, 'total_time': datetime.timedelta(microseconds=1287)} ------------------------------------------------------------------------ source ### StepMeta ``` python def StepMeta( step_name:str, step_description:str, in_time:datetime, in_df_shape:tuple, in_df_head:DataFrame, in_df_sample:DataFrame, in_df_tail:DataFrame, out_time:datetime, out_df_shape:tuple, out_df_head:DataFrame, out_df_sample:DataFrame, out_df_tail:DataFrame, total_time:timedelta )->None: ``` *Metadata collected for a single ETL pipeline step.* ``` python step_meta = StepMeta( step_name='filter_products', step_description='Exclude products which are Widget C', in_time=datetime.now(), in_df_shape=df.shape, in_df_head=df.head(3), in_df_sample=df.sample(min(df.shape[0], 5)), in_df_tail=df.tail(3), out_time=datetime.now(), out_df_shape=(4, 5), out_df_head=df[df["product"] != "Widget C"].head(3), out_df_sample=df[df["product"] != "Widget C"].sample(4), out_df_tail=df[df["product"] != "Widget C"].tail(3), total_time=timedelta(microseconds=1287) ) step_meta ``` StepMeta(step_name='filter_products', step_description='Exclude products which are Widget C', in_time=datetime.datetime(2026, 1, 21, 15, 36, 59, 108163), in_df_shape=(5, 5), in_df_head= order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, in_df_sample= order_id product quantity defects production_time 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, in_df_tail= order_id product quantity defects production_time 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110, out_time=datetime.datetime(2026, 1, 21, 15, 36, 59, 108396), out_df_shape=(4, 5), out_df_head= order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150, out_df_sample= order_id product quantity defects production_time 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95, out_df_tail= order_id product quantity defects production_time 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110, total_time=datetime.timedelta(microseconds=1287)) ## print name and time ``` python print(15*'*' + ' ' + step_meta.step_name + ' ' + 15*'*') ``` *************** filter_products *************** ``` python print(f'Total Time: {step_meta.total_time}') print(f'Start: {step_meta.in_time}') print(f'End: {step_meta.out_time}') ``` Total Time: 0:00:00.001287 Start: 2026-01-21 15:36:59.108163 End: 2026-01-21 15:36:59.108396 ------------------------------------------------------------------------ source ### format_timedelta ``` python def format_timedelta( td ): ``` *Call self as a function.* ``` python print(f'Total Time: {format_timedelta(step_meta.total_time)}') ``` Total Time: 1.29 ms ------------------------------------------------------------------------ source ### print_step_name ``` python def print_step_name( step_meta:StepMeta ): ``` *Call self as a function.* ``` python print_step_name(step_meta) ``` *************** filter_products *************** ------------------------------------------------------------------------ source ### print_time ``` python def print_time( step_meta:StepMeta ): ``` *Call self as a function.* ``` python print_time(step_meta) ``` Total Time: 1.29 ms Start: 2026-01-21 15:36:59.108163 End: 2026-01-21 15:36:59.108396 ## print sample df ``` python pd.concat([step_meta.in_df_head, step_meta.in_df_sample, step_meta.in_df_tail]) ```

	order_id	product	quantity	defects	production_time
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
2	103	Widget A	75	3	150
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110

``` python pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns) ```

	order_id	product	quantity	defects	production_time
0	NaN	NaN	NaN	NaN	NaN

``` python pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna('...') ```

	order_id	product	quantity	defects	production_time
0	...	...	...	...	...

``` python pd.DataFrame(np.nan, index=range(3), columns=meta_dict['in_df_head'].columns).fillna('.') ```

	order_id	product	quantity	defects	production_time
0	.	.	.	.	.
1	.	.	.	.	.
2	.	.	.	.	.

``` python pd.DataFrame(np.nan, index=range(1), columns=step_meta.in_df_head.columns).fillna(':') ```

	order_id	product	quantity	defects	production_time
0	:	:	:	:	:

------------------------------------------------------------------------ source ### fill_between_df_parts ``` python def fill_between_df_parts( df ): ``` *Call self as a function.* ``` python fill_between_df_parts(meta_dict['in_df_head']) ```

	order_id	product	quantity	defects	production_time
	:	:	:	:	:

``` python pd.concat( [step_meta.in_df_head, fill_between_df_parts(step_meta.in_df_head), step_meta.in_df_sample, fill_between_df_parts(step_meta.in_df_head), step_meta.in_df_tail] ) ```

	order_id	product	quantity	defects	production_time
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
	:	:	:	:	:
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
	:	:	:	:	:
2	103	Widget A	75	3	150
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110

We want to build to functions: 1. a display function for within our wrapper which only uses the meta_dict 2. a display function that just takes in one df and gets the head, sample, tail from there In the future we might want to turn this into one function but for now we don’t want to store the whole df in meta_dict ------------------------------------------------------------------------ source ### print_sample_from_meta_dict ``` python def print_sample_from_meta_dict( step_meta:StepMeta, mode:str='in' ): ``` *Call self as a function.* ``` python print_sample_from_meta_dict(step_meta) ``` order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 ------------------------------------------------------------------------ source ### display_sample_from_df ``` python def display_sample_from_df( df ): ``` *Call self as a function.* ``` python display_sample_from_df(df) ```

	order_id	product	quantity	defects	production_time
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
2	103	Widget A	75	3	150
	:	:	:	:	:
2	103	Widget A	75	3	150
4	105	Widget B	45	2	110
0	101	Widget A	50	2	120
1	102	Widget B	30	1	95
3	104	Widget C	20	0	80
	:	:	:	:	:
2	103	Widget A	75	3	150
3	104	Widget C	20	0	80
4	105	Widget B	45	2	110

We will have to also consider what happen if we have to many columns. Our idea is to truncate them as well and just add a dummy columns with ‘…’ as sign that its truncated ## print shape and shape change ``` python print('(rows, columns) =', step_meta.in_df_shape) print(' | ') print(' V ') print('(rows, columns) =', step_meta.out_df_shape) ``` (rows, columns) = (5, 5) | V (rows, columns) = (4, 5) ``` python print('(rows, columns) =', step_meta.in_df_shape) print(' ↓ '*8) print('(rows, columns) =', step_meta.out_df_shape) ``` (rows, columns) = (5, 5) ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ (rows, columns) = (4, 5) ``` python in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns" print(in_str) print('↓ ↓ ↓'.center(len(in_str), ' ')) print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns") ``` 5 rows, 5 columns ↓ ↓ ↓ 4 rows, 5 columns is there a convenietn way to space out a number of strings over a specific length? ``` python s1, s2, s3 = "5 rows", "4 columns", "time: 1.2ms" total_width = 50 print(f"{s1:<{total_width//3}}{s2:<{total_width//3}}{s3}") ``` 5 rows 4 columns time: 1.2ms ``` python width = 20 print(s1.ljust(width) + s2.center(width) + s3.rjust(width)) ``` 5 rows 4 columns time: 1.2ms ------------------------------------------------------------------------ source ### space_strings ``` python def space_strings( strings, total_width ): ``` *Call self as a function.* ``` python space_strings(['|', '|', '|'], 20) ``` '| | |' ``` python len(space_strings(['|', '|', '|'], 20)) ``` 19 ``` python in_str = f"{step_meta.in_df_shape[0]} rows, {step_meta.in_df_shape[1]} columns" print(in_str) print(space_strings(['↓','↓','↓'], len(in_str))) print(f"{step_meta.out_df_shape[0]} rows, {step_meta.out_df_shape[1]} columns") ``` 5 rows, 5 columns ↓ ↓ ↓ 4 rows, 5 columns ``` python in_rows, in_cols = step_meta.in_df_shape out_rows, out_cols = step_meta.out_df_shape in_str = f"{in_rows} rows, {in_cols} columns" print(in_str) print(space_strings(['↓','↓','↓'], len(in_str))) print(f'{str(out_rows-in_rows).center(len(in_str)//2)}', f'{str(out_cols-in_cols).center(len(in_str)//2)}') print(space_strings(['↓','↓','↓'], len(in_str))) print(f"{out_rows} rows, {out_cols} columns") ``` 5 rows, 5 columns ↓ ↓ ↓ -1 0 ↓ ↓ ↓ 4 rows, 5 columns ``` python in_rows, in_cols = step_meta.in_df_shape out_rows, out_cols = step_meta.out_df_shape diff_rows, diff_cols = out_rows - in_rows, out_cols - in_cols diff_rows_str = f"{diff_rows:+d}" if diff_rows != 0 else "0" diff_cols_str = f"{diff_cols:+d}" if diff_cols != 0 else "0" # Right-align each column row_width = max(len(str(in_rows)), len(str(out_rows)), len(diff_rows_str)) col_width = max(len(str(in_cols)), len(str(out_cols)), len(diff_cols_str)) print(f"Input: {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols") print(f" {' '*row_width} ↓ {' '*col_width} ↓") print(f"Diff: {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols") print(f" {' '*row_width} ↓ {' '*col_width} ↓") print(f"Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols") ``` Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols ``` python print(f""" Input: {in_rows:>{row_width}} rows, {in_cols:>{col_width}} cols {' '*row_width} ↓ {' '*col_width} ↓ Diff: {diff_rows_str:>{row_width}} rows, {diff_cols_str:>{col_width}} cols {' '*row_width} ↓ {' '*col_width} ↓ Output: {out_rows:>{row_width}} rows, {out_cols:>{col_width}} cols """) ``` Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols ------------------------------------------------------------------------ source ### print_shape_change ``` python def print_shape_change( step_meta:StepMeta ): ``` *Call self as a function.* ``` python print_shape_change(step_meta) ``` Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols ## print docstring ``` python @track def my_func(): """This is the docstring.""" pass print(my_func.__doc__) ``` This is the docstring. ------------------------------------------------------------------------ source ### print_step_description ``` python def print_step_description( step_meta:StepMeta ): ``` *Call self as a function.* ``` python print_step_description(step_meta) ``` '''Exclude products which are Widget C''' ## print all function ------------------------------------------------------------------------ source ### print_step_info ``` python def print_step_info( step_meta:StepMeta ): ``` *Call self as a function.* ``` python print_step_info(step_meta) ``` *************** filter_products *************** '''Exclude products which are Widget C''' Total Time: 1.29 ms Start: 2026-01-21 15:36:59.108163 End: 2026-01-21 15:36:59.108396 Input DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols Output DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 # Testing the pipeline ------------------------------------------------------------------------ source ### pipeline ``` python def pipeline( df, steps, vrbs_default:bool=True ): ``` *Call self as a function.* ------------------------------------------------------------------------ source ### track ``` python def track( func ): ``` *Call self as a function.* ``` python @track def filter_products(df): '''Exclude products which are Widget C.''' return df[df["product"] != "Widget C"] ``` ``` python @track def aggregate_by_product(df): '''aggregate by product and sum columns "quantity", "defects", "production_time"''' return df.groupby(["product"])[["quantity", "defects", "production_time"]].sum() ``` ``` python steps = [ (filter_products, {'vrbs':True}), (aggregate_by_product, {}), ] ``` ``` python _df = pipeline(df, steps, vrbs_default=False) ``` *************** filter_products *************** '''Exclude products which are Widget C.''' Total Time: 574 µs Start: 2026-01-21 15:36:59.369533 End: 2026-01-21 15:36:59.370107 Input DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 2 103 Widget A 75 3 150 1 102 Widget B 30 1 95 0 101 Widget A 50 2 120 4 105 Widget B 45 2 110 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols Output DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 2 103 Widget A 75 3 150 1 102 Widget B 30 1 95 0 101 Widget A 50 2 120 4 105 Widget B 45 2 110 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 ``` python _df = pipeline(df, steps, vrbs_default=True) ``` *************** filter_products *************** '''Exclude products which are Widget C.''' Total Time: 401 µs Start: 2026-01-21 15:36:59.380798 End: 2026-01-21 15:36:59.381199 Input DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 1 102 Widget B 30 1 95 4 105 Widget B 45 2 110 2 103 Widget A 75 3 150 0 101 Widget A 50 2 120 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 Input: 5 rows, 5 cols ↓ ↓ Diff: -1 rows, 0 cols ↓ ↓ Output: 4 rows, 5 cols Output DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 3 104 Widget C 20 0 80 1 102 Widget B 30 1 95 4 105 Widget B 45 2 110 2 103 Widget A 75 3 150 0 101 Widget A 50 2 120 : : : : : 2 103 Widget A 75 3 150 3 104 Widget C 20 0 80 4 105 Widget B 45 2 110 *************** aggregate_by_product *************** '''aggregate by product and sum columns "quantity", "defects", "production_time"''' Total Time: 800 µs Start: 2026-01-21 15:36:59.386272 End: 2026-01-21 15:36:59.387072 Input DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 : : : : : 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 Input: 4 rows, 5 cols ↓ ↓ Diff: -2 rows, -2 cols ↓ ↓ Output: 2 rows, 3 cols Output DataFrame: order_id product quantity defects production_time 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 : : : : : 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 0 101 Widget A 50 2 120 1 102 Widget B 30 1 95 : : : : : 1 102 Widget B 30 1 95 2 103 Widget A 75 3 150 4 105 Widget B 45 2 110 Summarize what we did so far ##### 🤖Reply🤖 🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠 Here’s a summary of what you’ve built so far: **Goal:** A lightweight ETL pipeline framework for pandas that lets you chain transformation steps and inspect intermediate results — aimed at your work as Data Science Consultants in manufacturing. **What’s been built:** 1. **`pipeline(df, steps, vrbs_default)`** — Runs a list of `(function, kwargs)` steps sequentially on a DataFrame, with per-step control over verbosity via the `vrbs` flag. 2. **`@track` decorator** — Wraps each step function to optionally collect and display metadata (timing, shapes, data samples) using `@wraps` to preserve docstrings. 3. **[`StepMeta`](https://MIS-Analytics.github.io/mis_analytics/etl.html#stepmeta) dataclass** — A structured container for all the metadata captured per step: name, description, timing, input/output shapes, head/sample/tail of DataFrames. 4. **Display helper functions:** - [`print_step_name`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_name) — banner with step name - [`print_step_description`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_description) — docstring display - [`print_time`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_time) / [`format_timedelta`](https://MIS-Analytics.github.io/mis_analytics/etl.html#format_timedelta) — human-readable timing (µs, ms, s, min, h) - [`print_shape_change`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_shape_change) — nice visual showing row/col diffs with arrows - [`print_sample_from_meta_dict`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_sample_from_meta_dict) — head + sample + tail with `:` separator rows - [`display_sample_from_df`](https://MIS-Analytics.github.io/mis_analytics/etl.html#display_sample_from_df) — same but directly from a DataFrame - [`print_step_info`](https://MIS-Analytics.github.io/mis_analytics/etl.html#print_step_info) — combines all the above into one output - [`fill_between_df_parts`](https://MIS-Analytics.github.io/mis_analytics/etl.html#fill_between_df_parts) / [`space_strings`](https://MIS-Analytics.github.io/mis_analytics/etl.html#space_strings) — formatting utilities **Still on the to-do list** (from your reflection note): - Column truncation for wide DataFrames - Side-by-side layout (metadata next to DataFrames) - `df.info()` / `df.describe()` integration - Column diff (added/removed columns) - Source code display of step functions - HTML/rich display - Assertion/sanity-check wrapper - Pipeline report generation (e.g. Excel)