url = 'https://www.sapdatasheet.org/abap/tabl/mara.html'DeSAPher
The Problem
Working with SAP systems often means dealing with hundreds of tables with cryptic names like MARA, VBAK, or MAKT. Understanding what these tables contain and how they relate to each other traditionally requires extensive documentation lookup or SAP expertise.
Our Solution
We created a tool that automatically fetches and structures SAP table definitions from online documentation. This gives us instant access to: - Field names and their meanings - Data types and lengths - Relationships between tables (via check tables) - Comprehensive descriptions of each field - Table names and descriptions
Benefits
- Time Saving: No more manual documentation lookups
- Better Understanding: Clear visibility of table structures and meanings
- Easier Data Analysis: Quick reference for field names and their purposes
- Knowledge Sharing: Makes SAP data structures more accessible to team members
Future Directions
1. Interactive Interface
- Build a searchable interface for quick table/field lookups
- Implement full-text search across descriptions
- Add semantic search using LLMs to find relevant tables by describing needs in plain English
2. Visual Data Model
- Create interactive graph visualizations showing table relationships
- Highlight primary/foreign key connections
- Enable visual exploration of the SAP data model
3. AI-Powered Data Assistant
- Use LLMs with our structured documentation as context
- Generate SQL queries from natural language questions
- Suggest relevant tables for specific business questions
- Provide data model explanations in plain language
Our plan
As data scientists working with ERP systems like SAP, we often encounter large datasets with hundreds of cryptically named columns. To better understand these data sources, we aim to web scrape SAP’s data dictionary documentation to create programmatic access to table definitions, column descriptions, and data types.
Steps
- Scrape column description for one table (for example, MARA)
- Bring it in to a easily serachable format
- Apply process to all tables
Scrape column description for one table (for example, MARA)
response = httpx.get(url)
response--------------------------------------------------------------------------- ConnectTimeout Traceback (most recent call last) File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:101, in map_httpcore_exceptions() 100 try: --> 101 yield 102 except Exception as exc: File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:250, in HTTPTransport.handle_request(self, request) 249 with map_httpcore_exceptions(): --> 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:256, in ConnectionPool.handle_request(self, request) 255 self._close_connections(closing) --> 256 raise exc from None 258 # Return the response. Note that in this case we still have to manage 259 # the point at which the response is closed. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:236, in ConnectionPool.handle_request(self, request) 234 try: 235 # Send the request on the assigned connection. --> 236 response = connection.handle_request( 237 pool_request.request 238 ) 239 except ConnectionNotAvailable: 240 # In some cases a connection may initially be available to 241 # handle a request, but then become unavailable. 242 # 243 # In this case we clear the connection and try again. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection.py:101, in HTTPConnection.handle_request(self, request) 100 self._connect_failed = True --> 101 raise exc 103 return self._connection.handle_request(request) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection.py:78, in HTTPConnection.handle_request(self, request) 77 if self._connection is None: ---> 78 stream = self._connect(request) 80 ssl_object = stream.get_extra_info("ssl_object") File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection.py:156, in HTTPConnection._connect(self, request) 155 with Trace("start_tls", logger, request, kwargs) as trace: --> 156 stream = stream.start_tls(**kwargs) 157 trace.return_value = stream File /usr/local/lib/python3.12/site-packages/httpcore/_backends/sync.py:154, in SyncStream.start_tls(self, ssl_context, server_hostname, timeout) 150 exc_map: ExceptionMapping = { 151 socket.timeout: ConnectTimeout, 152 OSError: ConnectError, 153 } --> 154 with map_exceptions(exc_map): 155 try: File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. File /usr/local/lib/python3.12/site-packages/httpcore/_exceptions.py:14, in map_exceptions(map) 13 if isinstance(exc, from_exc): ---> 14 raise to_exc(exc) from exc 15 raise ConnectTimeout: _ssl.c:993: The handshake operation timed out The above exception was the direct cause of the following exception: ConnectTimeout Traceback (most recent call last) Cell In[53], line 1 ----> 1 response = httpx.get(url) 2 response File /usr/local/lib/python3.12/site-packages/httpx/_api.py:195, in get(url, params, headers, cookies, auth, proxy, follow_redirects, verify, timeout, trust_env) 174 def get( 175 url: URL | str, 176 *, (...) 185 trust_env: bool = True, 186 ) -> Response: 187 """ 188 Sends a `GET` request. 189 (...) 193 on this function, as `GET` requests should not include a request body. 194 """ --> 195 return request( 196 "GET", 197 url, 198 params=params, 199 headers=headers, 200 cookies=cookies, 201 auth=auth, 202 proxy=proxy, 203 follow_redirects=follow_redirects, 204 verify=verify, 205 timeout=timeout, 206 trust_env=trust_env, 207 ) File /usr/local/lib/python3.12/site-packages/httpx/_api.py:109, in request(method, url, params, content, data, files, json, headers, cookies, auth, proxy, timeout, follow_redirects, verify, trust_env) 57 """ 58 Sends an HTTP request. 59 (...) 100 ``` 101 """ 102 with Client( 103 cookies=cookies, 104 proxy=proxy, (...) 107 trust_env=trust_env, 108 ) as client: --> 109 return client.request( 110 method=method, 111 url=url, 112 content=content, 113 data=data, 114 files=files, 115 json=json, 116 params=params, 117 headers=headers, 118 auth=auth, 119 follow_redirects=follow_redirects, 120 ) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:825, in Client.request(self, method, url, content, data, files, json, params, headers, cookies, auth, follow_redirects, timeout, extensions) 810 warnings.warn(message, DeprecationWarning, stacklevel=2) 812 request = self.build_request( 813 method=method, 814 url=url, (...) 823 extensions=extensions, 824 ) --> 825 return self.send(request, auth=auth, follow_redirects=follow_redirects) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:914, in Client.send(self, request, stream, auth, follow_redirects) 910 self._set_timeout(request) 912 auth = self._build_request_auth(request, auth) --> 914 response = self._send_handling_auth( 915 request, 916 auth=auth, 917 follow_redirects=follow_redirects, 918 history=[], 919 ) 920 try: 921 if not stream: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:942, in Client._send_handling_auth(self, request, auth, follow_redirects, history) 939 request = next(auth_flow) 941 while True: --> 942 response = self._send_handling_redirects( 943 request, 944 follow_redirects=follow_redirects, 945 history=history, 946 ) 947 try: 948 try: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:979, in Client._send_handling_redirects(self, request, follow_redirects, history) 976 for hook in self._event_hooks["request"]: 977 hook(request) --> 979 response = self._send_single_request(request) 980 try: 981 for hook in self._event_hooks["response"]: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:1014, in Client._send_single_request(self, request) 1009 raise RuntimeError( 1010 "Attempted to send an async request with a sync Client instance." 1011 ) 1013 with request_context(request=request): -> 1014 response = transport.handle_request(request) 1016 assert isinstance(response.stream, SyncByteStream) 1018 response.request = request File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:249, in HTTPTransport.handle_request(self, request) 235 import httpcore 237 req = httpcore.Request( 238 method=request.method, 239 url=httpcore.URL( (...) 247 extensions=request.extensions, 248 ) --> 249 with map_httpcore_exceptions(): 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 156 value = typ() 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. 163 return exc is not value File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:118, in map_httpcore_exceptions() 115 raise 117 message = str(exc) --> 118 raise mapped_exc(message) from exc ConnectTimeout: _ssl.c:993: The handshake operation timed out
sample_text = '''
<table class="table table-sm">
<caption class="text-right sapds-alv">
<a href="/download/abap-tabl-component.php?format=csv&tabname=MARA" title="Download components as CSV file.
The downloaded file contains more columns than displayed here." target="_blank">
<img src="/abap/icon/s_wdvtxe.gif"></a>
<a href="/download/abap-tabl-component.php?format=xls&tabname=MARA" title="Download components as Excel 97-2003 Worksheet (.xls) file.
The downloaded file contains more columns than displayed here." target="_blank">
<img src="/abap/icon/s_x__xls.gif"></a>
<a href="/download/abap-tabl-component.php?format=xlsx&tabname=MARA" title="Download components as Excel Open XML Format Spreadsheet (.xlsx) file.
The downloaded file contains more columns than displayed here." target="_blank">
<img src="/abap/icon/s_lisvie.gif"></a>
</caption>
<thead>
<tr>
<th class="sapds-alv"> <img src="/abap/icon/s_b_pvre.gif"> </th>
<th class="sapds-alv"> Field </th>
<th class="sapds-alv"> Key </th>
<th class="sapds-alv"> Data Element</th>
<th class="sapds-alv"> Domain</th>
<th class="sapds-alv"> Data<br>Type</th>
<th class="sapds-alv"> Length</th>
<th class="sapds-alv"> Decimal<br>Places</th>
<th class="sapds-alv"> Short Description</th>
<th class="sapds-alv"> Check<br>table</th>
</tr>
</thead>
<tbody>
<tr>
<td class="sapds-alv"> <a id="FIELD_MANDT"></a> 1 </td>
<td class="sapds-alv"> <img src="/abap/icon/s_struct.gif"> <a href="/abap/tabl/mara-mandt.html" title="MANDT" target="_blank">MANDT</a> </td>
<td class="sapds-alv text-center"> <input type="checkbox" name="field_MANDT" disabled="disabled" checked="checked"> </td>
<td class="sapds-alv"> <a href="/abap/dtel/mandt.html" title="Client" target="_blank">MANDT</a> </td>
<td class="sapds-alv"> <a href="/abap/doma/mandt.html" title="Client (key field in client-specific tables)" target="_blank">MANDT</a> </td>
<td class="sapds-alv"> <a href="/abap/doma/datatype.html#values" title="Dictionary Data Type" target="_blank">CLNT</a> </td>
<td class="sapds-alv text-right"> 3 </td>
<td class="sapds-alv text-right"> 0 </td>
<td class="sapds-alv"> Client </td>
<td class="sapds-alv"> <a href="/abap/tabl/t000.html" title="Clients" target="_blank">T000</a> </td>
</tr>
<tr>
<td class="sapds-alv"> <a id="FIELD_FASHGRD"></a> 239 </td>
<td class="sapds-alv"> <img src="/abap/icon/s_struct.gif"> <a href="/abap/tabl/mara-fashgrd.html" title="FASHGRD" target="_blank">FASHGRD</a> </td>
<td class="sapds-alv text-center"> <input type="checkbox" name="field_FASHGRD" disabled="disabled"> </td>
<td class="sapds-alv"> <a href="/abap/dtel/fashgrd.html" title="Fashion Grade" target="_blank">FASHGRD</a> </td>
<td class="sapds-alv"> <a href="/abap/doma/fashgrd.html" title="Fashion Grade" target="_blank">FASHGRD</a> </td>
<td class="sapds-alv"> <a href="/abap/doma/datatype.html#values" title="Dictionary Data Type" target="_blank">CHAR</a> </td>
<td class="sapds-alv text-right"> 4 </td>
<td class="sapds-alv text-right"> 0 </td>
<td class="sapds-alv"> Fashion Grade </td>
<td class="sapds-alv"> <a href="/abap/tabl/t6wfg.html" title="Degree of Fashion" target="_blank">T6WFG</a> </td>
</tr>
</tbody>
</table>'''soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', class_='table table-sm')--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[51], line 1 ----> 1 soup = BeautifulSoup(response.text, 'lxml') 2 table = soup.find('table', class_='table table-sm') NameError: name 'response' is not defined
Bring it in to a easily searchable format
headers = [th.text.strip() for th in table.find('thead').find_all('th')]
headers--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[17], line 1 ----> 1 headers = [th.text.strip() for th in table.find('thead').find_all('th')] 2 headers NameError: name 'table' is not defined
trs = [tr for tr in table.find('tbody').find_all('tr')]
trs[0]--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[18], line 1 ----> 1 trs = [tr for tr in table.find('tbody').find_all('tr')] 2 trs[0] NameError: name 'table' is not defined
trs[1]--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[19], line 1 ----> 1 trs[1] NameError: name 'trs' is not defined
[td.text.strip() for td in trs[1].find_all('td')]--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[20], line 1 ----> 1 [td.text.strip() for td in trs[1].find_all('td')] NameError: name 'trs' is not defined
rows = []
for tr in table.find('tbody').find_all('tr'):
row = [td.text.strip() for td in tr.find_all('td')]
rows.append(row)
rows[10]--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[21], line 2 1 rows = [] ----> 2 for tr in table.find('tbody').find_all('tr'): 3 row = [td.text.strip() for td in tr.find_all('td')] 4 rows.append(row) 5 NameError: name 'table' is not defined
df = pd.DataFrame(rows, columns=headers)
df.head()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[22], line 1 ----> 1 df = pd.DataFrame(rows, columns=headers) 2 df.head() NameError: name 'headers' is not defined
get_sap_table_structure
def get_sap_table_structure(
url
):
Scrapes SAP table structure from sapdatasheet.org and returns a pandas DataFrame
df = get_sap_table_structure(url)
df.head()--------------------------------------------------------------------------- ReadTimeout Traceback (most recent call last) File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:101, in map_httpcore_exceptions() 100 try: --> 101 yield 102 except Exception as exc: File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:250, in HTTPTransport.handle_request(self, request) 249 with map_httpcore_exceptions(): --> 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:256, in ConnectionPool.handle_request(self, request) 255 self._close_connections(closing) --> 256 raise exc from None 258 # Return the response. Note that in this case we still have to manage 259 # the point at which the response is closed. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:236, in ConnectionPool.handle_request(self, request) 234 try: 235 # Send the request on the assigned connection. --> 236 response = connection.handle_request( 237 pool_request.request 238 ) 239 except ConnectionNotAvailable: 240 # In some cases a connection may initially be available to 241 # handle a request, but then become unavailable. 242 # 243 # In this case we clear the connection and try again. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection.py:103, in HTTPConnection.handle_request(self, request) 101 raise exc --> 103 return self._connection.handle_request(request) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:136, in HTTP11Connection.handle_request(self, request) 135 self._response_closed() --> 136 raise exc File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:106, in HTTP11Connection.handle_request(self, request) 97 with Trace( 98 "receive_response_headers", logger, request, kwargs 99 ) as trace: 100 ( 101 http_version, 102 status, 103 reason_phrase, 104 headers, 105 trailing_data, --> 106 ) = self._receive_response_headers(**kwargs) 107 trace.return_value = ( 108 http_version, 109 status, 110 reason_phrase, 111 headers, 112 ) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:177, in HTTP11Connection._receive_response_headers(self, request) 176 while True: --> 177 event = self._receive_event(timeout=timeout) 178 if isinstance(event, h11.Response): File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:217, in HTTP11Connection._receive_event(self, timeout) 216 if event is h11.NEED_DATA: --> 217 data = self._network_stream.read( 218 self.READ_NUM_BYTES, timeout=timeout 219 ) 221 # If we feed this case through h11 we'll raise an exception like: 222 # 223 # httpcore.RemoteProtocolError: can't handle event type (...) 227 # perspective. Instead we handle this case distinctly and treat 228 # it as a ConnectError. File /usr/local/lib/python3.12/site-packages/httpcore/_backends/sync.py:126, in SyncStream.read(self, max_bytes, timeout) 125 exc_map: ExceptionMapping = {socket.timeout: ReadTimeout, OSError: ReadError} --> 126 with map_exceptions(exc_map): 127 self._sock.settimeout(timeout) File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. File /usr/local/lib/python3.12/site-packages/httpcore/_exceptions.py:14, in map_exceptions(map) 13 if isinstance(exc, from_exc): ---> 14 raise to_exc(exc) from exc 15 raise ReadTimeout: The read operation timed out The above exception was the direct cause of the following exception: ReadTimeout Traceback (most recent call last) Cell In[24], line 1 ----> 1 df = get_sap_table_structure(url) 2 df.head() Cell In[23], line 9, in get_sap_table_structure(url) 5 import httpx 6 from bs4 import BeautifulSoup 7 import pandas as pd 8 ----> 9 response = httpx.get(url) 10 soup = BeautifulSoup(response.text, 'lxml') 11 12 table = soup.find('table', class_='table table-sm') File /usr/local/lib/python3.12/site-packages/httpx/_api.py:195, in get(url, params, headers, cookies, auth, proxy, follow_redirects, verify, timeout, trust_env) 174 def get( 175 url: URL | str, 176 *, (...) 185 trust_env: bool = True, 186 ) -> Response: 187 """ 188 Sends a `GET` request. 189 (...) 193 on this function, as `GET` requests should not include a request body. 194 """ --> 195 return request( 196 "GET", 197 url, 198 params=params, 199 headers=headers, 200 cookies=cookies, 201 auth=auth, 202 proxy=proxy, 203 follow_redirects=follow_redirects, 204 verify=verify, 205 timeout=timeout, 206 trust_env=trust_env, 207 ) File /usr/local/lib/python3.12/site-packages/httpx/_api.py:109, in request(method, url, params, content, data, files, json, headers, cookies, auth, proxy, timeout, follow_redirects, verify, trust_env) 57 """ 58 Sends an HTTP request. 59 (...) 100 ``` 101 """ 102 with Client( 103 cookies=cookies, 104 proxy=proxy, (...) 107 trust_env=trust_env, 108 ) as client: --> 109 return client.request( 110 method=method, 111 url=url, 112 content=content, 113 data=data, 114 files=files, 115 json=json, 116 params=params, 117 headers=headers, 118 auth=auth, 119 follow_redirects=follow_redirects, 120 ) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:825, in Client.request(self, method, url, content, data, files, json, params, headers, cookies, auth, follow_redirects, timeout, extensions) 810 warnings.warn(message, DeprecationWarning, stacklevel=2) 812 request = self.build_request( 813 method=method, 814 url=url, (...) 823 extensions=extensions, 824 ) --> 825 return self.send(request, auth=auth, follow_redirects=follow_redirects) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:914, in Client.send(self, request, stream, auth, follow_redirects) 910 self._set_timeout(request) 912 auth = self._build_request_auth(request, auth) --> 914 response = self._send_handling_auth( 915 request, 916 auth=auth, 917 follow_redirects=follow_redirects, 918 history=[], 919 ) 920 try: 921 if not stream: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:942, in Client._send_handling_auth(self, request, auth, follow_redirects, history) 939 request = next(auth_flow) 941 while True: --> 942 response = self._send_handling_redirects( 943 request, 944 follow_redirects=follow_redirects, 945 history=history, 946 ) 947 try: 948 try: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:979, in Client._send_handling_redirects(self, request, follow_redirects, history) 976 for hook in self._event_hooks["request"]: 977 hook(request) --> 979 response = self._send_single_request(request) 980 try: 981 for hook in self._event_hooks["response"]: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:1014, in Client._send_single_request(self, request) 1009 raise RuntimeError( 1010 "Attempted to send an async request with a sync Client instance." 1011 ) 1013 with request_context(request=request): -> 1014 response = transport.handle_request(request) 1016 assert isinstance(response.stream, SyncByteStream) 1018 response.request = request File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:249, in HTTPTransport.handle_request(self, request) 235 import httpcore 237 req = httpcore.Request( 238 method=request.method, 239 url=httpcore.URL( (...) 247 extensions=request.extensions, 248 ) --> 249 with map_httpcore_exceptions(): 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 156 value = typ() 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. 163 return exc is not value File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:118, in map_httpcore_exceptions() 115 raise 117 message = str(exc) --> 118 raise mapped_exc(message) from exc ReadTimeout: The read operation timed out
url_2 = 'https://www.sapdatasheet.org/abap/tabl/makt.html'
df_2 = get_sap_table_structure(url_2)
df_2--------------------------------------------------------------------------- ReadTimeout Traceback (most recent call last) File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:101, in map_httpcore_exceptions() 100 try: --> 101 yield 102 except Exception as exc: File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:250, in HTTPTransport.handle_request(self, request) 249 with map_httpcore_exceptions(): --> 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:256, in ConnectionPool.handle_request(self, request) 255 self._close_connections(closing) --> 256 raise exc from None 258 # Return the response. Note that in this case we still have to manage 259 # the point at which the response is closed. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection_pool.py:236, in ConnectionPool.handle_request(self, request) 234 try: 235 # Send the request on the assigned connection. --> 236 response = connection.handle_request( 237 pool_request.request 238 ) 239 except ConnectionNotAvailable: 240 # In some cases a connection may initially be available to 241 # handle a request, but then become unavailable. 242 # 243 # In this case we clear the connection and try again. File /usr/local/lib/python3.12/site-packages/httpcore/_sync/connection.py:103, in HTTPConnection.handle_request(self, request) 101 raise exc --> 103 return self._connection.handle_request(request) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:136, in HTTP11Connection.handle_request(self, request) 135 self._response_closed() --> 136 raise exc File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:106, in HTTP11Connection.handle_request(self, request) 97 with Trace( 98 "receive_response_headers", logger, request, kwargs 99 ) as trace: 100 ( 101 http_version, 102 status, 103 reason_phrase, 104 headers, 105 trailing_data, --> 106 ) = self._receive_response_headers(**kwargs) 107 trace.return_value = ( 108 http_version, 109 status, 110 reason_phrase, 111 headers, 112 ) File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:177, in HTTP11Connection._receive_response_headers(self, request) 176 while True: --> 177 event = self._receive_event(timeout=timeout) 178 if isinstance(event, h11.Response): File /usr/local/lib/python3.12/site-packages/httpcore/_sync/http11.py:217, in HTTP11Connection._receive_event(self, timeout) 216 if event is h11.NEED_DATA: --> 217 data = self._network_stream.read( 218 self.READ_NUM_BYTES, timeout=timeout 219 ) 221 # If we feed this case through h11 we'll raise an exception like: 222 # 223 # httpcore.RemoteProtocolError: can't handle event type (...) 227 # perspective. Instead we handle this case distinctly and treat 228 # it as a ConnectError. File /usr/local/lib/python3.12/site-packages/httpcore/_backends/sync.py:126, in SyncStream.read(self, max_bytes, timeout) 125 exc_map: ExceptionMapping = {socket.timeout: ReadTimeout, OSError: ReadError} --> 126 with map_exceptions(exc_map): 127 self._sock.settimeout(timeout) File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. File /usr/local/lib/python3.12/site-packages/httpcore/_exceptions.py:14, in map_exceptions(map) 13 if isinstance(exc, from_exc): ---> 14 raise to_exc(exc) from exc 15 raise ReadTimeout: The read operation timed out The above exception was the direct cause of the following exception: ReadTimeout Traceback (most recent call last) Cell In[25], line 3 1 url_2 = 'https://www.sapdatasheet.org/abap/tabl/makt.html' 2 ----> 3 df_2 = get_sap_table_structure(url_2) 4 df_2 Cell In[23], line 9, in get_sap_table_structure(url) 5 import httpx 6 from bs4 import BeautifulSoup 7 import pandas as pd 8 ----> 9 response = httpx.get(url) 10 soup = BeautifulSoup(response.text, 'lxml') 11 12 table = soup.find('table', class_='table table-sm') File /usr/local/lib/python3.12/site-packages/httpx/_api.py:195, in get(url, params, headers, cookies, auth, proxy, follow_redirects, verify, timeout, trust_env) 174 def get( 175 url: URL | str, 176 *, (...) 185 trust_env: bool = True, 186 ) -> Response: 187 """ 188 Sends a `GET` request. 189 (...) 193 on this function, as `GET` requests should not include a request body. 194 """ --> 195 return request( 196 "GET", 197 url, 198 params=params, 199 headers=headers, 200 cookies=cookies, 201 auth=auth, 202 proxy=proxy, 203 follow_redirects=follow_redirects, 204 verify=verify, 205 timeout=timeout, 206 trust_env=trust_env, 207 ) File /usr/local/lib/python3.12/site-packages/httpx/_api.py:109, in request(method, url, params, content, data, files, json, headers, cookies, auth, proxy, timeout, follow_redirects, verify, trust_env) 57 """ 58 Sends an HTTP request. 59 (...) 100 ``` 101 """ 102 with Client( 103 cookies=cookies, 104 proxy=proxy, (...) 107 trust_env=trust_env, 108 ) as client: --> 109 return client.request( 110 method=method, 111 url=url, 112 content=content, 113 data=data, 114 files=files, 115 json=json, 116 params=params, 117 headers=headers, 118 auth=auth, 119 follow_redirects=follow_redirects, 120 ) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:825, in Client.request(self, method, url, content, data, files, json, params, headers, cookies, auth, follow_redirects, timeout, extensions) 810 warnings.warn(message, DeprecationWarning, stacklevel=2) 812 request = self.build_request( 813 method=method, 814 url=url, (...) 823 extensions=extensions, 824 ) --> 825 return self.send(request, auth=auth, follow_redirects=follow_redirects) File /usr/local/lib/python3.12/site-packages/httpx/_client.py:914, in Client.send(self, request, stream, auth, follow_redirects) 910 self._set_timeout(request) 912 auth = self._build_request_auth(request, auth) --> 914 response = self._send_handling_auth( 915 request, 916 auth=auth, 917 follow_redirects=follow_redirects, 918 history=[], 919 ) 920 try: 921 if not stream: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:942, in Client._send_handling_auth(self, request, auth, follow_redirects, history) 939 request = next(auth_flow) 941 while True: --> 942 response = self._send_handling_redirects( 943 request, 944 follow_redirects=follow_redirects, 945 history=history, 946 ) 947 try: 948 try: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:979, in Client._send_handling_redirects(self, request, follow_redirects, history) 976 for hook in self._event_hooks["request"]: 977 hook(request) --> 979 response = self._send_single_request(request) 980 try: 981 for hook in self._event_hooks["response"]: File /usr/local/lib/python3.12/site-packages/httpx/_client.py:1014, in Client._send_single_request(self, request) 1009 raise RuntimeError( 1010 "Attempted to send an async request with a sync Client instance." 1011 ) 1013 with request_context(request=request): -> 1014 response = transport.handle_request(request) 1016 assert isinstance(response.stream, SyncByteStream) 1018 response.request = request File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:249, in HTTPTransport.handle_request(self, request) 235 import httpcore 237 req = httpcore.Request( 238 method=request.method, 239 url=httpcore.URL( (...) 247 extensions=request.extensions, 248 ) --> 249 with map_httpcore_exceptions(): 250 resp = self._pool.handle_request(req) 252 assert isinstance(resp.stream, typing.Iterable) File /usr/local/lib/python3.12/contextlib.py:158, in _GeneratorContextManager.__exit__(self, typ, value, traceback) 156 value = typ() 157 try: --> 158 self.gen.throw(value) 159 except StopIteration as exc: 160 # Suppress StopIteration *unless* it's the same exception that 161 # was passed to throw(). This prevents a StopIteration 162 # raised inside the "with" statement from being suppressed. 163 return exc is not value File /usr/local/lib/python3.12/site-packages/httpx/_transports/default.py:118, in map_httpcore_exceptions() 115 raise 117 message = str(exc) --> 118 raise mapped_exc(message) from exc ReadTimeout: The read operation timed out
Apply process to all tables
get_sap_table_url
def get_sap_table_url(
table_name
):
Constructs sapdatasheet.org URL from SAP table name
get_sap_table_url('VBAK')'https://www.sapdatasheet.org/abap/tabl/vbak.html'
get_sap_tables_structure
def get_sap_tables_structure(
tables
):
Gets structure for multiple SAP tables and combines them into one DataFrame with a column indicating the source table
get_sap_table_structure
def get_sap_table_structure(
url
):
Scrapes SAP table structure from sapdatasheet.org and returns a pandas DataFrame Returns None if table not found or other error occurs
get_sap_tables_structure
def get_sap_tables_structure(
tables
):
Gets structure for multiple SAP tables and combines them into one DataFrame with a column indicating the source table
tables = ['MARC', 'MARD', 'MARM', 'MBEW']df = get_sap_tables_structure(tables)
df.sample(10)Error processing https://www.sapdatasheet.org/abap/tabl/marc.html: The read operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/mard.html: The read operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/marm.html: The read operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/mbew.html: The read operation timed out
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[32], line 2 1 df = get_sap_tables_structure(tables) ----> 2 df.sample(10) AttributeError: 'NoneType' object has no attribute 'sample'
Let’s also grab the table description for each table.
url'https://www.sapdatasheet.org/abap/tabl/mara.html'
get_sap_table_description
def get_sap_table_description(
url
):
Scrapes SAP table description from sapdatasheet.org Returns None if not found or error occurs
get_sap_table_description(url)Error getting description from https://www.sapdatasheet.org/abap/tabl/mara.html: Server error '500 Internal Server Error' for url 'https://www.sapdatasheet.org/abap/tabl/mara.html'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500
from pathlib import PathCACHE_PATH = Path('sap_sheet.parquet')def get_sap_tables_structure(tables, cache_path=CACHE_PATH, force_refresh=False):
"""
Gets structure for multiple SAP tables, caching results in a single parquet file.
Only scrapes tables not already in the cache. force_refresh=True re-scrapes everything.
"""
if force_refresh and cache_path.exists():
cache_path.unlink()
# Load existing cache
if cache_path.exists():
cached = pd.read_parquet(cache_path)
already_cached = set(cached['Table'].unique())
else:
cached = None
already_cached = set()
# Scrape only missing tables
missing = [t for t in tables if t.upper() not in already_cached]
new_dfs = []
for table in missing:
url = get_sap_table_url(table)
df = get_sap_table_structure(url)
if df is not None:
df['Table'] = table.upper()
df['Table Description'] = get_sap_table_description(url)
new_dfs.append(df)
# Combine and save
all_dfs = [d for d in [cached] + new_dfs if d is not None]
if not all_dfs:
return None
result = pd.concat(all_dfs, ignore_index=True)
if missing: # only write if something new was added
result.to_parquet(cache_path, index=False)
# Return only requested tables
return result[result['Table'].isin([t.upper() for t in tables])]sap_sheet = get_sap_tables_structure(tables)
sap_sheet.sample(10)Error processing https://www.sapdatasheet.org/abap/tabl/marc.html: The read operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/mard.html: The read operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/marm.html: _ssl.c:993: The handshake operation timed out
Error processing https://www.sapdatasheet.org/abap/tabl/mbew.html: _ssl.c:993: The handshake operation timed out
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[39], line 2 1 sap_sheet = get_sap_tables_structure(tables) ----> 2 sap_sheet.sample(10) AttributeError: 'NoneType' object has no attribute 'sample'
sap_sheet['Table Description'].unique()--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[40], line 1 ----> 1 sap_sheet['Table Description'].unique() TypeError: 'NoneType' object is not subscriptable
data = {
'MANDT': [100, 100, 100],
'MATNR': ['MAT001', 'MAT002', 'MAT003'],
'WERKS': ['1000', '1000', '2000'],
'PSTAT': ['KVEB', 'KVEB', 'KVEB'],
'LVORM': [None, None, None],
'BWTTY': [None, None, None],
'MMSTA': ['1', '1', '9'],
'MMSTD': [20200101, 20200115, 20210201],
'MAABC': ['A', 'B', None],
'KZKRI': [None, None, None],
'EKGRP': ['100', '100', '200'],
'DISPO': ['001', '001', '002'],
'BESKZ': ['E', 'E', 'X'],
'SOBSL': [None, None, None],
'EISBE': [10.0, 20.0, 15.0],
'MABST': [100.0, 200.0, 150.0],
'ALTSL': [None, None, None],
'KZAUS': [None, None, None],
'AUSDT': [0, 0, 0],
'NFMAT': [None, None, None],
'KZBED': [None, 'T', None],
'RGEKZ': [None, None, None],
'FEVOR': ['G01', None, 'G02'],
'BASMG': [1.0, 1.0, 1.0],
'STAWN': ['84141025', '84141025', '84148073'],
'HERKL': ['DE', 'DE', 'US'],
'HERKR': ['05', '05', '16'],
'EXPME': ['ST', 'ST', 'ST'],
'MTVER': ['1', '1', '1'],
'PRCTR': ['PC100', 'PC100', 'PC200'],
'VERKZ': [None, 'X', None],
'STLAL': [None, None, None],
'STLAN': [None, None, None],
'PLNNR': [None, None, None],
'APLAL': [None, None, None],
'FRTME': [None, None, None],
'LGPRO': ['1001', '1001', '2001'],
'DISGR': ['2000', '2000', '6000'],
'SERNP': [None, None, None],
'PREFE': [None, None, None],
'PRENE': [None, None, None],
'SCHGT': [None, None, None],
'MCRUE': ['X', 'X', 'X'],
'LFGJA': [2024, 2024, 2025],
'EISLO': [0.0, 0.0, 0.0],
'TARGET_STOCK': [50.0, 100.0, 75.0],
'SCM_SCOST': [0.0, 0.0, 0.0],
'SCM_LSUOM': [None, None, None],
'SCM_STRA1': [None, None, None],
}
df_marc = pd.DataFrame(data)
df_marc| MANDT | MATNR | WERKS | PSTAT | LVORM | BWTTY | MMSTA | MMSTD | MAABC | KZKRI | ... | PREFE | PRENE | SCHGT | MCRUE | LFGJA | EISLO | TARGET_STOCK | SCM_SCOST | SCM_LSUOM | SCM_STRA1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100 | MAT001 | 1000 | KVEB | None | None | 1 | 20200101 | A | None | ... | None | None | None | X | 2024 | 0.0 | 50.0 | 0.0 | None | None |
| 1 | 100 | MAT002 | 1000 | KVEB | None | None | 1 | 20200115 | B | None | ... | None | None | None | X | 2024 | 0.0 | 100.0 | 0.0 | None | None |
| 2 | 100 | MAT003 | 2000 | KVEB | None | None | 9 | 20210201 | None | None | ... | None | None | None | X | 2025 | 0.0 | 75.0 | 0.0 | None | None |
3 rows × 49 columns
convert_sap_types
def convert_sap_types(
df:DataFrame, sap_sheet:DataFrame
)->DataFrame:
Converts the columns of the ‘df’ DataFrame to the correct data types based on ‘sap_sheet’.
:param df: Pandas DataFrame containing SAP table data (e.g., “df”). :param sap_sheet: Pandas DataFrame containing SAP metadata with column data types. :return: Converted Pandas DataFrame.
df_marc.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 49 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MANDT 3 non-null int64
1 MATNR 3 non-null object
2 WERKS 3 non-null object
3 PSTAT 3 non-null object
4 LVORM 0 non-null object
5 BWTTY 0 non-null object
6 MMSTA 3 non-null object
7 MMSTD 3 non-null int64
8 MAABC 2 non-null object
9 KZKRI 0 non-null object
10 EKGRP 3 non-null object
11 DISPO 3 non-null object
12 BESKZ 3 non-null object
13 SOBSL 0 non-null object
14 EISBE 3 non-null float64
15 MABST 3 non-null float64
16 ALTSL 0 non-null object
17 KZAUS 0 non-null object
18 AUSDT 3 non-null int64
19 NFMAT 0 non-null object
20 KZBED 1 non-null object
21 RGEKZ 0 non-null object
22 FEVOR 2 non-null object
23 BASMG 3 non-null float64
24 STAWN 3 non-null object
25 HERKL 3 non-null object
26 HERKR 3 non-null object
27 EXPME 3 non-null object
28 MTVER 3 non-null object
29 PRCTR 3 non-null object
30 VERKZ 1 non-null object
31 STLAL 0 non-null object
32 STLAN 0 non-null object
33 PLNNR 0 non-null object
34 APLAL 0 non-null object
35 FRTME 0 non-null object
36 LGPRO 3 non-null object
37 DISGR 3 non-null object
38 SERNP 0 non-null object
39 PREFE 0 non-null object
40 PRENE 0 non-null object
41 SCHGT 0 non-null object
42 MCRUE 3 non-null object
43 LFGJA 3 non-null int64
44 EISLO 3 non-null float64
45 TARGET_STOCK 3 non-null float64
46 SCM_SCOST 3 non-null float64
47 SCM_LSUOM 0 non-null object
48 SCM_STRA1 0 non-null object
dtypes: float64(6), int64(4), object(39)
memory usage: 1.3+ KB
df_marc_converted = convert_sap_types(df_marc, sap_sheet)
df_marc_converted.head()--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[45], line 1 ----> 1 df_marc_converted = convert_sap_types(df_marc, sap_sheet) 2 df_marc_converted.head() Cell In[43], line 9, in convert_sap_types(df, sap_sheet) 5 :param df: Pandas DataFrame containing SAP table data (e.g., "df"). 6 :param sap_sheet: Pandas DataFrame containing SAP metadata with column data types. 7 :return: Converted Pandas DataFrame. 8 """ ----> 9 sap_sheet = sap_sheet.rename(columns=lambda x: x.strip()) 10 column_type_mapping = sap_sheet.set_index("Field")["DataType"].to_dict() 11 12 for column, sap_type in column_type_mapping.items(): AttributeError: 'NoneType' object has no attribute 'rename'
df_marc_converted.info()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[46], line 1 ----> 1 df_marc_converted.info() NameError: name 'df_marc_converted' is not defined
rename_sap_columns
def rename_sap_columns(
df:DataFrame, sap_sheet:DataFrame
)->DataFrame:
Renames the columns in the ‘df’ DataFrame using the ‘Short Description’ from ‘sap_sheet’. If a column is not found in ‘sap_sheet’, it remains unchanged.
:param df: Pandas DataFrame containing SAP table data (e.g., “df”). :param sap_sheet: Pandas DataFrame containing SAP metadata with column names and short descriptions. :return: DataFrame with renamed columns.
df_marc_converted = rename_sap_columns(df_marc_converted, sap_sheet)
df_marc_converted.info()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[48], line 1 ----> 1 df_marc_converted = rename_sap_columns(df_marc_converted, sap_sheet) 2 df_marc_converted.info() NameError: name 'df_marc_converted' is not defined
df_marc_converted = clean_col_names(df_marc_converted)
df_marc_converted.info()--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[49], line 1 ----> 1 df_marc_converted = clean_col_names(df_marc_converted) 2 df_marc_converted.info() NameError: name 'df_marc_converted' is not defined
Next steps
we should write some kind of setup functionality to scrape the sap_sheet once and the have it as a lookup reference.
We have duplicates in the
Fieldcolumn, because columns are in multiple tables.We also have duplicates in the
Short Descriptioncolumn, sometimes even when theFieldvalue isn’t the same.We could combine columns, for example
Field,Table,Short Description, which would make a unique value. But currently we only match over theFieldcolumn. Therefore we can not decide which description we should take. We either have to set a sensible default or we have to have a way to know where the column originated from. The second option seem only viable if we just process sap columns from one sap table.the sensible default could be based on the number of times the table is used in other tables (the more usages the better).
ok, how to best cache the data to not always scrape the data at each run
🤖Reply🤖
A few good options, from simplest to most robust:
- Save to Parquet/CSV — scrape once, save to file, load from file on subsequent runs. Parquet is nice because it preserves dtypes.
- Pickle the DataFrame — similar but keeps all Python types exactly.
- SQLite — good if you want to query the data without loading it all into memory.
For your use case, I’d recommend Parquet — it’s compact, fast, preserves types well, and you can easily inspect it with pandas. The pattern would be: check if the file exists → if yes, load it; if not, scrape and save.
Something like:
from pathlib import Path
CACHE_PATH = Path('sap_sheet.parquet')
def get_or_scrape_sap_tables(tables, cache_path=CACHE_PATH):
if cache_path.exists():
return pd.read_parquet(cache_path)
df = get_sap_tables_structure(tables)
df.to_parquet(cache_path, index=False)
return dfYou could also add a force_refresh=False parameter to re-scrape when needed. Want to go with this approach, or would you prefer something different?
what if it is not the same tables every time?