Feat: Handle get_lot when RECAPITULATIF is nan

fix: raise a warning when a page is not recognized
Feat: publish tag on Matrix
2023-09-20 09:28:57 +02:00 · 2023-09-20 09:27:40 +02:00 · 2023-07-08 09:08:09 +02:00 · 2023-07-08 09:06:25 +02:00 · 2023-07-07 21:26:00 +02:00 · 2023-07-05 18:13:41 +02:00
5 changed files with 33 additions and 13 deletions
--- a/.drone.yml
+++ b/.drone.yml
@@ -12,7 +12,7 @@ steps:
    image: python:3.11
    commands:
      - echo ${DRONE_TAG}
-      - sed -i "s/VERSION_PLACEHOLDER/${DRONE_TAG}/g" pyproject.toml
+      - sed -i 's/version = "[^"]*"/version = "${DRONE_TAG}"/g' pyproject.toml
      - curl -sSL https://install.python-poetry.org | python3 -
      - export PATH="/root/.local/bin:$PATH"
      - poetry --version
@@ -22,6 +22,19 @@ steps:
      PYPI_TOKEN:
        from_secret: pypi_token
  - name: Notify on matrix
    image: plugins/matrix
    environment:
      MATRIX_ROOMID:
        from_secret: MATRIX_ROOMID
      MATRIX_ACCESSTOKEN: 
        from_secret: MATRIX_ACCESSTOKEN
      MATRIX_USERID:
        from_secret: MATRIX_USERID
    settings:
      homeserver: https://matrix.poneyworld.net
      template: "Une nouvelle version (${DRONE_TAG}) de pdf-oralia est publiée!"
    when:
      event:
        include:
--- a/pdf_oralia/extract.py
+++ b/pdf_oralia/extract.py
@@ -45,7 +45,7 @@ def from_pdf(pdf):
    charge_tables = []
    patrimoie_tables = []
-    for page in pdf.pages:
+    for page_number, page in enumerate(pdf.pages):
        page_text = page.extract_text()
        date = extract_date(page_text)
        additionnal_fields = {
@@ -76,7 +76,7 @@ def from_pdf(pdf):
            pass
        else:
-            raise ValueError("Page non reconnu")
+            logging.warning(f"Page {page_number+1} non reconnu. Page ignorée.")
    df_charge = charge.table2df(recapitulatif_tables + charge_tables)
    df_loc = locataire.table2df(loc_tables)
--- a/pdf_oralia/pages/charge.py
+++ b/pdf_oralia/pages/charge.py
@@ -17,6 +17,7 @@ DF_TYPES = {
    "annee": str,
    "lot": str,
 }
 DEFAULT_FOURNISSEUR = "ROSIER MODICA MOTTEROZ SA"
 def is_it(page_text):
@@ -31,7 +32,10 @@ def is_it(page_text):
 def get_lot(txt):
    """Return lot number from "RECAPITULATIF DES OPERATIONS" """
    regex = r"[BSM](\d+)(?=\s*-)"
    try:
        result = re.findall(regex, txt)
    except TypeError:
        return "*"
    if result:
        return "{:02d}".format(int(result[0]))
    return "*"
@@ -62,8 +66,8 @@ def extract(table, additionnal_fields: dict = {}):
            for k, v in additionnal_fields.items():
                r[k] = v
-            if "honoraire" in row[RECAPITULATIF_DES_OPERATIONS]:
+            if "honoraire" in row[RECAPITULATIF_DES_OPERATIONS].lower():
-                r["Fournisseur"] = "IMI GERANCE"
+                r["Fournisseur"] = DEFAULT_FOURNISSEUR
            extracted.append(r)
@@ -83,6 +87,5 @@ def table2df(tables):
    df = pd.concat(dfs)
    df["immeuble"] = df["immeuble"].apply(lambda x: x[0].capitalize())
    print(df.columns)
    df["lot"] = df["RECAPITULATIF DES OPERATIONS"].apply(get_lot)
-    return df.astype(DF_TYPES, errors="ignore")
+    return df.astype(DF_TYPES)
--- a/pdf_oralia/pages/locataire.py
+++ b/pdf_oralia/pages/locataire.py
@@ -1,3 +1,4 @@
 import numpy as np
 import pandas as pd
 DF_TYPES = {
@@ -33,7 +34,7 @@ def is_drop(row):
 def extract(table, additionnal_fields: dict = {}):
-    """Turn table to dictionary with additionnal fields"""
+    """Turn table to dictionary with additional fields"""
    extracted = []
    header = table[0]
    for row in table[1:]:
@@ -138,8 +139,6 @@ def join_row(table):
                    }
                )
                joined.append(row)
            else:
                pass
    return joined
@@ -159,4 +158,9 @@ def table2df(tables):
    df["immeuble"] = df["immeuble"].apply(lambda x: x[0].capitalize())
    df["Type"] = df["Type"].apply(clean_type)
-    return df.astype(DF_TYPES, errors="ignore")
+    numeric_cols = [k for k, v in DF_TYPES.items() if v == float]
    df[numeric_cols] = df[numeric_cols].replace("", np.nan)
    df = df.drop(df[(df["Locataires"] == "") & (df["Période"] == "")].index)
    return df.astype(DF_TYPES)
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "pdf-oralia"
-version = "1.dev"
+version = "dev"
 description = ""
 authors = ["Bertrand Benjamin <benjamin.bertrand@opytex.org>"]
 readme = "README.md"
Author	SHA1	Message	Date
Bertrand Benjamin	0040dccd9a	Feat: Handle get_lot when RECAPITULATIF is nan	2023-09-20 09:28:57 +02:00
Bertrand Benjamin	b0333cddd8	fix: raise a warning when a page is not recognized	2023-09-20 09:27:40 +02:00
Bertrand Benjamin	406b89fea1	Feat: publish tag on Matrix	2023-07-08 09:08:09 +02:00
Bertrand Benjamin	812d392720	feat: publish to matrix All checks were successful continuous-integration/drone/push Build is passing Details	2023-07-08 09:06:25 +02:00
Bertrand Benjamin	6b77980e6c	Fix 7: change the default FOURNISSEUR	2023-07-07 21:26:00 +02:00
Bertrand Benjamin	90c2d3689b	Fix I4: drop row with "" on locataire ans Période	2023-07-05 18:13:41 +02:00
Bertrand Benjamin	f9be31c090	Fix #3 : replace empty string with np.nan	2023-07-05 17:49:25 +02:00
Bertrand Benjamin	2761c3ed7b	Feat: improve version name for drone	2023-06-30 13:51:04 +02:00