This commit is contained in:
Jun-te Kim 2026-02-27 12:50:16 +00:00
parent f9922113a5
commit 7eef0aca44

View file

@ -0,0 +1,325 @@
# HubSpot Client Scripts - Onboarding Guide
## Overview
The scripts in this directory form a **3-stage ETL pipeline** for syncing HubSpot data (companies and deals) into the local database:
1. **Stage 0 (Seed)**: `hubspot_company.py` — Load company master records
2. **Stage 1 (Bulk Load)**: `hubspot_gather_all_deals.py` — Initial load of all deals for a company
3. **Stage 2 (Sync/Update)**: `hubspot_update_script.py` — Ongoing synchronization (runs on a schedule)
These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.
---
## Onboarding a New Client
Follow these steps in order when adding a new company/client:
### Step 1: Add the Company to the `Companies` Enum
Edit `../hubspotClient.py` and add your new company to the `Companies` enum class:
```python
class Companies(Enum):
ABRI = "237615001799"
SOUTHERN_HOUSING_GROUP = "109343619305"
LIVEWEST = "86205872354"
SURESERVE = "301745289413"
HOMEGROUP = "94946071794"
APPLE = "184769046716"
THE_GUINESS_PARTNERSHIP = "86970043613"
YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID" # ← Add here
```
**How to find your HubSpot Company ID:**
- Log into HubSpot
- Navigate to **Contacts** → **Companies**
- Click on the company name
- The URL will be: `https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID` — copy that ID
### Step 2: Update Each Script to Include Your Company
After adding the enum, update the company lists in all three scripts:
#### `hubspot_company.py` (line ~6)
```python
companies = [
Companies.THE_GUINESS_PARTNERSHIP,
Companies.YOUR_NEW_COMPANY # ← Add here
]
```
#### `hubspot_gather_all_deals.py` (line ~7)
```python
valuable_companies = [
Companies.THE_GUINESS_PARTNERSHIP.value,
Companies.YOUR_NEW_COMPANY.value # ← Add here
]
```
#### `hubspot_update_script.py` (line ~12)
```python
companies = [
Companies.THE_GUINESS_PARTNERSHIP,
Companies.YOUR_NEW_COMPANY # ← Add here
]
```
### Step 3: Run `hubspot_company.py` (One-time setup)
This script seeds the company record into the `hubspot_company_data` table. Run it once:
```bash
python etl/hubSpotClient/scripts/hubspot_company.py
```
**What it does:**
- Connects to HubSpot and fetches company information (name, ID)
- Inserts the company record into the local database
**Output:** You'll see the company added to `hubspot_company_data` table.
### Step 3.5: Update Group ID in Database (Manual)
After the company record is created, you need to manually update the **group ID** for the new company. This is done via DBeaver or pgAdmin:
**Steps:**
1. Open DBeaver or pgAdmin and connect to the database
2. Navigate to the `hubspot_company_data` table
3. Find the row with your new company (search by `company_name` or `company_id`)
4. Edit the **`group_id`** column to the portfolio/group ID you want to track for this company
5. Save the changes
**Example Query** (if you prefer SQL):
```sql
UPDATE hubspot_company_data
SET group_id = 'YOUR_GROUP_ID'
WHERE company_id = 'YOUR_COMPANY_ID';
```
**What is Group ID?**
- The group ID identifies which portfolio/group in your system this company belongs to
- Each company can be associated with one group ID for tracking and organization
- This field is used for tracking and reporting across your survey data
### Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load)
This script performs the initial load of all deals for your company, filtered by the `OPERATIONS_SOCIAL_HOUSING` pipeline. Run it once per company:
```bash
python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py
```
**What it does:**
- Fetches all deal IDs associated with your company from HubSpot
- For each deal, retrieves detailed properties:
- `dealname`, `dealstage`, `pipeline`, `outcome`, `outcome_notes`, `project_code`
- `major_condition_issue_description`, `major_condition_issue_photos`
- `coordination_status__stage_1_`, `retrofit_design_status`
- Filters to only deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline
- Fetches the associated listing (UPRN, property IDs)
- Inserts each deal into the `hubspot_data` table
- **Downloads photo evidence files** and uploads them to S3 (bucket: `retrofit-data-dev`)
**⚠️ Note:** This script can take a long time if your company has many deals. It processes deals serially with progress reporting via `tqdm`.
**Output:** Deals appear in `hubspot_data` table; photos appear in S3 at `s3://retrofit-data-dev/hubspot/awaabs_law_evidence/`.
### Step 5: `hubspot_update_script.py` (Automatic scheduling)
After the initial setup, **no manual action is needed**. This script runs automatically every 15 minutes during working hours as a scheduled job.
**What it does:**
- Queries the local database for all stored deals for each company
- Compares each deal's stored fields against the live HubSpot data (13 fields checked)
- Updates the database if any values have changed in HubSpot
- **Uploads newly available photos** to S3 (with SHA-256 integrity verification)
- Prints a summary report of changes, updates, and any failures
---
## Script Reference
### `hubspot_company.py`
**Stage:** Seed (one-time setup)
**Frequency:** Run once per new company
**Speed:** Fast
**Purpose:** Load company master data into the database.
**Database Output:**
- Table: `hubspot_company_data`
- Fields: `company_id`, `company_name`
**Code Flow:**
```
For each company in config:
1. Call HubSpot API: get_company_information(company_id)
2. Insert record into hubspot_company_data table
```
---
### `hubspot_gather_all_deals.py`
**Stage:** Bulk Load (one-time per company)
**Frequency:** Run once per company (manually triggered)
**Speed:** Slow (serial processing of all deals)
**Purpose:** Perform initial load of all deals for target companies.
**Database Output:**
- Table: `hubspot_data`
- Fields: `deal_id`, `deal_name`, `company_id`, `stage`, `outcome`, `photos_s3_url`, and others
**S3 Output:**
- Bucket: `retrofit-data-dev`
- Path: `hubspot/awaabs_law_evidence/{filename}`
**Code Flow:**
```
For each company in config:
1. Fetch all deal IDs from HubSpot
2. For each deal:
a. Get deal properties from HubSpot
b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
c. Fetch associated listing data (UPRN, property IDs)
d. Insert deal into hubspot_data table
e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
f. Print progress: "Uploaded deal_id {id} to db"
```
**Error Handling:** None — script will abort on first error. Re-run to retry.
---
### `hubspot_update_script.py`
**Stage:** Sync/Update (ongoing maintenance)
**Frequency:** Every 15 minutes during working hours (automated schedule)
**Speed:** Fast (only processes stored deals, compares, updates deltas)
**Purpose:** Keep database synchronized with live HubSpot data; handle new/updated photos.
**Database Operations:**
- Reads: All deals from `hubspot_data` for each company
- Writes: Updates only when fields differ from HubSpot
- S3 Uploads: New or previously missing photos
**Summary Report:**
After completion, prints a table of per-company statistics:
```
Company | Checked | Updated | Up-to-date | Failed
```
Plus detailed error messages for any failed updates.
**Code Flow:**
```
1. Initialize HubSpot client (warm-up: get_deal_stages)
2. For each company:
a. Query DB for all deals with company_id
b. For each deal:
- Fetch live deal data from HubSpot
- Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
dealstage, dealname, project_code, uprn,
outcome_notes, major_condition_issue_description,
major_condition_issue_photos, coordination_status,
design_status
- If any field differs: call upsert_hubspot_deal() to update DB
- If photos exist in HubSpot but not yet in S3:
* Download file from HubSpot URL
* Upload to S3
* Verify SHA-256 hash integrity
* Save S3 URL back to DB
- Collect success/failure counts
c. Print per-company summary
3. Print all failures (if any) with error messages
```
**Error Handling:** Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.
---
## Common Tasks
### I added a new company but deals aren't showing up
**Checklist:**
- [ ] Company added to `Companies` enum in `hubspotClient.py`
- [ ] Company added to the `companies` list in **all three scripts**
- [ ] Ran `hubspot_company.py` successfully
- [ ] Ran `hubspot_gather_all_deals.py` and watched for "Uploaded deal_id" messages
- [ ] Check database: `SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'`
- [ ] Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?
### Deals exist in HubSpot but aren't syncing
The `hubspot_gather_all_deals.py` script only loads deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.
### Photos aren't uploading
- First run of `hubspot_gather_all_deals.py` should upload photos at import time
- Subsequent runs of `hubspot_update_script.py` will upload newly available photos
- Check S3 bucket `retrofit-data-dev` under `hubspot/awaabs_law_evidence/`
- Check DB field `major_condition_issue_photos` (photo S3 URL is stored here)
### I need to re-sync everything for a company
1. Clear the deals from the database:
```sql
DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID';
```
2. Clear the company:
```sql
DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID';
```
3. Re-run from **Step 3** above (run `hubspot_company.py`, then `hubspot_gather_all_deals.py`)
---
## Dependencies
All scripts depend on:
- `HubSpotClient` from `../hubspotClient.py` — Handles HubSpot API calls
- `HubspotTodb` from `../../db/hubSpotLoad.py` — Handles database operations (insert/upsert/query)
- `tqdm` — Progress bars
- Python `requests` — HTTP downloads for photo files
Environment Requirements:
- Valid HubSpot API token (configured in `HubSpotClient.__init__()`)
- Database connection (configured in `HubspotTodb`)
- S3 credentials (for photo uploads)
- Network access to HubSpot API and S3
---
## Notes & Tips
1. **Idempotency:** `hubspot_gather_all_deals.py` and `hubspot_update_script.py` use upsert logic, so they can be run multiple times without creating duplicates.
2. **Large Portfolios:** If a company has thousands of deals, `hubspot_gather_all_deals.py` will take a while. Use `tqdm` progress indicators to monitor.
3. **Error Handling:** `hubspot_update_script.py` has error handling per deal. `hubspot_company.py` and `hubspot_gather_all_deals.py` do not — any failure aborts the script. If interrupted, simply re-run.
4. **Schedule:** `hubspot_update_script.py` is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler).
5. **Photo Integrity:** The `hubspot_update_script.py` verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database.
6. **Unused Fields:** The scripts populate `deals_to_add` and `deal_to_companies` dicts in `hubspot_gather_all_deals.py` but don't use them downstream. This is harmless but could be cleaned up in future refactors.
---
## Troubleshooting
| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| "Company not found" error | Company enum not added or typo in name | Double-check `Companies` enum in `hubspotClient.py` |
| Deal count mismatch | Company wasn't added to the script's companies list | Ensure company is in `valuable_companies` / `companies` in all 3 scripts |
| Slow script execution | Large portfolio or network latency | Normal for first run; `hubspot_update_script.py` is faster on subsequent runs |
| Photos not uploading | Deal doesn't have `major_condition_issue_photos` property | Photos only upload if HubSpot deal has photos attached |
| S3 upload fails | Credentials or bucket issues | Check IAM permissions and bucket name (`retrofit-data-dev`) |
| Update script reports failures | Stale data or missing DB fields | Check error messages in summary report; may need to re-sync company |