mirror of
https://github.com/Hestia-Homes/survey-extraction.git
synced 2026-06-08 11:17:29 +00:00
save
This commit is contained in:
parent
f9922113a5
commit
7eef0aca44
1 changed files with 325 additions and 0 deletions
325
etl/hubSpotClient/scripts/README.md
Normal file
325
etl/hubSpotClient/scripts/README.md
Normal file
|
|
@ -0,0 +1,325 @@
|
|||
# HubSpot Client Scripts - Onboarding Guide
|
||||
|
||||
## Overview
|
||||
|
||||
The scripts in this directory form a **3-stage ETL pipeline** for syncing HubSpot data (companies and deals) into the local database:
|
||||
|
||||
1. **Stage 0 (Seed)**: `hubspot_company.py` — Load company master records
|
||||
2. **Stage 1 (Bulk Load)**: `hubspot_gather_all_deals.py` — Initial load of all deals for a company
|
||||
3. **Stage 2 (Sync/Update)**: `hubspot_update_script.py` — Ongoing synchronization (runs on a schedule)
|
||||
|
||||
These scripts work together to keep your local database in sync with HubSpot while handling photo uploads to S3 and maintaining data integrity.
|
||||
|
||||
---
|
||||
|
||||
## Onboarding a New Client
|
||||
|
||||
Follow these steps in order when adding a new company/client:
|
||||
|
||||
### Step 1: Add the Company to the `Companies` Enum
|
||||
|
||||
Edit `../hubspotClient.py` and add your new company to the `Companies` enum class:
|
||||
|
||||
```python
|
||||
class Companies(Enum):
|
||||
ABRI = "237615001799"
|
||||
SOUTHERN_HOUSING_GROUP = "109343619305"
|
||||
LIVEWEST = "86205872354"
|
||||
SURESERVE = "301745289413"
|
||||
HOMEGROUP = "94946071794"
|
||||
APPLE = "184769046716"
|
||||
THE_GUINESS_PARTNERSHIP = "86970043613"
|
||||
YOUR_NEW_COMPANY = "YOUR_HUBSPOT_COMPANY_ID" # ← Add here
|
||||
```
|
||||
|
||||
**How to find your HubSpot Company ID:**
|
||||
- Log into HubSpot
|
||||
- Navigate to **Contacts** → **Companies**
|
||||
- Click on the company name
|
||||
- The URL will be: `https://app.hubspot.com/crm/xxx/objects/companies/COMPANY_ID` — copy that ID
|
||||
|
||||
### Step 2: Update Each Script to Include Your Company
|
||||
|
||||
After adding the enum, update the company lists in all three scripts:
|
||||
|
||||
#### `hubspot_company.py` (line ~6)
|
||||
```python
|
||||
companies = [
|
||||
Companies.THE_GUINESS_PARTNERSHIP,
|
||||
Companies.YOUR_NEW_COMPANY # ← Add here
|
||||
]
|
||||
```
|
||||
|
||||
#### `hubspot_gather_all_deals.py` (line ~7)
|
||||
```python
|
||||
valuable_companies = [
|
||||
Companies.THE_GUINESS_PARTNERSHIP.value,
|
||||
Companies.YOUR_NEW_COMPANY.value # ← Add here
|
||||
]
|
||||
```
|
||||
|
||||
#### `hubspot_update_script.py` (line ~12)
|
||||
```python
|
||||
companies = [
|
||||
Companies.THE_GUINESS_PARTNERSHIP,
|
||||
Companies.YOUR_NEW_COMPANY # ← Add here
|
||||
]
|
||||
```
|
||||
|
||||
### Step 3: Run `hubspot_company.py` (One-time setup)
|
||||
|
||||
This script seeds the company record into the `hubspot_company_data` table. Run it once:
|
||||
|
||||
```bash
|
||||
python etl/hubSpotClient/scripts/hubspot_company.py
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Connects to HubSpot and fetches company information (name, ID)
|
||||
- Inserts the company record into the local database
|
||||
|
||||
**Output:** You'll see the company added to `hubspot_company_data` table.
|
||||
|
||||
### Step 3.5: Update Group ID in Database (Manual)
|
||||
|
||||
After the company record is created, you need to manually update the **group ID** for the new company. This is done via DBeaver or pgAdmin:
|
||||
|
||||
**Steps:**
|
||||
1. Open DBeaver or pgAdmin and connect to the database
|
||||
2. Navigate to the `hubspot_company_data` table
|
||||
3. Find the row with your new company (search by `company_name` or `company_id`)
|
||||
4. Edit the **`group_id`** column to the portfolio/group ID you want to track for this company
|
||||
5. Save the changes
|
||||
|
||||
**Example Query** (if you prefer SQL):
|
||||
```sql
|
||||
UPDATE hubspot_company_data
|
||||
SET group_id = 'YOUR_GROUP_ID'
|
||||
WHERE company_id = 'YOUR_COMPANY_ID';
|
||||
```
|
||||
|
||||
**What is Group ID?**
|
||||
- The group ID identifies which portfolio/group in your system this company belongs to
|
||||
- Each company can be associated with one group ID for tracking and organization
|
||||
- This field is used for tracking and reporting across your survey data
|
||||
|
||||
### Step 4: Run `hubspot_gather_all_deals.py` (One-time bulk load)
|
||||
|
||||
This script performs the initial load of all deals for your company, filtered by the `OPERATIONS_SOCIAL_HOUSING` pipeline. Run it once per company:
|
||||
|
||||
```bash
|
||||
python etl/hubSpotClient/scripts/hubspot_gather_all_deals.py
|
||||
```
|
||||
|
||||
**What it does:**
|
||||
- Fetches all deal IDs associated with your company from HubSpot
|
||||
- For each deal, retrieves detailed properties:
|
||||
- `dealname`, `dealstage`, `pipeline`, `outcome`, `outcome_notes`, `project_code`
|
||||
- `major_condition_issue_description`, `major_condition_issue_photos`
|
||||
- `coordination_status__stage_1_`, `retrofit_design_status`
|
||||
- Filters to only deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline
|
||||
- Fetches the associated listing (UPRN, property IDs)
|
||||
- Inserts each deal into the `hubspot_data` table
|
||||
- **Downloads photo evidence files** and uploads them to S3 (bucket: `retrofit-data-dev`)
|
||||
|
||||
**⚠️ Note:** This script can take a long time if your company has many deals. It processes deals serially with progress reporting via `tqdm`.
|
||||
|
||||
**Output:** Deals appear in `hubspot_data` table; photos appear in S3 at `s3://retrofit-data-dev/hubspot/awaabs_law_evidence/`.
|
||||
|
||||
### Step 5: `hubspot_update_script.py` (Automatic scheduling)
|
||||
|
||||
After the initial setup, **no manual action is needed**. This script runs automatically every 15 minutes during working hours as a scheduled job.
|
||||
|
||||
**What it does:**
|
||||
- Queries the local database for all stored deals for each company
|
||||
- Compares each deal's stored fields against the live HubSpot data (13 fields checked)
|
||||
- Updates the database if any values have changed in HubSpot
|
||||
- **Uploads newly available photos** to S3 (with SHA-256 integrity verification)
|
||||
- Prints a summary report of changes, updates, and any failures
|
||||
|
||||
---
|
||||
|
||||
## Script Reference
|
||||
|
||||
### `hubspot_company.py`
|
||||
|
||||
**Stage:** Seed (one-time setup)
|
||||
**Frequency:** Run once per new company
|
||||
**Speed:** Fast
|
||||
|
||||
**Purpose:** Load company master data into the database.
|
||||
|
||||
**Database Output:**
|
||||
- Table: `hubspot_company_data`
|
||||
- Fields: `company_id`, `company_name`
|
||||
|
||||
**Code Flow:**
|
||||
```
|
||||
For each company in config:
|
||||
1. Call HubSpot API: get_company_information(company_id)
|
||||
2. Insert record into hubspot_company_data table
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### `hubspot_gather_all_deals.py`
|
||||
|
||||
**Stage:** Bulk Load (one-time per company)
|
||||
**Frequency:** Run once per company (manually triggered)
|
||||
**Speed:** Slow (serial processing of all deals)
|
||||
|
||||
**Purpose:** Perform initial load of all deals for target companies.
|
||||
|
||||
**Database Output:**
|
||||
- Table: `hubspot_data`
|
||||
- Fields: `deal_id`, `deal_name`, `company_id`, `stage`, `outcome`, `photos_s3_url`, and others
|
||||
|
||||
**S3 Output:**
|
||||
- Bucket: `retrofit-data-dev`
|
||||
- Path: `hubspot/awaabs_law_evidence/{filename}`
|
||||
|
||||
**Code Flow:**
|
||||
```
|
||||
For each company in config:
|
||||
1. Fetch all deal IDs from HubSpot
|
||||
2. For each deal:
|
||||
a. Get deal properties from HubSpot
|
||||
b. Filter by OPERATIONS_SOCIAL_HOUSING pipeline
|
||||
c. Fetch associated listing data (UPRN, property IDs)
|
||||
d. Insert deal into hubspot_data table
|
||||
e. If photos exist: download from HubSpot URL, upload to S3, save S3 URL to DB
|
||||
f. Print progress: "Uploaded deal_id {id} to db"
|
||||
```
|
||||
|
||||
**Error Handling:** None — script will abort on first error. Re-run to retry.
|
||||
|
||||
---
|
||||
|
||||
### `hubspot_update_script.py`
|
||||
|
||||
**Stage:** Sync/Update (ongoing maintenance)
|
||||
**Frequency:** Every 15 minutes during working hours (automated schedule)
|
||||
**Speed:** Fast (only processes stored deals, compares, updates deltas)
|
||||
|
||||
**Purpose:** Keep database synchronized with live HubSpot data; handle new/updated photos.
|
||||
|
||||
**Database Operations:**
|
||||
- Reads: All deals from `hubspot_data` for each company
|
||||
- Writes: Updates only when fields differ from HubSpot
|
||||
- S3 Uploads: New or previously missing photos
|
||||
|
||||
**Summary Report:**
|
||||
After completion, prints a table of per-company statistics:
|
||||
```
|
||||
Company | Checked | Updated | Up-to-date | Failed
|
||||
```
|
||||
|
||||
Plus detailed error messages for any failed updates.
|
||||
|
||||
**Code Flow:**
|
||||
```
|
||||
1. Initialize HubSpot client (warm-up: get_deal_stages)
|
||||
2. For each company:
|
||||
a. Query DB for all deals with company_id
|
||||
b. For each deal:
|
||||
- Fetch live deal data from HubSpot
|
||||
- Compare 13 fields: deal_id, company_id, landlord_property_id, outcome,
|
||||
dealstage, dealname, project_code, uprn,
|
||||
outcome_notes, major_condition_issue_description,
|
||||
major_condition_issue_photos, coordination_status,
|
||||
design_status
|
||||
- If any field differs: call upsert_hubspot_deal() to update DB
|
||||
- If photos exist in HubSpot but not yet in S3:
|
||||
* Download file from HubSpot URL
|
||||
* Upload to S3
|
||||
* Verify SHA-256 hash integrity
|
||||
* Save S3 URL back to DB
|
||||
- Collect success/failure counts
|
||||
c. Print per-company summary
|
||||
3. Print all failures (if any) with error messages
|
||||
```
|
||||
|
||||
**Error Handling:** Wrapped in try/except per deal. Failures are logged, and the script continues to the next deal.
|
||||
|
||||
---
|
||||
|
||||
## Common Tasks
|
||||
|
||||
### I added a new company but deals aren't showing up
|
||||
|
||||
**Checklist:**
|
||||
- [ ] Company added to `Companies` enum in `hubspotClient.py`
|
||||
- [ ] Company added to the `companies` list in **all three scripts**
|
||||
- [ ] Ran `hubspot_company.py` successfully
|
||||
- [ ] Ran `hubspot_gather_all_deals.py` and watched for "Uploaded deal_id" messages
|
||||
- [ ] Check database: `SELECT COUNT(*) FROM hubspot_data WHERE company_id = 'YOUR_ID'`
|
||||
- [ ] Check HubSpot: Does the company have any deals in the OPERATIONS_SOCIAL_HOUSING pipeline?
|
||||
|
||||
### Deals exist in HubSpot but aren't syncing
|
||||
|
||||
The `hubspot_gather_all_deals.py` script only loads deals in the `OPERATIONS_SOCIAL_HOUSING` pipeline. If deals are in a different pipeline, they won't be loaded. Check the deal's pipeline in HubSpot.
|
||||
|
||||
### Photos aren't uploading
|
||||
|
||||
- First run of `hubspot_gather_all_deals.py` should upload photos at import time
|
||||
- Subsequent runs of `hubspot_update_script.py` will upload newly available photos
|
||||
- Check S3 bucket `retrofit-data-dev` under `hubspot/awaabs_law_evidence/`
|
||||
- Check DB field `major_condition_issue_photos` (photo S3 URL is stored here)
|
||||
|
||||
### I need to re-sync everything for a company
|
||||
|
||||
1. Clear the deals from the database:
|
||||
```sql
|
||||
DELETE FROM hubspot_data WHERE company_id = 'YOUR_COMPANY_ID';
|
||||
```
|
||||
2. Clear the company:
|
||||
```sql
|
||||
DELETE FROM hubspot_company_data WHERE company_id = 'YOUR_COMPANY_ID';
|
||||
```
|
||||
3. Re-run from **Step 3** above (run `hubspot_company.py`, then `hubspot_gather_all_deals.py`)
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
All scripts depend on:
|
||||
- `HubSpotClient` from `../hubspotClient.py` — Handles HubSpot API calls
|
||||
- `HubspotTodb` from `../../db/hubSpotLoad.py` — Handles database operations (insert/upsert/query)
|
||||
- `tqdm` — Progress bars
|
||||
- Python `requests` — HTTP downloads for photo files
|
||||
|
||||
Environment Requirements:
|
||||
- Valid HubSpot API token (configured in `HubSpotClient.__init__()`)
|
||||
- Database connection (configured in `HubspotTodb`)
|
||||
- S3 credentials (for photo uploads)
|
||||
- Network access to HubSpot API and S3
|
||||
|
||||
---
|
||||
|
||||
## Notes & Tips
|
||||
|
||||
1. **Idempotency:** `hubspot_gather_all_deals.py` and `hubspot_update_script.py` use upsert logic, so they can be run multiple times without creating duplicates.
|
||||
|
||||
2. **Large Portfolios:** If a company has thousands of deals, `hubspot_gather_all_deals.py` will take a while. Use `tqdm` progress indicators to monitor.
|
||||
|
||||
3. **Error Handling:** `hubspot_update_script.py` has error handling per deal. `hubspot_company.py` and `hubspot_gather_all_deals.py` do not — any failure aborts the script. If interrupted, simply re-run.
|
||||
|
||||
4. **Schedule:** `hubspot_update_script.py` is scheduled to run every 15 minutes during business hours (typically configured as a cron job or similar scheduler).
|
||||
|
||||
5. **Photo Integrity:** The `hubspot_update_script.py` verifies downloaded photos using SHA-256 hashing before committing the S3 URL to the database.
|
||||
|
||||
6. **Unused Fields:** The scripts populate `deals_to_add` and `deal_to_companies` dicts in `hubspot_gather_all_deals.py` but don't use them downstream. This is harmless but could be cleaned up in future refactors.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Issue | Likely Cause | Solution |
|
||||
|-------|--------------|----------|
|
||||
| "Company not found" error | Company enum not added or typo in name | Double-check `Companies` enum in `hubspotClient.py` |
|
||||
| Deal count mismatch | Company wasn't added to the script's companies list | Ensure company is in `valuable_companies` / `companies` in all 3 scripts |
|
||||
| Slow script execution | Large portfolio or network latency | Normal for first run; `hubspot_update_script.py` is faster on subsequent runs |
|
||||
| Photos not uploading | Deal doesn't have `major_condition_issue_photos` property | Photos only upload if HubSpot deal has photos attached |
|
||||
| S3 upload fails | Credentials or bucket issues | Check IAM permissions and bucket name (`retrofit-data-dev`) |
|
||||
| Update script reports failures | Stale data or missing DB fields | Check error messages in summary report; may need to re-sync company |
|
||||
|
||||
Loading…
Add table
Reference in a new issue