Compare commits

...

244 commits

Author SHA1 Message Date
quandanrepo
b8dcf626b2
Merge pull request #117 from Hestia-Homes/sap-dev
Sap dev
2024-05-30 20:18:25 +01:00
Github-Bot
d09c534e0d Update Registry 2024-05-30 11:47:46 +00:00
Github-Bot
9925b54af2 Update Registry 2024-05-30 11:47:04 +00:00
KhalimCK
d307d9e093
Merge pull request #116 from Hestia-Homes/sap-dev-assumed
Sap dev assumed
2024-05-30 12:46:28 +01:00
Michael Duong
1944ea1cf1 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-assumed 2024-05-28 19:59:07 +01:00
Michael Duong
8399092e20 formatting 2024-05-28 19:58:46 +01:00
Github-Bot
dc260fddd0 Update Registry 2024-05-28 15:58:31 +00:00
Github-Bot
6f00d6afb8 Update Registry 2024-05-28 15:57:55 +00:00
quandanrepo
1f0414a905
Merge pull request #115 from Hestia-Homes/sap-dev-assumed
Sap dev assumed
2024-05-28 16:57:22 +01:00
Michael Duong
5e0118ca0b change deployment - pineed serverless pajkage 2024-05-28 16:55:47 +01:00
Michael Duong
7e3a6f7700 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-assumed 2024-05-26 10:46:38 +01:00
Github-Bot
396a5ffb08 Update Registry 2024-05-26 09:08:23 +00:00
Github-Bot
a78c5a50b0 Update Registry 2024-05-26 09:07:46 +00:00
quandanrepo
dc70b84626
Merge pull request #113 from Hestia-Homes/sap-dev-gto
Sap dev gto
2024-05-26 10:07:07 +01:00
Michael Duong
e0954b52bc Upgrade dvc packages to fix pygit2 error 2024-05-26 09:56:05 +01:00
Michael Duong
9e23eae27a add testing script 2024-05-26 09:54:22 +01:00
Michael Duong
0768ace947 add testing script 2024-05-26 09:50:39 +01:00
Michael Duong
4ff7cfb271 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-gto 2024-05-26 09:47:23 +01:00
Michael Duong
a4dffe527a add testing script 2024-05-26 09:47:08 +01:00
quandanrepo
8adfa72036
Merge pull request #111 from Hestia-Homes/sap-dev-package
Sap dev package
2024-05-26 09:31:46 +01:00
Michael Duong
29b350e33b Merge branch 'master' of github.com:Hestia-Homes/ML into sap-dev-assumed 2024-05-26 09:28:16 +01:00
Michael Duong
b985bbf753 new model with is_as_built_ending and is assumed columns 2024-05-26 09:28:00 +01:00
Michael Duong
f43d077479 use previous model with new downstream processes 2024-04-22 19:10:40 +01:00
Michael Duong
52f33239f4 Merge branch 'sap-dev-package' of github.com:Hestia-Homes/ML into sap-dev-package 2024-04-22 19:02:13 +01:00
Michael Duong
874b1db5f3 add ignored file to dockerignore 2024-04-22 19:01:56 +01:00
Michael Duong
7a3477c0e1 change to all packages 2024-04-22 13:30:58 +01:00
Michael Duong
87e3cc391e push files to s3 2024-04-19 17:48:15 +01:00
Michael Duong
380bd6b595 correct the dockerignore files and test model with just tabular 2024-04-19 17:34:10 +01:00
Michael Duong
50a3e2d5be correct the dockerignore files and test model with just tabular 2024-04-19 16:25:26 +01:00
Michael Duong
620c1d10a1 correct the dockerignore files and test model with just tabular 2024-04-19 16:22:06 +01:00
Michael Duong
179c334b6e add switch to turn off scenario data (for carbon and heat for now) 2024-04-19 14:38:57 +01:00
quandanrepo
502621e434
Merge pull request #110 from Hestia-Homes/sap-dev
Sap dev
2024-04-19 14:36:45 +01:00
Github-Bot
e97c01c366 Update Registry 2024-03-28 15:23:18 +00:00
Github-Bot
94a6aaa38f Update Registry 2024-03-28 15:22:33 +00:00
quandanrepo
98254555a1
Merge pull request #108 from Hestia-Homes/sap-dev-model
add c++ to docker, fixed dynaconf
2024-03-28 15:21:31 +00:00
Michael Duong
7aeaa9a5f6 add c++ to docker, fixed dynaconf 2024-03-28 15:13:20 +00:00
Github-Bot
a7bb61433a Update Registry 2024-03-28 09:31:07 +00:00
Github-Bot
64a5c93833 Update Registry 2024-03-28 09:30:30 +00:00
KhalimCK
e746352977
Merge pull request #104 from Hestia-Homes/sap-dev-model
Sap dev model
2024-03-28 09:29:53 +00:00
Michael Duong
1bb1f8d61f add metrics for scenarios 2024-03-27 12:30:31 +00:00
Michael Duong
c3985e2104 add metrics for scenarios 2024-03-27 12:22:58 +00:00
Michael Duong
9b6aeae0da medium model with scenario and upgraded autogluon 2024-03-26 22:32:44 +00:00
Michael Duong
96f5b37001 medium model with scenario and upgraded autogluon 2024-03-26 22:32:14 +00:00
Michael Duong
8a9b5877b5 medium model with scenario and upgraded autogluon 2024-03-26 22:30:50 +00:00
Michael Duong
ad2c4d6019 upgrade autogluon 2024-03-21 14:41:58 +00:00
Michael Duong
d5f40a8eb2 only ending 2024-02-17 21:17:34 +00:00
Michael Duong
cec3cc60e7 test less features 2024-02-17 16:26:49 +00:00
Michael Duong
81e7c2a4bd test this version 2024-02-16 16:57:37 +00:00
Michael Duong
fe430c4326 test this version 2024-02-16 16:54:18 +00:00
Michael Duong
49e66411ce test this version 2024-02-16 16:51:43 +00:00
Michael Duong
fdbf339d63 try the scenario cml 2024-02-16 16:44:43 +00:00
Michael Duong
2221283de4 try the scenario cml 2024-02-16 16:43:23 +00:00
Github-Bot
7f2f80af22 Update Registry 2024-02-16 16:36:38 +00:00
Github-Bot
99e883584b Update Registry 2024-02-16 16:35:54 +00:00
KhalimCK
3ee352b719
Merge pull request #103 from Hestia-Homes/sap-dev-revert
revert change on sap-dev-model
2024-02-16 16:35:18 +00:00
Michael Duong
0e2bff9d64 revert changes 2024-02-16 16:30:13 +00:00
Michael Duong
e060aeb4c0 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-revert 2024-02-16 16:25:57 +00:00
Michael Duong
a9b50c8a2d revert change on sap-dev-model 2024-02-16 16:23:37 +00:00
Github-Bot
6e76716fbc Update Registry 2024-02-16 14:52:15 +00:00
Github-Bot
86352ce0ce Update Registry 2024-02-16 14:51:31 +00:00
KhalimCK
33d0f6b323
Merge pull request #102 from Hestia-Homes/sap-dev-model
Sap dev model
2024-02-16 14:50:51 +00:00
Michael Duong
8363d5f0de Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2024-02-15 15:11:08 +00:00
Michael Duong
603dfe2eab new model with starting and ending rooms 2024-02-15 15:10:49 +00:00
Github-Bot
babbc155e9 Update Registry 2024-02-12 18:34:09 +00:00
Github-Bot
d21fd1c4e8 Update Registry 2024-02-12 18:33:28 +00:00
KhalimCK
6815cfcc66
Merge pull request #101 from Hestia-Homes/sap-dev-model
Sap dev model
2024-02-12 18:32:38 +00:00
Michael Duong
fedcd1ed92 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2024-02-10 12:30:52 +00:00
Michael Duong
eeb653c041 new model 2024-02-10 11:03:38 +00:00
Github-Bot
8a1e2958b4 Update Registry 2024-02-09 18:54:16 +00:00
Github-Bot
bc44376e07 Update Registry 2024-02-09 18:53:22 +00:00
KhalimCK
89604645d5
Merge pull request #99 from Hestia-Homes/sap-dev-model
Sap dev model
2024-02-09 18:52:45 +00:00
Michael Duong
1e36d6e4f6 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2024-02-09 18:46:33 +00:00
Michael Duong
778bff37fb 4000 model 2024-02-09 18:46:19 +00:00
Github-Bot
f17119382b Update Registry 2024-02-09 16:27:45 +00:00
Github-Bot
a98fc9d93a Update Registry 2024-02-09 16:27:01 +00:00
KhalimCK
051921ff3f
Merge pull request #97 from Hestia-Homes/sap-dev-model
Sap dev model
2024-02-09 16:26:24 +00:00
Michael Duong
18ea4a2177 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2024-02-09 16:20:02 +00:00
Michael Duong
f92c97f6cf drop days_starting and days_ending 2024-02-09 16:19:47 +00:00
Github-Bot
96eb3904e2 Update Registry 2024-01-29 12:38:33 +00:00
Github-Bot
7f59305e20 Update Registry 2024-01-29 12:37:45 +00:00
KhalimCK
23dbfcc467
Merge pull request #94 from Hestia-Homes/sap-dev-model
test model with 1 percent o change records
2024-01-29 12:37:02 +00:00
Michael Duong
353b62bc77 test model with all data, using interal cross validation, all dataset with permuation and 0, test data is just a random 10 percent sample of the training data 2024-01-29 09:03:36 +00:00
Michael Duong
d356fbfed0 test model with all permutation and zero records 2024-01-24 10:29:56 +00:00
Michael Duong
ca2a3d3623 longer run model 2024-01-23 21:46:24 +00:00
Michael Duong
efb84723bb test model with 1 percent o change records 2024-01-23 19:27:53 +00:00
Github-Bot
6d6b824006 Update Registry 2024-01-18 10:37:52 +00:00
Github-Bot
47f8447223 Update Registry 2024-01-18 10:36:52 +00:00
KhalimCK
d9cbc1e190
Merge pull request #91 from Hestia-Homes/sap-dev-model
run sap model with new data
2024-01-18 10:36:00 +00:00
Michael Duong
0e31d67970 run sap model with new data 2024-01-17 23:07:22 +00:00
Github-Bot
77888bb839 Update Registry 2024-01-16 17:38:50 +00:00
Github-Bot
f472d3c5fa Update Registry 2024-01-16 17:38:07 +00:00
KhalimCK
03364036db
Merge pull request #90 from Hestia-Homes/sap-dev-model
Sap dev model
2024-01-16 17:37:12 +00:00
Michael Duong
50c369720e corrected model 2023-12-22 11:16:45 +00:00
Michael Duong
717a1a64fe update version control packages 2023-12-22 10:47:35 +00:00
Michael Duong
daa4c28be6 remove unneeded dvc gto files 2023-12-22 10:44:23 +00:00
Michael Duong
c576657805 comment out old dataset 2023-12-22 10:35:17 +00:00
Michael Duong
acdac3d8dc test new data 2023-12-22 10:28:56 +00:00
Michael Duong
598c1118f3 fix merge conflict 2023-12-22 09:54:35 +00:00
Michael Duong
639ba9dd11 add infernce limit 2023-11-27 21:50:08 +00:00
KhalimCK
ed4c1aebf6
Merge pull request #84 from Hestia-Homes/new-model-workflows
Added additional workflows for new models
2023-11-27 19:12:09 +00:00
Khalim Conn-Kowlessar
e9417ca73d Added additional workflows for new models 2023-11-27 15:17:01 +00:00
Github-Bot
7d26ec4219 Update Registry 2023-10-22 21:07:17 +00:00
Github-Bot
6d3407ba0e Update Registry 2023-10-22 21:06:37 +00:00
quandanrepo
91741c527b
Merge pull request #83 from Hestia-Homes/sap-dev-model
Sap dev model
2023-10-22 17:05:52 -04:00
Michael Duong
0f96bc55f1 add time to inference to model 2023-10-22 21:05:07 +00:00
Michael Duong
499458b699 add time to inference to model 2023-10-22 21:02:32 +00:00
Michael Duong
8689d4391e Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2023-10-22 03:25:23 +00:00
Michael Duong
cbd46489fe Remove propgate 2023-10-22 03:25:07 +00:00
Github-Bot
a15bdd5ee0 Update Registry 2023-10-21 03:03:21 +00:00
Github-Bot
72cf709601 Update Registry 2023-10-21 03:02:38 +00:00
quandanrepo
6e35e8cdfe
Merge pull request #82 from Hestia-Homes/sap-dev-dockerignore
final removal of dash from handler
2023-10-20 23:01:59 -04:00
Michael Duong
ca37e4ee18 final removal of dash from handler 2023-10-21 04:00:13 +01:00
Github-Bot
3145b5d331 Update Registry 2023-10-20 22:41:26 +00:00
Github-Bot
960425e709 Update Registry 2023-10-20 22:40:39 +00:00
quandanrepo
46bb25012a
Merge pull request #81 from Hestia-Homes/sap-dev-dockerignore
Sap dev dockerignore
2023-10-20 18:39:53 -04:00
Michael Duong
811d47b78a remove more lines 2023-10-20 23:30:31 +01:00
Michael Duong
59113859b1 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-dockerignore 2023-10-20 15:45:25 +01:00
Michael Duong
867f4e0bf0 change logging style 2023-10-20 15:45:04 +01:00
Github-Bot
c605d6b549 Update Registry 2023-10-20 02:16:05 +00:00
Github-Bot
72d4dbae3f Update Registry 2023-10-20 02:15:23 +00:00
quandanrepo
7a2347a937
Merge pull request #80 from Hestia-Homes/sap-dev-dockerignore
Sap dev dockerignore
2023-10-19 22:14:40 -04:00
Michael Duong
56b7139b41 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-dockerignore 2023-10-20 03:13:36 +01:00
Michael Duong
dadcbbab3a revert back for now 2023-10-20 03:13:24 +01:00
Github-Bot
c5a9b548ab Update Registry 2023-10-20 02:12:04 +00:00
Github-Bot
652bdd3467 Update Registry 2023-10-20 02:11:10 +00:00
quandanrepo
ca4edb5068
Merge pull request #79 from Hestia-Homes/sap-dev-dockerignore
Sap dev dockerignore
2023-10-19 22:10:10 -04:00
Michael Duong
b9ea396f86 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-dockerignore 2023-10-20 03:08:26 +01:00
Michael Duong
0c87f21673 test just a single dependency 2023-10-20 03:08:13 +01:00
Github-Bot
b50e0ef1ba Update Registry 2023-10-20 01:59:48 +00:00
Github-Bot
ad98ec4f1a Update Registry 2023-10-20 01:58:57 +00:00
quandanrepo
9de74ce453
Merge pull request #78 from Hestia-Homes/sap-dev-dockerignore
add dockerignore file for prediction lamda
2023-10-19 21:58:10 -04:00
Michael Duong
ddf3ad3b40 add dependency for workflow files 2023-10-20 02:56:58 +01:00
Michael Duong
a44fe33998 add the test data back to get it to run 2023-10-20 02:48:17 +01:00
Michael Duong
fbd235addf add dockerignore for verify step 2023-10-20 02:39:09 +01:00
Michael Duong
e1cf3a48a9 add dockerignore file for prediction lamda 2023-10-20 02:27:26 +01:00
Github-Bot
7efb910103 Update Registry 2023-10-19 01:20:19 +00:00
Github-Bot
b2e5fd9419 Update Registry 2023-10-19 01:19:29 +00:00
quandanrepo
e921d0f90b
Merge pull request #73 from Hestia-Homes/sap-dev-model
Sap dev model
2023-10-18 21:18:42 -04:00
Michael Duong
790c3a9456 use test dataset 2023-10-18 13:27:25 +00:00
Michael Duong
a60a3bd285 Merge branch 'master' of github.com:Hestia-Homes/ML into sap-dev-model 2023-10-17 23:54:06 +00:00
Michael Duong
17fad3cf0a Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-model 2023-10-17 23:53:43 +00:00
quandanrepo
96153f8248
Update Makefile 2023-10-17 03:08:01 +01:00
quandanrepo
7589977cda
Update Makefile 2023-10-12 10:19:22 +01:00
quandanrepo
b570829b5a
Merge pull request #70 from Hestia-Homes/sap-dev
Sap dev
2023-10-11 09:36:44 +01:00
Github-Bot
4597c12795 Update Registry 2023-10-10 23:00:04 +00:00
Github-Bot
c668e4227c Update Registry 2023-10-10 22:59:21 +00:00
quandanrepo
b04a0a4a90
Merge pull request #69 from Hestia-Homes/sap-dev-fix
Sap dev fix
2023-10-10 23:58:39 +01:00
Michael Duong
bd80c3d69d final fix for workflow on post merge 2023-10-10 23:58:07 +01:00
Michael Duong
da8cf5c1c4 Merge branch 'sap-dev' of github.com:Hestia-Homes/ML into sap-dev-fix 2023-10-10 23:56:46 +01:00
Michael Duong
8bdedf25a2 final fix for workflow on post merge 2023-10-10 23:56:35 +01:00
Github-Bot
7a113f790e Update Registry 2023-10-10 22:43:36 +00:00
Github-Bot
755d00e0e4 Update Registry 2023-10-10 22:42:45 +00:00
quandanrepo
6e71a59cc5
Merge pull request #68 from Hestia-Homes/sap-dev-fix
add smape
2023-10-10 23:41:56 +01:00
Michael Duong
fe34356822 Merge branch 'master' of github.com:Hestia-Homes/ML into sap-dev-fix 2023-10-10 23:41:17 +01:00
Michael Duong
6552e97555 fix the register increments 2023-10-10 23:41:06 +01:00
Michael Duong
8dd784255a add smape 2023-10-10 23:28:30 +01:00
quandanrepo
051f07df77
Update README.md 2023-10-10 14:02:54 +01:00
Github-Bot
7a1b9aed5f Update Registry 2023-10-10 11:49:02 +00:00
Github-Bot
69c5c77a88 Update Registry 2023-10-10 11:48:13 +00:00
quandanrepo
ae474fedb4
Merge pull request #66 from Hestia-Homes/sap-dev-fix
Sap dev fix
2023-10-10 12:47:29 +01:00
Michael Duong
57934d0ae3 fixed buffer bug and add id 2023-10-10 12:35:34 +01:00
quandanrepo
70b3008dc5
Update README.md 2023-10-10 11:56:56 +01:00
quandanrepo
391cc66435
Update README.md 2023-10-10 11:53:52 +01:00
quandanrepo
d3b1bb4bb9
Update README.md 2023-10-10 11:49:37 +01:00
quandanrepo
dda9065a88
Update README.md 2023-10-10 11:45:50 +01:00
Michael Duong
f9b0b6112c add some processing ocde 2023-10-09 15:44:37 +00:00
quandanrepo
ba4d1bcc8b
Merge pull request #65 from Hestia-Homes/sap-dev
Sap dev
2023-10-07 09:56:42 +01:00
Github-Bot
8105706ea7 Update Registry 2023-10-04 17:01:20 +00:00
Github-Bot
4d909c3996 Update Registry 2023-10-04 17:00:31 +00:00
quandanrepo
88aa4048bb
Merge pull request #64 from Hestia-Homes/sap-dev-model
change sapmodel stack anme to be more general - remove change_ sed co…
2023-10-04 17:59:46 +01:00
Michael Duong
a15befe381 change sapmodel stack anme to be more general - remove change_ sed command 2023-10-04 16:58:18 +00:00
Github-Bot
50f72f91e3 Update Registry 2023-10-04 16:41:12 +00:00
Github-Bot
325153a725 Update Registry 2023-10-04 16:40:28 +00:00
quandanrepo
445b46507b
Merge pull request #63 from Hestia-Homes/sap-dev-model
change sapmodel stack anme to be more general
2023-10-04 17:39:50 +01:00
Michael Duong
e7222e0c44 change sapmodel stack anme to be more general 2023-10-04 16:38:59 +00:00
Github-Bot
6529c93cff Update Registry 2023-10-04 16:23:27 +00:00
Github-Bot
d125a2a8a1 Update Registry 2023-10-04 16:22:36 +00:00
quandanrepo
9377051a88
Merge pull request #62 from Hestia-Homes/sap-dev-model
test workflow
2023-10-04 17:21:53 +01:00
Michael Duong
2e5b354356 test workflow 2023-10-04 16:14:01 +00:00
Michael Duong
9ed3e2a3b3 just use sap 2023-10-04 16:10:35 +00:00
Michael Duong
d0399d29c6 change branch names to sapmodel 2023-10-04 15:53:47 +00:00
Michael Duong
c5fa850f71 change workflow to only work on the sapmodel- branchs 2023-10-04 15:49:19 +00:00
Michael Duong
bd6dd213b4 change deployment to use this new branch name 2023-10-04 15:44:08 +00:00
Michael Duong
62321b8d00 change to sapmodel branch name 2023-10-04 15:42:40 +00:00
Github-Bot
f1c8656cfb Update Registry 2023-10-04 15:31:52 +00:00
Github-Bot
bdc0136bfb Update Registry 2023-10-04 15:30:55 +00:00
quandanrepo
1f683b28d0
Merge pull request #61 from Hestia-Homes/sap_change-model
change workflow
2023-10-04 16:30:10 +01:00
Michael Duong
63b7d7127a change workflow 2023-10-04 15:29:21 +00:00
quandanrepo
b0cfc2d184
Merge pull request #60 from Hestia-Homes/sap_change-model
Sap change model
2023-10-04 11:24:39 +01:00
Michael Duong
5589282485 use target branch as diff location 2023-10-04 10:22:37 +00:00
Michael Duong
129c0f8c2f change branch name 2023-10-04 10:14:12 +00:00
Michael Duong
b91a5f26ec install pyoopenssl 2023-10-04 10:08:40 +00:00
Michael Duong
4710dac788 use newer model 2023-10-04 10:03:38 +00:00
Michael Duong
7573d885af add the branchs to workflwo 2023-10-04 09:55:50 +00:00
Michael Duong
f12514aca9 make sure best model in master 2023-10-04 09:47:01 +00:00
Michael Duong
51b7049720 add optimised model 2023-10-03 23:46:37 +00:00
Michael Duong
bcd2383d8d add eda script bits 2023-10-03 23:01:17 +00:00
Michael Duong
961773f58a add identifier column to datasets 2023-10-03 22:29:55 +00:00
Michael Duong
6b7171adc0 Merge branch 'master' of github.com:Hestia-Homes/ML into model-test 2023-10-03 22:04:02 +00:00
Michael Duong
0386346c67 add eda code for nowA 2023-10-03 22:03:50 +00:00
KhalimCK
4320a5bd89
Merge pull request #58 from Hestia-Homes/master
updated save filetype to parquet
2023-10-03 18:24:28 +01:00
Khalim Conn-Kowlessar
5e62b2d43e updated save filetype to parquet 2023-10-03 18:23:51 +01:00
KhalimCK
06a56fdb54
Merge pull request #57 from Hestia-Homes/master
corrected reference to s3 bucekts
2023-10-03 18:05:49 +01:00
Khalim Conn-Kowlessar
e4352bda1e corrected reference to s3 bucekts 2023-10-03 18:05:21 +01:00
KhalimCK
4b870143fd
Merge pull request #56 from Hestia-Homes/master
remove model bucket from serverless
2023-10-03 17:46:24 +01:00
Khalim Conn-Kowlessar
b21a221f3b remove model bucket from serverless 2023-10-03 17:46:04 +01:00
KhalimCK
8859d7b321
Merge pull request #55 from Hestia-Homes/master
removed redundant bucket and fixed storage filepath
2023-10-03 17:39:21 +01:00
Khalim Conn-Kowlessar
57ed666ea7 removed redundant bucket and fixed storage filepath 2023-10-03 17:38:56 +01:00
KhalimCK
f8409ac63b
Merge pull request #54 from Hestia-Homes/master
Got deployment working
2023-10-03 17:09:36 +01:00
Khalim Conn-Kowlessar
fd07605502 Got deployment working 2023-10-03 17:08:48 +01:00
KhalimCK
97c1469451
Merge pull request #53 from Hestia-Homes/master
install vc requirements
2023-10-03 16:35:37 +01:00
Khalim Conn-Kowlessar
1400e6843c install vc requirements 2023-10-03 16:35:05 +01:00
KhalimCK
03dc72799a
Merge pull request #52 from Hestia-Homes/master
move dvc pull to github actions
2023-10-03 16:32:21 +01:00
Khalim Conn-Kowlessar
5960ebbf22 remove install of version control requirements 2023-10-03 16:31:40 +01:00
Khalim Conn-Kowlessar
9501130419 Trying dvc pull in github actions and copying into docker 2023-10-03 16:30:44 +01:00
KhalimCK
383177b282
Merge pull request #51 from Hestia-Homes/master
changing to the deployment directory for sls deploy
2023-10-03 14:05:02 +01:00
Khalim Conn-Kowlessar
fd11114674 changing to the deployment directory for sls deploy 2023-10-03 14:03:50 +01:00
KhalimCK
885f7ba977
Merge pull request #50 from Hestia-Homes/master
fixed docker file and added instructions
2023-10-03 12:48:24 +01:00
Khalim Conn-Kowlessar
749e824a9d fixed docker file and added instructions 2023-10-03 12:48:00 +01:00
KhalimCK
e9317eda6f
Merge pull request #49 from Hestia-Homes/master
getting docker context to the root
2023-10-03 12:13:07 +01:00
Khalim Conn-Kowlessar
c4d1d074b5 getting docker context to the root 2023-10-03 12:12:36 +01:00
KhalimCK
7026198bd8
Merge pull request #48 from Hestia-Homes/master
Setting aws credentials
2023-10-03 12:03:49 +01:00
Khalim Conn-Kowlessar
21645968ad Setting aws credentials 2023-10-03 12:03:07 +01:00
Github-Bot
a3bd1967b6 Update Registry 2023-10-03 10:54:48 +00:00
KhalimCK
f320b9e0e9
Merge pull request #47 from Hestia-Homes/migrate-deployment
Migrate deployment
2023-10-03 11:54:07 +01:00
Khalim Conn-Kowlessar
a6f9954125 added environment variables to docker 2023-10-03 11:52:24 +01:00
Khalim Conn-Kowlessar
6b96c084c2 added build arguments to github actions 2023-10-03 11:49:08 +01:00
Khalim Conn-Kowlessar
e2ce04aa0d formatted 2023-10-03 11:45:51 +01:00
Michael Duong
c0d73d8b9e optimise to squared error to penalise large errors 2023-09-29 15:53:12 +00:00
Github-Bot
f67b138406 Update Registry 2023-09-29 13:33:21 +00:00
Github-Bot
d448c714ea Update Registry 2023-09-29 13:32:36 +00:00
quandanrepo
363325ccab
Merge pull request #46 from Hestia-Homes/model-test
Model test
2023-09-29 14:31:54 +01:00
Michael Duong
c67edce7cd Merge branch 'master' of github.com:Hestia-Homes/ML into model-test 2023-09-29 13:30:57 +00:00
Michael Duong
ddab96236c add json flag to avoid issues 2023-09-29 13:30:49 +00:00
quandanrepo
f69f42266f
Merge pull request #45 from Hestia-Homes/model-test
add 1000 second model
2023-09-29 14:17:16 +01:00
Michael Duong
c6ef7b514e add 1000 second model 2023-09-29 13:09:07 +00:00
Github-Bot
f46fa2cd66 Update Registry 2023-09-29 12:38:17 +00:00
Github-Bot
ad2c6f0f3f Update Registry 2023-09-29 12:37:30 +00:00
quandanrepo
14750b6bec
Merge pull request #44 from Hestia-Homes/model-test
Model test
2023-09-29 13:36:46 +01:00
Michael Duong
206b6e9d06 add dynaconf to prediction requirements 2023-09-29 11:46:01 +00:00
Michael Duong
39e31d8080 Merge branch 'master' of github.com:Hestia-Homes/ML into model-test 2023-09-29 11:37:48 +00:00
Michael Duong
ba592b36b7 use dynaconf to simplify configs 2023-09-29 11:37:36 +00:00
Github-Bot
c9fdc0d1d6 Update Registry 2023-09-29 10:13:16 +00:00
Github-Bot
1c229f857e Update Registry 2023-09-29 10:12:30 +00:00
61 changed files with 1306 additions and 538 deletions

9
.dockerignore Normal file
View file

@ -0,0 +1,9 @@
modules/ml-pipeline/src/pipeline/data/predictions
modules/ml-pipeline/src/pipeline/data/fit_predictions
modules/ml-pipeline/src/pipeline/data/prepared_data
modules/ml-pipeline/src/pipeline/data/model/allmodels
modules/ml-pipeline/src/pipeline/metrics
modules/ml-pipeline/src/pipeline/__pycache__
modules/ml-pipeline/src/pipeline/.dvc
modules/ml-pipeline/src/pipeline/analysis
modules/ml-pipeline/src/pipeline/metrics

127
.github/workflows/Deploy.yml vendored Normal file
View file

@ -0,0 +1,127 @@
name: Sap Change Model Deploy
on:
push:
branches: [ sap-dev, sap-prod, heat-dev, heat-prod, carbon-dev, carbon-prod]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.10.12
- name: Install Serverless and plugins
run: |
npm install -g serverless@^3.38.0
npm install -g serverless-domain-manager@^7.3.8
- name: Install DVC
run: |
pip install --upgrade pip
pip install -r modules/ml-pipeline/src/pipeline/requirements/version_control/requirements.txt
# Set up all of the secrets required for the deployment
- name: set secret prefix which is used across multiple steps
id: secret_prefix
run: |
# Convert branch name to uppercase and replace hyphens with underscores
echo "::set-output name=secret_prefix::$(echo "${{ github.ref_name }}" | tr 'a-z-' 'A-Z_')"
- name: Set domain name
id: set_domain
run: echo "::set-output name=domain::${{ secrets[format('{0}_DOMAIN_NAME', steps.secret_prefix.outputs.secret_prefix)] }}"
- name: Set ECR credentials
id: set_ecr_credentials
run: |
# Fetch the secret using the secret prefix
echo "::set-output name=ecr_uri::${{ secrets[format('{0}_ECR_URI', steps.secret_prefix.outputs.secret_prefix)] }}"
- name: Set S3 buckets
id: set_s3_buckets
run: |
# Fetch the secret using the secret prefix
echo "::set-output name=data_bucket::${{ secrets[format('{0}_DATA_BUCKET', steps.secret_prefix.outputs.secret_prefix)] }}"
echo "::set-output name=predictions_bucket::${{ secrets[format('{0}_PREDICTIONS_BUCKET', steps.secret_prefix.outputs.secret_prefix)] }}"
- name: Set stack_name
id: set_stack_name
run: |
# Take branch prefix and add "model" for stack name
stack_name=$( echo ${{ github.ref_name }} | awk -F"-" '{print $1}' | sed 's/$/model/g')
if [ -z "${stack_name}" ]; then
echo "::set-output name=stack_name::"
else
echo "::set-output name=stack_name::${stack_name}"
fi
- name: Set runtime_environment
id: set_runtime_environment
run: |
# Extract the suffix after the hyphen from the branch name
runtime_environment=$(echo "${{ github.ref_name }}" | awk -F'-' '{print $NF}')
echo "::set-output name=runtime_environment::$runtime_environment"
- name: AWS credentials for dev
if: ${{ steps.set_runtime_environment.outputs.runtime_environment }} == 'dev'
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.DEV_AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.DEV_AWS_SECRET_ACCESS_KEY }}
aws-region: eu-west-2
- name: AWS credentials for prod
if: ${{ steps.set_runtime_environment.outputs.runtime_environment }} == 'prod'
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.PROD_AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.PROD_AWS_SECRET_ACCESS_KEY }}
aws-region: eu-west-2
- name: DVC Pull
run: |
cd modules/ml-pipeline/src/pipeline
dvc pull -r ${{ steps.set_runtime_environment.outputs.runtime_environment }}
- name: Setup Docker
uses: docker/setup-buildx-action@v1
- name: Login to ECR
run: |
aws ecr get-login-password --region eu-west-2 | docker login --username AWS --password-stdin ${{ steps.set_ecr_credentials.outputs.ecr_uri }}
# Building and pushing Docker image with caching
- name: Build and push Docker image
uses: docker/build-push-action@v3
with:
context: .
file: ./deployment/Dockerfile.prediction.lambda
push: true
tags: ${{ steps.set_ecr_credentials.outputs.ecr_uri }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64
provenance: false
build-args: |
RUNTIME_ENVIRONMENT=${{ steps.set_runtime_environment.outputs.runtime_environment }}
- name: Deploy to AWS Lambda via Serverless
env:
RUNTIME_ENVIRONMENT: ${{ steps.set_runtime_environment.outputs.runtime_environment }}
PREDICTIONS_BUCKET: ${{ steps.set_s3_buckets.outputs.predictions_bucket }}
DATA_BUCKET: ${{ steps.set_s3_buckets.outputs.data_bucket }}
DOMAIN_NAME: ${{ steps.set_domain.outputs.domain }}
ECR_URI: ${{ steps.set_ecr_credentials.outputs.ecr_uri }}
GITHUB_SHA: ${{ github.sha }}
STACK_NAME: ${{ steps.set_stack_name.outputs.stack_name }}
run: |
# Deploy to AWS Lambda via Serverless
cd deployment
sls deploy --config serverless.yml --stage ${{ steps.set_runtime_environment.outputs.runtime_environment }} --verbose

View file

@ -10,7 +10,9 @@ on:
types: types:
- closed - closed
branches: branches:
- "master" - "sap-dev"
- "heat-dev"
- "carbon-dev"
permissions: write-all permissions: write-all
@ -40,7 +42,14 @@ jobs:
if [ -z "${latest_version}" ]; then if [ -z "${latest_version}" ]; then
increment_version="1.0.0" increment_version="1.0.0"
else else
increment_version=$(echo ${latest_version} | awk -F'.' '{OFS="."; $1+=1; print}') increment_version=$(echo ${latest_version} | awk 'BEGIN {
FS="\\." # Set the field separator to a period
OFS="." # Set the output field separator to a period
}
{
major = $1 + 1 # Increment the major version
print major, "0", "0" # Print the new version
}')
fi fi
new_tag=${REGISTER_MODEL_NAME}@v${increment_version} new_tag=${REGISTER_MODEL_NAME}@v${increment_version}
@ -48,7 +57,7 @@ jobs:
git tag -a ${new_tag} -m "Registering new Major Version" git tag -a ${new_tag} -m "Registering new Major Version"
git push origin ${new_tag} git push origin ${new_tag}
gto show > MODEL_REGISTRY.md gto show --json > MODEL_REGISTRY.md
git add . git add .
git commit -m "Update Registry" git commit -m "Update Registry"
git push git push
@ -78,7 +87,14 @@ jobs:
if [ -z "${latest_version}" ]; then if [ -z "${latest_version}" ]; then
increment_version="0.1.0" increment_version="0.1.0"
else else
increment_version=$(echo ${latest_version} | awk 'BEGIN{FS=OFS="."} {$2++; print}') increment_version=$(echo ${latest_version} | awk 'BEGIN {
FS="\\." # Set the field separator to a period
OFS="." # Set the output field separator to a period
}
{
minor = $2 + 1 # Increment the minor version
print $1, minor, "0" # Print the new version
}')
fi fi
new_tag=${REGISTER_MODEL_NAME}@v${increment_version} new_tag=${REGISTER_MODEL_NAME}@v${increment_version}
@ -86,7 +102,7 @@ jobs:
git tag -a ${new_tag} -m "Registering new Minor Version" git tag -a ${new_tag} -m "Registering new Minor Version"
git push origin ${new_tag} git push origin ${new_tag}
gto show > MODEL_REGISTRY.md gto show --json > MODEL_REGISTRY.md
git add . git add .
git commit -m "Update Registry" git commit -m "Update Registry"
git push git push
@ -116,7 +132,14 @@ jobs:
if [ -z "${latest_version}" ]; then if [ -z "${latest_version}" ]; then
increment_version="0.0.1" increment_version="0.0.1"
else else
increment_version=$(echo ${latest_version} | awk 'BEGIN{FS=OFS="."} {$3++; print}') increment_version=$(echo ${latest_version} | awk 'BEGIN {
FS="\\." # Set the field separator to a period
OFS="." # Set the output field separator to a period
}
{
patch = $3 + 1 # Increment the patch version
print $1, $2, patch # Print the new version
}')
fi fi
new_tag=${REGISTER_MODEL_NAME}@v${increment_version} new_tag=${REGISTER_MODEL_NAME}@v${increment_version}
@ -124,7 +147,7 @@ jobs:
git tag -a ${new_tag} -m "Registering new Patch Version" git tag -a ${new_tag} -m "Registering new Patch Version"
git push origin ${new_tag} git push origin ${new_tag}
gto show > MODEL_REGISTRY.md gto show --json > MODEL_REGISTRY.md
git add . git add .
git commit -m "Update Registry" git commit -m "Update Registry"
git push git push
@ -176,6 +199,8 @@ jobs:
pip install -r modules/ml-pipeline/src/pipeline/requirements/version_control/requirements.txt pip install -r modules/ml-pipeline/src/pipeline/requirements/version_control/requirements.txt
- name: Register Model - name: Register Model
env:
TARGET_BRANCH: ${{ github.base_ref }}
run: | run: |
REGISTER_MODEL_NAME=$(echo ${{ github.event.pull_request.head.ref }} | awk -F"-" '{print $1}') REGISTER_MODEL_NAME=$(echo ${{ github.event.pull_request.head.ref }} | awk -F"-" '{print $1}')
@ -184,7 +209,7 @@ jobs:
git config user.name "Github-Bot" git config user.name "Github-Bot"
git config user.email "Github-Bot@no-reply.com" git config user.email "Github-Bot@no-reply.com"
latest_dev_version=$(gto history ${REGISTER_MODEL_NAME} --asc --plain | awk '{print $NF}' | awk '/dev/') latest_dev_version=$(gto history ${REGISTER_MODEL_NAME} --asc --plain | awk '{print $NF}' | awk '/dev/' | awk 'END {print}')
if [ -z "${latest_dev_version}" ]; then if [ -z "${latest_dev_version}" ]; then
increment_version="1" increment_version="1"
else else
@ -192,7 +217,7 @@ jobs:
fi fi
new_tag=${REGISTER_MODEL_NAME}#dev#${increment_version} new_tag=${REGISTER_MODEL_NAME}#dev#${increment_version}
latest_version=$(gto show model@latest --ref | awk -F"@" '{print $2}') latest_version=$(gto show ${REGISTER_MODEL_NAME}@latest --ref | awk -F"@" '{print $2}')
echo ${new_tag} echo ${new_tag}
@ -203,11 +228,11 @@ jobs:
git tag -a ${new_tag} -m "Assigning stage dev to artifact ${REGISTER_MODEL_NAME} version ${latest_version}" git tag -a ${new_tag} -m "Assigning stage dev to artifact ${REGISTER_MODEL_NAME} version ${latest_version}"
git push origin ${new_tag} git push origin ${new_tag}
git checkout master git checkout ${TARGET_BRANCH}
git fetch --all git fetch --all
git pull git pull
gto show > MODEL_REGISTRY.md gto show --json > MODEL_REGISTRY.md
git add . git add .
git commit -m "Update Registry" git commit -m "Update Registry"
git push origin master git push origin ${TARGET_BRANCH}

View file

@ -5,7 +5,7 @@ on:
# branches: # branches:
# - "model-**" # - "model-**"
pull_request: pull_request:
branches: [ "master" ] branches: ["sap-dev", "heat-dev", "carbon-dev"]
label: label:
types: ["created", "edited"] types: ["created", "edited"]
@ -89,13 +89,24 @@ jobs:
AWS_ACCESS_KEY_ID: ${{ secrets.ROBOT_AWS_ACCESS_KEY_ID }} AWS_ACCESS_KEY_ID: ${{ secrets.ROBOT_AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.ROBOT_AWS_SECRET_ACCESS_KEY }} AWS_SECRET_ACCESS_KEY: ${{ secrets.ROBOT_AWS_SECRET_ACCESS_KEY }}
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
TARGET_BRANCH: ${{ github.base_ref }}
run: | run: |
cd modules/ml-pipeline/src/pipeline cd modules/ml-pipeline/src/pipeline
echo "## Model metrics" > report.md echo "## Model metrics" > report.md
# Compare metrics to master # Compare metrics to master
git fetch --depth=1 origin master:master git fetch --depth=1 origin ${TARGET_BRANCH}:${TARGET_BRANCH}
dvc metrics diff --md --all master >> report.md dvc metrics diff --md --all ${TARGET_BRANCH} >> report.md
echo "## Scenario comparison" >> report.md
cat metrics/scenario_table.md >> report.md
echo "" >> report.md
echo "## Scenario metrics" >> report.md
cat metrics/scenario_metrics.md >> report.md
cml comment create report.md cml comment create report.md

View file

@ -1,5 +1,34 @@
╒════════╤══════════╤═════════╕ {
│ name │ latest │ #dev "model": {
╞════════╪══════════╪═════════╡ "version": "v12.10.12",
│ model │ v10.9.1 │ v10.9.1 │ "stage": {
╘════════╧══════════╧═════════╛ "dev": "v11.10.12"
},
"registered": true,
"active": true
},
"sap": {
"version": "v0.14.0",
"stage": {
"dev": "v0.14.0"
},
"registered": true,
"active": true
},
"heat": {
"version": "v0.5.0",
"stage": {
"dev": "v0.5.0"
},
"registered": true,
"active": true
},
"carbon": {
"version": "v0.5.0",
"stage": {
"dev": "v0.5.0"
},
"registered": true,
"active": true
}
}

View file

@ -10,14 +10,76 @@ tracking and a model registry
- A bolt-on service that can implement model monitoring - A bolt-on service that can implement model monitoring
There are multiple protected branches which adapt the generic pipeline to produce different models: There are multiple protected branches which adapt the generic pipeline to produce different models:
- sap_change-** - sap-{dev/staging/prod}-**
- heat_change-** - heat-{dev/staging/prod}-**
- carbon_change-** - carbon-{dev/staging/prod}-**
These branches will differ by the configuration files that define the data used and the outputs of the ML-pipeline These branches will differ by the configuration files that define the data used and the outputs of the ML-pipeline
- There can be different additional logic for each branch but the pipeline will be the same. - There can be different additional logic for each branch but the pipeline will be the same.
# Deployment # Deployment
TBD Scripts associated to deployment can be found in the deployment/ folder.
Deployment is automated via Github Actions, where a deployment is triggered by a push to one of the
protected branch, with one of dev or prod as the suffix, describing the target environment.
The github actions file will build and push a docker image to ECR and then deploy a lambda
which produces predictions for the relevant model.
In order for this to be set up, some key environment variables needs to be inserted into Github
secrets. Each different model and protected branch has its own set of secrets which allows for flexibility
between different pipelines.
For example, for the branch sap-dev, the prefix=SAP_DEV, and the following secrets are:
- {prefix}_ECR_URI, which is the URI of the ECR repository to push to. For example, for the
sap change model this is the lambda-sap-prediction-dev repository.
- {prefix}_DOMAIN_NAME, is the custom domain name. This is likely going to be the same across the different
models, but is still included in the secrets for flexibility.
- {prefix}_DATA_BUCKET, is the name of the s3 data bucket where data to be scored by the model is stored
- {prefix}_MODEL_BUCKET, is the name of the s3 bucket where the model is stored
- {prefix}_PREDICTIONS_BUCKET, is the name of the s3 bucket where the predictions are stored
# Building and Testing the Prediction Lambda Function Locally
TODO: Generalise these instructions for the various different pipelines
This guide outlines the steps to build and test the Lambda function locally using Docker. These instructions assume you're working with a machine that has Docker installed.
### Prerequisites
Docker: Make sure Docker is installed and running on your machine.
AWS Credentials: Ensure you have AWS credentials set up on your local machine, typically stored
in ~/.aws/credentials.
Root Directory: All commands should be run from the root directory of the repository.
Step-by-Step Guide
1. Building the Docker Image
First, navigate to the root directory of the repository. Open a terminal and execute the following
2. command to build the Docker image:
```bash
docker build -t sap -f deployment/Dockerfile.prediction.lambda .
```
This will build a Docker image tagged as sap_change using the Dockerfile.prediction.lambda located
in the deployment directory.
2. Running the Docker Image
Once the image is built, you can run it using the following command:
```bash
docker run -p 9000:8080 -v ~/.aws/credentials:/root/.aws/credentials:ro -e RUNTIME_ENVIRONMENT=dev -e PREDICTIONS_BUCKET=retrofit-sap-predictions-dev sap
```
This command does the following:
Maps port 9000 on your local machine to port 8080 on the Docker container.
Mounts your AWS credentials into the Docker container in read-only mode.
Sets the RUNTIME_ENVIRONMENT variable to dev.
3. Testing the Lambda Function
To test the Lambda function, use the following curl command:
```json
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{"body": "{\"file_location\": \"s3://retrofit-data-dev/sap_change_model/one_sample_test_dataset.parquet\", \"property_id\": 1, \"portfolio_id\": 4, \"created_at\": \"now\"}"}'
```
This will send a POST request to the running Lambda function and pass in the required data as JSON.

9
deployment/.dockerignore Normal file
View file

@ -0,0 +1,9 @@
modules/ml-pipeline/src/pipeline/data/predictions
modules/ml-pipeline/src/pipeline/data/fit_predictions
modules/ml-pipeline/src/pipeline/data/prepared_data
modules/ml-pipeline/src/pipeline/data/model/allmodels
modules/ml-pipeline/src/pipeline/metrics
modules/ml-pipeline/src/__pycache__
modules/ml-pipeline/src/.dvc
modules/ml-pipeline/src/analysis
modules/ml-pipeline/src/metrics

View file

@ -0,0 +1,25 @@
FROM public.ecr.aws/lambda/python:3.10
# Set the working directory
WORKDIR ${LAMBDA_TASK_ROOT}
ENV PYTHONPATH "${PYTHONPATH}:${LAMBDA_TASK_ROOT}"
# Environment variables
ARG RUNTIME_ENVIRONMENT
ENV RUNTIME_ENVIRONMENT=${RUNTIME_ENVIRONMENT}
# Install necessary build tools - required to test locally
RUN yum install -y gcc python3-devel gcc-c++
# Install python packages
COPY modules/ml-pipeline/src/pipeline/requirements/predictions/requirements.txt ./requirements.txt
RUN pip install --no-cache-dir -r ./requirements.txt
# Copy the project code
COPY modules/ml-pipeline/src/pipeline ./pipeline
# Copy the handler
COPY deployment/handlers/prediction_app.py ./pipeline/prediction_app.py
WORKDIR ${LAMBDA_TASK_ROOT}/pipeline
CMD [ "prediction_app.handler" ]

View file

@ -0,0 +1,123 @@
"""
This script is the handler for the lambda prediction function, responsible
for producting predictions for a model
"""
import boto3
from botocore.exceptions import NoCredentialsError
import json
from io import StringIO
import os
import logging
from generate_predictions import generate_predictions
from core.MLModels import model_factory
from config import settings
from core.DataClient import dataclient_factory
logger = logging.getLogger()
logger.setLevel(logging.INFO)
PREDICTIONS_BUCKET = os.getenv("PREDICTIONS_BUCKET", None)
def upload_dataframe_to_s3(df, bucket, s3_file_name):
"""
Upload a pandas DataFrame to an S3 bucket as CSV
:param df: DataFrame to upload
:param bucket: Bucket to upload to
:param s3_file_name: S3 object name
:return: True if file was uploaded, else False
"""
# Initialize the S3 client
s3 = boto3.client("s3")
csv_buffer = StringIO()
# Write the DataFrame to the buffer as CSV
df.to_csv(csv_buffer, index=False)
try:
# Upload the CSV from the buffer to S3
s3.put_object(Bucket=bucket, Key=s3_file_name, Body=csv_buffer.getvalue())
print(f"Successfully uploaded DataFrame to {bucket}/{s3_file_name}")
return True
except NoCredentialsError:
print("Credentials not available")
return False
def handler(event, context):
"""
Take in event and trigger the prediction pipeline
"""
logger.info("received event: " + str(event))
try:
body = (
json.loads(event["body"])
if not isinstance(event["body"], dict)
else event["body"]
)
property_id = body["property_id"]
portfolio_id = body["portfolio_id"]
created_at = body["created_at"]
# TODO: Implement the loading of the model and prediction
storage_filepath = f"s3://{PREDICTIONS_BUCKET}/{portfolio_id}/{property_id}/{created_at}.parquet"
logger.info(f"--- Initiate MLModel ---")
build_model_params = settings.build_model
client_params = settings.client
feature_process_params = settings.feature_processor
generate_predictions_params = settings.generate_predictions
model = model_factory(build_model_params["model_type"])
logger.info(f"--- Initiate Input DataClient ---")
input_dataclient = dataclient_factory(
dataclient_type="aws-s3",
dataclient_config=client_params["aws-s3"],
)
logger.info(f"--- Initiate Output DataClient ---")
output_dataclient = dataclient_factory(
dataclient_type="aws-s3",
dataclient_config=client_params["aws-s3"],
)
generate_predictions(
input_dataclient=input_dataclient,
output_dataclient=output_dataclient,
model=model,
target=feature_process_params["feature_processor_config"]["target"],
model_filepath=build_model_params["model_save_filepath"],
test_data_filepath=body["file_location"],
predictions_output_filepath=storage_filepath,
predictions_column_name=generate_predictions_params[
"predictions_column_name"
],
identifier_column=generate_predictions_params["identifier_column"],
)
return {
"statusCode": 200,
"body": json.dumps(
{
"message": "Successfully processed input",
"storage_filepath": storage_filepath,
}
),
}
except (Exception, KeyError, ValueError) as e:
logger.info("Prediction failed")
logger.info(e)
return {
"statusCode": 500,
"body": json.dumps({"message": "Prediction failed", "error": str(e)}),
}

53
deployment/serverless.yml Normal file
View file

@ -0,0 +1,53 @@
service: ${env:STACK_NAME}
provider:
name: aws
region: eu-west-2
architecture: x86_64
environment:
RUNTIME_ENVIRONMENT: ${env:RUNTIME_ENVIRONMENT}
PREDICTIONS_BUCKET: ${env:PREDICTIONS_BUCKET}
DATA_BUCKET: ${env:DATA_BUCKET}
DOMAIN_NAME: ${env:DOMAIN_NAME}
ECR_URI: ${env:ECR_URI}
GITHUB_SHA: ${env:GITHUB_SHA}
iam:
role:
name: ${env:STACK_NAME}_s3_access
statements:
# Allow reading from the DATA_BUCKET
- Effect: Allow
Action:
- s3:*
Resource:
- arn:aws:s3:::${env:DATA_BUCKET}
- arn:aws:s3:::${env:DATA_BUCKET}/*
# Allow reading and writing to PREDICTIONS_BUCKET
- Effect: Allow
Action:
- s3:*
Resource:
- arn:aws:s3:::${env:PREDICTIONS_BUCKET}
- arn:aws:s3:::${env:PREDICTIONS_BUCKET}/*
plugins:
- serverless-domain-manager
custom:
customDomain:
domainName: api.${self:provider.environment.DOMAIN_NAME}
basePath: ${env:STACK_NAME}
createRoute53Record: true
certificateArn: ${ssm:/ssl_certificate_arn}
functions:
sap_prediction_lambda:
image:
uri: ${env:ECR_URI}:${env:GITHUB_SHA}
events:
- http:
path: /predict
method: POST
timeout: 120 # Set max run time to 2 minutes - we shouldn't need this much time so this can be reviewed

View file

@ -1,3 +0,0 @@
/config.local
/tmp
/cache

View file

@ -1,2 +0,0 @@
['remote "myremote"']
url = /tmp/dvcstore

View file

@ -1,3 +0,0 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore

View file

@ -3,3 +3,4 @@
__pycache__/ __pycache__/
.DS_Store .DS_Store
.vscode/ .vscode/
data/

View file

@ -1,2 +0,0 @@
# .gto config file
stages: [dev, stage, prod] # list of allowed Stages

View file

@ -9,16 +9,16 @@ init: dev-conda
.PHONY: dev-conda .PHONY: dev-conda
dev-conda: dev-conda:
# conda deactivate || echo "Not in conda environment" # conda deactivate || echo "Not in conda environment"
# conda remove --name $CONDA_ENV --all -y || echo "No environment created previously" # conda remove --name ${CONDA_ENV} --all -y || echo "No environment created previously"
conda create --name $CONDA_ENV python=$(PYTHON_VERSION) -y conda create --name ${CONDA_ENV} python=$(PYTHON_VERSION) -y
conda init bash conda init bash
conda run -vvvv -n $CONDA_ENV pip install --upgrade pip conda run -v -n ${CONDA_ENV} pip install --upgrade pip
conda run -vvvv -n $CONDA_ENV pip install -r src/pipeline/requirements/training/requirements-dev.txt conda run -v -n ${CONDA_ENV} pip install -r src/pipeline/requirements/training/requirements-dev.txt
conda run -vvvv -n $CONDA_ENV pip install -r src/pipeline/requirements/version_control/requirements.txt conda run -v -n ${CONDA_ENV} pip install -r src/pipeline/requirements/version_control/requirements.txt
conda run -vvvv -n $CONDA_ENV pre-commit install conda run -v -n ${CONDA_ENV} pre-commit install
conda run -vvvv -n $CONDA_ENV pip install ipykernel conda run -v -n ${CONDA_ENV} pip install ipykernel
echo "TO ACTIVATE ENVIRONMENT, USE THE FOLLOWING COMMAND" echo "TO ACTIVATE ENVIRONMENT, USE THE FOLLOWING COMMAND"
echo "conda activate $CONDA_ENV" echo "conda activate ${CONDA_ENV}"
.PHONY: dev-pyenv .PHONY: dev-pyenv

View file

@ -0,0 +1,8 @@
pipeline/data/predictions
pipeline/data/fit_predictions
pipeline/data/prepared_data/train.parquet
pipeline/data/fit_predictions
pipeline/data/model/allmodels
pipeline/metrics
pipeline/.dvc
pipeline/analysis

View file

@ -1,7 +1,7 @@
# Dockerfile that can be used to test loading a model to generate a prediction (part of CI/CD flow) # Dockerfile that can be used to test loading a model to generate a prediction (part of CI/CD flow)
FROM python:3.10.12-slim FROM python:3.10.12-slim
RUN apt-get update && apt-get install -y libgomp1 RUN apt-get update && apt-get install -y libgomp1 gcc python3-dev
COPY pipeline/requirements/predictions/requirements.txt requirements.txt COPY pipeline/requirements/predictions/requirements.txt requirements.txt

View file

@ -1,3 +1,3 @@
# The generic reproducible ML-pipeline # The generic reproducible ML-pipeline
Pipeline required to build a model to produce an output Pipeline required to build a model to produce an output, that gets hashed via DVC

View file

@ -0,0 +1,3 @@
# Ignore dynaconf secret files
.secrets.*

View file

@ -6,9 +6,9 @@ import shutil
import yaml import yaml
from pathlib import Path from pathlib import Path
from core.Logger import logger from core.Logger import logger
from config import settings
startup_cleanup_path = Path(__file__).parent / "configs" / "startup_cleanup.yaml" startup_cleanup_params = settings.startup_cleanup
startup_cleanup_params = yaml.safe_load(open(startup_cleanup_path))
def run_cleanup(artefacts_directory: str, metrics_directory: str) -> None: def run_cleanup(artefacts_directory: str, metrics_directory: str) -> None:
@ -16,13 +16,9 @@ def run_cleanup(artefacts_directory: str, metrics_directory: str) -> None:
Remove the directory where artefacts are stored Remove the directory where artefacts are stored
""" """
logger.info("---------------------")
logger.info(f"--- Run Clean up ---") logger.info(f"--- Run Clean up ---")
logger.info("---------------------")
logger.info("-------------------------")
logger.info(f"--- Delete artefacts ---") logger.info(f"--- Delete artefacts ---")
logger.info("-------------------------")
artefact_directory_path = Path(artefacts_directory) artefact_directory_path = Path(artefacts_directory)
@ -31,9 +27,7 @@ def run_cleanup(artefacts_directory: str, metrics_directory: str) -> None:
logger.info(f"Removing the directory: {artefacts_directory}") logger.info(f"Removing the directory: {artefacts_directory}")
shutil.rmtree(artefact_directory_path) shutil.rmtree(artefact_directory_path)
logger.info("-----------------------")
logger.info(f"--- Delete metrics ---") logger.info(f"--- Delete metrics ---")
logger.info("-----------------------")
metrics_directory_path = Path(metrics_directory) metrics_directory_path = Path(metrics_directory)
@ -45,15 +39,11 @@ def run_cleanup(artefacts_directory: str, metrics_directory: str) -> None:
if __name__ == "__main__": if __name__ == "__main__":
logger.info("----------------------------")
logger.info(f"--- {__file__} - Start! ---") logger.info(f"--- {__file__} - Start! ---")
logger.info("----------------------------")
run_cleanup( run_cleanup(
artefacts_directory=startup_cleanup_params["artefacts"], artefacts_directory=startup_cleanup_params["artefacts"],
metrics_directory=startup_cleanup_params["metrics"], metrics_directory=startup_cleanup_params["metrics"],
) )
logger.info("-------------------------------")
logger.info(f"--- {__file__} - Complete! ---") logger.info(f"--- {__file__} - Complete! ---")
logger.info("-------------------------------")

View file

@ -15,21 +15,15 @@ from configs.feature_processor_logic import business_logic, new_feature_funcs
from core.Logger import logger from core.Logger import logger
from core.DataClient import dataclient_factory from core.DataClient import dataclient_factory
from core.FeatureProcessor import feature_processor_factory from core.FeatureProcessor import feature_processor_factory
from config import settings
logger.info("----------------------------")
logger.info(f"--- Initiate Parameters ---") logger.info(f"--- Initiate Parameters ---")
logger.info("----------------------------")
RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local") RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local")
client_path = Path(__file__).parent / "configs" / "client.yaml" client_params = settings.client
client_params = yaml.safe_load(open(client_path)) prepare_data_params = settings.prepare_data
feature_process_params = settings.feature_processor
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml"
prepare_data_params = yaml.safe_load(open(prepare_data_path))
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
data_filepath = prepare_data_params["data_filepath"] data_filepath = prepare_data_params["data_filepath"]
train_proportion = prepare_data_params["train_proportion"] train_proportion = prepare_data_params["train_proportion"]
@ -37,9 +31,7 @@ output_train_filepath = prepare_data_params["output_train_filepath"]
output_test_filepath = prepare_data_params["output_test_filepath"] output_test_filepath = prepare_data_params["output_test_filepath"]
feature_processor_config = feature_process_params["feature_processor_config"] feature_processor_config = feature_process_params["feature_processor_config"]
logger.info("----------------------------")
logger.info(f"--- Initiate DataClient ---") logger.info(f"--- Initiate DataClient ---")
logger.info("----------------------------")
input_dataclient_type = prepare_data_params["input_dataclient_type"] input_dataclient_type = prepare_data_params["input_dataclient_type"]
output_dataclient_type = prepare_data_params["output_dataclient_type"] output_dataclient_type = prepare_data_params["output_dataclient_type"]
@ -53,9 +45,7 @@ output_dataclient = dataclient_factory(
dataclient_config=client_params[output_dataclient_type], dataclient_config=client_params[output_dataclient_type],
) )
logger.info("----------------------------------")
logger.info(f"--- Initiate FeatureProcessor ---") logger.info(f"--- Initiate FeatureProcessor ---")
logger.info("----------------------------------")
feature_processor = feature_processor_factory( feature_processor = feature_processor_factory(
feature_process_params["feature_processor_type"] feature_process_params["feature_processor_type"]
@ -80,15 +70,11 @@ def prepare_data(
:param pipeline_mode: bool, Default False, this caches out the file for experimentation, objects returned in pipeline mode :param pipeline_mode: bool, Default False, this caches out the file for experimentation, objects returned in pipeline mode
""" """
logger.info("--------------------")
logger.info("--- Loading data ---") logger.info("--- Loading data ---")
logger.info("--------------------")
data = input_dataclient.load_data(location=data_filepath, load_config={}) data = input_dataclient.load_data(location=data_filepath, load_config={})
logger.info("--------------------------")
logger.info("--- Feature Processing ---") logger.info("--- Feature Processing ---")
logger.info("--------------------------")
data = feature_processor.feature_process( data = feature_processor.feature_process(
data, data,
@ -97,13 +83,12 @@ def prepare_data(
new_feature_funcs=new_feature_funcs, new_feature_funcs=new_feature_funcs,
) )
logger.info("----------------------")
logger.info("--- Splitting data ---") logger.info("--- Splitting data ---")
logger.info("----------------------")
if train_proportion == 1: if train_proportion == 1:
train = data train = data
test = None # Sample 10% of the data for testing
test = data.sample(round(len(data) * 0.1))
else: else:
train, test = train_test_split( train, test = train_test_split(
data, train_size=train_proportion, test_size=(1 - train_proportion) data, train_size=train_proportion, test_size=(1 - train_proportion)
@ -112,9 +97,7 @@ def prepare_data(
train = train.reset_index(drop=True) train = train.reset_index(drop=True)
logger.info("-----------------------")
logger.info("--- Outputting data ---") logger.info("--- Outputting data ---")
logger.info("-----------------------")
output_dataclient.save_data( output_dataclient.save_data(
obj=train, location=output_train_filepath, save_config=None obj=train, location=output_train_filepath, save_config=None
@ -130,13 +113,9 @@ def prepare_data(
if __name__ == "__main__": if __name__ == "__main__":
logger.info("----------------------------")
logger.info(f"--- {__file__} - Start! ---") logger.info(f"--- {__file__} - Start! ---")
logger.info("----------------------------")
logger.info("---------------------------")
logger.info(f"--- Prepare Data Stage ---") logger.info(f"--- Prepare Data Stage ---")
logger.info("---------------------------")
prepare_data( prepare_data(
input_dataclient=input_dataclient, input_dataclient=input_dataclient,
@ -151,6 +130,4 @@ if __name__ == "__main__":
new_feature_funcs=new_feature_funcs, new_feature_funcs=new_feature_funcs,
) )
logger.info("-------------------------------")
logger.info(f"--- {__file__} - Complete! ---") logger.info(f"--- {__file__} - Complete! ---")
logger.info("-------------------------------")

View file

@ -6,7 +6,7 @@ Once we have the features, we build a model
import os import os
import yaml import yaml
import pandas as pd import pandas as pd
from typing import Union from typing import Union, List
from pathlib import Path from pathlib import Path
from core.Logger import logger from core.Logger import logger
from core.interface.InterfaceMetrics import MLMetrics from core.interface.InterfaceMetrics import MLMetrics
@ -16,49 +16,41 @@ from core.DataClient import dataclient_factory
from core.MLModels import model_factory from core.MLModels import model_factory
from core.MLMetrics import metrics_factory from core.MLMetrics import metrics_factory
from configs.post_prediction_logic import post_prediction_logic from configs.post_prediction_logic import post_prediction_logic
from config import settings
logger.info("----------------------------")
logger.info(f"--- Initiate Parameters ---") logger.info(f"--- Initiate Parameters ---")
logger.info("----------------------------")
RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local") RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local")
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml" prepare_data_params = settings.prepare_data
prepare_data_params = yaml.safe_load(open(prepare_data_path)) build_model_params = settings.build_model
feature_process_params = settings.feature_processor
build_model_path = Path(__file__).parent / "configs" / "build_model.yaml" generate_metrics_params = settings.generate_metrics
build_model_params = yaml.safe_load(open(build_model_path)) generate_predictions_params = settings.generate_predictions
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
generate_metrics_path = Path(__file__).parent / "configs" / "generate_metrics.yaml"
generate_metrics_params = yaml.safe_load(open(generate_metrics_path))
model_type = build_model_params["model_type"] model_type = build_model_params["model_type"]
target = feature_process_params["feature_processor_config"]["target"] target = feature_process_params["feature_processor_config"]["target"]
fit_predictions_filepath = build_model_params["fit_predictions_filepath"]
predictions_column_name = generate_predictions_params["predictions_column_name"]
identifier_columns = feature_process_params["feature_processor_config"][
"identifier_columns"
]
model_save_location = build_model_params["model_save_filepath"] model_save_location = build_model_params["model_save_filepath"]
model_hyperparameters = build_model_params[model_type] model_hyperparameters = build_model_params[model_type]
train_filepath = prepare_data_params["output_train_filepath"] train_filepath = prepare_data_params["output_train_filepath"]
test_filepath = prepare_data_params["output_test_filepath"] test_filepath = prepare_data_params["output_test_filepath"]
fit_metrics_filepath = build_model_params["fit_metrics_filepath"] fit_metrics_filepath = build_model_params["fit_metrics_filepath"]
logger.info("----------------------------")
logger.info(f"--- Initiate DataClient ---") logger.info(f"--- Initiate DataClient ---")
logger.info("----------------------------")
# Output of previous prepare data step, will be where the data is # Output of previous prepare data step, will be where the data is
dataclient = dataclient_factory(prepare_data_params["output_dataclient_type"]) dataclient = dataclient_factory(prepare_data_params["output_dataclient_type"])
logger.info("-------------------------")
logger.info(f"--- Initiate MLModel ---") logger.info(f"--- Initiate MLModel ---")
logger.info("-------------------------")
model = model_factory(model_type) model = model_factory(model_type)
logger.info("-------------------------")
logger.info(f"--- Initiate Metrics ---") logger.info(f"--- Initiate Metrics ---")
logger.info("-------------------------")
metrics = metrics_factory(generate_metrics_params["metrics_type"]) metrics = metrics_factory(generate_metrics_params["metrics_type"])
@ -68,8 +60,11 @@ def build_model(
model: MLModel, model: MLModel,
metrics: MLMetrics, metrics: MLMetrics,
target: str, target: str,
identifier_columns: List[str],
model_save_location: str, model_save_location: str,
model_hyperparameters: dict, model_hyperparameters: dict,
fit_predictions_filepath: str,
predictions_column_name: str,
fit_metrics_filepath: str, fit_metrics_filepath: str,
train_filepath: Union[str, None] = None, train_filepath: Union[str, None] = None,
test_filepath: Union[str, None] = None, test_filepath: Union[str, None] = None,
@ -77,9 +72,7 @@ def build_model(
test_data: Union[pd.DataFrame, None] = None, test_data: Union[pd.DataFrame, None] = None,
pipeline_mode: bool = False, pipeline_mode: bool = False,
): ):
logger.info("--------------------------------------")
logger.info("--- Loading Data for build process ---") logger.info("--- Loading Data for build process ---")
logger.info("--------------------------------------")
if train_data is None: if train_data is None:
if train_filepath is None: if train_filepath is None:
@ -91,42 +84,41 @@ def build_model(
raise ValueError(f"Need {test_filepath} if no data supplied") raise ValueError(f"Need {test_filepath} if no data supplied")
test_data = dataclient.load_data(location=test_filepath, load_config=None) test_data = dataclient.load_data(location=test_filepath, load_config=None)
logger.info("----------------------")
logger.info("--- Training model ---") logger.info("--- Training model ---")
logger.info("----------------------")
model.train_model( model.train_model(
data=train_data, target=target, model_hyperparameters=model_hyperparameters data=train_data.drop(columns=identifier_columns),
target=target,
model_hyperparameters=model_hyperparameters,
) )
logger.info("----------------------------------")
logger.info("--- Generating fit predictions ---") logger.info("--- Generating fit predictions ---")
logger.info("----------------------------------")
prediction_data = train_data.drop(columns=target)
fit_predictions = model.predict( fit_predictions = model.predict(
data=prediction_data, post_prediction_logic=post_prediction_logic data=train_data, post_prediction_logic=post_prediction_logic
)
logger.info("--- Saving fit predictions ---")
predictions_df = pd.DataFrame(fit_predictions)
predictions_df.columns = [predictions_column_name]
dataclient.save_data(
obj=predictions_df, location=fit_predictions_filepath, save_config=None
) )
logger.info("------------------------------")
logger.info("--- Generating fit metrics ---") logger.info("--- Generating fit metrics ---")
logger.info("------------------------------")
metrics_output = metrics.generate_metrics( metrics_output = metrics.generate_metrics(
target=train_data[target], target=train_data[target],
predictions=pd.Series(fit_predictions), predictions=pd.Series(fit_predictions),
) )
logger.info("--------------------")
logger.info("--- Saving model ---") logger.info("--- Saving model ---")
logger.info("--------------------")
model.save_model(path=Path(model_save_location)) model.save_model(path=Path(model_save_location))
logger.info("--------------------------")
logger.info("--- Saving fit metrics ---") logger.info("--- Saving fit metrics ---")
logger.info("--------------------------")
dataclient.save_data( dataclient.save_data(
obj=metrics_output, location=fit_metrics_filepath, save_config=None obj=metrics_output, location=fit_metrics_filepath, save_config=None
@ -135,26 +127,23 @@ def build_model(
if __name__ == "__main__": if __name__ == "__main__":
logger.info("----------------------------")
logger.info(f"--- {__file__} - Start! ---") logger.info(f"--- {__file__} - Start! ---")
logger.info("----------------------------")
logger.info("--------------------------")
logger.info(f"--- Build Model Stage ---") logger.info(f"--- Build Model Stage ---")
logger.info("--------------------------")
build_model( build_model(
dataclient=dataclient, dataclient=dataclient,
model=model, model=model,
metrics=metrics, metrics=metrics,
target=target, target=target,
identifier_columns=identifier_columns,
model_save_location=model_save_location, model_save_location=model_save_location,
model_hyperparameters=build_model_params[model_type], model_hyperparameters=model_hyperparameters,
train_filepath=model_hyperparameters, train_filepath=train_filepath,
test_filepath=test_filepath, test_filepath=test_filepath,
fit_metrics_filepath=fit_metrics_filepath, fit_metrics_filepath=fit_metrics_filepath,
fit_predictions_filepath=fit_predictions_filepath,
predictions_column_name=predictions_column_name,
) )
logger.info("-------------------------------")
logger.info(f"--- {__file__} - Complete! ---") logger.info(f"--- {__file__} - Complete! ---")
logger.info("-------------------------------")

View file

@ -4,133 +4,58 @@ After the model is built, we can evaluate its performance
""" """
import os import os
import yaml
import pandas as pd
from pathlib import Path
from core.interface.InterfaceModels import MLModel
from core.interface.InterfaceDataClient import DataClient
from core.DataClient import dataclient_factory from core.DataClient import dataclient_factory
from core.MLModels import model_factory from core.MLModels import model_factory
from core.Logger import logger from core.Logger import logger
from configs.post_prediction_logic import post_prediction_logic from config import settings
from generate_predictions import generate_predictions
logger.info("----------------------------")
logger.info(f"--- Initiate Parameters ---") logger.info(f"--- Initiate Parameters ---")
logger.info("----------------------------")
RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local") RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local")
client_path = Path(__file__).parent / "configs" / "client.yaml" client_params = settings.client
client_params = yaml.safe_load(open(client_path)) prepare_data_params = settings.prepare_data
build_model_params = settings.build_model
generate_predictions_params = settings.generate_predictions
feature_process_params = settings.feature_processor
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml" input_dataclient_type = generate_predictions_params["input_dataclient_type"]
prepare_data_params = yaml.safe_load(open(prepare_data_path)) output_dataclient_type = generate_predictions_params["output_dataclient_type"]
build_model_path = Path(__file__).parent / "configs" / "build_model.yaml" test_data_filepath = generate_predictions_params["test_data_filepath"]
build_model_params = yaml.safe_load(open(build_model_path)) test_data_filepath = os.environ.get("PREDICTION_FILE", test_data_filepath)
generate_predictions_path = (
Path(__file__).parent / "configs" / "generate_predictions.yaml"
)
generate_predictions_params = yaml.safe_load(open(generate_predictions_path))
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
target = feature_process_params["feature_processor_config"]["target"] target = feature_process_params["feature_processor_config"]["target"]
model_filepath = build_model_params["model_save_filepath"] model_filepath = build_model_params["model_save_filepath"]
test_data_filepath = generate_predictions_params["test_data_filepath"]
predictions_output_filepath = generate_predictions_params["predictions_output_filepath"] predictions_output_filepath = generate_predictions_params["predictions_output_filepath"]
predictions_column_name = generate_predictions_params["predictions_column_name"] predictions_column_name = generate_predictions_params["predictions_column_name"]
logger.info("-------------------------")
logger.info(f"--- Initiate MLModel ---") logger.info(f"--- Initiate MLModel ---")
logger.info("-------------------------")
model = model_factory(build_model_params["model_type"]) model = model_factory(build_model_params["model_type"])
logger.info("----------------------------")
logger.info(f"--- Initiate DataClient ---") logger.info(f"--- Initiate DataClient ---")
logger.info("----------------------------")
# We may have different locations of loading hence why we use one specified in generate_predictions.yaml # We may have different locations of loading hence why we use one specified in generate_predictions.yaml
# I.e. for metric runs, this will be a local data client # I.e. for metric runs, this will be a local data client
# For predictions, we will want a cloud data client # For predictions, we will want a cloud data client
input_dataclient_type = generate_predictions_params["input_dataclient_type"]
input_dataclient = dataclient_factory( input_dataclient = dataclient_factory(
dataclient_type=input_dataclient_type, dataclient_type=input_dataclient_type,
dataclient_config=client_params[input_dataclient_type], dataclient_config=client_params[input_dataclient_type],
) )
output_dataclient_type = generate_predictions_params["output_dataclient_type"]
output_dataclient = dataclient_factory( output_dataclient = dataclient_factory(
dataclient_type=output_dataclient_type, dataclient_type=output_dataclient_type,
dataclient_config=client_params[output_dataclient_type], dataclient_config=client_params[output_dataclient_type],
) )
def generate_predictions(
input_dataclient: DataClient,
output_dataclient: DataClient,
model: MLModel,
target: str,
model_filepath: str,
test_data_filepath: str,
predictions_output_filepath: str,
predictions_column_name: str,
):
"""
For a given model, we generate prediction and evaluate this against the true target
"""
logger.info("-------------------------")
logger.info("--- Loading test data ---")
logger.info("-------------------------")
test_data = input_dataclient.load_data(
location=test_data_filepath, load_config=None
)
logger.info("---------------------")
logger.info("--- Loading model ---")
logger.info("---------------------")
model.load_model(model_filepath)
logger.info("------------------------------")
logger.info("--- Generating predictions ---")
logger.info("------------------------------")
prediction_data = (
test_data.drop(columns=target) if target in test_data.columns else test_data
)
predictions = model.predict(
data=prediction_data, post_prediction_logic=post_prediction_logic
)
logger.info("--------------------------")
logger.info("--- Saving predictions ---")
logger.info("--------------------------")
predictions_df = pd.DataFrame(predictions)
predictions_df.columns = [predictions_column_name]
output_dataclient.save_data(
obj=predictions_df, location=predictions_output_filepath, save_config=None
)
if __name__ == "__main__": if __name__ == "__main__":
logger.info("----------------------------")
logger.info(f"--- {__file__} - Start! ---") logger.info(f"--- {__file__} - Start! ---")
logger.info("----------------------------")
logger.info("----------------------------------")
logger.info(f"--- Generate Predictions Stage---") logger.info(f"--- Generate Predictions Stage---")
logger.info("----------------------------------")
generate_predictions( generate_predictions(
input_dataclient=input_dataclient, input_dataclient=input_dataclient,
@ -143,6 +68,4 @@ if __name__ == "__main__":
predictions_column_name=predictions_column_name, predictions_column_name=predictions_column_name,
) )
logger.info("-------------------------------")
logger.info(f"--- {__file__} - Complete! ---") logger.info(f"--- {__file__} - Complete! ---")
logger.info("-------------------------------")

View file

@ -14,33 +14,18 @@ from core.DataClient import dataclient_factory
from core.MLModels import model_factory from core.MLModels import model_factory
from core.MLMetrics import metrics_factory from core.MLMetrics import metrics_factory
from core.Logger import logger from core.Logger import logger
from config import settings
logger.info("----------------------------")
logger.info(f"--- Initiate Parameters ---") logger.info(f"--- Initiate Parameters ---")
logger.info("----------------------------")
RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local") RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local")
client_path = Path(__file__).parent / "configs" / "client.yaml" client_params = settings.client
client_params = yaml.safe_load(open(client_path)) prepare_data_params = settings.prepare_data
build_model_params = settings.build_model
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml" generate_predictions_params = settings.generate_predictions
prepare_data_params = yaml.safe_load(open(prepare_data_path)) generate_metrics_params = settings.generate_metrics
feature_process_params = settings.feature_processor
build_model_path = Path(__file__).parent / "configs" / "build_model.yaml"
build_model_params = yaml.safe_load(open(build_model_path))
generate_predictions_path = (
Path(__file__).parent / "configs" / "generate_predictions.yaml"
)
generate_predictions_params = yaml.safe_load(open(generate_predictions_path))
generate_metrics_path = Path(__file__).parent / "configs" / "generate_metrics.yaml"
generate_metrics_params = yaml.safe_load(open(generate_metrics_path))
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
target = feature_process_params["feature_processor_config"]["target"] target = feature_process_params["feature_processor_config"]["target"]
test_data_filepath = generate_predictions_params["test_data_filepath"] test_data_filepath = generate_predictions_params["test_data_filepath"]
@ -48,16 +33,11 @@ predictions_output_filepath = generate_predictions_params["predictions_output_fi
predictions_column_name = generate_predictions_params["predictions_column_name"] predictions_column_name = generate_predictions_params["predictions_column_name"]
metrics_output_filepath = generate_metrics_params["metrics_output_filepath"] metrics_output_filepath = generate_metrics_params["metrics_output_filepath"]
logger.info("-------------------------")
logger.info(f"--- Initiate MLModel ---") logger.info(f"--- Initiate MLModel ---")
logger.info("-------------------------")
model = model_factory(build_model_params["model_type"]) model = model_factory(build_model_params["model_type"])
logger.info("----------------------------")
logger.info(f"--- Initiate DataClient ---") logger.info(f"--- Initiate DataClient ---")
logger.info("----------------------------")
# Use data client for input and output, as we use dvc to cache later to the cloud # Use data client for input and output, as we use dvc to cache later to the cloud
dataclient_type = generate_metrics_params["dataclient_type"] dataclient_type = generate_metrics_params["dataclient_type"]
@ -66,9 +46,7 @@ dataclient = dataclient_factory(
dataclient_config=client_params[dataclient_type], dataclient_config=client_params[dataclient_type],
) )
logger.info("---------------------------")
logger.info(f"--- Initiate MLMetrics ---") logger.info(f"--- Initiate MLMetrics ---")
logger.info("---------------------------")
metrics = metrics_factory(generate_metrics_params["metrics_type"]) metrics = metrics_factory(generate_metrics_params["metrics_type"])
@ -88,34 +66,26 @@ def generate_metrics(
For a given model, we generate prediction and evaluate this against the true target For a given model, we generate prediction and evaluate this against the true target
""" """
logger.info("-------------------------")
logger.info("--- Loading test data ---") logger.info("--- Loading test data ---")
logger.info("-------------------------")
test_data = input_dataclient.load_data( test_data = input_dataclient.load_data(
location=test_data_filepath, load_config=None location=test_data_filepath, load_config=None
) )
logger.info("---------------------------")
logger.info("--- Loading predictions ---") logger.info("--- Loading predictions ---")
logger.info("---------------------------")
predictions = input_dataclient.load_data( predictions = input_dataclient.load_data(
location=predictions_output_filepath, load_config=None location=predictions_output_filepath, load_config=None
) )
logger.info("--------------------------")
logger.info("--- Generating metrics ---") logger.info("--- Generating metrics ---")
logger.info("--------------------------")
metrics_output = metrics.generate_metrics( metrics_output = metrics.generate_metrics(
target=test_data[target], target=test_data[target],
predictions=pd.Series(predictions[predictions_column_name]), predictions=pd.Series(predictions[predictions_column_name]),
) )
logger.info("----------------------")
logger.info("--- Saving metrics ---") logger.info("--- Saving metrics ---")
logger.info("----------------------")
output_dataclient.save_data( output_dataclient.save_data(
obj=metrics_output, location=metrics_output_filepath, save_config=None obj=metrics_output, location=metrics_output_filepath, save_config=None
@ -124,13 +94,9 @@ def generate_metrics(
if __name__ == "__main__": if __name__ == "__main__":
logger.info("----------------------------")
logger.info(f"--- {__file__} - Start! ---") logger.info(f"--- {__file__} - Start! ---")
logger.info("----------------------------")
logger.info("------------------------------")
logger.info(f"--- Generate Metrics Stage---") logger.info(f"--- Generate Metrics Stage---")
logger.info("------------------------------")
generate_metrics( generate_metrics(
input_dataclient=dataclient, input_dataclient=dataclient,
@ -144,6 +110,4 @@ if __name__ == "__main__":
metrics_output_filepath=metrics_output_filepath, metrics_output_filepath=metrics_output_filepath,
) )
logger.info("-------------------------------")
logger.info(f"--- {__file__} - Complete! ---") logger.info(f"--- {__file__} - Complete! ---")
logger.info("-------------------------------")

View file

@ -0,0 +1,162 @@
"""
Fourth part of the pipeline:
After the model is built and metrics are generated,
we want to test this model against known scenarios
"""
import os
import pandas as pd
from core.interface.InterfaceModels import MLModel
from core.interface.InterfaceDataClient import DataClient
from core.interface.InterfaceMetrics import MLMetrics
from configs.post_prediction_logic import post_prediction_logic
from core.DataClient import dataclient_factory
from core.MLModels import model_factory
from core.MLMetrics import metrics_factory
from core.Logger import logger
from config import settings
logger.info(f"--- Initiate Parameters ---")
RUNTIME_ENVIRONMENT = os.environ.get("RUNTIME_ENVIRONMENT", "local")
client_params = settings.client
prepare_data_params = settings.prepare_data
build_model_params = settings.build_model
generate_predictions_params = settings.generate_predictions
generate_metrics_params = settings.generate_metrics
feature_process_params = settings.feature_processor
scenarios_params = settings.scenarios
model_filepath = build_model_params["model_save_filepath"]
target = feature_process_params["feature_processor_config"]["target"]
scenario_data_filepaths = scenarios_params["scenario_data_filepaths"]
predictions_column_name = generate_predictions_params["predictions_column_name"]
comparison_output_filepath = scenarios_params["comparison_output_filepath"]
metrics_output_filepath = scenarios_params["metrics_output_filepath"]
logger.info(f"--- Initiate MLModel ---")
model = model_factory(build_model_params["model_type"])
logger.info(f"--- Initiate DataClient ---")
# Use data client for input and output, as we use dvc to cache later to the cloud
input_dataclient_type = scenarios_params["input_dataclient_type"]
input_dataclient = dataclient_factory(
dataclient_type=input_dataclient_type,
dataclient_config=client_params[input_dataclient_type],
)
output_dataclient_type = scenarios_params["output_dataclient_type"]
output_dataclient = dataclient_factory(
dataclient_type=output_dataclient_type,
dataclient_config=client_params[output_dataclient_type],
)
logger.info(f"--- Initiate MLMetrics ---")
metrics = metrics_factory(generate_metrics_params["metrics_type"])
def generate_scenario_predictions(
input_dataclient: DataClient,
output_dataclient: DataClient,
model: MLModel,
metrics: MLMetrics,
model_filepath: str,
scenario_data_filepaths: list,
predictions_column_name: str,
comparison_output_filepath: str,
metrics_output_filepath: str,
):
"""
Given the new model, we generate prediction for expected scenarios
"""
logger.info("--- Loading Scenario Data ---")
scenario_data = pd.DataFrame()
# If we have no scenario data, we can save empty dataframes
if scenario_data_filepaths is None:
logger.info("No scenario data filepaths provided")
output_dataclient.save_data(
obj=scenario_data, location=comparison_output_filepath, save_config=None
)
output_dataclient.save_data(
obj=scenario_data, location=metrics_output_filepath, save_config=None
)
return
# Can have multiple scenario data files
for scenario_data_filepath in scenario_data_filepaths:
scenario_data = pd.concat(
[
scenario_data,
input_dataclient.load_data(scenario_data_filepath, load_config=None),
]
)
logger.info("--- Loading Model ---")
model.load_model(model_filepath)
logger.info("--- Generating Predictions ---")
predictions = model.predict(
data=scenario_data, post_prediction_logic=post_prediction_logic
)
logger.info("--- Generate Scenario Predicted Impact ---")
predictions_df = pd.DataFrame(predictions)
predictions_df.columns = [predictions_column_name]
scenario_data = pd.concat([scenario_data, predictions_df], axis=1)
scenario_data["predicted_impact"] = abs(
scenario_data[predictions_column_name] - scenario_data["sap_starting"]
)
logger.info("--- Generate Metrics ---")
metrics_dict = metrics.generate_metrics(
scenario_data["impact"], scenario_data["predicted_impact"]
)
metrics_df = pd.DataFrame(metrics_dict, index=[0]).T.reset_index()
metrics_df.columns = ["metric", "value"]
logger.info("--- Save prediction into metrics ---")
output_df = scenario_data[["uprn", "id", "impact", "predicted_impact"]]
output_dataclient.save_data(
obj=output_df, location=comparison_output_filepath, save_config=None
)
output_dataclient.save_data(
obj=metrics_df, location=metrics_output_filepath, save_config=None
)
if __name__ == "__main__":
logger.info(f"--- {__file__} - Start! ---")
logger.info(f"--- Generate Scenario Predictions ---")
generate_scenario_predictions(
input_dataclient=input_dataclient,
output_dataclient=output_dataclient,
model=model,
metrics=metrics,
model_filepath=model_filepath,
scenario_data_filepaths=scenario_data_filepaths,
predictions_column_name=predictions_column_name,
comparison_output_filepath=comparison_output_filepath,
metrics_output_filepath=metrics_output_filepath,
)
logger.info(f"--- {__file__} - Complete! ---")

View file

@ -37,3 +37,4 @@ Workflow:
- This experiment will have the corresponding .dvc files for the hashed model and data - This experiment will have the corresponding .dvc files for the hashed model and data
- Use version control as normal - Use version control as normal
- git add, git commit etc - git add, git commit etc
- To revert change, use `git checkout {COMMIT_HASH}`, followed by `git switch -c {NEW_BRANCH_NAME}`

View file

@ -0,0 +1,15 @@
from dynaconf import Dynaconf
settings = Dynaconf(
environments=True,
envvar_prefix="DYNACONF",
settings_files=[
"./configs/settings.yaml",
"./configs/build_model.yaml",
"./configs/analysis.yaml",
"./configs/scenarios.yaml",
],
)
# `envvar_prefix` = export envvars with `export DYNACONF_FOO=bar`.
# `settings_files` = Load these files in the order.

View file

@ -0,0 +1,16 @@
default:
model_analysis:
dataclient_type: local
feature_importance_filepath: ./analysis/feature_importance.parquet
permutation_subsample_amount: 1000
loss_fns: "mean_absolute_percentage_error"
feature_importance_column: importance
n_repeats: 5
figwidth: 7
figheight: 6
prediction_analysis:
dataclient_type: local
nshap_samples: 100 # how many samples to use to approximate each Shapely value, larger values will be slower
n_val: 30 # how many datapoints from validation data should we interpret predictions for, larger values will be slower
row_index: [20695, 50243, 7653] # index of an example datapoint

View file

@ -1,16 +1,22 @@
model_type: AutogluonAutoML default:
model_save_filepath: ./data/model/autogluonmodel/ build_model:
fit_metrics_filepath: ./metrics/fit_metrics.json model_type: AutogluonAutoML
model_save_filepath: ./data/model/optimised/
fit_metrics_filepath: ./metrics/fit_metrics.json
fit_predictions_filepath: ./data/fit_predictions/predictions.parquet
SKLearnLinearRegression: null SKLearnLinearRegression: null
SKLearnSVMRegression: SKLearnSVMRegression:
kernel: "linear" kernel: "linear"
AutogluonAutoML: AutogluonAutoML:
output_filepath: ./data/model/autogluonmodel/ output_filepath: ./data/model/allmodels/
problem_type: regression problem_type: regression
eval_metric: mean_absolute_error eval_metric: mean_squared_error #mean_absolute_error
time_limit: 800 time_limit: 1800
presets: medium_quality presets: medium_quality
excluded_model_types: ['KNN'] excluded_model_types: ['RF', 'CAT', 'NN_TORCH', 'KNN', 'XT']
infer_limit: 0.05
infer_limit_batch_size: 10000
ag_args_ensemble: {'num_folds_parallel': 2}

View file

@ -1,10 +0,0 @@
aws-s3:
AWS_ACCESS_KEY_ID: null
AWS_SECRET_ACCESS_KEY: null
ENDPOINT_URL: null
aws-s3-mock:
AWS_ACCESS_KEY_ID: minio
AWS_SECRET_ACCESS_KEY: minio123
ENDPOINT_URL: http://localhost:9000
local:
null

View file

@ -1,3 +0,0 @@
"""
Stitch all yaml configuration files together, override some settings (such as bucket location) based off environment variables
"""

View file

@ -1,61 +0,0 @@
feature_processor_type: dataframe
feature_processor_config:
subsample_amount: null
subsample_seed: 0
target: SAP_ENDING
drop_columns: ["UPRN", "HEAT_DEMAND_CHANGE", "CARBON_CHANGE", "RDSAP_CHANGE", "HEAT_DEMAND_ENDING", "CARBON_ENDING"]
# retain_features: ["TOTAL_FLOOR_AREA_STARTING", "SAP_STARTING", "HEAT_DEMAND_STARTING", "CARBON_STARTING", "NUMBER_HABITABLE_ROOMS", "NUMBER_HEATED_ROOMS", "FIXED_LIGHTING_OUTLETS_COUNT", "PHOTO_SUPPLY_STARTING", "MULTI_GLAZE_PROPORTION_STARTING", "LOW_ENERGY_LIGHTING_STARTING", "NUMBER_OPEN_FIREPLACES_STARTING", "EXTENSION_COUNT_STARTING", "FLOOR_HEIGHT_STARTING", "PHOTO_SUPPLY_ENDING", "MULTI_GLAZE_PROPORTION_ENDING", "LOW_ENERGY_LIGHTING_ENDING", "NUMBER_OPEN_FIREPLACES_ENDING", "EXTENSION_COUNT_ENDING", "TOTAL_FLOOR_AREA_ENDING", "FLOOR_HEIGHT_ENDING", "DAYS_TO_STARTING", "DAYS_TO_ENDING"]
# retain_features: null
# retain_features: ["SAP_STARTING", 'PROPERTY_TYPE', 'BUILT_FORM', 'CONSTITUENCY', 'NUMBER_HABITABLE_ROOMS',
# 'NUMBER_HEATED_ROOMS',
# 'FIXED_LIGHTING_OUTLETS_COUNT',
# 'CONSTRUCTION_AGE_BAND',
# 'TRANSACTION_TYPE_STARTING',
# 'LIGHTING_DESCRIPTION_STARTING',
# 'MAINHEAT_DESCRIPTION_STARTING',
# 'HOTWATER_DESCRIPTION_STARTING',
# 'MAIN_FUEL_STARTING',
# 'MECHANICAL_VENTILATION_STARTING',
# 'SECONDHEAT_DESCRIPTION_STARTING',
# 'ENERGY_TARIFF_STARTING',
# 'SOLAR_WATER_HEATING_FLAG_STARTING',
# 'PHOTO_SUPPLY_STARTING',
# 'WINDOWS_DESCRIPTION_STARTING',
# 'GLAZED_TYPE_STARTING',
# 'MULTI_GLAZE_PROPORTION_STARTING',
# 'LOW_ENERGY_LIGHTING_STARTING',
# 'NUMBER_OPEN_FIREPLACES_STARTING',
# 'MAINHEATCONT_DESCRIPTION_STARTING',
# 'EXTENSION_COUNT_STARTING',
# 'TOTAL_FLOOR_AREA_STARTING',
# 'FLOOR_HEIGHT_STARTING',
# 'DAYS_TO_STARTING',
# 'WALLS_DESCRIPTION_STARTING',
# 'FLOOR_DESCRIPTION_STARTING']
# retain_features: ["SAP_STARTING", 'PROPERTY_TYPE', 'BUILT_FORM', 'CONSTITUENCY', 'NUMBER_HABITABLE_ROOMS',
# 'NUMBER_HEATED_ROOMS',
# 'FIXED_LIGHTING_OUTLETS_COUNT',
# 'CONSTRUCTION_AGE_BAND',
# 'TRANSACTION_TYPE_ENDING',
# 'LIGHTING_DESCRIPTION_ENDING',
# 'MAINHEAT_DESCRIPTION_ENDING',
# 'HOTWATER_DESCRIPTION_ENDING',
# 'MAIN_FUEL_ENDING',
# 'MECHANICAL_VENTILATION_ENDING',
# 'SECONDHEAT_DESCRIPTION_ENDING',
# 'ENERGY_TARIFF_ENDING',
# 'SOLAR_WATER_HEATING_FLAG_ENDING',
# 'PHOTO_SUPPLY_ENDING',
# 'WINDOWS_DESCRIPTION_ENDING',
# 'GLAZED_TYPE_ENDING',
# 'MULTI_GLAZE_PROPORTION_ENDING',
# 'LOW_ENERGY_LIGHTING_ENDING',
# 'NUMBER_OPEN_FIREPLACES_ENDING',
# 'MAINHEATCONT_DESCRIPTION_ENDING',
# 'EXTENSION_COUNT_ENDING',
# 'TOTAL_FLOOR_AREA_ENDING',
# 'FLOOR_HEIGHT_ENDING',
# 'DAYS_TO_ENDING',
# 'WALLS_DESCRIPTION_ENDING',
# 'FLOOR_DESCRIPTION_ENDING']
retain_features: null

View file

@ -9,15 +9,42 @@ Business Logic dict + functions
def remove_starting_columns(df): def remove_starting_columns(df):
keep_column_index = [ keep_column_index = [
False if col_name.endswith("_STARTING") else True False if col_name.endswith("_starting") else True
for col_name in list(df.columns) for col_name in list(df.columns)
] ]
keep_columns = df.columns[keep_column_index].to_list() keep_columns = df.columns[keep_column_index].to_list()
keep_columns.append("SAP_STARTING") keep_columns.append("sap_starting")
df = df[keep_columns] df = df[keep_columns]
return df return df
def remove_floor_height_ending(df):
# df.describe(percentiles=[0.005,0.99])['FLOOR_HEIGHT_ENDING']
# shows bottom 0.5 percentile is 1.665
# So keep anything above this
df = df[df["floor_height_ending"] > 1.665].reset_index(drop=True)
print("we in here")
return df
def remove_minimum_habitable_room_size(df):
# Need minimum of 6.5m per habitable room
df = df[
df["total_floor_area_ending"] / df["number_habitable_rooms"] > 6.5
].reset_index(drop=True)
return df
def keep_flats(df):
df = df[df["property_type"] == "Flat"]
return df
def keep_non_zero_rdsap(df):
df = df[df["rdsap_change"] != 0]
return df
# def keep_ending_columns(df): # def keep_ending_columns(df):
# ending_column_index = [ col_name.endswith("_ENDING") for col_name in list(df.columns)] # ending_column_index = [ col_name.endswith("_ENDING") for col_name in list(df.columns)]
# keep_columns = df.columns[ending_column_index].to_list() # keep_columns = df.columns[ending_column_index].to_list()
@ -27,7 +54,11 @@ def remove_starting_columns(df):
# return df # return df
business_logic = { business_logic = {
"remove_starting_columns": remove_starting_columns # "keep_non_zero_rdsap": keep_non_zero_rdsap,
# "keep_flats": keep_flats,
# "remove_minimum_habitable_room_size": remove_minimum_habitable_room_size,
# "remove_floor_height_ending": remove_floor_height_ending
# "remove_starting_columns": remove_starting_columns
# "keep_ENDING_COLUMNS": keep_ending_columns # "keep_ENDING_COLUMNS": keep_ending_columns
} }

View file

@ -1,3 +0,0 @@
dataclient_type: local
metrics_type: Regression
metrics_output_filepath: ./metrics/metrics.json

View file

@ -1,5 +0,0 @@
input_dataclient_type: local
output_dataclient_type: local
test_data_filepath: ./data/prepared_data/test.parquet
predictions_output_filepath: ./data/predictions/predictions.parquet
predictions_column_name: predictions

View file

@ -1,8 +0,0 @@
dataclient_type: local
feature_importance_filepath: ./analysis/feature_importance.parquet
permutation_subsample_amount: 1000
loss_fns: "mean_absolute_percentage_error"
feature_importance_column: importance
n_repeats: 5
figwidth: 7
figheight: 6

View file

@ -5,15 +5,18 @@ import pandas as pd
def clip_predictions_to_minimum_value( def clip_predictions_to_minimum_value(
data: pd.DataFrame, predictions: pd.Series, minimum_value: int = 1 data: pd.DataFrame, predictions: pd.Series, minimum_value: int = 0
) -> pd.Series: ) -> pd.Series:
series_name = predictions.name series_name = predictions.name
predictions.name = "predictions" predictions.name = "predictions"
predictions_df = pd.concat([data, predictions], axis=1) predictions_df = pd.concat([data, predictions], axis=1)
replace_index = predictions_df["SAP_STARTING"] + 1 > predictions_df["predictions"] # We expect all prediction to be atleast one point improvement
replace_index = (
predictions_df["sap_starting"] + minimum_value > predictions_df["predictions"]
)
predictions_df.loc[replace_index, "predictions"] = ( predictions_df.loc[replace_index, "predictions"] = (
predictions_df.loc[replace_index, "SAP_STARTING"] + minimum_value predictions_df.loc[replace_index, "sap_starting"] + minimum_value
) )
predictions_new = predictions_df["predictions"] predictions_new = predictions_df["predictions"]

View file

@ -1,4 +0,0 @@
dataclient_type: local
nshap_samples: 100 # how many samples to use to approximate each Shapely value, larger values will be slower
n_val: 30 # how many datapoints from validation data should we interpret predictions for, larger values will be slower
row_index: [0, 10, 20] # index of an example datapoint

View file

@ -1,9 +0,0 @@
input_dataclient_type: aws-s3
output_dataclient_type: local
# data_filepath: s3://retrofit-data-dev/sap_change_model/dataset.parquet
data_filepath: s3://retrofit-data-dev/sap_change_model/dataset_without_differencing.parquet
train_proportion: 0.9
output_train_filepath: ./data/prepared_data/train.parquet
output_test_filepath: ./data/prepared_data/test.parquet
# cache_o

View file

@ -0,0 +1,13 @@
default:
scenarios:
input_dataclient_type: aws-s3
output_dataclient_type: local
scenario_data_filepaths:
# - s3://retrofit-data-dev/scenario_data/22-03-2024-19-20-09/recommendations_scoring_data.parquet
# - s3://retrofit-data-dev/scenario_data/24-03-2024-20-23-25/recommendations_scoring_data.parquet
# - s3://retrofit-data-dev/scenario_data/27-03-2024-11-38-15/recommendations_scoring_data.parquet
# - s3://retrofit-data-dev/scenario_data/26-05-2024-08-47-45/recommendations_scoring_data.parquet
# - s3://retrofit-data-dev/scenario_data/26-05-2024-10-44-53/recommendations_scoring_data.parquet
- s3://retrofit-data-dev/scenario_data/28-05-2024-19-22-41/recommendations_scoring_data.parquet
comparison_output_filepath: ./metrics/scenario_table.md
metrics_output_filepath: ./metrics/scenario_metrics.md

View file

@ -0,0 +1,81 @@
default:
startup_cleanup:
artefacts: ./data
metrics: ./metrics
client:
aws-s3:
AWS_ACCESS_KEY_ID: null # Use local credentials
AWS_SECRET_ACCESS_KEY: null # Use local credentials
ENDPOINT_URL: null # Use local credentials
aws-s3-mock:
AWS_ACCESS_KEY_ID: minio
AWS_SECRET_ACCESS_KEY: minio123
ENDPOINT_URL: http://localhost:9000
local:
null
prepare_data:
input_dataclient_type: aws-s3
output_dataclient_type: local
# data_filepath: s3://retrofit-data-dev/sap_change_model/2024-03-22-18-56-53/dataset_rooms.parquet
# data_filepath: s3://retrofit-data-dev/sap_change_model/2024-05-25-08-36-36/dataset_rooms.parquet
# data_filepath: s3://retrofit-data-dev/sap_change_model/2024-05-26-10-31-39/dataset_rooms.parquet
data_filepath: s3://retrofit-data-dev/sap_change_model/2024-05-28-19-08-25/dataset_rooms.parquet
train_proportion: 0.9
output_train_filepath: ./data/prepared_data/train.parquet
output_test_filepath: ./data/prepared_data/test.parquet
feature_processor:
feature_processor_type: dataframe
feature_processor_config:
subsample_amount: null
subsample_seed: 0
target: sap_ending
identifier_columns: ["uprn"]
# drop_columns: ["heat_demand_change", "carbon_change", "rdsap_change", "heat_demand_ending", "carbon_ending", "days_to_starting", "days_to_ending"]
drop_columns: [
"heat_demand_change", "carbon_change", "rdsap_change", "heat_demand_ending", "carbon_ending", "days_to_starting", "days_to_ending",
'number_habitable_rooms_starting', 'number_habitable_rooms_ending', 'number_heated_rooms_starting', 'number_heated_rooms_ending',
'number_habitable_rooms', 'number_heated_rooms']
retain_features: null
# retain_features: ['uprn', 'sap_starting', 'hot_water_energy_eff_ending',
# 'mainheat_energy_eff_ending', 'constituency', 'roof_energy_eff_ending',
# 'walls_energy_eff_ending', 'secondheat_description_ending',
# 'property_type', 'mainheatc_energy_eff_ending', 'built_form',
# 'walls_insulation_thickness_ending', 'potential_energy_efficiency',
# 'transaction_type_ending',
# 'floor_thermal_transmittance_ending',
# 'low_energy_lighting_ending', 'heat_demand_starting',
# 'photo_supply_ending', 'carbon_starting',
# 'walls_thermal_transmittance_ending',
# 'roof_insulation_thickness_ending',
# 'total_floor_area_ending', 'number_open_fireplaces_ending',
# 'windows_energy_eff_ending',
# 'floor_height_ending',
# 'extension_count_ending',
# 'has_air_source_heat_pump_ending',
# 'charging_system_ending', 'construction_age_band', 'glazed_type_ending',
# 'roof_thermal_transmittance_ending',
# 'floor_insulation_thickness_ending', 'has_mains_gas_ending',
# 'estimated_perimeter_starting', 'energy_consumption_potential',
# 'environment_impact_potential', 'heater_type_ending',
# 'multi_glaze_proportion_ending',
# 'lighting_energy_eff_ending', 'fixed_lighting_outlets_count']
generate_predictions:
input_dataclient_type: local
output_dataclient_type: local
test_data_filepath: ./data/prepared_data/test.parquet
predictions_output_filepath: ./data/predictions/predictions.parquet
predictions_column_name: predictions
identifier_column: id
generate_metrics:
dataclient_type: local
metrics_type: Regression
metrics_output_filepath: ./metrics/metrics.json
dev:
generate_predictions:
input_dataclient_type: aws-s3

View file

@ -1,2 +0,0 @@
artefacts: ./data
metrics: ./metrics

View file

@ -142,9 +142,15 @@ class AWSS3Client:
buffer = BytesIO() buffer = BytesIO()
obj.to_parquet(buffer, index=False) obj.to_parquet(buffer, index=False)
# Reset the buffer position to the beginning
buffer.seek(0)
bucket, key = location.strip("s3://").split("/", 1) bucket, key = location.strip("s3://").split("/", 1)
self.client.upload_fileobj(buffer, bucket, key) self.client.upload_fileobj(buffer, bucket, key)
# Close the buffer
buffer.close()
def _load_parquet(self, location: str, load_config: dict) -> pd.DataFrame: def _load_parquet(self, location: str, load_config: dict) -> pd.DataFrame:
""" """
Load a parquet file Load a parquet file
@ -239,7 +245,8 @@ class LocalClient:
save_methods = { save_methods = {
".parquet": self._save_parquet, ".parquet": self._save_parquet,
".json": self._save_json ".json": self._save_json,
".md": self._save_md,
# "": _save_directory(**save_config), # "": _save_directory(**save_config),
# ADD MORE save_methods HERE # ADD MORE save_methods HERE
} }
@ -288,3 +295,10 @@ class LocalClient:
# Write the contents of the buffer to the local file # Write the contents of the buffer to the local file
with open(location, "wb") as f: with open(location, "wb") as f:
f.write(buffer.getvalue()) f.write(buffer.getvalue())
def _save_md(self, obj: pd.DataFrame, location: str, save_config: dict):
"""
Save object as markdown
"""
obj.to_markdown(location, **save_config)

View file

@ -21,6 +21,7 @@ def setup_logger():
# Add the stream handler to the logger # Add the stream handler to the logger
logger.addHandler(stream_handler) logger.addHandler(stream_handler)
logger.propagate = False
return logger return logger

View file

@ -4,6 +4,7 @@ Implementation of MLMetrics, all of which will have two methods:
- Generate Plot Suite - Generate Plot Suite
""" """
import numpy as np
import pandas as pd import pandas as pd
from typing import Union from typing import Union
from sklearn.metrics import ( from sklearn.metrics import (
@ -14,6 +15,18 @@ from sklearn.metrics import (
) )
from core.interface.InterfaceMetrics import MLMetrics from core.interface.InterfaceMetrics import MLMetrics
# Define the function to return the SMAPE value
def symmetric_mape(actual, predicted) -> float:
# Convert actual and predicted to numpy
# array data type if not already
if not all([isinstance(actual, np.ndarray), isinstance(predicted, np.ndarray)]):
actual, predicted = np.array(actual), np.array(predicted)
return np.mean(
np.abs(predicted - actual) / ((np.abs(predicted) + np.abs(actual)) / 2)
)
def metrics_factory(metrics_type: str) -> MLMetrics: def metrics_factory(metrics_type: str) -> MLMetrics:
metrics = { metrics = {
@ -34,7 +47,7 @@ class RegressionMetrics:
median_absolute_error, median_absolute_error,
mean_squared_error, mean_squared_error,
mean_absolute_percentage_error, mean_absolute_percentage_error,
# max_error symmetric_mape,
] ]
def generate_metrics( def generate_metrics(

View file

@ -25,7 +25,7 @@ def model_factory(model_type: str) -> MLModel:
models = { models = {
"SKLearnLinearRegression": SKLearnLinearRegression(), "SKLearnLinearRegression": SKLearnLinearRegression(),
"SKLearnSVMRegression": SKLearnSVMRegression(), "SKLearnSVMRegression": SKLearnSVMRegression(),
"AutogluonAutoML": AutogluonAutoML() "AutogluonAutoML": AutogluonAutoML(),
# ADD OTHER MODELS HERE # ADD OTHER MODELS HERE
} }
@ -149,6 +149,9 @@ class AutogluonAutoML:
"time_limit", "time_limit",
"presets", "presets",
"excluded_model_types", "excluded_model_types",
"infer_limit",
"infer_limit_batch_size",
"ag_args_ensemble",
] ]
def load_model(self, path: Union[Path, str]) -> None: def load_model(self, path: Union[Path, str]) -> None:
@ -165,8 +168,12 @@ class AutogluonAutoML:
if self.model is None: if self.model is None:
raise KeyError("No model trained/ loaded - unable to save") raise KeyError("No model trained/ loaded - unable to save")
logger.info("In local development mode - no need for s3 client") logger.info(
logger.info("Using AutoGluon Model - Model saving already occured") "Using AutoGluon Model - Model saving is using optimised deployment mode"
)
logger.info("Saving optimised model")
self.model.clone_for_deployment(str(path))
return str(path) return str(path)
@ -199,6 +206,9 @@ class AutogluonAutoML:
time_limit=model_hyperparameters["time_limit"], time_limit=model_hyperparameters["time_limit"],
presets=model_hyperparameters["presets"], presets=model_hyperparameters["presets"],
excluded_model_types=model_hyperparameters["excluded_model_types"], excluded_model_types=model_hyperparameters["excluded_model_types"],
infer_limit=model_hyperparameters["infer_limit"],
infer_limit_batch_size=model_hyperparameters["infer_limit_batch_size"],
ag_args_ensemble=model_hyperparameters["ag_args_ensemble"],
) )
def predict( def predict(

View file

@ -1,3 +0,0 @@
/prepared_data
/model
/predictions

View file

@ -1,126 +1,190 @@
schema: '2.0' schema: '2.0'
stages: stages:
startup_cleanup:
cmd: python 0_startup_cleanup.py
deps:
- path: 0_startup_cleanup.py
hash: md5
md5: b1b12f6b6393fbf8b83d23684df0a3d4
size: 1220
params:
configs/settings.yaml:
default.startup_cleanup.artefacts: ./data
default.startup_cleanup.metrics: ./metrics
prepare_data: prepare_data:
cmd: python 1_prepare_data.py cmd: python 1_prepare_data.py
deps: deps:
- path: 1_prepare_data.py - path: 1_prepare_data.py
hash: md5 hash: md5
md5: 2648d7d407dca857a1d20a11a88d3d98 md5: 11a3b8bfdfe199ab7ecc39ccc5652649
size: 5116 size: 4298
params: params:
configs/prepare_data.yaml: configs/settings.yaml:
output_test_filepath: ./data/prepared_data/test.parquet default.feature_processor.feature_processor_config.drop_columns:
output_train_filepath: ./data/prepared_data/train.parquet - heat_demand_change
train_proportion: 0.9 - carbon_change
- rdsap_change
- heat_demand_ending
- carbon_ending
- days_to_starting
- days_to_ending
- number_habitable_rooms_starting
- number_habitable_rooms_ending
- number_heated_rooms_starting
- number_heated_rooms_ending
- number_habitable_rooms
- number_heated_rooms
default.feature_processor.feature_processor_config.retain_features:
default.feature_processor.feature_processor_config.subsample_amount:
default.feature_processor.feature_processor_config.subsample_seed: 0
default.feature_processor.feature_processor_config.target: sap_ending
default.feature_processor.feature_processor_type: dataframe
default.prepare_data.data_filepath:
s3://retrofit-data-dev/sap_change_model/2024-05-28-19-08-25/dataset_rooms.parquet
default.prepare_data.input_dataclient_type: aws-s3
default.prepare_data.output_dataclient_type: local
default.prepare_data.output_test_filepath: ./data/prepared_data/test.parquet
default.prepare_data.output_train_filepath: ./data/prepared_data/train.parquet
default.prepare_data.train_proportion: 0.9
outs: outs:
- path: data/prepared_data/ - path: data/prepared_data/
hash: md5 hash: md5
md5: 7bcbf81a82015276e25749d1bc249a57.dir md5: 80c9e138146a1d96b9d16091c207e2e8.dir
size: 21076961 size: 45056059
nfiles: 2 nfiles: 2
build_model: build_model:
cmd: python 2_build_model.py cmd: python 2_build_model.py
deps: deps:
- path: 2_build_model.py - path: 2_build_model.py
hash: md5 hash: md5
md5: 3eb1a5110df6e25a23d8e8a92bb27823 md5: 7231450b78920b0c5e7c6bada496b24a
size: 5257 size: 4820
- path: data/prepared_data - path: data/prepared_data
hash: md5 hash: md5
md5: 7bcbf81a82015276e25749d1bc249a57.dir md5: 80c9e138146a1d96b9d16091c207e2e8.dir
size: 21076961 size: 45056059
nfiles: 2 nfiles: 2
params: params:
configs/build_model.yaml: configs/build_model.yaml:
AutogluonAutoML: default:
output_filepath: ./data/model/autogluonmodel/ build_model:
problem_type: regression model_type: AutogluonAutoML
eval_metric: mean_absolute_error model_save_filepath: ./data/model/optimised/
time_limit: 800 fit_metrics_filepath: ./metrics/fit_metrics.json
presets: medium_quality fit_predictions_filepath: ./data/fit_predictions/predictions.parquet
excluded_model_types: SKLearnLinearRegression:
- KNN SKLearnSVMRegression:
SKLearnLinearRegression: kernel: linear
SKLearnSVMRegression: AutogluonAutoML:
kernel: linear output_filepath: ./data/model/allmodels/
fit_metrics_filepath: ./metrics/fit_metrics.json problem_type: regression
model_save_filepath: ./data/model/autogluonmodel/ eval_metric: mean_squared_error
model_type: AutogluonAutoML time_limit: 1800
presets: medium_quality
excluded_model_types:
- RF
- CAT
- NN_TORCH
- KNN
- XT
infer_limit: 0.05
infer_limit_batch_size: 10000
ag_args_ensemble:
num_folds_parallel: 2
outs: outs:
- path: data/fit_predictions/
hash: md5
md5: d9c9afc05e8780db47c0548b19bf7d19.dir
size: 3349989
nfiles: 1
- path: data/model/ - path: data/model/
hash: md5 hash: md5
md5: 397c46c062b51034b6f8f3f229345de3.dir md5: 13c3100e1486c27a83a8a47491077842.dir
size: 334481421 size: 773523079
nfiles: 18 nfiles: 36
- path: metrics/fit_metrics.json - path: metrics/fit_metrics.json
hash: md5 hash: md5
md5: f6e7e21d4229d4a229ea0a11f3023637 md5: 2ff70a2a45813e1bcdf2ea3aa8e07d4a
size: 184 size: 224
generate_predictions: generate_predictions:
cmd: python 3_generate_predictions.py cmd: python 3_generate_predictions.py
deps: deps:
- path: data/model
hash: md5
md5: 397c46c062b51034b6f8f3f229345de3.dir
size: 334481421
nfiles: 18
- path: data/prepared_data
hash: md5
md5: 7bcbf81a82015276e25749d1bc249a57.dir
size: 21076961
nfiles: 2
- path: 3_generate_predictions.py - path: 3_generate_predictions.py
hash: md5 hash: md5
md5: 874da2443ef0d92731e4c127f3ce4acb md5: 0a70ad4dfe99414a75d1261c75a177b9
size: 4434 size: 2464
- path: data/model
hash: md5
md5: 13c3100e1486c27a83a8a47491077842.dir
size: 773523079
nfiles: 36
- path: data/prepared_data
hash: md5
md5: 80c9e138146a1d96b9d16091c207e2e8.dir
size: 45056059
nfiles: 2
params: params:
configs/generate_predictions.yaml: configs/settings.yaml:
input_dataclient_type: local default.generate_predictions.input_dataclient_type: local
output_dataclient_type: local default.generate_predictions.output_dataclient_type: local
predictions_column_name: predictions default.generate_predictions.predictions_column_name: predictions
predictions_output_filepath: ./data/predictions/predictions.parquet default.generate_predictions.predictions_output_filepath: ./data/predictions/predictions.parquet
test_data_filepath: ./data/prepared_data/test.parquet default.generate_predictions.test_data_filepath: ./data/prepared_data/test.parquet
outs: outs:
- path: data/predictions/ - path: data/predictions/
hash: md5 hash: md5
md5: 9c18005e722f0e428f4b83c3f974f206.dir md5: 5d07bcebf3160a72bb18dfd79106e85c.dir
size: 381870 size: 463197
nfiles: 1 nfiles: 1
generate_metrics: generate_metrics:
cmd: python 4_generate_metrics.py cmd: python 4_generate_metrics.py
deps: deps:
- path: 4_generate_metrics.py
hash: md5
md5: 4fedb86d89d528f0a6597934ba3890a0
size: 3484
- path: data/predictions - path: data/predictions
hash: md5 hash: md5
md5: 9c18005e722f0e428f4b83c3f974f206.dir md5: 5d07bcebf3160a72bb18dfd79106e85c.dir
size: 381870 size: 463197
nfiles: 1 nfiles: 1
- path: data/prepared_data - path: data/prepared_data
hash: md5 hash: md5
md5: 7bcbf81a82015276e25749d1bc249a57.dir md5: 80c9e138146a1d96b9d16091c207e2e8.dir
size: 21076961 size: 45056059
nfiles: 2 nfiles: 2
- path: 4_generate_metrics.py
hash: md5
md5: 8ce0b6b55e1688fca816985e0cf37f28
size: 4220
params: params:
configs/generate_metrics.yaml: configs/settings.yaml:
dataclient_type: local default.generate_metrics.dataclient_type: local
metrics_output_filepath: ./metrics/metrics.json default.generate_metrics.metrics_output_filepath: ./metrics/metrics.json
metrics_type: Regression default.generate_metrics.metrics_type: Regression
outs: outs:
- path: metrics/metrics.json - path: metrics/metrics.json
hash: md5 hash: md5
md5: 93d9b69d6cd951ae2c14b29ba92a2a38 md5: 3e08df02fd5c5d094bcf936e1338d596
size: 186 size: 223
startup_cleanup: generate_scenerio_metrics:
cmd: python 0_startup_cleanup.py cmd: python 5_generate_scenarios.py
deps: deps:
- path: 0_startup_cleanup.py - path: 5_generate_scenarios.py
hash: md5 hash: md5
md5: 2e51fbcac960d0f960bf32a8ec7486a0 md5: 40506749fefd926d47c60ff5b16db307
size: 1748 size: 5337
params: params:
configs/startup_cleanup.yaml: configs/scenarios.yaml:
artefacts: ./data default.scenarios:
metrics: ./metrics input_dataclient_type: aws-s3
output_dataclient_type: local
scenario_data_filepaths:
- s3://retrofit-data-dev/scenario_data/28-05-2024-19-22-41/recommendations_scoring_data.parquet
comparison_output_filepath: ./metrics/scenario_table.md
metrics_output_filepath: ./metrics/scenario_metrics.md
outs:
- path: metrics/scenario_metrics.md
hash: md5
md5: fa4d6d7bbd7818613800da5f8f37ea96
size: 363
- path: metrics/scenario_table.md
hash: md5
md5: d6baf100a1623cc2467c2f8221d314c9
size: 2133

View file

@ -4,19 +4,28 @@ stages:
deps: deps:
- 0_startup_cleanup.py - 0_startup_cleanup.py
params: params:
- configs/startup_cleanup.yaml: - configs/settings.yaml:
- artefacts - default.startup_cleanup.artefacts
- metrics - default.startup_cleanup.metrics
always_changed: true always_changed: true
prepare_data: prepare_data:
cmd: python 1_prepare_data.py cmd: python 1_prepare_data.py
deps: deps:
- 1_prepare_data.py - 1_prepare_data.py
params: params:
- configs/prepare_data.yaml: - configs/settings.yaml:
- output_test_filepath - default.prepare_data.input_dataclient_type
- output_train_filepath - default.prepare_data.output_dataclient_type
- train_proportion - default.prepare_data.data_filepath
- default.prepare_data.train_proportion
- default.prepare_data.output_train_filepath
- default.prepare_data.output_test_filepath
- default.feature_processor.feature_processor_type
- default.feature_processor.feature_processor_config.subsample_amount
- default.feature_processor.feature_processor_config.subsample_seed
- default.feature_processor.feature_processor_config.target
- default.feature_processor.feature_processor_config.drop_columns
- default.feature_processor.feature_processor_config.retain_features
outs: outs:
- data/prepared_data/ - data/prepared_data/
always_changed: true always_changed: true
@ -29,6 +38,7 @@ stages:
- configs/build_model.yaml: - configs/build_model.yaml:
outs: outs:
- data/model/ - data/model/
- data/fit_predictions/
- metrics/fit_metrics.json - metrics/fit_metrics.json
always_changed: true always_changed: true
generate_predictions: generate_predictions:
@ -38,7 +48,12 @@ stages:
- data/prepared_data - data/prepared_data
- data/model - data/model
params: params:
- configs/generate_predictions.yaml: - configs/settings.yaml:
- default.generate_predictions.input_dataclient_type
- default.generate_predictions.output_dataclient_type
- default.generate_predictions.test_data_filepath
- default.generate_predictions.predictions_output_filepath
- default.generate_predictions.predictions_column_name
outs: outs:
- data/predictions/ - data/predictions/
always_changed: true always_changed: true
@ -49,10 +64,24 @@ stages:
- data/prepared_data - data/prepared_data
- data/predictions - data/predictions
params: params:
- configs/generate_metrics.yaml: - configs/settings.yaml:
- default.generate_metrics.dataclient_type
- default.generate_metrics.metrics_type
- default.generate_metrics.metrics_output_filepath
outs: outs:
- metrics/metrics.json - metrics/metrics.json
always_changed: true always_changed: true
generate_scenerio_metrics:
cmd: python 5_generate_scenarios.py
deps:
- 5_generate_scenarios.py
params:
- configs/scenarios.yaml:
- default.scenarios
outs:
- metrics/scenario_table.md
- metrics/scenario_metrics.md
always_changed: true
metrics: metrics:
- metrics/metrics.json - metrics/metrics.json
- metrics/fit_metrics.json - metrics/fit_metrics.json

View file

@ -175,3 +175,74 @@ plot_permutation_importance(exp, fig_kw={"figwidth": 7, "figheight": 6})
# Use shap package to explain why 9158 has a 35 prediction when its sap ending is 96 # Use shap package to explain why 9158 has a 35 prediction when its sap ending is 96
# #
# #
from core.MLModels import model_factory
from core.DataClient import dataclient_factory
import pandas as pd
from config import settings
client_params = settings.client
prepare_data_params = settings.prepare_data
feature_process_params = settings.feature_processor
build_model_params = settings.build_model
generate_predictions_params = settings.generate_predictions
prediction_analysis_params = settings.prediction_analysis
model = model_factory(build_model_params["model_type"])
model.load_model(build_model_params["model_save_filepath"])
dataclient_type = prediction_analysis_params["dataclient_type"]
# dataclient_type = 'aws-s3'
# dataclient = dataclient_factory(
# dataclient_type=dataclient_type,
# dataclient_config=client_params[dataclient_type],
# )
# data = dataclient.load_data("s3://retrofit-data-dev/sap_change_model/dataset.parquet")
target = feature_process_params["feature_processor_config"]["target"]
predictions_column_name = generate_predictions_params["predictions_column_name"]
output_test_filepath = prepare_data_params["output_test_filepath"]
predictions_output_filepath = generate_predictions_params["predictions_output_filepath"]
# score_data = dataclient.load_data("s3://retrofit-data-dev/carbon_change_predictions/51/2023-11-28T21:01:21.869339.parquet")
local_dataclient = dataclient_factory(
dataclient_type="local",
dataclient_config=client_params["local"],
)
test_df = local_dataclient.load_data(output_test_filepath)
predictions = local_dataclient.load_data(predictions_output_filepath)
mix_df = pd.concat([test_df.copy(), predictions], axis=1)
mix_df["residual"] = abs(mix_df[predictions_column_name] - mix_df[target])
mix_df = mix_df.sort_values("residual", ascending=False)
cosine_similarity_df = mix_df[mix_df.columns.difference(["predictions", "residual"])]
from sklearn.metrics.pairwise import cosine_similarity
row_index = 0
from sklearn.preprocessing import LabelEncoder
object_columns = cosine_similarity_df.select_dtypes(["object"])
cosine_similarity_df[object_columns.columns] = cosine_similarity_df[
object_columns.columns
].apply(LabelEncoder().fit_transform)
feature_vector = cosine_similarity_df.loc[[row_index]]
cosine_similarity_df["cosine"] = cosine_similarity(cosine_similarity_df, feature_vector)
similar_index = (
cosine_similarity_df.sort_values("cosine", ascending=False).head(15).index
)
check_df = mix_df.loc[similar_index]
columns_to_check = [
"LOW_ENERGY_LIGHTING_ENDING",
"walls_thermal_transmittance_ENDING",
"floor_thermal_transmittance_ENDING",
"roof_thermal_transmittance_ENDING",
"roof_insulation_thickness_ENDING",
]
cosine_similarity_df = mix_df[columns_to_check]

View file

@ -0,0 +1,56 @@
import pandas as pd
from configs.post_prediction_logic import post_prediction_logic
from core.interface.InterfaceModels import MLModel
from core.interface.InterfaceDataClient import DataClient
from core.Logger import logger
def generate_predictions(
input_dataclient: DataClient,
output_dataclient: DataClient,
model: MLModel,
target: str,
model_filepath: str,
test_data_filepath: str,
predictions_output_filepath: str,
predictions_column_name: str,
identifier_column: str = "id",
):
"""
For a given model, we generate prediction and evaluate this against the true target
"""
logger.info("--- Loading test data ---")
test_data = input_dataclient.load_data(
location=test_data_filepath, load_config=None
)
logger.info("--- Loading model ---")
model.load_model(model_filepath)
logger.info("--- Generating predictions ---")
prediction_data = (
test_data.drop(columns=target) if target in test_data.columns else test_data
)
predictions = model.predict(
data=prediction_data, post_prediction_logic=post_prediction_logic
)
logger.info("--- Saving predictions ---")
predictions_df = pd.DataFrame(predictions)
predictions_df.columns = [predictions_column_name]
output_df = (
pd.concat([test_data[identifier_column], predictions_df], axis=1)
if identifier_column in test_data.columns
else predictions_df
)
output_dataclient.save_data(
obj=output_df, location=predictions_output_filepath, save_config=None
)

View file

@ -1,2 +1,4 @@
/fit_metrics.json /fit_metrics.json
/metrics.json /metrics.json
/scenario_table.md
/scenario_metrics.md

View file

@ -3,8 +3,6 @@ Post Model generation step:
We want to look at feature analysis of the model We want to look at feature analysis of the model
""" """
import yaml
from pathlib import Path
from core.interface.InterfaceModels import MLModel from core.interface.InterfaceModels import MLModel
from core.interface.InterfaceDataClient import DataClient from core.interface.InterfaceDataClient import DataClient
from core.Logger import logger from core.Logger import logger
@ -13,27 +11,16 @@ from core.DataClient import dataclient_factory
from alibi.explainers import PermutationImportance, plot_permutation_importance from alibi.explainers import PermutationImportance, plot_permutation_importance
import numpy as np import numpy as np
import pandas as pd import pandas as pd
from config import settings
client_path = Path(__file__).parent / "configs" / "client.yaml" client_params = settings.client
client_params = yaml.safe_load(open(client_path)) prepare_data_params = settings.prepare_data
feature_process_params = settings.feature_processor
build_model_params = settings.build_model
generate_predictions_params = settings.generate_predictions
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml" model_analysis_params = settings.model_analysis
prepare_data_params = yaml.safe_load(open(prepare_data_path))
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
build_model_path = Path(__file__).parent / "configs" / "build_model.yaml"
build_model_params = yaml.safe_load(open(build_model_path))
model_analysis_path = Path(__file__).parent / "configs" / "model_analysis.yaml"
model_analysis_params = yaml.safe_load(open(model_analysis_path))
generate_predictions_path = (
Path(__file__).parent / "configs" / "generate_predictions.yaml"
)
generate_predictions_params = yaml.safe_load(open(generate_predictions_path))
model = model_factory(build_model_params["model_type"]) model = model_factory(build_model_params["model_type"])
model.load_model(build_model_params["model_save_filepath"]) model.load_model(build_model_params["model_save_filepath"])

View file

@ -12,40 +12,21 @@ import shap
shap.initjs() shap.initjs()
import yaml
from typing import List from typing import List
from pathlib import Path
from core.interface.InterfaceModels import MLModel from core.interface.InterfaceModels import MLModel
from core.interface.InterfaceDataClient import DataClient from core.interface.InterfaceDataClient import DataClient
from core.Logger import logger from core.Logger import logger
from core.MLModels import model_factory from core.MLModels import model_factory
from core.DataClient import dataclient_factory from core.DataClient import dataclient_factory
import numpy as np
import pandas as pd import pandas as pd
from config import settings
client_params = settings.client
client_path = Path(__file__).parent / "configs" / "client.yaml" prepare_data_params = settings.prepare_data
client_params = yaml.safe_load(open(client_path)) feature_process_params = settings.feature_processor
build_model_params = settings.build_model
prepare_data_path = Path(__file__).parent / "configs" / "prepare_data.yaml" generate_predictions_params = settings.generate_predictions
prepare_data_params = yaml.safe_load(open(prepare_data_path)) prediction_analysis_params = settings.prediction_analysis
feature_process_path = Path(__file__).parent / "configs" / "feature_processor.yaml"
feature_process_params = yaml.safe_load(open(feature_process_path))
build_model_path = Path(__file__).parent / "configs" / "build_model.yaml"
build_model_params = yaml.safe_load(open(build_model_path))
generate_predictions_path = (
Path(__file__).parent / "configs" / "generate_predictions.yaml"
)
generate_predictions_params = yaml.safe_load(open(generate_predictions_path))
prediction_analysis_path = (
Path(__file__).parent / "configs" / "prediction_analysis.yaml"
)
prediction_analysis_params = yaml.safe_load(open(prediction_analysis_path))
model = model_factory(build_model_params["model_type"]) model = model_factory(build_model_params["model_type"])
model.load_model(build_model_params["model_save_filepath"]) model.load_model(build_model_params["model_save_filepath"])

View file

@ -1,8 +1,7 @@
joblib==1.3.2 joblib==1.3.2
boto3==1.28.17 boto3==1.28.17
pandas==1.5.3 pandas==2.1.4
autogluon==0.8.2 autogluon.tabular[all]==1.0.0
dynaconf==3.2.1
pyarrow==13.0.0 pyarrow==13.0.0
pre-commit==3.3.3 pre-commit==3.3.3
sphinx==7.2.5
sphinx_rtd_theme==1.3.0

View file

@ -1,6 +1,7 @@
joblib==1.3.2 joblib==1.3.2
boto3==1.28.17 boto3==1.28.17
pandas==1.5.3 pandas==2.1.4
autogluon==0.8.2 autogluon.tabular[all]==1.0.0
dynaconf==3.2.1
pyarrow==13.0.0 pyarrow==13.0.0
PyYAML==6.0.1 PyYAML==6.0.1

View file

@ -1,9 +1,10 @@
joblib==1.3.2 joblib==1.3.2
boto3==1.28.17 boto3==1.28.17
pandas==1.5.3 pandas==2.1.4
autogluon==0.8.2 autogluon.tabular[all]==1.0.0
dynaconf==3.2.0 ray==2.6.3
alibi==0.9.4 dynaconf==3.2.1
alibi==0.9.5
shap==0.42.1 shap==0.42.1
pyarrow==13.0.0 pyarrow==13.0.0
pre-commit==3.3.3 pre-commit==3.3.3

View file

@ -1,4 +1,4 @@
boto3==1.28.41 boto3==1.28.41
pandas==1.5.3 pandas==2.1.4
autogluon==0.8.2 autogluon.tabular[all]==1.0.0
dynaconf==3.2.0 dynaconf==3.2.1

View file

@ -1,3 +1,4 @@
dvc==3.18.0 dvc==3.51.0
dvc-s3==2.23.0 dvc-s3==3.2.0
gto==1.0.4 gto==1.7.1
pyOpenSSL==23.3.0