Kaila Gilbert — Data Scientist & Storyteller

02 —

Selected work

Interactive county-level voter risk map for Ohio — color-coded by Recruit Immediately, Recruit, or Monitor action

Featured in Press Civic Tech Voting Rights CMU · DSSG

Protecting Voting Access During the Pandemic (Team)

📰 Featured in CMU Heinz College News: "Carnegie Mellon Data Scientists Join National Effort to Protect Voting Access During the Pandemic"

How do you protect public health and voter rights at the same time? Our team partnered with the nonprofit Voter Protection Corps and CMU professor Rayid Ghani to examine poll closures, voting behavior, and election resource gaps across the U.S. during COVID-19. We gathered and analyzed data from over 35,000 polling places alongside public census and survey sources.

Outcome: Together, we created a Voter Resource Prioritization Toolkit identifying counties with critical gaps between in-person voting demand and available resources — factoring in polling machine access, poll worker shortages, and historically marginalized voter populations. Our findings directly informed Voter Protection Corps' national action plan to preserve in-person voting ahead of the 2020 general election.

Data: 35,000+ polling places · ACS Census · Survey data · Tools: Python · Tableau · Geospatial Analysis

↗ Toolkit ↗ Report ↗ Heinz College News ↗ CMU News

Map of hazardous waste facilities across New York State

Featured Random Forest Public Policy Python CMU · DSSG

Predicting Environmental Waste Violations for the EPA (Team)

New York holds 22% of all large quantity hazardous waste generators in the U.S. — yet regulators can only inspect 3% each year. Working with the EPA and NYS Dept. of Environmental Conservation data, our team built a machine learning pipeline to predict which facilities are most likely to violate compliance rules, helping allocate scarce inspection resources more equitably and effectively.

Methods: We ran a grid of 282 models (Random Forest, KNN, Logistic Regression, AdaBoost) across 193 features drawn from RCRA, NYSDEC waste reports, and Census ACS data. Our best model achieved ~92% precision at top 3% — far outperforming the current baseline. We also integrated a full fairness audit on disparities across high-poverty zip codes.

Models: 282 · Features: 193 · Best Precision@3%: ~92% · Dataset: 9,172 LQGs · NY State · 2009–2017

↗ GitHub ↗ Final Report ↗ Slides

Feature importance chart for food inspection outcome prediction

Rare Events R SMOTE Random Forest

Predicting Food Inspection Outcomes in Allegheny County (Team)

Which restaurants and shops are most likely to fail food inspection audits? This project tackled a classic applied ML challenge — limited data and severe class imbalance — to predict food safety inspection failures across Allegheny County, PA. This ML project evaluated how standard classifiers degrade under rare-event conditions and applied SMOTE (Synthetic Minority Oversampling) to rebalance the training data.

Key Finding: Our analyses revealed that location (city), business expenses, and inspection description were the strongest predictors — outweighing prior violation history.

Focus: Class imbalance · Rare event modeling · Tools: R · SMOTE · Random Forest

↗ GitHub ↗ Report

Complainants by age and race — Philadelphia police misconduct EDA

Public Policy R · ArcGIS Clustering Group Project

Predicting Police Misconduct in Philadelphia, PA (Team)

Civic distrust between citizens and law enforcement is exarcerbated by police misconduct. Our team designed a full data science pipeline — from EDA to modeling to a decision-support dashboard — to help the Philadelphia Mayor's Office identify which police districts could be targeted for intervention and training. We analyzed 1,300–1,800 annual complaints filed with the Philadelphia Police Department between 2015 and 2019.

Methods: Our approach utilized k-means clustering, logistic regression, and random forest models to identify high-risk officers and districts. We also conducted spatial hotspot analysis in ArcGIS, revealing Districts 39 and 25 as statistically significant misconduct hotspots at 99% confidence. Recommendations included!

Data: OpenDataPhilly · ACS Census · TIGER Shapefiles · Tools: R · ArcGIS · Tableau · scikit-learn

↗ GitHub ↗ Slides ↗ Proposal

ArcGIS dashboard showing transit accessibility to opportunity employment zones in Allegheny County

Featured ArcGIS Dashboard Transportation Equity GIS Public Policy

Measuring Transit Accessibility to Opportunity Employment in Allegheny County (Team)

What if the populations in most need of certain jobs are completely unable to reach them by transit? Our team investigated whether the Port Authority of Allegheny County (PAAC) bus system adequately serves "high needs" populations — those who rely on public transit to access opportunity employment. This project first analyzed demographic data to identify origins of interest, using ACS data filtered by transit dependency, age, and industry. We pulled raw industry and job vacancy data to ascertain the whereabouts of job opporutnity sites. We built an origin-destination composite accessibility index scoring each transit route.

Outcome: Composite scores ranged from 9.8 to 87 — penalizing long travel times, excessive transfers, poor regional walkability, and limited off-peak service. Our findings revealed stark spatial mismatches between where transit-dependent workers live and where opportunity jobs are located. Our final outputs were delivered as an interactive ArcGIS dashboard and an Excel accessibility index used by Allegheny County DHS.

Geography: Allegheny County, PA · Block group level · Tools: ArcGIS · Python · Excel · ACS Census · Output: Live dashboard + Accessibility Index

↗ GitHub ↗ Report ↗ Slides ↗ Sample Output

ProPublica Machine Bias — COMPAS racial bias investigation

Algorithmic Fairness R Criminal Justice Group Project

Evaluating Racial Bias in the COMPAS Recidivism Risk System (Team)

What happens when an algorithm used by government officials consistently reveals discrimination in its recommendations? Our team reproduced and extended ProPublica's landmark investigation into COMPAS — a risk assessment tool used across the U.S. criminal justice system to score individuals' likelihood of reoffending. Using the same Broward County dataset, we built competing models (logistic regression, LDA, KNN, decision trees) to predict both general and violent recidivism.

Key Findings: The strongest predictors of recidivism were age, sex, and prior criminal history, not race — yet COMPAS scores diverged sharply by race. Our models achieved similar sensitivity to COMPAS with meaningfully better specificity — and with substantially more equitable outcomes across race. Our pruned random forest model performed best for general recidivism, while our LDA model led on violent recidivism.

Dataset: 6,172 individuals · Broward County, FL · Models: LR, LDA, KNN, Decision Tree, Random Forest

↗ GitHub ↗ Full Report

COVID-19 virus illustration — Points of Dispense modeling

Optimization Public Health Python · Gurobi Group Project

Identifying Points of Dispense for the COVID-19 Epidemic (Team)

During a pandemic, how can the public sector ensure efficient distribution of resources to protect public health? Our team designed and solved a mixed-integer optimization model (lots of math!) to strategically site COVID-19 testing Points of Dispense (PODs) across Allegheny County, PA — balancing the cost of opening facilities against the societal cost of travel time and unmet demand. There's a lot of details, but we implemented two formulations: an average-weighted objective and a minimax approach minimizing the worst-case travel distance for any zip code.

Methods: To anticipate unexpected events in disease development, we used a SIR model to simulate outbreak demand scenarios across 47 candidate sites, incorporating GIS-derived travel distances, ACS census data, and PA health budget constraints. We ultimately recommended 18 PODs at a cost of ~$1.5M — with 90% weight to societal cost — ensuring equitable geographic coverage with zero unmet demand across all scenarios.

Sites evaluated: 47 candidate PODs · Allegheny County · Tools: Python · Gurobi · R · GIS · Outcome: 18 PODs · $0 shortage

↗ GitHub ↗ Report

Topic frequency trends across DOJ press releases 2009-2019

NLP Topic Modeling LDA t-SNE

Discovering Agendas in DOJ Press Releases (Team)

How do DOJ press releases reveal the ultimate priorities and agendas of an administration? Our team used LDA topic modeling and t-SNE clustering on 13,000 Department of Justice press releases (2009–present) to surface the "expressed agendas" of each Attorney General.

Key Findings: Our analysis revealed clear partisan shifts — notably, sharp declines in voting rights and disability cases under the last two AGs — with policy recommendations delivered to the House Judiciary Committee

Corpus: 13,000 press releases · 57MB · Topics: 10 discovered

↗ GitHub (coming soon) ↗ Slides

All Projects on GitHub →

03 —

Education & experience

Sep 2020 – Mar 2025

Data Scientist, Senior Consultant, & AI Ethics Champion

IBM (Federal) · Washington, D.C.

Served Federal Civilian and Health agencies across eleven engagements. At the U.S. Postal Service, identified $40M+/year in cost savings by building freight optimization and data mining workstreams in Python on Red Hat OpenShift. Translated geospatial network data into visual decision tools to pinpoint logistics inefficiencies. At the FDA (CDER), led Python script development for KickStart — an automation platform generating statistical insights from drug applications — and performed 80+ data quality checks to brief medical reviewers on analytical risk. Held Secret Security Clearance (2023). Also served as AI Ethics Champion and Neurodiversity Advocate.

Apr 2020 – Sep 2020

Data Scientist — Voter Protection Corps

Data Science for Social Good · Pittsburgh, PA

Built an interactive Tableau and Python dashboard to identify COVID-19 voter resource gaps nationwide — visualizing state-level hotspots and potential demographic biases in access to voting resources. Coordinated data demos with national political campaigns to communicate findings and support targeted resource allocation.

Aug 2019 – Aug 2020

M.S. Data Analytics · Research Advisor & Teaching Assistant

Carnegie Mellon University — Heinz College · Pittsburgh, PA

Heinz Industry Partners Fellow · Honors Distinction · GPA: 3.84. Quantified millions in ROI for Manchester Bidwell's workforce development program by integrating state earnings data, social services records, and self-reported data using ML and spatial/survival analysis. Synthesized findings into policy narratives and visual reports for agency leadership. Served as TA for Negotiation (MBA, Policy, CS students). Key coursework: Data Mining · AI/ML · Unstructured Data · GIS · Optimization (I–IV) · Econometrics · Business Process Modeling.

Mar 2018 – Jul 2018

Research Associate

BluWave, LP · Brentwood, TN

Sixth employee at a fast-growing PE advisory firm. Built and curated a network of 120+ value creation and due diligence resources featured in M&A Magazine. Operationalized Salesforce analytics data and developed onboarding materials and database training for new hires.

Jul 2015 – Oct 2017

Account Executive

Dell EMC Technologies · Nashville, TN

Averaged 117% of annual quota in B2B enterprise and public sector sales, generating $3M+ in revenue over two years as part of Dell's pilot SMB program. Designed hardware and data center solutions for businesses in the greater NY area, collaborating with channel partners to close complex deals.

Aug 2015

B.A. — Magna Cum Laude + Departmental Honors

Vanderbilt University — Peabody College · Nashville, TN

GPA: 3.88 · Study Abroad: Rabat, Morocco & Beijing, China. Key coursework: Small Group Behavior · Advanced Creative Writing · Systematic Inquiry · Neuroscience · Sociology · Education Policy.

KailaGilbert.

About me