# Regression Analysis

Throughout this story, I’ve treated the nine most frequented entry-stations [Exhibit 1] as one, and from there, grouped riders by destination. While this served as a useful theoretical simplification, it was not sufficient in testing for riders’ responsiveness to varying percentages of off-peak fare discounts.

For this, I grouped riders by specific origin/destination station pairings, to get the exact composite miles associated with trips. Such precision was needed given composite miles directly determine peak and off-peak fares, from which we calculate percent discount – And further, percent discount varies greatly over small mileage ranges [Exhibit 4b].

The following analysis still limits entries to the nine stations displayed in Exhibit 1, but groups riders by unique origin/destination pairings (9 possible entries x 91 possible exits) rather than aggregating entries.

# Regression Analysis

## Dependent Variable:

I utilized the same t-test from Exhibits 2a & 2b to create a variable ranking each origin/destination pairing by its likelihood of being a route for ‘delayed travel’ riders.

The variable is continuous from 0 to 1 (equal to 1-(*p*-value) from the t-test).

## Independent Variables:

The biggest challenge was to detangle the effect of percent discount from trip length (composite miles). While percent discount is directly calculated from composite miles and generally increases with length of trip, there is systematic variation over distinct mileage ranges due to fare caps.

**Composite Miles**

With the help of *interaction terms*** **we can create a model that keeps the coefficient on composite miles from picking up the effect of percent discount.

As seen below [Exhibit 6], the *flat* regions help isolate the effect of trip length. While composite miles increase over these regions, percent discount is held constant. Alternatively, percent discount increases sharply with composite miles through *steep* regions.

β_{1} signifies the relationship between composite miles and riders’ tendency to wait within flat regions. The dummy variables medium and steep take on the value 1 within their respective mileage ranges, meaning we can add either β_{2} or β_{3} to β_{1} to get the relationship between composite miles and the tendency to wait through these regions.

β_{4} - β_{10} are dummy-on-dummy interaction terms and simply allow the intercept to vary within each flat, medium or steep region. Conceptually, think of flat regions a and b. While the effect (β_{1}) of incremental composite miles should be the same over these regions, their starting point, on a ‘tendency to delay travel’ scale, would differ if percent discount were playing a role (considering the percent discount facing riders at flat b is much greater than at flat a).

**Interestingly, the coefficient on composite miles (through all ranges) is negative, though not significant – suggesting that with all else equal, riders traveling longer distances are slightly less inclined to wait for off-peak fares. **(Possibly they just want to get home to see their family, or settle into the couch.)

**Exhibit 6**

**Income**

After controlling for length of trip and percent discount, it turns out income is an important predictor of riders’ tendencies to wait for off-peak fares. As we’d expect, the relationship is negative; higher income riders are less inclined to delay travel for lower fares.

**Percent Discount**

As it turns out, percent discount is the most significant predictor of delayed travel – this is indicated by its coefficient with *p*-value of 0.024. Meaning the probability that percent discount is a determinant for delayed travel is greater than 95%.