Behavioural interview questions for software engineers, done properly
How to build a structured behavioural interview that actually predicts software-engineer performance: competencies, scorecards, STAR probes.
Why the whiteboard alone keeps under-hiring
Most engineering interviews still test one slice of the job. The candidate solves a problem on a shared editor, walks through a system design under time pressure, perhaps takes home a small project. The panel forms a view, mostly on the coding rounds, and the offer goes out. Six months later, half the time, the new starter is fine; the other half, something quieter has gone wrong, and the team puts it down to "fit". The whiteboard worked. The whiteboard was never the whole instrument.
Behavioural interview questions exist to test the rest of the role - judgement under ambiguity, behaviour in disagreement, what the candidate does when something they shipped breaks at three in the morning. None of that surfaces in a coding round, and engineering managers know this. The usual fix is more rounds, longer take-homes, harder system-design prompts. None of that helps if the missing instrument is a different one.
The research on the alternative - an unstructured behavioural chat - is grim. Dana, Dawes and Peterson (2012) showed that interviewers do not just fail to extract signal from a free-form conversation; they construct a coherent story out of nearly anything the candidate says, and that confabulated story actively dilutes whatever signal they had from the CV. There are engineers who can talk a problem through without ever shipping working code; there are also engineers who ship excellent code and freeze in any interview that resembles a viva. The format chooses who clears the bar, not the work.
The fix is a behavioural interview that is genuinely structured. Same questions, same scoring, same anchors, every candidate. That is what the rest of this article is about.
What a behavioural interview actually is
A behavioural interview is not a softer technical screen, and it is not a chat about the candidate's career. It is a series of questions about specific past events, scored against a fixed competency rubric. "Tell me about a code change you regret shipping." "Tell me about a technical decision you lost an argument on." "Tell me about a time you had to push back on a product manager." Each answer is then rated against the competency the question targets, on the same scale, by every interviewer.
The format has academic roots. McClelland (1998) introduced the Behavioural-Event Interview as a way to identify competencies by coding what people had actually done, rather than asking them what they thought they were good at. McClelland's original finding was that the ratings were reliable and predicted executive success, and that bias largely disappeared when interviewer and coder were blind to performance. Salgado and Moscoso (2002) sharpened the picture in their construct-validity meta-analysis: behaviour interviews and conventional interviews measure different things. Conventional interviews mostly tap general mental ability and personality; behaviour interviews tap job knowledge, situational judgement, and social skills. They are different instruments, and you need both.
It is worth distinguishing the behavioural question from its close cousin, the situational one. A behavioural question asks about a specific past event ("tell me about a time you..."). A situational question asks about a hypothetical ("what would you do if..."). Both belong in a structured interview; neither replaces the other. The STAR method is the scaffold candidates use to answer behavioural questions specifically - situation, task, action, result - and the panel uses it as a diagnostic for whether the candidate is recounting something real or assembling a plausible story.
Done properly, the structured interview built on these questions sits alongside the technical screen, not against it. The whiteboard tests one thing. This tests the rest.

The evidence the format actually predicts
The case for structured behavioural interviewing is not an opinion. It has been settled in the meta-analytic literature for thirty years, and the numbers are not subtle.
The canonical reference is McDaniel, Whetzel, Schmidt and Maurer (1994). They pulled 245 validity coefficients from 86,311 individuals, and found that interview validity tracked two things: the content of the questions (situational and job-related questions out-predicted psychological ones) and the structure of the conduct. Structured interviews predicted job performance substantially better than unstructured ones at the same content level. Wiesner and Cronshaw (1988), the earlier meta-analysis the literature still cites, put the difference more starkly: structured interviews showed roughly twice the predictive validity of unstructured ones at the same panel format.
Campion, Palmer and Campion (1997) summarised the spread of corrected validity coefficients across the structured-interview literature. Unstructured interviews land between .14 and .33. Structured interviews land between .35 and .62. That is the difference between a screening tool that is barely better than a coin flip and one that genuinely sorts candidates by likely on-the-job performance. They identified fifteen separate components of structure that drive the lift, split between question-side discipline (same questions for all, behavioural or situational, longer interviews) and evaluation-side discipline (rating each answer, anchored scales, taking notes, multiple interviewers, training).
Twenty years on, Levashina et al. (2013) revisited the whole picture in the most comprehensive review of structured-interview research to date. The headline did not change. Structure raises validity. Structure reduces bias. Anchored rating scales and disciplined probing both move the needle independently. None of this is in dispute among the people who study it for a living.
Structured interviews were found to have higher validity than unstructured interviews. Interviews showed similar validity for job performance and training criteria. - McDaniel, Whetzel, Schmidt and Maurer (1994)
The takeaway for an engineering team is the inversion of how this is usually pitched. The format is the predictor. The questions are the instrument the format runs on. A great question bank inside a free-form chat is a worse interview than a mediocre question bank inside a properly structured one.
The competencies a software engineering panel should score
The first job in building a behavioural interview is choosing what you are scoring for. Not what is interesting; what the role actually needs from someone who is doing it well at six months. Five to seven competencies is plenty. More turns into paste, and the rubric stops being a rubric.
Across the practitioner literature, the same five-or-six categories recur for software engineers: ownership and accountability, conflict and disagreement, ambiguity and prioritisation, technical judgement, mistakes and learning, and collaboration across functions. Senior roles add mentoring or technical leadership. The specifics change with seniority - "ownership" looks different at L4 and L7 - but the categories hold. Karat, who run engineering interviews as a service, put the principle bluntly: anything not on the rubric is noise. Tech Interview Handbook reaches the same shortlist when it documents the same five-or-six categories that recur across Amazon, Google and Meta scorecards.
Each competency wants two or three scripted question stems, so the interviewer is not improvising on the day. For "mistakes and learning", that might be "tell me about a code change you regret shipping". For "conflict and disagreement", "tell me about a technical decision you lost an argument on". For "ambiguity", "tell me about a time you had to start work on a problem that was not yet well-defined". The questions sound conversational. The fact that they are the same questions every candidate gets is the point.
The scorecard is the other half. Each competency gets an anchored 1-5 rating scale, with a sentence describing what each level looks like in answers. "Walks through a specific incident, names what they personally did, names what changed because of it" might be a 4. "Describes a team's work in the abstract; cannot identify their own contribution under probing" might be a 2. Anchored scales sound like bureaucracy. They are the thing that turns three interviewers' opinions into three comparable numbers. Huffcutt and Woehr (1999) showed in their re-analysis that interviewer training, multiple interviewers, and standardised note-taking each independently lift validity. The rubric is half the work; the discipline of using it is the other half.

STAR as the interviewer's tool, not just the candidate's
Every candidate prepping for an engineering role has read about the STAR method. Situation, task, action, result. There is the situation-task-action-result scaffold every well-prepared candidate already knows, and they will use it whether the interviewer cues for it or not. The mistake is treating that as a candidate's preparation hack rather than the interviewer's diagnostic.
The diagnostic looks like this. Most candidates lead with situation - the project, the team, what was at stake. Engineers in particular drift from there into architecture detail: the language, the framework, the deployment topology. That is comfortable territory and easy to fill several minutes with. The action - what the candidate personally did - often gets a single sentence, and the result frequently gets none. The behavioural answer that scores well names a specific behaviour and a specific change that resulted from it. The behavioural answer that scores low describes a system, with the candidate hovering somewhere near it.
The interviewer's job is to keep the answer landing on action and result, without leading the candidate to a particular response. Scripted probes are useful because the panel uses them consistently. "What did you specifically do, as opposed to the rest of the team?" "What changed because of your contribution?" "What did you learn that changed how you would handle it next time?" Levashina et al. (2013) treat probing and follow-up as part of structure, not a separate activity from it. Every interviewer using the same probes is part of why structured interviews predict.
The candidate has read every STAR primer on the internet. The defence is not to spring a different format on them. The defence is to score the answer they give against the anchored rubric you wrote in advance, and to keep the probes consistent across the panel so two candidates with similar answers get similar scores.
From question bank to installed practice
A question bank in a Google Doc and a scorecard in a spreadsheet are the easy parts of this. Most engineering teams already have something close to both, drafted by a head of engineering on a quiet Sunday, circulated to a few hiring managers, and used inconsistently within a quarter. The hard part is everyone in the panel running the same structured interview two months from now, with hiring managers who change between rounds and a head count target that has just doubled. That is where the rubric quietly stops being a rubric, and the panel goes back to "did we like them?".
HireSchool exists to install the practice, not to write the bank. It is a self-guided digital programme called the Structured Hiring Method, delivered as video content plus a learning management system. The customer's team buys access and rolls it out themselves. The LMS records who has been trained, what the standard is, and how every recent panel scored, so the structure does not depend on the head of engineering remembering to chase people.
For an engineering hiring loop, the components that matter most are Leadership Values (the small set of behaviours the company has decided to hire against, which is what the competency rubric actually scores), behavioural interviewing training (the panel learns the technique - how to probe, how to score, how to avoid sensemaking - rather than just inheriting a question list), codified scorecards with anchored ratings, and decision management - the codified process for reaching a well-considered, evidence-based, unbiased call after the panel has scored. Underneath all of it sits First Past the Post: the standard for the role is set in advance, and the team hires the first candidate who meets it. No moving goalposts. No waiting indefinitely while strong candidates take other offers.
The thing this fixes is the slow drift. A head of engineering writes a great rubric, three managers run great panels, and three months later the new starter feedback loop is back to instinct. The LMS is the mechanism that keeps the standard intact across hiring managers and across time. It is not a tool that runs interviews for you; it is the kit your team uses to run the same interview every time.
HireSchool is not a recruiter, not an applicant tracking system, and not a consultancy. There is no embedded HireSchool team sitting in the customer's interview loop. The whole point is that the customer's hiring managers and engineers do the interviewing, to a standard the customer has decided on, supported by a programme that codifies the method.
If this is the next thing the team needs - the structure to make the question bank actually do its job - explore the Structured Hiring Method programme. It is the cheapest way to install consultancy-grade interviewing without paying for consultancy.

Common failure modes and what they sound like
Even with a decent rubric, the same handful of failure modes show up across engineering panels. They are not character defects in the interviewer; they are the predictable drift you get when a structured process meets a real conversation.
The "we" answer is the most common. The candidate describes a team's work in detail and never names what they personally contributed. Often this is modesty; sometimes it is cover. Probe once: "what did you do, as opposed to the rest of the team?" If the candidate cannot or will not say, score the competency low and move on. There are answers that drift into architecture detail and never name a single thing the candidate did; the rubric exists precisely so the panel does not score these on charisma.
The hypothetical answer is next. The interviewer asks for a specific past event and the candidate slides into "well, what I would do is...". This is a STAR-prep instinct misfiring under pressure. Reset gently: "I am after a specific time this happened, if there is one." If a real example does not surface, accept the lower score; do not invent a different question to rescue them.
Bad-mouthing the previous team or manager is a different category. It is rarely a calibration issue and usually a values one. A candidate who narrates a conflict by listing the colleague's failings without acknowledging their own contribution to the situation is signalling something useful. Weight it appropriately and note it clearly in the scorecard.
Calibration drift is the panel's own failure mode. Two interviewers score the same answer at 3 and 5 because the anchors on the rating scale were vague. Fix that in the rubric, not in the panel meeting. Anchored scales need to describe the answer at each level concretely enough that two interviewers reading the same transcript would land within one point.
And finally, the interviewer who insists they "got a great read" from a meandering, unstructured answer. Dana, Dawes and Peterson (2012) have a name for this. It is sensemaking, and it is the single biggest reason structured interviews predict and unstructured ones do not.

The interview engineers actually need
The whiteboard tests one part of the role. Behavioural interview questions test the rest, and the evidence on which version of those questions actually predicts has been settled for thirty years. Structure is the predictor. The questions are the instrument structure runs on. A great question bank inside an unstructured chat scores worse than an ordinary one inside a properly anchored rubric.
The shape of a structured behavioural interview for software engineers is not exotic. Five to seven competencies that match the role. Two or three scripted question stems per competency. An anchored 1-5 scale with concrete descriptions at each level. The same probes, used by every interviewer, to keep answers landing on action and result. A scorecard that turns three opinions into three comparable numbers. None of it is glamorous, and that is more or less the point.
What changes when a small engineering team installs this is small at first and large later. Panels start agreeing more often than not. Decisions hold up under review six months on. The new starter at month six looks roughly like the engineer the panel thought they were hiring, and when they do not, the scorecard tells you which competency the panel mis-scored - which is information you can use the next time. No promises about percentages. Better hiring is its own argument.