How planning poker works
Planning poker is a consensus-based estimation technique where each team member votes simultaneously using cards with Fibonacci sequence numbers (1, 2, 3, 5, 8, 13, 21). Numbers aren't hours, they're story points: abstract units representing combined effort, complexity, and risk.
The process: someone presents a task, team asks questions, everyone secretly picks a card, reveals simultaneously. If there's consensus, that's the score. If there's disparity (e.g., a 3 and a 13), the extremes explain their reasoning and re-vote. The key is the conversation: one dev might see technical complexity others don't, or QA identifies edge cases that increase effort.
Why Fibonacci? Growing gaps reflect uncertainty: distinguishing between 1 and 2 points is easy, but between 13 and 15 makes no sense because precision is impossible at that scale. If something seems more than 13, you should probably split it into smaller tasks.
From story points to real hours
Story points aren't hours, but eventually you need to translate them to calendar time. Each team has its own velocity: points completed per sprint. If in a 2-week sprint you complete 40 points with 4 devs, your velocity is ~10 points/dev/sprint, or ~1.25 points/dev/day.
To convert points to hours, use your history. If 1 point usually takes 2-4 hours, 5 points are 10-20 hours (1-2.5 days). But beware: this is an average. A 5 in backend might be 12 hours; a 5 in frontend might be 8. Context matters: is there legacy code? New team? External dependencies? Add buffer.
Common mistake: mechanically converting points to hours ('1 point = 1 hour') ignores that points include testing, code review, refactor, meetings. A 3-point task is rarely 3 hours of pure coding; it's 3 points of total effort, which might be 6-10 calendar hours considering interruptions and overhead.
Factors affecting estimation
Technical complexity is obvious: integrating an external API with poor documentation is more than copying an existing component. But there are less visible factors: team context (has anyone done something similar?), design quality (clear mockups vs 'make it nice'), requirements clarity (concrete acceptance criteria vs 'something like X').
Risk multiplies effort. A task with dependency on an unstable external service might be technically simple but risky: it deserves more points for potential debugging time. Same with changes in legacy code without tests: coding is fast, but validating you didn't break anything takes time.
Team state also counts. Are you in 'normal sprint' mode or 'everyone in calls with stakeholders'? Are there planned vacations? Onboarding someone new who'll do pair programming? A 5-point task can expand to 8 if you know you'll be teaching while doing it. Honesty about these factors improves estimates.
When to review and re-estimate
Estimates aren't immutable contracts. If you start a 5-point task and 2 hours in discover the API you were going to use is deprecated, stop and re-estimate. Better to communicate early 'this will be 8, not 5' than surprise in review with 'it took double'.
In retrospective, compare estimates vs real time. If you consistently underestimate, your scale is miscalibrated: what you consider 3 points is actually 5 for your context. Adjust in future sprints. If you always overestimate, you're being too conservative (or improved a lot and didn't update your baseline).
Tasks that consistently explode (estimated at 5, end up being 13) signal a problem: unclear requirements, hidden technical debt, or scope creep. It's not an estimation problem, it's a definition problem. Use that data in refinement to request more clarity or dedicate a sprint to cleaning the codebase before continuing.