These are the details on how the fifth Smash tier list is to be created. This page is in the process of being edited: to be changed things are in bold.
In this post, I’ll be describing how the fifth community-recognized Super Smash Brothers for the Nintendo 64 (NTSC version) character tier list will be built. By posting the methodology in advance, the community will know that I did not make ad hoc decisions influencing the outcome of the tier list.
A data analysis plan like this one is consistent with efforts in science which push for reproducibility and rigor (see a brief description of reproducibility and rigor here and Samsa and Samsa (2019)).
I was sought out by @Dogs_Johnson. I was a part of the DFW Smash 64 scene, where I was ranked “9” (tag: Accru Fenix), did commentary on occasion, and designed the schedule for Hitstun 6, a 30-person round robin tournament (which is a post for another day). If you’re clicking here from Smash 64 Discord/Smash 64 FB/Smash 64 Twitter/etc, you can read more about my academic research, my Ph.D. in Statistical Science and other credentials, and find my public social media accounts at ryanmcshane.com. In fact, the Smash 64 community is intended to be the primary audience for this post.
The four prior tier lists are collected at ssbwiki. The methodology behind the 13th melee tier list was suggested as a blueprint. I will detail how this tier list will be created and differences in my approach to that of the 13th melee tier list.
All four of the preceding tier lists were based on a community vote/survey. While characters are programmed into the game and are static in their capacities, players develop an understanding of the characters and their relationship to characters in head-to-head matches. As players improve, learn to counter other characters, learn to counter those counters, etc, each character’s place in the hierarchy changes over time (e.g., the “meta” changes). These changes are most profound at the highest levels of the game with the highest level of players*.
Therefore, the best players’ input is most important in understanding the character hierarchy. Furthermore, the best players have a greater understanding of these hierarchies and should therefore have greater input on the tier list than the players rated below them. I discuss my model-based solution to this problem shortly, although other valid approaches exist as well. (See, for example, the disagreement-downweighting method of rating aggregation detailed in Cao and Stokes (2012)).
Players’ input will be delivered in a survey I constructed in Qualtrics, which was approved by Dogs Johnson. I will detail this shortly.
The Smash 64 2021-2022 rankings committee (@SSB64UPR) aggregated results from American competitions they deemed “major” in a braacket database, and the data has continued to be collected, including SSC 2023. Braacket has a feature to produce Trueskill rankings (Herbrich, Minka, Graepel (2007)) given match results.
The Trueskill system is Microsoft’s system for aggregating paired comparisons data in team-based games to estimate individual player ratings. These ratings are then used for matchmaking purposes, and Microsoft seeks to balance teams such that both teams have an approximate 50% chance of winning in an effort to keep matchmaking fun. However, the system also works for ordinary paired comparisons data (e.g., data in which there are one versus one games as in major Smash 64 tournaments). In this regard, it functions similarly to Elo’s method (see Elo (1978)), which is a time-dependent Thurstone-Mosteller model. However, Trueskill is more explicitly Bayesian, like Glicko (see Glickman (2001)). See McShane (2019) for a more detailed introduction to Thurstone-Mosteller, Elo, Glicko, Trueskill, and other paired comparisons methods.
As the Trueskill ratings were a decision element the committee used in creating the player rankings (although not the sole element), I used these as objective measures to originate the player weights.
The 2021-2022 ranked players (denoted “A” in table), the top 80 players on this 2021-2023 list of aforementioned input Trueskill ratings (denoted “B” in table), anyone that was ranked in 2019-2020 including in the “expansion pack” (denoted “C” in table), and anyone ranked in the top 30 of 2020 online Smash 64 rankings (denoted “D” in table) were qualified to respond to the tier list survey. This led to the following list of 109 of qualified respondents, who may have qualified in one to four ways. Shortly, you will see the list of players, including their tags, the way they were qualified, and their weight.
The weights are simply the Trueskill rating divided by 4000 and rounded to the third decimal place, which gives some of the following example weights Kurabba (2.096 – maximum), Isai (2.017), Dogs_Johnson (1.779), K Ruel (1.450), NaCl (1.377), Jimmy Joe (1.073), Darkhorse (0.907 – minimum). That is, Isai’s response would count nearly twice as much as that of Jimmy Joe’s, and Jimmy Joe’s would count an infinite number of times more than players that do not have a vote, myself included. Players absent from the 2021-2023 Braacket data had their Trueskill and thus weight imputed (denoted with an asterisk) by proximity in the 2019-2020 ranking to those that continued competitive play. The average weight is approximately 1.517.
The weight distribution is shown below. This should reflect our general understanding of the player abilities among the top players, where a few (5 to 10) players stand head and shoulders above the rest in the smaller region on the right hand side, and the remainder are more competitive with each other in the larger peak. Meanwhile, there are a few players that are viewed as stronger than their measured Trueskill suggests and are qualified for the survey by inclusion in rankings.
Note: for comparison, three alternative weighting scheme (monotonic transforms of the aforementioned and described weighting scheme), are included in the table as well, and the weights are about to change with the updated Trueskill Ratings
Player | Qualification | Weight | \(\sqrt{\text{Weight}}\) | \(\text{Weight}^\frac{2}{3}\) | \(\text{Weight}^\frac{3}{4}\) |
---|---|---|---|---|---|
Kurabba | AB | 2.096 | 1.448 | 1.638 | 1.742 |
kysk | C | 2.019* | 1.421* | 1.597* | 1.694* |
SuPeRbOoMfAn | C | 2.019* | 1.421* | 1.597* | 1.694* |
Isai | ABC | 2.017 | 1.420 | 1.596 | 1.693 |
Nax | AB | 1.943 | 1.394 | 1.557 | 1.646 |
Prince | ABC | 1.889 | 1.374 | 1.528 | 1.611 |
wario | AB | 1.888 | 1.374 | 1.528 | 1.611 |
Alvin | ABC | 1.875 | 1.369 | 1.521 | 1.602 |
KD3 | ABC | 1.873 | 1.369 | 1.519 | 1.601 |
JaimeHR | ABC | 1.870 | 1.367 | 1.518 | 1.599 |
Shihman | ABC | 1.864 | 1.365 | 1.515 | 1.595 |
Hero Pie | ABC | 1.857 | 1.363 | 1.511 | 1.591 |
KeroKeroppi | ABC | 1.842 | 1.357 | 1.503 | 1.581 |
Lowww | ABC | 1.817 | 1.348 | 1.489 | 1.565 |
Wizzrobe | ABC | 1.793 | 1.339 | 1.476 | 1.549 |
Jam | C | 1.792* | 1.339* | 1.475* | 1.549* |
Josh Brody | ABC | 1.792 | 1.339 | 1.475 | 1.549 |
Zero | AB | 1.781 | 1.335 | 1.469 | 1.542 |
Dogs_Johnson | ABC | 1.779 | 1.334 | 1.468 | 1.540 |
Joshi | AB | 1.736 | 1.318 | 1.444 | 1.512 |
Mercy | ABC | 1.732 | 1.316 | 1.442 | 1.510 |
Sleepy Fox | ABC | 1.657 | 1.287 | 1.400 | 1.460 |
Robert | ABC | 1.650 | 1.285 | 1.396 | 1.456 |
Livin La Fetus Loca | ABC | 1.645 | 1.283 | 1.394 | 1.453 |
baby caweb | AB | 1.642 | 1.281 | 1.392 | 1.451 |
Finio | ABC | 1.636 | 1.279 | 1.388 | 1.447 |
Take | AB | 1.633 | 1.278 | 1.387 | 1.445 |
FranK | ABC | 1.620 | 1.273 | 1.379 | 1.436 |
tacos | ABC | 1.613 | 1.270 | 1.375 | 1.431 |
Revan | ABC | 1.597 | 1.264 | 1.366 | 1.421 |
Fray | AB | 1.594 | 1.263 | 1.365 | 1.419 |
Hotline | ABC | 1.589 | 1.261 | 1.362 | 1.415 |
SheerMadness | ABC | 1.564 | 1.251 | 1.347 | 1.399 |
B33F | ABC | 1.561 | 1.249 | 1.346 | 1.397 |
Freean | AB | 1.550 | 1.245 | 1.339 | 1.389 |
CTG | ABC | 1.543 | 1.242 | 1.335 | 1.384 |
Wolf | ABC | 1.542 | 1.242 | 1.335 | 1.384 |
Quincy | AB | 1.532 | 1.238 | 1.329 | 1.377 |
Janitor | C | 1.530* | 1.237* | 1.328* | 1.376* |
SKG | C | 1.530* | 1.237* | 1.328* | 1.376* |
Killer | ABC | 1.530 | 1.237 | 1.328 | 1.376 |
dboss | C | 1.528* | 1.236* | 1.327* | 1.374* |
Spongy | ABC | 1.527 | 1.236 | 1.326 | 1.374 |
Raychu | ABC | 1.526 | 1.235 | 1.325 | 1.373 |
Bo | ABC | 1.525 | 1.235 | 1.325 | 1.372 |
KrisKringle | AB | 1.518 | 1.232 | 1.321 | 1.368 |
Shalaka | C | 1.515* | 1.231* | 1.319* | 1.366* |
JPX | AB | 1.515 | 1.231 | 1.319 | 1.366 |
OJ | AB | 1.515 | 1.231 | 1.319 | 1.366 |
emptyW | ABC | 1.514 | 1.230 | 1.319 | 1.365 |
Janco | AB | 1.508 | 1.228 | 1.315 | 1.361 |
Paco | ABC | 1.508 | 1.228 | 1.315 | 1.361 |
epad10 | AB | 1.506 | 1.227 | 1.314 | 1.359 |
Crovy | AB | 1.503 | 1.226 | 1.312 | 1.357 |
Stevie G | AB | 1.491 | 1.221 | 1.305 | 1.349 |
lord narwhal | AB | 1.488 | 1.220 | 1.303 | 1.347 |
Box | ABC | 1.483 | 1.218 | 1.300 | 1.344 |
Wookiee | AB | 1.482 | 1.217 | 1.300 | 1.343 |
EG | ABC | 1.477 | 1.215 | 1.297 | 1.340 |
SOMBRERO | C | 1.475* | 1.214* | 1.296* | 1.338* |
Weedwack | C | 1.475* | 1.214* | 1.296* | 1.338* |
Loto | ABC | 1.475 | 1.214 | 1.296 | 1.338 |
Combo Blaze | ABC | 1.471 | 1.213 | 1.293 | 1.336 |
cobr | C | 1.470* | 1.212* | 1.293* | 1.335* |
Madrush | ABC | 1.470 | 1.212 | 1.293 | 1.335 |
SSBAfro | AB | 1.464 | 1.210 | 1.289 | 1.331 |
Blondekid | ABC | 1.462 | 1.209 | 1.288 | 1.330 |
Fireblaster | BC | 1.460 | 1.208 | 1.287 | 1.328 |
waxy:joe | ABC | 1.458 | 1.207 | 1.286 | 1.327 |
Andykins | AB | 1.452 | 1.205 | 1.282 | 1.323 |
K Ruel | AB | 1.450 | 1.204 | 1.281 | 1.321 |
Ranryoku | AB | 1.438 | 1.199 | 1.274 | 1.313 |
antwon420 | AB | 1.432 | 1.197 | 1.270 | 1.309 |
Huntsman | AB | 1.423 | 1.193 | 1.265 | 1.303 |
SOTO | ABC | 1.420 | 1.192 | 1.263 | 1.301 |
Da_Bear | AB | 1.418 | 1.191 | 1.262 | 1.299 |
Jay-R | C | 1.413* | 1.189* | 1.259* | 1.296* |
lordtoko | C | 1.413* | 1.189* | 1.259* | 1.296* |
BARD | BC | 1.412 | 1.188 | 1.259 | 1.295 |
MasterHandJob | AB | 1.412 | 1.188 | 1.259 | 1.295 |
Dr. Grin | B | 1.403 | 1.184 | 1.253 | 1.289 |
Sonjo | C | 1.401* | 1.184* | 1.252* | 1.288* |
Marbles | AB | 1.396 | 1.182 | 1.249 | 1.284 |
Miniohh! | AB | 1.394 | 1.181 | 1.248 | 1.283 |
Shears | ABC | 1.386 | 1.177 | 1.243 | 1.277 |
Nackle | ABC | 1.377 | 1.173 | 1.238 | 1.271 |
Schmerka Berl | B | 1.377 | 1.173 | 1.238 | 1.271 |
NewbTube | AB | 1.354 | 1.164 | 1.224 | 1.255 |
YBOMBB | C | 1.335* | 1.155* | 1.212* | 1.242* |
Traiman | B | 1.333 | 1.155 | 1.211 | 1.241 |
Papa louie | B | 1.332 | 1.154 | 1.211 | 1.240 |
Isildur1 | AB | 1.325 | 1.151 | 1.206 | 1.235 |
bloogo | B | 1.321 | 1.149 | 1.204 | 1.232 |
Czar | ABC | 1.316 | 1.147 | 1.201 | 1.229 |
Big Red | C | 1.315* | 1.147* | 1.200* | 1.228* |
D35 | C | 1.315* | 1.147* | 1.200* | 1.228* |
Gravyfingers | C | 1.315* | 1.147* | 1.200* | 1.228* |
The Yid | AB | 1.312 | 1.145 | 1.198 | 1.226 |
Yobolight | BC | 1.277 | 1.130 | 1.177 | 1.201 |
Dr. Sauce | AB | 1.273 | 1.128 | 1.175 | 1.198 |
Qapples | BC | 1.270 | 1.127 | 1.173 | 1.196 |
Mando | BC | 1.242 | 1.114 | 1.155 | 1.176 |
Razz | BC | 1.215 | 1.102 | 1.139 | 1.157 |
Dankey Kang | ABC | 1.114 | 1.055 | 1.075 | 1.084 |
LETSGO | BC | 1.098 | 1.048 | 1.064 | 1.073 |
Jimmy Joe | BC | 1.073 | 1.036 | 1.048 | 1.054 |
Roman | BC | 1.028 | 1.014 | 1.019 | 1.021 |
Dishier Wand | BC | 0.986 | 0.993 | 0.991 | 0.989 |
Darkhorse | BC | 0.907 | 0.952 | 0.937 | 0.929 |
. |
Not all Smash 64 players view Smash 64 Remix favorably, and may not be inclined to provide a useful response to a Smash Remix character ranking. In an effort to include only players that understand the Smash Remix cast, only some survey participants will be presented with a Smash Remix ballot. Players that were qualified to respond to the Smash 64 character ranking (and thus invited to the survey) will only be qualified to respond to the Smash 64 Remix survey if they participated in a Smash Remix tournament listed in the Smash Remix database.
NOTE: the Remix database, as of this writing, goes to January 2022. Remix tournaments have taken place since then. The Remix weights are still under discussion.
The survey itself was implemented in Qualtrics, which is production-quality survey software to which my institution (The University of Chicago) subscribes. Notably, it includes a drag-and-drop question type to visually create rankings, which allow for quick self-checking by the voter.
The survey is broken into three sections: an introduction and self-identification section, a Smash 64 rankings section, and a Smash 64 Remix rankings section.
The first section includes measures to assure that each survey-taker is uniquely identified, a survey can only be taken by that individual, and that each survey participant may only submit the form once.
The second section asks participants to rank the Smash 64 characters from 1 to 12 (with no ties allowed). The ordering the participants are presented with is randomly generated, and will in all likelihood not resemble an ordering they agree with. This is to prevent bias from being introduced into the question. Specifically, we are concerned that if all participants are presented with the previous tier list’s ordering, they may be more likely to simply agree with it and move forward. While the most vocal members of the Smash 64 community are relatively entrenched in the previous tier list, the purpose of this exercise is to create a new tier list, which may or may not agree with the previous tier list.
Participants are then given the opportunity to provide verbal feedback about their rankings as well as rate their confidence in their rankings. Neither of these questions will impact the way their input is weighted or considered when calculating the average rankings and tier list, but may be used to inform the committee about changes to the procedure should there be a sixth tier list.
The third and final section asks participants to rank the Smash 64 Remix characters from 1 to 17 (with no ties allowed), again with a randomly generated initial ordering. Participants are afforded the same opportunity to provide feedback about their response as in the previous section. They are also permitted to ask that their response be excluded entirely. The confidence question may be used to up-weight or down-weight their remix response, in addition to the amount of agreement their Smash 64 response had with the resulting tier list.
Participants will not be asked to compare Smash Remix and Smash 64 characters simultaneously.
These will be available upon request from Dogs_Johnson after the tier list has been completed. Results will be de-identified and weights will be removed (so that no respondent is identifiable by look-up). Thus, the aggregated ratings and tier list will not be entirely replicable.
Again, the calculation procedure was developed after the 13th melee tier list. However, unlike in that survey, no ties are allowed from the respondents.
What is the weighted trimmed mean? For example, suppose we had 10 equally weighted votes whose weights summed to 10. If we wanted the 5% trimmed mean, the lowest and highest votes would have their weights cut in half. If we wanted the 10% trimmed mean, we would drop the highest and lowest values.
Consider the following example of three rank votes for a single character, where we have differing weights.
Vote | Voter_Weight |
---|---|
3 | 1.524 |
4 | 2.345 |
6 | 0.935 |
The weighted mean would be \[\frac{3 \cdot 1.524 + 4 \cdot 2.345 + 6 \cdot 0.935}{1.524 + 2.345 + 0.935} = \frac{19.562}{4.804} \approx 4.072.\] Note here that the sum of the weights is 4.804. The 5% weighted trimmed mean would be \[\frac{3 \cdot (1.524 - 4.804 \cdot 0.05) + 4 \cdot 2.345 + 6 \cdot (0.935 - 4.804 \cdot 0.05)}{1.524 + 2.345 + 0.935 - 4.804 \cdot 0.10)} \approx 4.024.\] The result is not substantially different, but it down-weights the “extreme” vote, and is more resistant to outliers.
The resulting values for each character (1 for top, 12 for bottom, etc) will be averaged using a weighted 5% trimmed mean. In this case, our weights (if everyone responds), sum to 165.314. Then the lowest and highest 165.314 \(\cdot\) 0.05 = 8.2657 votes will be removed, where the two votes whose weights coincide with the 5th and 95th percentiles shall have their weights reduced such that only 8.2657 votes are removed from the top and bottom.
The 5% weighted trimmed mean of voter-supplied ranks is calculated for every character. Then, the characters are ordered from low (best rank) to high (worst rank). As in the previous two tier lists, the average rating will be reported for each character.
Please read my introduction to \(k\)-means clustering here (with Smash-relevant examples!). \(k\)-means clustering will be performed on the overall ratings, of which there will be 12. As in the 13th melee tier list, the number of tiers that indicates “reasonable” spacing between tiers will be used. However, instead of doing this “by eye,” an objective measure, BIC, will be minimized to select the best \(k\), the number of clusters. See the following study and simulation to demonstrate how and why that is the best approach.
Consider the last tier list:
If we use \(k\)-means clustering with \(k = 2\) through \(k = 7\) clusters, the resulting tierings are in the following columns. Note that the official tier list uses four tiers (\(k = 4\) column below). We intuit that \(k = 2\) is too few clusters (tiers), as Mario and Jigglypuff are in the same tier as Pikachu. Meanwhile, we can intuit that \(k = 7\) is too many tiers, as Captain Falcon, Fox, Yoshi, and Luigi each have their own tier.
rank | character | rating | \(2\) | \(3\) | \(4\) | \(5\) | \(6\) | \(7\) |
---|---|---|---|---|---|---|---|---|
1 | Pikachu | 1.10 | S | S | S | S | S | S |
2 | Kirby | 2.18 | S | S | S | S | S | S |
3 | Captain Falcon | 3.42 | S | S | A | S | S | A |
4 | Fox | 3.75 | S | S | A | S | S | B |
5 | Yoshi | 4.85 | S | A | A | A | A | C |
6 | Jigglypuff | 6.46 | S | A | B | A | A | D |
7 | Mario | 6.49 | S | A | B | A | A | D |
8 | Samus | 9.28 | A | B | C | B | B | E |
9 | Donkey Kong | 9.49 | A | B | C | B | C | E |
10 | Ness | 10.01 | A | B | C | C | D | E |
11 | Link | 10.33 | A | B | C | C | D | E |
12 | Luigi | 11.67 | A | B | C | D | E | F |
So, how do we match our intuition with an objective measure? We can use criteria for model selection. In particular, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are the most popular, and both are easily calculable in R for \(k\)-means clustering. Let’s compare AIC and BIC for the ratings above.
We can see that both AIC and BIC are minimized when \(k = 4\), suggesting this is the optimal number of clusters for the fourth tier list, which is indeed what the intuitive approach suggested. Since BIC is more punitive for too many clusters, this is the model selection criterion I will use. Note: AIC should only be compared to itself and BIC should only be compared to itself.
So, what if we simulate survey results from the prior tier list? We could simulate each rating from a \(N(\mu = \bar x_i, \sigma = 3\hat \sigma)\) distribution, where \(\bar x_i\) is the observed rating for each character \(i\), and \(\hat \sigma\) is the median absolute deviation of \((\bar x_i - i)\), which is approximately 0.55. That is, each voter disagrees somewhat with the previous result, but not by much. We then extract ranks from their underlying simulated character ratings. We could produce \(N = 109\) surveys in this way (the number we hope to have), and then find the character rating means and repeat the above study in BIC vs \(k\) \(M = 100\) times. Doing so, we get the following result. We see that \(k = 4\) is a reasonable choice in most cases where surveys have the same general result as in the previous tier list. This gives further support to having used \(k = 4\) in the previous tier list.
And, what if we repeat this process but let every initial character rating be \(N(\mu = i, \sigma = 3\hat \sigma)\). That is, there is a perfect true ranking which is contaminated slightly by noise. In doing so, we get the following result, which says that \(k = 4\) would generally be the best choice in that situation.
While the truth and the future survey results are unknown, BIC will be a reasonable approach to automatically selecting the number of tiers for the tier list.
These will be produced in a similar way to the Smash 64 tier list, with the exception that weights derived exclusively from Smash 64 rankings or ratings will not be used as these are not necessarily representative of Smash 64 Remix player ability.
The following players that are qualified for the Smash 64 character rankings are not currently qualified for Smash Remix character rankings: Alvin, BARD, Big Red, Bo, Chars, cobr, D35, Da_Bear, Dankey Kang, dboss, Dino, Dishier Wand, Dr. Grin, Dr. Sauce, eL maN, Elias_YFGM, Fireblaster, Gravyfingers, Hero Pie, Huntsman, Jam, Janitor, Jay-R, Jimmy Joe, Joshi, K Ruel, K.O.Ken, Killer, Kimimaru, kix, Kurabba, kysk, LETSGO, Lord Narwhal, lordtoko, Madrush, maha, Miniohh!, mrsir, Nax, Prince, Qapples, Quincy, rainshifter, Ranryoku, Raychu, ReefyBeefy, Robert, Shalaka, SOMBRERO, Sonjo, SOTO, Spongy, SSBAfro, SuPeRbOoMfAn, Take, wario, Wizzrobe, YBOMBB, Yobolight, Zero, Zuber. Please let me know if they are in the Remix database, but under a different tag.
Results will be presented in a separate post. If there are deviations from this data analysis plan, they will be described.
3 | 4 | 5 | |
---|---|---|---|
resp | cTSU | Qqzv | Rkat |
weights | 1.567 | 2.132 | 1.134 |
S1 | Pikachu | Pikachu | Pikachu |
S2 | Kirby | Captain Falcon | Captain Falcon |
S3 | Captain Falcon | Kirby | Fox |
S4 | Yoshi | Fox | Yoshi |
S5 | Fox | Yoshi | Kirby |
S6 | Jigglypuff | Jigglypuff | Jigglypuff |
S7 | Mario | Mario | Mario |
S8 | Donkey Kong | Samus | Donkey Kong |
S9 | Samus | Donkey Kong | Samus |
S10 | Ness | Ness | Ness |
S11 | Link | Link | Link |
S12 | Luigi | Luigi | Luigi |
char | rating |
---|---|
Pikachu | 1.00 |
Captain Falcon | 2.30 |
Kirby | 3.11 |
Fox | 4.10 |
Yoshi | 4.43 |
Jigglypuff | 6.00 |
Mario | 7.00 |
Donkey Kong | 8.43 |
Samus | 8.57 |
Ness | 10.00 |
Link | 11.00 |
Luigi | 12.00 |