Smash 64 Tier List Data Analysis Plan

Smash 64 k-means clustering survey consulting

These are the details on how the fifth Smash tier list is to be created. This page is in the process of being edited: to be changed things are in bold.

Ryan McShane https://ryanmcshane.com
2023-08-19

Background

In this post, I’ll be describing how the fifth community-recognized Super Smash Brothers for the Nintendo 64 (NTSC version) character tier list will be built. By posting the methodology in advance, the community will know that I did not make ad hoc decisions influencing the outcome of the tier list.

A data analysis plan like this one is consistent with efforts in science which push for reproducibility and rigor (see a brief description of reproducibility and rigor here and Samsa and Samsa (2019)).

I was sought out by @Dogs_Johnson. I was a part of the DFW Smash 64 scene, where I was ranked “9” (tag: Accru Fenix), did commentary on occasion, and designed the schedule for Hitstun 6, a 30-person round robin tournament (which is a post for another day). If you’re clicking here from Smash 64 Discord/Smash 64 FB/Smash 64 Twitter/etc, you can read more about my academic research, my Ph.D. in Statistical Science and other credentials, and find my public social media accounts at ryanmcshane.com. In fact, the Smash 64 community is intended to be the primary audience for this post.

The four prior tier lists are collected at ssbwiki. The methodology behind the 13th melee tier list was suggested as a blueprint. I will detail how this tier list will be created and differences in my approach to that of the 13th melee tier list.

Data Collection

All four of the preceding tier lists were based on a community vote/survey. While characters are programmed into the game and are static in their capacities, players develop an understanding of the characters and their relationship to characters in head-to-head matches. As players improve, learn to counter other characters, learn to counter those counters, etc, each character’s place in the hierarchy changes over time (e.g., the “meta” changes). These changes are most profound at the highest levels of the game with the highest level of players*.

Therefore, the best players’ input is most important in understanding the character hierarchy. Furthermore, the best players have a greater understanding of these hierarchies and should therefore have greater input on the tier list than the players rated below them. I discuss my model-based solution to this problem shortly, although other valid approaches exist as well. (See, for example, the disagreement-downweighting method of rating aggregation detailed in Cao and Stokes (2012)).

Players’ input will be delivered in a survey I constructed in Qualtrics, which was approved by Dogs Johnson. I will detail this shortly.

Survey Participation

The Smash 64 2021-2022 rankings committee (@SSB64UPR) aggregated results from American competitions they deemed “major” in a braacket database, and the data has continued to be collected, including SSC 2023. Braacket has a feature to produce Trueskill rankings (Herbrich, Minka, Graepel (2007)) given match results.

Aside on Trueskill

The Trueskill system is Microsoft’s system for aggregating paired comparisons data in team-based games to estimate individual player ratings. These ratings are then used for matchmaking purposes, and Microsoft seeks to balance teams such that both teams have an approximate 50% chance of winning in an effort to keep matchmaking fun. However, the system also works for ordinary paired comparisons data (e.g., data in which there are one versus one games as in major Smash 64 tournaments). In this regard, it functions similarly to Elo’s method (see Elo (1978)), which is a time-dependent Thurstone-Mosteller model. However, Trueskill is more explicitly Bayesian, like Glicko (see Glickman (2001)). See McShane (2019) for a more detailed introduction to Thurstone-Mosteller, Elo, Glicko, Trueskill, and other paired comparisons methods.

Survey Participation (resumed)

As the Trueskill ratings were a decision element the committee used in creating the player rankings (although not the sole element), I used these as objective measures to originate the player weights.

The 2021-2022 ranked players (denoted “A” in table), the top 80 players on this 2021-2023 list of aforementioned input Trueskill ratings (denoted “B” in table), anyone that was ranked in 2019-2020 including in the “expansion pack” (denoted “C” in table), and anyone ranked in the top 30 of 2020 online Smash 64 rankings (denoted “D” in table) were qualified to respond to the tier list survey. This led to the following list of 109 of qualified respondents, who may have qualified in one to four ways. Shortly, you will see the list of players, including their tags, the way they were qualified, and their weight.

The weights are simply the Trueskill rating divided by 4000 and rounded to the third decimal place, which gives some of the following example weights Kurabba (2.096 – maximum), Isai (2.017), Dogs_Johnson (1.779), K Ruel (1.450), NaCl (1.377), Jimmy Joe (1.073), Darkhorse (0.907 – minimum). That is, Isai’s response would count nearly twice as much as that of Jimmy Joe’s, and Jimmy Joe’s would count an infinite number of times more than players that do not have a vote, myself included. Players absent from the 2021-2023 Braacket data had their Trueskill and thus weight imputed (denoted with an asterisk) by proximity in the 2019-2020 ranking to those that continued competitive play. The average weight is approximately 1.517.

The weight distribution is shown below. This should reflect our general understanding of the player abilities among the top players, where a few (5 to 10) players stand head and shoulders above the rest in the smaller region on the right hand side, and the remainder are more competitive with each other in the larger peak. Meanwhile, there are a few players that are viewed as stronger than their measured Trueskill suggests and are qualified for the survey by inclusion in rankings.

Note: for comparison, three alternative weighting scheme (monotonic transforms of the aforementioned and described weighting scheme), are included in the table as well, and the weights are about to change with the updated Trueskill Ratings

Click for Table
Player Qualification Weight \(\sqrt{\text{Weight}}\) \(\text{Weight}^\frac{2}{3}\) \(\text{Weight}^\frac{3}{4}\)
Kurabba AB 2.096 1.448 1.638 1.742
kysk C 2.019* 1.421* 1.597* 1.694*
SuPeRbOoMfAn C 2.019* 1.421* 1.597* 1.694*
Isai ABC 2.017 1.420 1.596 1.693
Nax AB 1.943 1.394 1.557 1.646
Prince ABC 1.889 1.374 1.528 1.611
wario AB 1.888 1.374 1.528 1.611
Alvin ABC 1.875 1.369 1.521 1.602
KD3 ABC 1.873 1.369 1.519 1.601
JaimeHR ABC 1.870 1.367 1.518 1.599
Shihman ABC 1.864 1.365 1.515 1.595
Hero Pie ABC 1.857 1.363 1.511 1.591
KeroKeroppi ABC 1.842 1.357 1.503 1.581
Lowww ABC 1.817 1.348 1.489 1.565
Wizzrobe ABC 1.793 1.339 1.476 1.549
Jam C 1.792* 1.339* 1.475* 1.549*
Josh Brody ABC 1.792 1.339 1.475 1.549
Zero AB 1.781 1.335 1.469 1.542
Dogs_Johnson ABC 1.779 1.334 1.468 1.540
Joshi AB 1.736 1.318 1.444 1.512
Mercy ABC 1.732 1.316 1.442 1.510
Sleepy Fox ABC 1.657 1.287 1.400 1.460
Robert ABC 1.650 1.285 1.396 1.456
Livin La Fetus Loca ABC 1.645 1.283 1.394 1.453
baby caweb AB 1.642 1.281 1.392 1.451
Finio ABC 1.636 1.279 1.388 1.447
Take AB 1.633 1.278 1.387 1.445
FranK ABC 1.620 1.273 1.379 1.436
tacos ABC 1.613 1.270 1.375 1.431
Revan ABC 1.597 1.264 1.366 1.421
Fray AB 1.594 1.263 1.365 1.419
Hotline ABC 1.589 1.261 1.362 1.415
SheerMadness ABC 1.564 1.251 1.347 1.399
B33F ABC 1.561 1.249 1.346 1.397
Freean AB 1.550 1.245 1.339 1.389
CTG ABC 1.543 1.242 1.335 1.384
Wolf ABC 1.542 1.242 1.335 1.384
Quincy AB 1.532 1.238 1.329 1.377
Janitor C 1.530* 1.237* 1.328* 1.376*
SKG C 1.530* 1.237* 1.328* 1.376*
Killer ABC 1.530 1.237 1.328 1.376
dboss C 1.528* 1.236* 1.327* 1.374*
Spongy ABC 1.527 1.236 1.326 1.374
Raychu ABC 1.526 1.235 1.325 1.373
Bo ABC 1.525 1.235 1.325 1.372
KrisKringle AB 1.518 1.232 1.321 1.368
Shalaka C 1.515* 1.231* 1.319* 1.366*
JPX AB 1.515 1.231 1.319 1.366
OJ AB 1.515 1.231 1.319 1.366
emptyW ABC 1.514 1.230 1.319 1.365
Janco AB 1.508 1.228 1.315 1.361
Paco ABC 1.508 1.228 1.315 1.361
epad10 AB 1.506 1.227 1.314 1.359
Crovy AB 1.503 1.226 1.312 1.357
Stevie G AB 1.491 1.221 1.305 1.349
lord narwhal AB 1.488 1.220 1.303 1.347
Box ABC 1.483 1.218 1.300 1.344
Wookiee AB 1.482 1.217 1.300 1.343
EG ABC 1.477 1.215 1.297 1.340
SOMBRERO C 1.475* 1.214* 1.296* 1.338*
Weedwack C 1.475* 1.214* 1.296* 1.338*
Loto ABC 1.475 1.214 1.296 1.338
Combo Blaze ABC 1.471 1.213 1.293 1.336
cobr C 1.470* 1.212* 1.293* 1.335*
Madrush ABC 1.470 1.212 1.293 1.335
SSBAfro AB 1.464 1.210 1.289 1.331
Blondekid ABC 1.462 1.209 1.288 1.330
Fireblaster BC 1.460 1.208 1.287 1.328
waxy:joe ABC 1.458 1.207 1.286 1.327
Andykins AB 1.452 1.205 1.282 1.323
K Ruel AB 1.450 1.204 1.281 1.321
Ranryoku AB 1.438 1.199 1.274 1.313
antwon420 AB 1.432 1.197 1.270 1.309
Huntsman AB 1.423 1.193 1.265 1.303
SOTO ABC 1.420 1.192 1.263 1.301
Da_Bear AB 1.418 1.191 1.262 1.299
Jay-R C 1.413* 1.189* 1.259* 1.296*
lordtoko C 1.413* 1.189* 1.259* 1.296*
BARD BC 1.412 1.188 1.259 1.295
MasterHandJob AB 1.412 1.188 1.259 1.295
Dr. Grin B 1.403 1.184 1.253 1.289
Sonjo C 1.401* 1.184* 1.252* 1.288*
Marbles AB 1.396 1.182 1.249 1.284
Miniohh! AB 1.394 1.181 1.248 1.283
Shears ABC 1.386 1.177 1.243 1.277
Nackle ABC 1.377 1.173 1.238 1.271
Schmerka Berl B 1.377 1.173 1.238 1.271
NewbTube AB 1.354 1.164 1.224 1.255
YBOMBB C 1.335* 1.155* 1.212* 1.242*
Traiman B 1.333 1.155 1.211 1.241
Papa louie B 1.332 1.154 1.211 1.240
Isildur1 AB 1.325 1.151 1.206 1.235
bloogo B 1.321 1.149 1.204 1.232
Czar ABC 1.316 1.147 1.201 1.229
Big Red C 1.315* 1.147* 1.200* 1.228*
D35 C 1.315* 1.147* 1.200* 1.228*
Gravyfingers C 1.315* 1.147* 1.200* 1.228*
The Yid AB 1.312 1.145 1.198 1.226
Yobolight BC 1.277 1.130 1.177 1.201
Dr. Sauce AB 1.273 1.128 1.175 1.198
Qapples BC 1.270 1.127 1.173 1.196
Mando BC 1.242 1.114 1.155 1.176
Razz BC 1.215 1.102 1.139 1.157
Dankey Kang ABC 1.114 1.055 1.075 1.084
LETSGO BC 1.098 1.048 1.064 1.073
Jimmy Joe BC 1.073 1.036 1.048 1.054
Roman BC 1.028 1.014 1.019 1.021
Dishier Wand BC 0.986 0.993 0.991 0.989
Darkhorse BC 0.907 0.952 0.937 0.929
.

Smash Remix Qualification

Not all Smash 64 players view Smash 64 Remix favorably, and may not be inclined to provide a useful response to a Smash Remix character ranking. In an effort to include only players that understand the Smash Remix cast, only some survey participants will be presented with a Smash Remix ballot. Players that were qualified to respond to the Smash 64 character ranking (and thus invited to the survey) will only be qualified to respond to the Smash 64 Remix survey if they participated in a Smash Remix tournament listed in the Smash Remix database.

NOTE: the Remix database, as of this writing, goes to January 2022. Remix tournaments have taken place since then. The Remix weights are still under discussion.

Survey

The survey itself was implemented in Qualtrics, which is production-quality survey software to which my institution (The University of Chicago) subscribes. Notably, it includes a drag-and-drop question type to visually create rankings, which allow for quick self-checking by the voter.

The survey is broken into three sections: an introduction and self-identification section, a Smash 64 rankings section, and a Smash 64 Remix rankings section.

Identification and Security

The first section includes measures to assure that each survey-taker is uniquely identified, a survey can only be taken by that individual, and that each survey participant may only submit the form once.

Smash 64 Character Ranking

The second section asks participants to rank the Smash 64 characters from 1 to 12 (with no ties allowed). The ordering the participants are presented with is randomly generated, and will in all likelihood not resemble an ordering they agree with. This is to prevent bias from being introduced into the question. Specifically, we are concerned that if all participants are presented with the previous tier list’s ordering, they may be more likely to simply agree with it and move forward. While the most vocal members of the Smash 64 community are relatively entrenched in the previous tier list, the purpose of this exercise is to create a new tier list, which may or may not agree with the previous tier list.

Participants are then given the opportunity to provide verbal feedback about their rankings as well as rate their confidence in their rankings. Neither of these questions will impact the way their input is weighted or considered when calculating the average rankings and tier list, but may be used to inform the committee about changes to the procedure should there be a sixth tier list.

Smash 64 Remix Character Ranking

The third and final section asks participants to rank the Smash 64 Remix characters from 1 to 17 (with no ties allowed), again with a randomly generated initial ordering. Participants are afforded the same opportunity to provide feedback about their response as in the previous section. They are also permitted to ask that their response be excluded entirely. The confidence question may be used to up-weight or down-weight their remix response, in addition to the amount of agreement their Smash 64 response had with the resulting tier list.

Participants will not be asked to compare Smash Remix and Smash 64 characters simultaneously.

Survey Results

These will be available upon request from Dogs_Johnson after the tier list has been completed. Results will be de-identified and weights will be removed (so that no respondent is identifiable by look-up). Thus, the aggregated ratings and tier list will not be entirely replicable.

Smash 64 Tier List Calculations

Again, the calculation procedure was developed after the 13th melee tier list. However, unlike in that survey, no ties are allowed from the respondents.

Weighted Trimmed Mean Definition

What is the weighted trimmed mean? For example, suppose we had 10 equally weighted votes whose weights summed to 10. If we wanted the 5% trimmed mean, the lowest and highest votes would have their weights cut in half. If we wanted the 10% trimmed mean, we would drop the highest and lowest values.

Consider the following example of three rank votes for a single character, where we have differing weights.

Vote Voter_Weight
3 1.524
4 2.345
6 0.935

The weighted mean would be \[\frac{3 \cdot 1.524 + 4 \cdot 2.345 + 6 \cdot 0.935}{1.524 + 2.345 + 0.935} = \frac{19.562}{4.804} \approx 4.072.\] Note here that the sum of the weights is 4.804. The 5% weighted trimmed mean would be \[\frac{3 \cdot (1.524 - 4.804 \cdot 0.05) + 4 \cdot 2.345 + 6 \cdot (0.935 - 4.804 \cdot 0.05)}{1.524 + 2.345 + 0.935 - 4.804 \cdot 0.10)} \approx 4.024.\] The result is not substantially different, but it down-weights the “extreme” vote, and is more resistant to outliers.

Weighted Trimmed Mean

The resulting values for each character (1 for top, 12 for bottom, etc) will be averaged using a weighted 5% trimmed mean. In this case, our weights (if everyone responds), sum to 165.314. Then the lowest and highest 165.314 \(\cdot\) 0.05 = 8.2657 votes will be removed, where the two votes whose weights coincide with the 5th and 95th percentiles shall have their weights reduced such that only 8.2657 votes are removed from the top and bottom.

Overall Ranks

The 5% weighted trimmed mean of voter-supplied ranks is calculated for every character. Then, the characters are ordered from low (best rank) to high (worst rank). As in the previous two tier lists, the average rating will be reported for each character.

\(k\)-means for Tier List

Please read my introduction to \(k\)-means clustering here (with Smash-relevant examples!). \(k\)-means clustering will be performed on the overall ratings, of which there will be 12. As in the 13th melee tier list, the number of tiers that indicates “reasonable” spacing between tiers will be used. However, instead of doing this “by eye,” an objective measure, BIC, will be minimized to select the best \(k\), the number of clusters. See the following study and simulation to demonstrate how and why that is the best approach.

Study

Consider the last tier list:

If we use \(k\)-means clustering with \(k = 2\) through \(k = 7\) clusters, the resulting tierings are in the following columns. Note that the official tier list uses four tiers (\(k = 4\) column below). We intuit that \(k = 2\) is too few clusters (tiers), as Mario and Jigglypuff are in the same tier as Pikachu. Meanwhile, we can intuit that \(k = 7\) is too many tiers, as Captain Falcon, Fox, Yoshi, and Luigi each have their own tier.

clustering for \(k =\)
rank character rating \(2\) \(3\) \(4\) \(5\) \(6\) \(7\)
1 Pikachu 1.10 S S S S S S
2 Kirby 2.18 S S S S S S
3 Captain Falcon 3.42 S S A S S A
4 Fox 3.75 S S A S S B
5 Yoshi 4.85 S A A A A C
6 Jigglypuff 6.46 S A B A A D
7 Mario 6.49 S A B A A D
8 Samus 9.28 A B C B B E
9 Donkey Kong 9.49 A B C B C E
10 Ness 10.01 A B C C D E
11 Link 10.33 A B C C D E
12 Luigi 11.67 A B C D E F

So, how do we match our intuition with an objective measure? We can use criteria for model selection. In particular, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are the most popular, and both are easily calculable in R for \(k\)-means clustering. Let’s compare AIC and BIC for the ratings above.

We can see that both AIC and BIC are minimized when \(k = 4\), suggesting this is the optimal number of clusters for the fourth tier list, which is indeed what the intuitive approach suggested. Since BIC is more punitive for too many clusters, this is the model selection criterion I will use. Note: AIC should only be compared to itself and BIC should only be compared to itself.

Simulation

So, what if we simulate survey results from the prior tier list? We could simulate each rating from a \(N(\mu = \bar x_i, \sigma = 3\hat \sigma)\) distribution, where \(\bar x_i\) is the observed rating for each character \(i\), and \(\hat \sigma\) is the median absolute deviation of \((\bar x_i - i)\), which is approximately 0.55. That is, each voter disagrees somewhat with the previous result, but not by much. We then extract ranks from their underlying simulated character ratings. We could produce \(N = 109\) surveys in this way (the number we hope to have), and then find the character rating means and repeat the above study in BIC vs \(k\) \(M = 100\) times. Doing so, we get the following result. We see that \(k = 4\) is a reasonable choice in most cases where surveys have the same general result as in the previous tier list. This gives further support to having used \(k = 4\) in the previous tier list.

And, what if we repeat this process but let every initial character rating be \(N(\mu = i, \sigma = 3\hat \sigma)\). That is, there is a perfect true ranking which is contaminated slightly by noise. In doing so, we get the following result, which says that \(k = 4\) would generally be the best choice in that situation.

While the truth and the future survey results are unknown, BIC will be a reasonable approach to automatically selecting the number of tiers for the tier list.

Smash 64 Remix Tier List Calculations

These will be produced in a similar way to the Smash 64 tier list, with the exception that weights derived exclusively from Smash 64 rankings or ratings will not be used as these are not necessarily representative of Smash 64 Remix player ability.

The following players that are qualified for the Smash 64 character rankings are not currently qualified for Smash Remix character rankings: Alvin, BARD, Big Red, Bo, Chars, cobr, D35, Da_Bear, Dankey Kang, dboss, Dino, Dishier Wand, Dr. Grin, Dr. Sauce, eL maN, Elias_YFGM, Fireblaster, Gravyfingers, Hero Pie, Huntsman, Jam, Janitor, Jay-R, Jimmy Joe, Joshi, K Ruel, K.O.Ken, Killer, Kimimaru, kix, Kurabba, kysk, LETSGO, Lord Narwhal, lordtoko, Madrush, maha, Miniohh!, mrsir, Nax, Prince, Qapples, Quincy, rainshifter, Ranryoku, Raychu, ReefyBeefy, Robert, Shalaka, SOMBRERO, Sonjo, SOTO, Spongy, SSBAfro, SuPeRbOoMfAn, Take, wario, Wizzrobe, YBOMBB, Yobolight, Zero, Zuber. Please let me know if they are in the Remix database, but under a different tag.

Results and Presentation

Results will be presented in a separate post. If there are deviations from this data analysis plan, they will be described.

Toy Example (work in progress)

Table 1: Processed Raw Survey Data
3 4 5
resp cTSU Qqzv Rkat
weights 1.567 2.132 1.134
S1 Pikachu Pikachu Pikachu
S2 Kirby Captain Falcon Captain Falcon
S3 Captain Falcon Kirby Fox
S4 Yoshi Fox Yoshi
S5 Fox Yoshi Kirby
S6 Jigglypuff Jigglypuff Jigglypuff
S7 Mario Mario Mario
S8 Donkey Kong Samus Donkey Kong
S9 Samus Donkey Kong Samus
S10 Ness Ness Ness
S11 Link Link Link
S12 Luigi Luigi Luigi
Table 1: Example Aggregation
char rating
Pikachu 1.00
Captain Falcon 2.30
Kirby 3.11
Fox 4.10
Yoshi 4.43
Jigglypuff 6.00
Mario 7.00
Donkey Kong 8.43
Samus 8.57
Ness 10.00
Link 11.00
Luigi 12.00