Background

In this post, I’ll be describing how the fifth community-recognized Super Smash Brothers for the Nintendo 64 (NTSC version) character tier list will be built. By posting the methodology in advance, the community will know that I did not make ad hoc decisions influencing the outcome of the tier list.

A data analysis plan like this one is consistent with efforts in science which push for reproducibility and rigor (see a brief description of reproducibility and rigor here and Samsa and Samsa (2019)).

I was sought out by @Dogs_Johnson. I was a part of the DFW Smash 64 scene, where I was ranked “9” (tag: Accru Fenix), did commentary on occasion, and designed the schedule for Hitstun 6, a 30-person round robin tournament (which is a post for another day). If you’re clicking here from Smash 64 Discord/Smash 64 FB/Smash 64 Twitter/etc, you can read more about my academic research, my Ph.D. in Statistical Science and other credentials, and find my public social media accounts at ryanmcshane.com. In fact, the Smash 64 community is intended to be the primary audience for this post.

The four prior tier lists are collected at ssbwiki. The methodology behind the 13th melee tier list was suggested as a blueprint. I will detail how this tier list will be created and differences in my approach to that of the 13th melee tier list.

Data Collection

All four of the preceding tier lists were based on a community vote/survey. While characters are programmed into the game and are static in their capacities, players develop an understanding of the characters and their relationship to characters in head-to-head matches. As players improve, learn to counter other characters, learn to counter those counters, etc, each character’s place in the hierarchy changes over time (e.g., the “meta” changes). These changes are most profound at the highest levels of the game with the highest level of players*.

Therefore, the best players’ input is most important in understanding the character hierarchy. Furthermore, the best players have a greater understanding of these hierarchies and should therefore have greater input on the tier list than the players rated below them. I discuss my model-based solution to this problem shortly, although other valid approaches exist as well. (See, for example, the disagreement-downweighting method of rating aggregation detailed in Cao and Stokes (2012)).

Players’ input will be delivered in a survey I constructed in Qualtrics, which was approved by Dogs Johnson. I will detail this shortly.

Survey Participation

The Smash 64 2021-2022 rankings committee (@SSB64UPR) aggregated results from American competitions they deemed “major” in a braacket database, and the data has continued to be collected, including SSC 2023. Braacket has a feature to produce Trueskill rankings (Herbrich, Minka, Graepel (2007)) given match results.

Aside on Trueskill

The Trueskill system is Microsoft’s system for aggregating paired comparisons data in team-based games to estimate individual player ratings. These ratings are then used for matchmaking purposes, and Microsoft seeks to balance teams such that both teams have an approximate 50% chance of winning in an effort to keep matchmaking fun. However, the system also works for ordinary paired comparisons data (e.g., data in which there are one versus one games as in major Smash 64 tournaments). In this regard, it functions similarly to Elo’s method (see Elo (1978)), which is a time-dependent Thurstone-Mosteller model. However, Trueskill is more explicitly Bayesian, like Glicko (see Glickman (2001)). See McShane (2019) for a more detailed introduction to Thurstone-Mosteller, Elo, Glicko, Trueskill, and other paired comparisons methods.

Survey Participation (resumed)

As the Trueskill ratings were a decision element the committee used in creating the player rankings (although not the sole element), I used these as objective measures to originate the player weights.

The 2021-2022 ranked players (denoted “A” in table), the top 80 players on this 2021-2023 list of aforementioned input Trueskill ratings (denoted “B” in table), anyone that was ranked in 2019-2020 including in the “expansion pack” (denoted “C” in table), and anyone ranked in the top 30 of 2020 online Smash 64 rankings (denoted “D” in table) were qualified to respond to the tier list survey. This led to the following list of 109 of qualified respondents, who may have qualified in one to four ways. Shortly, you will see the list of players, including their tags, the way they were qualified, and their weight.

The weights are simply the Trueskill rating divided by 4000 and rounded to the third decimal place, which gives some of the following example weights Kurabba (2.096 – maximum), Isai (2.017), Dogs_Johnson (1.779), K Ruel (1.450), NaCl (1.377), Jimmy Joe (1.073), Darkhorse (0.907 – minimum). That is, Isai’s response would count nearly twice as much as that of Jimmy Joe’s, and Jimmy Joe’s would count an infinite number of times more than players that do not have a vote, myself included. Players absent from the 2021-2023 Braacket data had their Trueskill and thus weight imputed (denoted with an asterisk) by proximity in the 2019-2020 ranking to those that continued competitive play. The average weight is approximately 1.517.

The weight distribution is shown below. This should reflect our general understanding of the player abilities among the top players, where a few (5 to 10) players stand head and shoulders above the rest in the smaller region on the right hand side, and the remainder are more competitive with each other in the larger peak. Meanwhile, there are a few players that are viewed as stronger than their measured Trueskill suggests and are qualified for the survey by inclusion in rankings.

Note: for comparison, three alternative weighting scheme (monotonic transforms of the aforementioned and described weighting scheme), are included in the table as well, and the weights are about to change with the updated Trueskill Ratings

Click for Table

Player	Qualification	Weight	\(\sqrt{\text{Weight}}\)	\(\text{Weight}^\frac{2}{3}\)	\(\text{Weight}^\frac{3}{4}\)
Kurabba	AB	2.096	1.448	1.638	1.742
kysk	C	2.019*	1.421*	1.597*	1.694*
SuPeRbOoMfAn	C	2.019*	1.421*	1.597*	1.694*
Isai	ABC	2.017	1.420	1.596	1.693
Nax	AB	1.943	1.394	1.557	1.646
Prince	ABC	1.889	1.374	1.528	1.611
wario	AB	1.888	1.374	1.528	1.611
Alvin	ABC	1.875	1.369	1.521	1.602
KD3	ABC	1.873	1.369	1.519	1.601
JaimeHR	ABC	1.870	1.367	1.518	1.599
Shihman	ABC	1.864	1.365	1.515	1.595
Hero Pie	ABC	1.857	1.363	1.511	1.591
KeroKeroppi	ABC	1.842	1.357	1.503	1.581
Lowww	ABC	1.817	1.348	1.489	1.565
Wizzrobe	ABC	1.793	1.339	1.476	1.549
Jam	C	1.792*	1.339*	1.475*	1.549*
Josh Brody	ABC	1.792	1.339	1.475	1.549
Zero	AB	1.781	1.335	1.469	1.542
Dogs_Johnson	ABC	1.779	1.334	1.468	1.540
Joshi	AB	1.736	1.318	1.444	1.512
Mercy	ABC	1.732	1.316	1.442	1.510
Sleepy Fox	ABC	1.657	1.287	1.400	1.460
Robert	ABC	1.650	1.285	1.396	1.456
Livin La Fetus Loca	ABC	1.645	1.283	1.394	1.453
baby caweb	AB	1.642	1.281	1.392	1.451
Finio	ABC	1.636	1.279	1.388	1.447
Take	AB	1.633	1.278	1.387	1.445
FranK	ABC	1.620	1.273	1.379	1.436
tacos	ABC	1.613	1.270	1.375	1.431
Revan	ABC	1.597	1.264	1.366	1.421
Fray	AB	1.594	1.263	1.365	1.419
Hotline	ABC	1.589	1.261	1.362	1.415
SheerMadness	ABC	1.564	1.251	1.347	1.399
B33F	ABC	1.561	1.249	1.346	1.397
Freean	AB	1.550	1.245	1.339	1.389
CTG	ABC	1.543	1.242	1.335	1.384
Wolf	ABC	1.542	1.242	1.335	1.384
Quincy	AB	1.532	1.238	1.329	1.377
Janitor	C	1.530*	1.237*	1.328*	1.376*
SKG	C	1.530*	1.237*	1.328*	1.376*
Killer	ABC	1.530	1.237	1.328	1.376
dboss	C	1.528*	1.236*	1.327*	1.374*
Spongy	ABC	1.527	1.236	1.326	1.374
Raychu	ABC	1.526	1.235	1.325	1.373
Bo	ABC	1.525	1.235	1.325	1.372
KrisKringle	AB	1.518	1.232	1.321	1.368
Shalaka	C	1.515*	1.231*	1.319*	1.366*
JPX	AB	1.515	1.231	1.319	1.366
OJ	AB	1.515	1.231	1.319	1.366
emptyW	ABC	1.514	1.230	1.319	1.365
Janco	AB	1.508	1.228	1.315	1.361
Paco	ABC	1.508	1.228	1.315	1.361
epad10	AB	1.506	1.227	1.314	1.359
Crovy	AB	1.503	1.226	1.312	1.357
Stevie G	AB	1.491	1.221	1.305	1.349
lord narwhal	AB	1.488	1.220	1.303	1.347
Box	ABC	1.483	1.218	1.300	1.344
Wookiee	AB	1.482	1.217	1.300	1.343
EG	ABC	1.477	1.215	1.297	1.340
SOMBRERO	C	1.475*	1.214*	1.296*	1.338*
Weedwack	C	1.475*	1.214*	1.296*	1.338*
Loto	ABC	1.475	1.214	1.296	1.338
Combo Blaze	ABC	1.471	1.213	1.293	1.336
cobr	C	1.470*	1.212*	1.293*	1.335*
Madrush	ABC	1.470	1.212	1.293	1.335
SSBAfro	AB	1.464	1.210	1.289	1.331
Blondekid	ABC	1.462	1.209	1.288	1.330
Fireblaster	BC	1.460	1.208	1.287	1.328
waxy:joe	ABC	1.458	1.207	1.286	1.327
Andykins	AB	1.452	1.205	1.282	1.323
K Ruel	AB	1.450	1.204	1.281	1.321
Ranryoku	AB	1.438	1.199	1.274	1.313
antwon420	AB	1.432	1.197	1.270	1.309
Huntsman	AB	1.423	1.193	1.265	1.303
SOTO	ABC	1.420	1.192	1.263	1.301
Da_Bear	AB	1.418	1.191	1.262	1.299
Jay-R	C	1.413*	1.189*	1.259*	1.296*
lordtoko	C	1.413*	1.189*	1.259*	1.296*
BARD	BC	1.412	1.188	1.259	1.295
MasterHandJob	AB	1.412	1.188	1.259	1.295
Dr. Grin	B	1.403	1.184	1.253	1.289
Sonjo	C	1.401*	1.184*	1.252*	1.288*
Marbles	AB	1.396	1.182	1.249	1.284
Miniohh!	AB	1.394	1.181	1.248	1.283
Shears	ABC	1.386	1.177	1.243	1.277
Nackle	ABC	1.377	1.173	1.238	1.271
Schmerka Berl	B	1.377	1.173	1.238	1.271
NewbTube	AB	1.354	1.164	1.224	1.255
YBOMBB	C	1.335*	1.155*	1.212*	1.242*
Traiman	B	1.333	1.155	1.211	1.241
Papa louie	B	1.332	1.154	1.211	1.240
Isildur1	AB	1.325	1.151	1.206	1.235
bloogo	B	1.321	1.149	1.204	1.232
Czar	ABC	1.316	1.147	1.201	1.229
Big Red	C	1.315*	1.147*	1.200*	1.228*
D35	C	1.315*	1.147*	1.200*	1.228*
Gravyfingers	C	1.315*	1.147*	1.200*	1.228*
The Yid	AB	1.312	1.145	1.198	1.226
Yobolight	BC	1.277	1.130	1.177	1.201
Dr. Sauce	AB	1.273	1.128	1.175	1.198
Qapples	BC	1.270	1.127	1.173	1.196
Mando	BC	1.242	1.114	1.155	1.176
Razz	BC	1.215	1.102	1.139	1.157
Dankey Kang	ABC	1.114	1.055	1.075	1.084
LETSGO	BC	1.098	1.048	1.064	1.073
Jimmy Joe	BC	1.073	1.036	1.048	1.054
Roman	BC	1.028	1.014	1.019	1.021
Dishier Wand	BC	0.986	0.993	0.991	0.989
Darkhorse	BC	0.907	0.952	0.937	0.929
.

Smash Remix Qualification

Not all Smash 64 players view Smash 64 Remix favorably, and may not be inclined to provide a useful response to a Smash Remix character ranking. In an effort to include only players that understand the Smash Remix cast, only some survey participants will be presented with a Smash Remix ballot. Players that were qualified to respond to the Smash 64 character ranking (and thus invited to the survey) will only be qualified to respond to the Smash 64 Remix survey if they participated in a Smash Remix tournament listed in the Smash Remix database.

NOTE: the Remix database, as of this writing, goes to January 2022. Remix tournaments have taken place since then. The Remix weights are still under discussion.

Survey

The survey itself was implemented in Qualtrics, which is production-quality survey software to which my institution (The University of Chicago) subscribes. Notably, it includes a drag-and-drop question type to visually create rankings, which allow for quick self-checking by the voter.

The survey is broken into three sections: an introduction and self-identification section, a Smash 64 rankings section, and a Smash 64 Remix rankings section.

Identification and Security

The first section includes measures to assure that each survey-taker is uniquely identified, a survey can only be taken by that individual, and that each survey participant may only submit the form once.

Smash 64 Character Ranking

The second section asks participants to rank the Smash 64 characters from 1 to 12 (with no ties allowed). The ordering the participants are presented with is randomly generated, and will in all likelihood not resemble an ordering they agree with. This is to prevent bias from being introduced into the question. Specifically, we are concerned that if all participants are presented with the previous tier list’s ordering, they may be more likely to simply agree with it and move forward. While the most vocal members of the Smash 64 community are relatively entrenched in the previous tier list, the purpose of this exercise is to create a new tier list, which may or may not agree with the previous tier list.

Participants are then given the opportunity to provide verbal feedback about their rankings as well as rate their confidence in their rankings. Neither of these questions will impact the way their input is weighted or considered when calculating the average rankings and tier list, but may be used to inform the committee about changes to the procedure should there be a sixth tier list.

Smash 64 Remix Character Ranking

The third and final section asks participants to rank the Smash 64 Remix characters from 1 to 17 (with no ties allowed), again with a randomly generated initial ordering. Participants are afforded the same opportunity to provide feedback about their response as in the previous section. They are also permitted to ask that their response be excluded entirely. The confidence question may be used to up-weight or down-weight their remix response, in addition to the amount of agreement their Smash 64 response had with the resulting tier list.

Participants will not be asked to compare Smash Remix and Smash 64 characters simultaneously.

Survey Results

These will be available upon request from Dogs_Johnson after the tier list has been completed. Results will be de-identified and weights will be removed (so that no respondent is identifiable by look-up). Thus, the aggregated ratings and tier list will not be entirely replicable.

Smash 64 Tier List Calculations

Again, the calculation procedure was developed after the 13th melee tier list. However, unlike in that survey, no ties are allowed from the respondents.

Weighted Trimmed Mean Definition

What is the weighted trimmed mean? For example, suppose we had 10 equally weighted votes whose weights summed to 10. If we wanted the 5% trimmed mean, the lowest and highest votes would have their weights cut in half. If we wanted the 10% trimmed mean, we would drop the highest and lowest values.

Consider the following example of three rank votes for a single character, where we have differing weights.

Vote	Voter_Weight
3	1.524
4	2.345
6	0.935

The weighted mean would be \[\frac{3 \cdot 1.524 + 4 \cdot 2.345 + 6 \cdot 0.935}{1.524 + 2.345 + 0.935} = \frac{19.562}{4.804} \approx 4.072.\] Note here that the sum of the weights is 4.804. The 5% weighted trimmed mean would be \[\frac{3 \cdot (1.524 - 4.804 \cdot 0.05) + 4 \cdot 2.345 + 6 \cdot (0.935 - 4.804 \cdot 0.05)}{1.524 + 2.345 + 0.935 - 4.804 \cdot 0.10)} \approx 4.024.\] The result is not substantially different, but it down-weights the “extreme” vote, and is more resistant to outliers.

Weighted Trimmed Mean

The resulting values for each character (1 for top, 12 for bottom, etc) will be averaged using a weighted 5% trimmed mean. In this case, our weights (if everyone responds), sum to 165.314. Then the lowest and highest 165.314 \(\cdot\) 0.05 = 8.2657 votes will be removed, where the two votes whose weights coincide with the 5th and 95th percentiles shall have their weights reduced such that only 8.2657 votes are removed from the top and bottom.

Overall Ranks

The 5% weighted trimmed mean of voter-supplied ranks is calculated for every character. Then, the characters are ordered from low (best rank) to high (worst rank). As in the previous two tier lists, the average rating will be reported for each character.

\(k\)-means for Tier List

Please read my introduction to \(k\)-means clustering here (with Smash-relevant examples!). \(k\)-means clustering will be performed on the overall ratings, of which there will be 12. As in the 13th melee tier list, the number of tiers that indicates “reasonable” spacing between tiers will be used. However, instead of doing this “by eye,” an objective measure, BIC, will be minimized to select the best \(k\), the number of clusters. See the following study and simulation to demonstrate how and why that is the best approach.

Study

Consider the last tier list:

If we use \(k\)-means clustering with \(k = 2\) through \(k = 7\) clusters, the resulting tierings are in the following columns. Note that the official tier list uses four tiers (\(k = 4\) column below). We intuit that \(k = 2\) is too few clusters (tiers), as Mario and Jigglypuff are in the same tier as Pikachu. Meanwhile, we can intuit that \(k = 7\) is too many tiers, as Captain Falcon, Fox, Yoshi, and Luigi each have their own tier.

			clustering for \(k =\)
rank	character	rating	\(2\)	\(3\)	\(4\)	\(5\)	\(6\)	\(7\)
1	Pikachu	1.10	S	S	S	S	S	S
2	Kirby	2.18	S	S	S	S	S	S
3	Captain Falcon	3.42	S	S	A	S	S	A
4	Fox	3.75	S	S	A	S	S	B
5	Yoshi	4.85	S	A	A	A	A	C
6	Jigglypuff	6.46	S	A	B	A	A	D
7	Mario	6.49	S	A	B	A	A	D
8	Samus	9.28	A	B	C	B	B	E
9	Donkey Kong	9.49	A	B	C	B	C	E
10	Ness	10.01	A	B	C	C	D	E
11	Link	10.33	A	B	C	C	D	E
12	Luigi	11.67	A	B	C	D	E	F

So, how do we match our intuition with an objective measure? We can use criteria for model selection. In particular, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) are the most popular, and both are easily calculable in R for \(k\)-means clustering. Let’s compare AIC and BIC for the ratings above.

We can see that both AIC and BIC are minimized when \(k = 4\), suggesting this is the optimal number of clusters for the fourth tier list, which is indeed what the intuitive approach suggested. Since BIC is more punitive for too many clusters, this is the model selection criterion I will use. Note: AIC should only be compared to itself and BIC should only be compared to itself.

Simulation

So, what if we simulate survey results from the prior tier list? We could simulate each rating from a \(N(\mu = \bar x_i, \sigma = 3\hat \sigma)\) distribution, where \(\bar x_i\) is the observed rating for each character \(i\), and \(\hat \sigma\) is the median absolute deviation of \((\bar x_i - i)\), which is approximately 0.55. That is, each voter disagrees somewhat with the previous result, but not by much. We then extract ranks from their underlying simulated character ratings. We could produce \(N = 109\) surveys in this way (the number we hope to have), and then find the character rating means and repeat the above study in BIC vs \(k\) \(M = 100\) times. Doing so, we get the following result. We see that \(k = 4\) is a reasonable choice in most cases where surveys have the same general result as in the previous tier list. This gives further support to having used \(k = 4\) in the previous tier list.

And, what if we repeat this process but let every initial character rating be \(N(\mu = i, \sigma = 3\hat \sigma)\). That is, there is a perfect true ranking which is contaminated slightly by noise. In doing so, we get the following result, which says that \(k = 4\) would generally be the best choice in that situation.

While the truth and the future survey results are unknown, BIC will be a reasonable approach to automatically selecting the number of tiers for the tier list.

Smash 64 Remix Tier List Calculations

These will be produced in a similar way to the Smash 64 tier list, with the exception that weights derived exclusively from Smash 64 rankings or ratings will not be used as these are not necessarily representative of Smash 64 Remix player ability.

The following players that are qualified for the Smash 64 character rankings are not currently qualified for Smash Remix character rankings: Alvin, BARD, Big Red, Bo, Chars, cobr, D35, Da_Bear, Dankey Kang, dboss, Dino, Dishier Wand, Dr. Grin, Dr. Sauce, eL maN, Elias_YFGM, Fireblaster, Gravyfingers, Hero Pie, Huntsman, Jam, Janitor, Jay-R, Jimmy Joe, Joshi, K Ruel, K.O.Ken, Killer, Kimimaru, kix, Kurabba, kysk, LETSGO, Lord Narwhal, lordtoko, Madrush, maha, Miniohh!, mrsir, Nax, Prince, Qapples, Quincy, rainshifter, Ranryoku, Raychu, ReefyBeefy, Robert, Shalaka, SOMBRERO, Sonjo, SOTO, Spongy, SSBAfro, SuPeRbOoMfAn, Take, wario, Wizzrobe, YBOMBB, Yobolight, Zero, Zuber. Please let me know if they are in the Remix database, but under a different tag.

Results and Presentation

Results will be presented in a separate post. If there are deviations from this data analysis plan, they will be described.

Toy Example (work in progress)

Table 1: Table 2: Processed Raw Survey Data
	3	4	5
resp	cTSU	Qqzv	Rkat
weights	1.567	2.132	1.134
S1	Pikachu	Pikachu	Pikachu
S2	Kirby	Captain Falcon	Captain Falcon
S3	Captain Falcon	Kirby	Fox
S4	Yoshi	Fox	Yoshi
S5	Fox	Yoshi	Kirby
S6	Jigglypuff	Jigglypuff	Jigglypuff
S7	Mario	Mario	Mario
S8	Donkey Kong	Samus	Donkey Kong
S9	Samus	Donkey Kong	Samus
S10	Ness	Ness	Ness
S11	Link	Link	Link
S12	Luigi	Luigi	Luigi

Table 1: Table 1: Example Aggregation
char	rating
Pikachu	1.00
Captain Falcon	2.30
Kirby	3.11
Fox	4.10
Yoshi	4.43
Jigglypuff	6.00
Mario	7.00
Donkey Kong	8.43
Samus	8.57
Ness	10.00
Link	11.00
Luigi	12.00