When Elon Musk bought Twitter, one of the early commitments he made was to open-source the algorithms that drive many of the key experiences on the platform. Today Twitter Engineering released ‘The Algorithm on GitHub for the world to see.
Musk was on a Twitter Space (recording here) shortly after the release to discuss the code with the community. While they were searching for technical questions about the code, many were centred around business decisions and end-user features, rather than specifically code, likely a symptom of the timeframe, that many just hadn’t had time to explore.
Personally, I’d like to see another follow-up space a few days from now, after more people have had time to digest the thousands of lines of code released today.
During the Space, Musk reiterated that there will be embarrassing things in the code, as this is how things stand today, it is not representative of their best efforts (under Twitter 2.0). By open-sourcing the code, Musk’s main objective is to make changes that reduce the amount of regretted hours on the service, a fairly obvious reference to other platforms like TikTok where users have a dopamine hit, but often regret how long they spent on the service.
Musk detailed the benefits of being open source and developing in public by referencing Linux, an open-source operating system used by both of his other companies, SpaceX and Tesla.
Twitter algorithm features Author Types ‘Republican’ and ‘Democrat’.
During the Twitter space, one of the questions came from a developer that had interrogated the newly available source code and discovered references Republican and Democrat categorisations.
After the space, I downloaded the code and ran this search myself. It turns out this reference is found primarily in the HomeTweetTypePredicates.scala file and has the ability to attributes one of the 4 properties:
- author_is_elon
- author_is_power_user
- author_is_democrat
- author_is_republican
The first is strange to explicitly call out the new CEO of the company, the second one is understandable, but it’s the last two that were of concern.
When asked about it in the Twitter Space, it was clear that this was the first time Elon was learning about this and he immediately indicated this was weird and would be removed. A Twitter Engineer also responded and indicated that this is not used as some may fear, to show a different experience to one group or not.
In RequestQueryFeatureHydrator.scala, we find a class called RequestQueryFeatureHydrator where an override value checks for DDGStatsDemocratsFeature and DDGStatsRepublicansFeature. As indicated by the engineer, these attributions were around statistics, rather than specific end-user features.
With attributions of political affiliations in the code, it is naturally going to raise concerns as users would not actually know which way they have been defined. It’s great to hear this will be removed as I can’t think of a practical reason why it’d be legitimate to slice users into those two groups, likely a legacy implementation, much of what we learnt about through the Twitter Files.
In the same file, it features the following comment on line 86:
/**
* These author ID lists are used purely for metrics collection. We track how often we are
* serving Tweets from these authors and how often their tweets are being impressed by users.
* This helps us validate in our A/B experimentation platform that we do not ship changes
* that negatively impacts one group over others.
*/
Regarding the Elon value mentioned above, Musk has posted and says this will be gone by tomorrow.
Security Bug Bounty
Where there are security-related bugs, Twitter offers a bug bounty, as do many other large companies. This hopes that any financially motivated attacker will disclose the bug responsibly to Twitter so they can fix it before its exploited and are remunerated for that discovery, sized on the category and severity of the bug.
This program is run through HackerOne and includes rewards for finding items like:
- Remote code execution (i.e. Command injection) – $20,160
- Administrative functionality (i.e. Access to internal Twitter apps) – $12,460
- Account takeover – (i.e. OAuth vulnerabilities) – $7,700
- Recommendation Algorithm Manipulation (i.e. Bypass filtering, rankings, recommendations etc) – $6,942.00
These figures feel incredibly low compared to industry averages and the potential harm that could come as a result of one of these being found and exploited in the wild. Basically, you could earn much more on the black market, so it’d be worth Twitter skipping on the jokes ($6,942.00 : ’69’, ‘420) and getting serious about incentivising white hats.
The first embarrassing item has been found in the code, something raised directly to Elon in the Twitter space.
After the space, Elon Musk tweeted that Twitter will be updating its algorithm every 24 to 48 hours based on user suggestions.
The speed at which improvements are made is important, not only does it ensure those who attempt to game the system can’t simply set and forget, but it also gives confidence to the contributing community that they can impact change to make the platform better.
Google’s Search algorithm also famously changes to reduce and avoid manipulating or gaming the search results, however their changes are typically implemented slower than this.
What we got today in terms of a source code release is listed below, however, there is more on the way, with one of the Twitter Engineering team confirming they are moving to also release their search algorithm, something Google and Facebook both hold very close to their chest.
These challenges of ranking content, and assigning credibility to their author is a difficult one and a challenge faced by the broader social media industry, I hope that today’s release by Twitter could lead to the industry having conversations about how best to solve for these challenges (i.e. NSFW and illegal content). In the best-case we could even see contributions to Twitter’s code from the learnings at other social networks.
Changes will happen on Twitter’s side in a regular way, developers can submit pull requests and a team at Twitter will review and determine if that suggested change is good/bad and approve/decline the change accordingly.
These are the main components of the Recommendation Algorithm included in this repository:
Type | Component | Description |
---|---|---|
Feature | simclusters-ann | Community detection and sparse embeddings into those communities. |
TwHIN | Dense knowledge graph embeddings for Users and Tweets. | |
trust-and-safety-models | Models for detecting NSFW or abusive content. | |
real-graph | Model to predict likelihood of a Twitter User interacting with another User. | |
tweepcred | Page-Rank algorithm for calculating Twitter User reputation. | |
recos-injector | Streaming event processor for building input streams for GraphJet based services. | |
graph-feature-service | Serves graph features for a directed pair of Users (e.g. how many of User A’s following liked Tweets from User B). | |
Candidate Source | search-index | Find and rank In-Network Tweets. ~50% of Tweets come from this candidate source. |
cr-mixer | Coordination layer for fetching Out-of-Network tweet candidates from underlying compute services. | |
user-tweet-entity-graph (UTEG) | Maintains an in memory User to Tweet interaction graph, and finds candidates based on traversals of this graph. This is built on the GraphJet framework. Several other GraphJet based features and candidate sources are located here | |
follow-recommendation-service (FRS) | Provides Users with recommendations for accounts to follow, and Tweets from those accounts. | |
Ranking | light-ranker | Light ranker model used by search index (Earlybird) to rank Tweets. |
heavy-ranker | Neural network for ranking candidate tweets. One of the main signals used to select timeline Tweets post candidate sourcing. | |
Tweet mixing & filtering | home-mixer | Main service used to construct and serve the Home Timeline. Built on product-mixer |
visibility-filters | Responsible for filtering Twitter content to support legal compliance, improve product quality, increase user trust, protect revenue through the use of hard-filtering, visible product treatments, and coarse-grained downranking. | |
timelineranker | Legacy service which provides relevance-scored tweets from the Earlybird Search Index and UTEG service. | |
Software framework | navi | High performance, machine learning model serving written in Rust. |
product-mixer | Software framework for building feeds of content. | |
twml | Legacy machine learning framework built on TensorFlow v1. |
More information at Twitter.