The use of the em dash (—) now raises suspicions that a text might have been AI-generated. Inspired by a suggestion from dang [1], I created a leaderboard of HN users according to how many of their posts before November 30, 2022—that is, before the release of ChatGPT—contained em dashes. Dang himself comes in number 2—by a very slim margin.
Credit to Claude Code for showing me how to search the HN database through Google BigQuery and for writing the HTML for the leaderboard.
[1] https://news.ycombinator.com/item?id=45053933
Whether this is interesting or not, well…
[1] >>45046883
I remap a key to the right of Space to Compose, and add various custom sequences. Before long, I was completely comfortably and casually typing dashes and curly quotes and more, and in fact it takes conscious effort for me to limit myself to ASCII when typing prose. (Writing code, writing *, /, -, ' and " is easy. But writing prose, I genuinely will write ×, ÷ if it feels the right one in that place, −, ‘/’ and “/”.)
On one previous laptop keyboard I mapped Menu, on my current one RAlt is more suitable.
When on Windows, I use WinCompose. On Linux, I used to just use it bare, which had advantages and disadvantages—apps implement a Compose key inconsistently, some messing things up related to includes and some handling overlapping sequences differently. More recently I wanted to be able to type Telugu and installed fcitx5 which is no longer mostly broken under Wayland like it was last time I tried, so now fcitx5 is handling the Compose sequences across the entire system, and working more consistently. Also I can use Ctrl+Alt+Shift+U and get a popup where I can search Unicode by code or description. Now if only that pesky popup would handle Shift+Space and Ctrl+Backspace itself rather than letting them fall through to the parent…
In my ~/.config/sway/config:
input * {
xkb_options "caps:backspace,compose:ralt"
}
(caps:backspace isn’t entirely relevant here, but it’s on the same line and I choose to mention it. When people are remapping Caps Lock, I’ve never understood why so many seem to choose to make it Escape. Just extend the left hand and slap the corner of the keyboard with the ring finger, it’s not a huge movement and is easy to reach and return. Backspace, however, tends to be needed at least as often (and yes, I say that despite using Vim), and is much harder to hit. In my mind, a far better candidate for shifting to that prime real estate.)For my ~/.XCompose, I start with the defaults and one good set of additions, https://raw.githubusercontent.com/kragen/xcompose/master/dot...:
include "/usr/share/X11/locale/en_US.UTF-8/Compose"
include "/home/chris/.XCompose-kragen"
Then I add all kinds of additions. Lots of fine typography stuff like zero-width space and non-joiner, narrow no-break space, thin space… a few more hyphen/dash mappings… and lots of other things like nice emoji sequences, music notation stuff, Greek letters matching Vim digraphs, superscript ordinals (ˢᵗ, ⁿᵈ, ʳᵈ, ᵗʰ), the keyboard shortcut symbols macOS uses (⌘⌃⌥⇧⌫ and another dozen less common ones), control pictures like ␆, and a handful of other things.When all’s said and done:
• Compose - - - gets me — EM DASH (stock)
• Compose - - . gets me – EN DASH (stock)
• Compose - - = gets me − MINUS SIGN (custom)
• Compose - - w gets me ⸺ TWO EM DASH (custom; w for wide)
• Compose - - W gets me ⸻ THREE EM DASH (custom; W for Wider)
The last two I use occasionally, the other three I use very frequently. I went through a phase of using HYPHEN and SOFT HYPHEN, now I seldom use them.
I also like to write &c. (italic where supported) for et cetera.
For quotation marks, I also use custom mappings:
<Multi_key> <semicolon> <semicolon> : "‘" U2018 # LEFT SINGLE QUOTATION MARK
<Multi_key> <apostrophe> <apostrophe> : "’" U2019 # RIGHT SINGLE QUOTATION MARK
<Multi_key> <colon> <colon> : "“" U201c # LEFT DOUBLE QUOTATION MARK
<Multi_key> <quotedbl> <quotedbl> : "”" U201d # RIGHT DOUBLE QUOTATION MARK
Think about how you physically type them, and I reckon these mappings make a lot of sense, very easy to type. Much better than the stock bindings (<' >' <" >") or kragen ones (`Space 'Space `` ''; or 6' 9' 6" 9").—⁂—
(Oh yeah, that one’s <Multi_key> <h> <r> : "—⁂—".)
Now, I have one question I’d like answered. Overlapping sequences. If you have -> → and <- ← you’re fine, but when you add <-> ↔, I can’t find any way of using the <- sequence any more. Before fcitx5, some apps would ignore one or the other (in ways difficult to explain which I think involved the fact that some definitions came from includes), and some would let you terminate the sequence early and match the shorter one (e.g. Compose < - Enter). Is there some proper solution I’ve missed?
I have plans for an article on my keyboard arrangements, including sharing a full .XCompose, but I’m going to finish my next major revision to my website first. Because then I’ll be able to draw things instead of just writing.
—⁂—
On mobile, I think I use FUTO keyboard at present, which lets me access most of these things, but not elegantly. I want to make my own keyboard layout that lets me access the good stuff more easily, but I haven’t got to it yet.
Also: anyone want to join me in advocating for completion dictionaries and libraries to replace their ' apostrophes with ’, or at least to support both approaches equally? I’m fed up with not having this stuff, Vim is the only place where it was straightforward to get it about right, and mobile is just a mess.
https://www.gally.net/miscellaneous/hn-em-dash-user-leaderbo...
This second version was vibe-coded with Codex CLI. I also tried Gemini CLI, but it didn’t work very well. The SQL scripts I ran at BigQuery were by Claude.
I am not a programmer or web designer, so I will leave these pages as they are, warts and all. It was a fun project, though. I never would have attempted something like this pre-vibe-coding.
Compose ` e produces è
" a produces ä
v s produces š
v S produces Š
a e produces æ
C = produces €
l - produces £
- > produces →
( 1 ) produces ①
^ 1 produces ¹
_ 1 produces ₁
1 8 produces ⅛
- - - produces —
- - . produces –
. . produces …
. - produces ·
| - produces †
| = produces ‡
" < produces “
x x produces ×
m u produces µ
> = produces ≥
See /usr/share/X11/locale/en_US.UTF-8/Compose for the list and https://en.wikipedia.org/wiki/Compose_keyI have also configured Shift+Compose to send the code 'dead_greek' using ~/.Xmodmap:
keycode 135 = Multi_key dead_greek Multi_key Multi_key
Then I can type α, β, γ, Δ, Ε, Ζ easily, although I hardly ever need this nowadays.https://daringfireball.net/2018/02/ios_messages_smart_punctu...
SELECT
EXTRACT(YEAR FROM timestamp) AS year,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) AS withDash,
COUNT(*) AS total,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'comment'
GROUP BY year
ORDER BY year;
year with— total frac
2006 0 12 0.000
2007 13 70858 0.000
2008 461 247922 0.001
2009 1497 491034 0.003
2010 3835 842438 0.005
2011 4719 1044913 0.005
2012 5648 1246782 0.005
2013 7881 1665185 0.005
2014 8400 1510814 0.006
2015 9967 1642912 0.006
2016 12081 2093612 0.006
2017 14530 2361709 0.006
2018 19246 2384086 0.008
2019 23662 2755063 0.009
2020 27316 3243173 0.008
2021 32863 3765921 0.009
2022 34657 4062159 0.009
2023 36611 4221940 0.009
2024 32543 3339861 0.010
2025 30608 2231919 0.014
So there's definitely been an increase.Querying for the users who use "—" most as a proportion of all their comments:
SELECT
`by`,
SUM(CASE WHEN text LIKE '%—%' THEN 1 ELSE 0 END) / COUNT(*) AS fraction,
COUNT(*) AS total,
MIN(timestamp) AS minTime,
MAX(timestamp) AS maxTime
FROM `bigquery-public-data.hacker_news.full`
WHERE
type = 'comment' AND
timestamp < '2022-11-30'
GROUP BY `by`
HAVING COUNT(*) > 100
ORDER BY fraction DESC
LIMIT 250;
zmgsabst uses them the most [1], westoncb [2] is an older account that uses them fourth-most.[0] https://console.cloud.google.com/marketplace/product/y-combi...
Of course this doesn't mean they're using ChatGPT either, they could've switched devices or started using them because they felt like it.
# user before_chatgpt after_chatgpt
1 fao_ 9/1777 (1 %) 36/225 (16 %)
2 tlogan 1/962 (0 %) 59/399 (15 %)
3 whynotminot 1/250 (0 %) 36/356 (10 %)
4 unclebucknasty 13/2566 (1 %) 38/378 (10 %)
5 iLemming 0/793 (0 %) 61/628 (10 %)
6 nostrebored 10/1045 (1 %) 32/331 (10 %)
7 freeone3000 0/2128 (0 %) 74/791 (9 %)
8 pdabbadabba 6/932 (1 %) 20/225 (9 %)
9 thebooktocome 4/632 (1 %) 18/208 (9 %)
10 tnecniv 0/671 (0 %) 34/446 (8 %)
11 dkersten 39/5092 (1 %) 24/318 (8 %)
12 stared 8/1565 (1 %) 29/392 (7 %)
13 ETH_start 3/385 (1 %) 75/1029 (7 %)
14 tcbawo 2/792 (0 %) 15/218 (7 %)
15 jbm 2/406 (0 %) 22/350 (6 %)
Query [2]: WITH by_user AS (
SELECT
`by` AS user,
COUNTIF(text LIKE '%—%') AS match_count,
COUNT(*) AS total_count,
(timestamp >= '2022-11-30') AS after_chatgpt
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'comment'
GROUP BY user, after_chatgpt
),
combined AS (
SELECT
user,
MAX(IF(NOT after_chatgpt, match_count, 0)) AS match_before_chatgpt,
MAX(IF(NOT after_chatgpt, total_count, 0)) AS total_before_chatgpt,
MAX(IF(after_chatgpt, match_count, 0)) AS match_after_chatgpt,
MAX(IF(after_chatgpt, total_count, 0)) AS total_after_chatgpt,
FROM by_user
GROUP BY user
HAVING total_before_chatgpt >= 200 AND total_after_chatgpt >= 200
),
with_fractions AS (
SELECT
*,
SAFE_DIVIDE(match_before_chatgpt, total_before_chatgpt) AS fraction_before_chatgpt,
SAFE_DIVIDE(match_after_chatgpt, total_after_chatgpt) AS fraction_after_chatgpt
FROM combined
)
SELECT
user,
FORMAT('%d/%d (%.0f %%)', match_before_chatgpt, total_before_chatgpt, ROUND(fraction_before_chatgpt*100)) AS before_chatgpt,
FORMAT('%d/%d (%.0f %%)', match_after_chatgpt, total_after_chatgpt, ROUND(fraction_after_chatgpt*100)) AS after_chatgpt
FROM with_fractions
WHERE fraction_before_chatgpt < 0.01
ORDER BY fraction_after_chatgpt DESC
LIMIT 15
[1] >>45072937
[2] https://console.cloud.google.com/marketplace/product/y-combi...
curl -s "https://hn.algolia.com/api/v1/search?tags=comment,author_sjs382&hitsPerPage=10000" \
| jq -r '.hits[].comment_text' \
| grep -o "—" \
| wc -lhttps://console.cloud.google.com/bigquery?p=bigquery-public-...
Click on the `+` (white over blue background) in the tab bar at the top that says "SQL query" on popup, and type the following (I use the GoogleSQL pipe syntax (https://cloud.google.com/bigquery/docs/reference/standard-sq... / >>41347188 ) below, but you can also use standard SQL if you prefer):
FROM `bigquery-public-data.hacker_news.full`
|> WHERE type = 'comment' AND timestamp < '2022-11-30'
|> AGGREGATE COUNT(*) AS total, COUNTIF(text LIKE '%—%') AS with_em GROUP BY `by`
|> EXTEND with_em / total AS fraction_with_em
|> ORDER BY fraction_with_em DESC
|> WHERE total > 100 AND fraction_with_em > 0.1
(I'm in place 47 of the 516 results, with 0.29 of my comments (258 of 875) having an em dash in them.)Edit: As you also asked about timestamps:
FROM `bigquery-public-data.hacker_news.full`
|> WHERE type = 'comment' AND timestamp < '2022-11-30'
|> EXTEND text LIKE '%—%' AS has_em
|> AGGREGATE
COUNT(*) AS total,
COUNTIF(has_em) AS with_em,
MIN(timestamp) AS first_comment_timestamp,
MIN(IF(has_em, timestamp, NULL)) AS first_em_timestamp,
TIMESTAMP_SECONDS(CAST(AVG(time) AS INT64)) AS avg_comment_timestamp,
TIMESTAMP_SECONDS(CAST(AVG(IF(has_em, time, NULL)) AS INT64)) AS avg_em_timestamp,
GROUP BY `by`
|> EXTEND with_em / total AS fraction_with_em
|> ORDER BY fraction_with_em DESC
|> WHERE total > 100 AND fraction_with_em > 0.1
for most people the average timestamp is just the midpoint of when they started posting (with em dashes) and the cutoff date of 2022-11-30, and the top-place user zmgsabst stands out for having started only in late January 2022.Also see the relevant XKCD:
So it was no surprise to me that ChatGPT used em dashes (I assume a US bias to its training data) and I immediately told it to stop using them (along with Title Case titles). (Source: professional writer for 30 years.)
We've replaced it now with v2, for more complex analytical em dash explorations :) - see >>45075379 and >>45072635 .
>>33950747 (Dec 2022)
>>24189762 (Aug 2020)
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
Whether to add it to the formal guidelines (https://news.ycombinator.com/newsguidelines.html) is a different question, of course. I'm reluctant to do that, partly because it arguably follows from what's there, partly because this is still a pretty fuzzy area that is rapidly evolving, and partly because the community is currently handling this issue pretty well. This may change of course.
One important thing to know: plenty of things not allowed in HN don't show up explicitly in the site guidelines. They are in no way a comprehensive list!
Someone recently created some long list of my reddit comments using them as a farcical claim of having used ChatGPT to author many dozens of 2010 comments.
https://github.com/andrewaylett/aylett.co.uk/blob/d338d35a3d...
1903 edition of The Wizard of Oz — https://archive.org/details/newwizardofoz00baum/page/2/mode/...
A page from Life magazine, 1894 — https://archive.org/details/sim_life_1894-08-23_24_608/page/...
The Illustrated London News, 1843 — https://archive.org/details/illustrated-london-news-v002-184...
The em dash pretty much just joins the two glyphs together. It's supposed to look that way.