zlacker

Show HN: I made a privacy-first minimalist Google Analytics

submitted by Adriaa+(OP) on 2018-09-19 14:12:56 | 968 points 263 comments
[view article] [source] [go to bottom]

NOTE: showing posts with links only show all posts
1. Adriaa+8[view] [source] 2018-09-19 14:13:28
>>Adriaa+(OP)
Creator here. As a developer, I install analytics for clients, but I never feel comfortable installing Google Analytics because Google creates profiles for their visitors, and uses their information for apps (like AdWords). As we all know, big corporations unnecessarily track users without their consent. I want to change that.

So I built Simple Analytics. To ensure that it's fast, secure, and stable, I built it entirely using languages that I'm very familiar with. The backend is plain Node.js without any framework, the database is PostgreSQL, and the frontend is written in plain JavaScript.

I learned a lot while coding, like sending requests as JSON requires an extra (pre-flight) request, so in my script I use the "text/plain" content type, which does not require an extra request. The script is publicly available (https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...). It works out of the box with modern frontend frameworks by overwriting the "history.pushState"-function.

I am transparent about what I collect (https://simpleanalytics.io/what-we-collect) so please let me know if you have any questions. My analytics tool is just the start for what I want to achieve in the non-tracking movement.

We can be more valuable without exploiting user data.

2. teddyh+c1[view] [source] 2018-09-19 14:21:58
>>Adriaa+(OP)
Please give a comparison to Matomo¹ (formerly Piwik), the current obvious choice for doing this.

1. https://matomo.org/

◧◩
3. cutety+v1[view] [source] [discussion] 2018-09-19 14:25:26
>>Adriaa+8
Just a heads up, HN comments only use a (I think) small subset of Markdown for formatting, but your link will work as is without having to wrap it in [] and adding the ().

https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...

https://simpleanalytics.io/what-we-collect

Anyway, cool project! I've always felt the same about using GA given I actually like to pretend I have some sort of privacy these days, and always have an adblocker on, so I hated setting it up for people. Definitely will be keeping an eye on this the next time someone asks me to setup GA.

◧◩
11. lucide+l2[view] [source] [discussion] 2018-09-19 14:31:58
>>Adriaa+8
This look great—have bookmarked it for future projects.

I would however a little more skeptical with tools claiming to be privacy-first than I would be with GA (who I presume are not privacy-first). On that note, some quick questions:

- Any plans to open source? I've used Piwik/Matomo in the past, and while I'm not a massive fan of the code-quality of that project, it's at least auditable (and editable).

- You say you're transparent about what you collect—IPs aren't mentioned on that page[0]. Are IPs stored in full or how are they handled? I assume you log IPs?

- How do you discern unique page-views? You seem to be dogfooding and I see no cookies or localStorage keys set.

[0] https://simpleanalytics.io/what-we-collect

15. sondr3+L2[view] [source] 2018-09-19 14:34:46
>>Adriaa+(OP)
I've moved away from using any kind of script embedded in my webpages for tracking and instead just use Goaccess (https://goaccess.io/) to analyze my logs. Though there are obvious caveats with this, you need to install it, configure the server logging to match it and so on. But personally the benefits outweighs the cons, it all runs on the server, you are the sole owner off all the data and this tracking doesn't require any kind of JS on the webpage.
◧◩◪
23. harian+M3[view] [source] [discussion] 2018-09-19 14:44:27
>>consto+C2
I practice, you can copy the code to your server. You could subscribe to repo updates on https://github.com/simpleanalytics/cdn.simpleanalytics.io and update your code if the changes make sense to you.
◧◩◪
24. harian+V3[view] [source] [discussion] 2018-09-19 14:46:10
>>nhooyr+y3
Explained here: https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS (search for preflight)
33. Cpmly+e5[view] [source] 2018-09-19 14:58:24
>>Adriaa+(OP)
It doesn't even offer close to the features Google Analytics offers and costs $12/month. The same such a service as Netflix costs. The idea is nice but looking at the actual product here: https://simpleanalytics.io/simpleanalytics.io

It disappoints in every way, you can't even check yesterdays stats.

58. whylo+P9[view] [source] 2018-09-19 15:29:55
>>Adriaa+(OP)
This is a great idea and I love the design.

It looks like anyone can see the stats for any domain using the service without any authentication. I added the tracking code to my domain and was able to hit https://simpleanalytics.io/[mydomain.co.uk] without signing up or logging in. I was also able to see the stats for your personal site.

Is that intentional? If it is, it seems like an odd choice for a privacy-first service. If not, it seems like quite a worrying oversight in a paid-for product.

62. MrQuin+ma[view] [source] 2018-09-19 15:33:02
>>Adriaa+(OP)
Well done brother. :-) The more privacy-aware tools, the better!

Something that would interest me, is a little explanation of https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl....

You already have very brief comments at strategic points. If you would explain these one by one, I would learn a lot about optimizing for number of requests, skipping stuff to load, etc. Maybe a technical blog post at a later time when the dust settles?

◧◩
73. aemble+xd[view] [source] [discussion] 2018-09-19 15:55:37
>>whylo+P9
I see what you mean https://simpleanalytics.io/adriaan.io
75. ciex+Zd[view] [source] 2018-09-19 15:59:31
>>Adriaa+(OP)
I am using fathom [1] for this. They allow hosting the backend yourself and your analytics are not publicly accessible. Biggest con is that each installation can only track one domain as of now.

[1]: https://usefathom.com

◧◩◪
77. donalt+de[view] [source] [discussion] 2018-09-19 16:01:01
>>aemble+xd
Some people are taking advantage of this to leave messages for us: https://simpleanalytics.io/simpleanalytics.io

Edit: It seems to have been filtered now, but people were using spoofed referer headers to leave offensive messages for HN users.

◧◩
83. pdkl95+Yf[view] [source] [discussion] 2018-09-19 16:13:14
>>Adriaa+8
> unnecessarily track users without their consent

Regardless of your intentions, you are collecting enough data to track users.

> I am transparent about what I collect ([URL])

That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address (and any other data in the IP and TCP headers). This can be uniquely identifying; even when it isn't unique you usually only need a few bits of additional entropy to reconstruct[1] a unique tracking ID.

In my case, you're collecting and storing more than enough additional entropy to make a decent fingerprint because [window.innerWidth, window.innerHeight] == [847, 836]. Even if I resized the window, you could follow those changes simply by watching analytics events from the same IP that are temporally nearby (you are collecting and storing timestamps).

[1] An older comment where I discussed how this could be done (and why GA's supposed "anonymization" feature (aip=1) is a blatant lie): https://news.ycombinator.com/item?id=17170468

◧◩◪◨⬒⬓⬔
86. pdkl95+zg[view] [source] [discussion] 2018-09-19 16:18:48
>>thesim+67
> Removing the last octet of IPv4 addresses before storing them should provide better privacy.

That doesn't provide any practical amount of privacy. For a longer discussion of why this is at best a placebo, see: https://news.ycombinator.com/item?id=17170468

87. eli+ph[view] [source] 2018-09-19 16:23:59
>>Adriaa+(OP)
I think there are a lot of misconceptions about how Google Analytics tracking works. I'm pretty sure a vanilla GA setup does not, in fact, create profiles that track you across the web. For one thing, all the cookies it creates are first-party (on your domain).

I still get objecting to Google products on principle, but their privacy policy for GA seems pretty reasonable to me: https://support.google.com/analytics/answer/6004245

◧◩
91. dylz+Oi[view] [source] [discussion] 2018-09-19 16:34:18
>>Adriaa+8
How can I run this myself?

It absolutely isn't privacy-first if it requires running on someone else's machine and giving your users' data to them - another issue would be that while your server is in the EU, the hosting company is subject to US law, and all the stuff that comes with it (https://en.wikipedia.org/wiki/CLOUD_Act f.e.)

◧◩◪
96. harian+Nj[view] [source] [discussion] 2018-09-19 16:41:36
>>pdkl95+Yf
Good comment! I only store the window.innerWidth metric. I updated the what we collect page (https://simpleanalytics.io/what-we-collect) to reflect the IP handling. We don't store them. And fingerprinting is something that would be definitely tracking, not on my watch!
102. xwvvvv+Uk[view] [source] 2018-09-19 16:49:48
>>Adriaa+(OP)
In case some people are unaware, after GDPR google released an addon that allows you to opt-out from google analytics tracking across the web:

https://tools.google.com/dlpage/gaoptout/

106. mdasen+Ll[view] [source] 2018-09-19 16:55:29
>>Adriaa+(OP)
First: really slick site. I'm not so into the video which takes a while to get to the point, but the site makes it really easy to understand the point of your product (and that's something a lot of sites lack).

I do have some questions/comments and I apologize if they seem a bit rapid-fire.

* When I look at the "Top Pages", there are links. When I click the link, it brings me to that page on your site not a chart of hits for that page. Is that how it's meant to work?

* If I sign up for your service, do my stats become public? https://simpleanalytics.io/apple.com just says "This domain does not have any data yet" (presumably because Apple doesn't have your script installed). But that kinda indicates that any domain with your script installed would show up there. It might just be an error in the messaging, but probably something to fix.

* What's your backend like? I'm mostly curious because analytics at scale isn't an easy problem. Do you write to a log-structured system with high availability (like Kafka) and then process asynchronously? How do you handle making the chart of visitors? Do you roll up the stats periodically?

* Speaking of scale, if I started sending thousands or tens of thousands of requests per second at you, would that be bad? Is this more targeted at small sites?

* What do you do about bots? Bot traffic can be a large source of traffic that throws off numbers.

* How long before numbers are available? It's September 19th, but the last stats on the live demo are September 18th. Is it lagged by a day?

* Do you not want to track user-agents for privacy reasons as well? Seems like a UA doesn't really identify anyone, but it can be useful for determining if you want to support a browser.

* You're not counting anyone that has the "Do Not Track" header. To me, DNT is more about tracking than counting (which is different). Even if you counted my hit, it wouldn't be tracking me if you didn't record information like IP address and there were no cookies.

Kudos for launching something. I think my biggest suggestions would be fixing the live-demo page so it doesn't look like it's leaking other site's data and providing some guidance about limits. It's easy to think that you don't want to put limits on people, but any architecture is made with a certain scale in mind. There's no shame in that. Sometimes what you want is a "let us know if you need more than X" message. At the very least, it lets you prepare. People sometimes use products in ways you wouldn't imagine and ways you didn't intend which the system doesn't handle gracefully.

Good luck with your product!

◧◩◪◨⬒⬓⬔⧯
107. dividu+hm[view] [source] [discussion] 2018-09-19 16:59:24
>>pdkl95+zg
I solved this my SaaS by internally logging all the requests and then using the Measurement Protocol (https://developers.google.com/analytics/devguides/collection...) to send them from the server-side. While doing that I also set the last digit to 0 and unify user agents and other data that's not important for me.
◧◩◪◨⬒
108. dvko+Pm[view] [source] [discussion] 2018-09-19 17:02:47
>>lucide+i5
If you need an open-source solution that truly cares about privacy yet can still keep track of unique pageviews, there's always Fathom Analytics (https://github.com/usefathom/fathom).
112. tzury+so[view] [source] 2018-09-19 17:14:49
>>Adriaa+(OP)
So this is open to everyone?

I mean, can I just see stats of a site that uses the service?

e.g.

https://simpleanalytics.io/simpleanalytics.io

114. fiatja+Np[view] [source] 2018-09-19 17:23:25
>>Adriaa+(OP)
This feels like a rant, but I've posted my https://trackingco.de/ here multiple times, which has very similar proposal (and is cheaper) but never got a single line of feedback.
119. phprec+rr[view] [source] 2018-09-19 17:33:21
>>Adriaa+(OP)
At my work (The New York Public Library), we created a “Google Analytics Proxy” that receives requests and then proxies them to Google’s Measurement Protocol so you still get the benefit of using Google Analytics but can control exactly what’s sent/saved in real-time.

It’s intended as a mostly drop-in replacement for the GA analytics.js API and to be used as an AWS Lambda.

You can check it out here: https://github.com/NYPL/google-analytics-proxy

◧◩
122. mr337+Fs[view] [source] [discussion] 2018-09-19 17:41:24
>>fiatja+Np
The live demo doesn't work that is linked on the main page.

https://trackingco.de/public/9ykvs7rk

◧◩
125. ianwal+Uu[view] [source] [discussion] 2018-09-19 17:55:43
>>fiatja+Np
Here is some feedback:

The example (https://trackingco.de/public/9ykvs7rk) does not work for me. Also, the first time I visited the site I saw Lightning Bitcoin and then left. You lost me as soon as I read that because I'm not interested in that. I was just trying to find a simple (but useful) analytics service that's easy to use.

◧◩◪
127. joshyi+my[view] [source] [discussion] 2018-09-19 18:19:33
>>southe+Jb
Looks like goaccess supports --anonymize-ip=true which sets the last octet of IPv4 user IP addresses and the last 80 bits of IPv6 addresses to zeros.

source: https://github.com/allinurl/goaccess/blob/master/config/goac...

◧◩
135. wumms+3E[view] [source] [discussion] 2018-09-19 18:58:14
>>Adriaa+8
"NoScript detected a potential Cross-Site Scripting attack from https://simpleanalytics.io to https://js.stripe.com."
◧◩◪
152. aabbcc+vJ[view] [source] [discussion] 2018-09-19 19:40:29
>>pdkl95+Yf
As you're concern to have your user data be on hand of third party, maybe it's better to do the analytics yourself.

Sometime like this https://stackoverflow.com/questions/34031251/javascript-libr...

◧◩
153. harris+yJ[view] [source] [discussion] 2018-09-19 19:40:49
>>fiatja+Np
I feel your pain. You should fix your live demo link (https://www.dropbox.com/s/z3qceg8mg86n5m1/Screenshot%202018-...).

Another bit of feedback is to draft on as many related stories as you can here in HN (like you are doing now).

158. ksec+eM[view] [source] 2018-09-19 20:05:20
>>Adriaa+(OP)
anyone here uses clicky

https://clicky.com

161. chpmrc+RN[view] [source] 2018-09-19 20:18:56
>>Adriaa+(OP)
Did Google just install your tool? https://simpleanalytics.io/google.com :)
◧◩
164. paulja+UQ[view] [source] [discussion] 2018-09-19 20:46:30
>>Adriaa+8
Fathom thought about data and privacy policy too:

https://usefathom.com/data/

◧◩◪
167. highac+fR[view] [source] [discussion] 2018-09-19 20:48:02
>>wongar+AN
There's always one.

https://news.ycombinator.com/item?id=9224

◧◩◪◨
177. pdkl95+oV[view] [source] [discussion] 2018-09-19 21:22:39
>>harian+Nj
> We don't collect and store IPs.

First, "IPs" might be confusing; "IP addresses" would be more accurate.

More importantly, you have to collect IP addresses (or any other value in the packet headers[1][2]) - even if you don't store it - if you want to receive any packets from the rest of the internet. Storage of those values is separate issue entirely, and it's good to hear that you are intending to NOT store IP addresses (and updating the documenting)!

Also, I strongly recommend using Drdrdrq's suggestion to lower the precision of the collected window dimensions, which should be done on the client i.e. "Math.floor(window.innerWidth/50)*50". This kind of bit-reduction makes fingerprinting a lot harder.

[1] https://en.wikipedia.org/wiki/IPv4#Header

[2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...

◧◩
180. amicha+WV[view] [source] [discussion] 2018-09-19 21:29:23
>>eli+ph
Also:

> When a customer of Analytics requests IP address anonymization, Analytics anonymizes the address as soon as technically feasible at the earliest possible stage of the collection network. The IP anonymization feature in Analytics sets the last octet of IPv4 user IP addresses and the last 80 bits of IPv6 addresses to zeros in memory shortly after being sent to the Analytics Collection Network. The full IP address is never written to disk in this case.

https://support.google.com/analytics/answer/2763052?hl=en

185. gator-+6Y[view] [source] 2018-09-19 21:57:20
>>Adriaa+(OP)
Data collection for legitimate purposes came up in our GDPR compliance review.

This product (https://truestats.com) collects the I.P. address and user agent for the purpose of detecting fraud (not selling data or profiling users). It is used for frequency checking and other patterns that would indicate fraud. We are still going through the legal analysis of how to deal with this, even though we have no idea who the visitors are.

I think considering the I.P. address as PII is a little much if you are not using it in a way that would violate privacy or selling the data.

186. dna_po+8Y[view] [source] 2018-09-19 21:57:58
>>Adriaa+(OP)
Just a quick reminder, that Fathom started its Pro offering only a few days ago: https://usefathom.com/

It's also Open Source so you can see for yourself what is going on, or even self-host.

◧◩
196. vanler+p11[view] [source] [discussion] 2018-09-19 22:36:02
>>mdasen+Ll
Looks like it's public

https://simpleanalytics.io/simpleanalytics.io

◧◩
197. Spone+C11[view] [source] [discussion] 2018-09-19 22:39:11
>>mossel+T5
Ahoy is a great tool to do just that, if you use Ruby on Rails: https://github.com/ankane/ahoy
◧◩◪◨⬒⬓⬔⧯
198. samirm+R11[view] [source] [discussion] 2018-09-19 22:42:35
>>kelnag+VQ
No, I'm not assuming that, because regardless of how the user browses your site, you're still going to prioritize the sizes important to you.

It wouldn't make sense to prioritize optimizing site design for the few people who are using a non-standard size.

http://gs.statcounter.com/screen-resolution-stats

◧◩
201. gregab+q31[view] [source] [discussion] 2018-09-19 23:01:13
>>eli+ph
Agreed, the first-party cookie is pretty self-evidently not a web-wide tracker.

There are lots of config options. Here's what I like to use:

  // Google Analytics Code.
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
  window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};

  // https://developers.google.com/analytics/devguides/collection/analyticsjs/field-reference
  ga('create', 'UA-XXX-XX', 'auto', {
      // The default cookie expiration is 2 years. We don't want our cookies
      // around that long. We only want just long enough to see analytics on
      // repeat visits. Instead, limit to 31 days. Field is in seconds:
      // 31 * 24 * 60 * 60 = 2678400
      'cookieExpires': 2678400,
      // We don't need a cookie to track campaign information, so remove that.
      'storeGac': false,
      // Anonymize the ip address of the user.
      'anonymizeIp': true,
      // Always send all data over SSL. Unnecessary, since the site only loads on
      // SSL, but defense in depth.
      'forceSSL': true});
  // Now, record 1 pageview event.
  ga('send', 'pageview');
◧◩
205. harian+Z41[view] [source] [discussion] 2018-09-19 23:17:32
>>eli+ph
Also read this: https://simpleanalytics.io/no-tracking, they state something about it in their policy.
◧◩◪
215. epicmu+Jr1[view] [source] [discussion] 2018-09-20 04:42:46
>>curun1+jq
How do you know the postgres implementation is naive? I've worked on several analytics platforms...including offshoots of google analytics within Google itself, and this problem domain is ridiculously easy to shard on natural partitions. And after sharding, you can start to do roll-ups, which Google Analytics does internally.

By 2014 when I left, we had a few petabytes of analytics data for a very small but high traffic set of customers. Could we query all of that at once within a reasonable online SLA? No. We partitioned and sharded the data easily and only queried the partitions we needed.

If I were to do this now and didn't need near real-time (what is real-time?) I'd use sqlite. Otherwise I'ld use trickle-n-flip on postgres or mysql. There are literally 10+ year-old books[1] on this wrt RDBMS.

And yes, even with 2000 clients reaching billions of requests per day, only the top few stressed the system. The rest is long tail.

1. https://www.amazon.com/Data-Warehousing-Handbook-Rob-Mattiso...

◧◩◪◨⬒⬓
219. rapnie+Nz1[view] [source] [discussion] 2018-09-20 07:07:48
>>dvko+Pm
And also Matomo (https://github.com/matomo-org/matomo)
221. marich+jB1[view] [source] 2018-09-20 07:33:05
>>Adriaa+(OP)
This is not GDPR friendly.

Executing third party JS on your website is an access to the page content, so unless the customer never had any user data or sensitive data on the page, they'll have to categorise simpleanalytics as a data processor.

Referers are often on their own private data, for example https://www.linkedin.com/in/markalanrichards/edit identifies not just you looked at this user, but that you are this user as it is the profile editing page, unique to this account.

The difference between whether simpleanalytics get or store data might remove a GDPR issue for them, but it certainly is for customers. Having access to the IP addresses is sufficient for privacy to be invaded at any point or by accident (wrong logging parameter added by the next new dev), malice (how can we illegally use this and lie to customers) or compromise (hackers take control of the analytics system) and therefore puts users at risk of full tracking at any point. As mentioned earlier GDPR is also about access, it is definitely about storage but the part in between of being given data (not just access to take it and not putting it on disk) is definitely included too.

In summary, simpleanalytics need to stop lying and redo their privacy impact assessments. Meanwhile don't use third party analytics (I have no idea how you maintain security control on third party JS) and if you're silly enough to, then it definitely is a GDPR consideration that needs to be assessed, added to audit, added to privacy policies, etc.

◧◩◪
227. whylo+DF1[view] [source] [discussion] 2018-09-20 08:38:44
>>harian+nZ
I was able to add the tracking code to my site without signing up and could see the stats without any authentication (see my other comment: https://news.ycombinator.com/item?id=18024886). Is that by design?
◧◩◪◨⬒⬓
238. sharce+MY1[view] [source] [discussion] 2018-09-20 12:44:01
>>Lyndsy+qz
Probably. Wow, you used the word "probably". I guess you aren't aware of the many cases wherein when a Chrome extension gets popular, indie developers are contacted by some company and many have sold their extension are let them collect data. Also yhis data gets sold to 3rd parties,many such cases with small-medium websites have occured. Remember Unroll.me

Also, Google knows how to make profiles and it knows the importance of that data amd keeping it safe. It is also somewhat answerable to Consumer groups,users,shareholders,regulatory bodies. Indie dev doesn't know how to make good profile, more likely to sell the data to make revenue. Not ridiculing indie devs, just ridiculing your assumptions that if a solo dev is an angel.

https://www.labnol.org/internet/sold-chrome-extension/28377/

https://m.slashdot.org/story/328731

◧◩
244. Findus+Ih2[view] [source] [discussion] 2018-09-20 15:16:47
>>jackgo+591
Just FYI: With Piwik/Matomo you can replay your access.log and therefore never miss any data even if the instance goes down completely: https://matomo.org/faq/log-analytics-tool/faq_19221/
◧◩◪
245. tomask+jt2[view] [source] [discussion] 2018-09-20 16:43:54
>>ucario+hr
Here's a gdpr compliant system that answers complex questions. Hint: if your content is worthy, a part of readers will agree to reasonable analytics, and you can extrapolate from this.

https://www.baekdal.com/thoughts/inside-story-what-i-did-to-...

[go to top]