So I built Simple Analytics. To ensure that it's fast, secure, and stable, I built it entirely using languages that I'm very familiar with. The backend is plain Node.js without any framework, the database is PostgreSQL, and the frontend is written in plain JavaScript.
I learned a lot while coding, like sending requests as JSON requires an extra (pre-flight) request, so in my script I use the "text/plain" content type, which does not require an extra request. The script is publicly available (https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...). It works out of the box with modern frontend frameworks by overwriting the "history.pushState"-function.
I am transparent about what I collect (https://simpleanalytics.io/what-we-collect) so please let me know if you have any questions. My analytics tool is just the start for what I want to achieve in the non-tracking movement.
We can be more valuable without exploiting user data.
https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...
https://simpleanalytics.io/what-we-collect
Anyway, cool project! I've always felt the same about using GA given I actually like to pretend I have some sort of privacy these days, and always have an adblocker on, so I hated setting it up for people. Definitely will be keeping an eye on this the next time someone asks me to setup GA.
I would however a little more skeptical with tools claiming to be privacy-first than I would be with GA (who I presume are not privacy-first). On that note, some quick questions:
- Any plans to open source? I've used Piwik/Matomo in the past, and while I'm not a massive fan of the code-quality of that project, it's at least auditable (and editable).
- You say you're transparent about what you collect—IPs aren't mentioned on that page[0]. Are IPs stored in full or how are they handled? I assume you log IPs?
- How do you discern unique page-views? You seem to be dogfooding and I see no cookies or localStorage keys set.
It disappoints in every way, you can't even check yesterdays stats.
It looks like anyone can see the stats for any domain using the service without any authentication. I added the tracking code to my domain and was able to hit https://simpleanalytics.io/[mydomain.co.uk] without signing up or logging in. I was also able to see the stats for your personal site.
Is that intentional? If it is, it seems like an odd choice for a privacy-first service. If not, it seems like quite a worrying oversight in a paid-for product.
Something that would interest me, is a little explanation of https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl....
You already have very brief comments at strategic points. If you would explain these one by one, I would learn a lot about optimizing for number of requests, skipping stuff to load, etc. Maybe a technical blog post at a later time when the dust settles?
Edit: It seems to have been filtered now, but people were using spoofed referer headers to leave offensive messages for HN users.
Regardless of your intentions, you are collecting enough data to track users.
> I am transparent about what I collect ([URL])
That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address (and any other data in the IP and TCP headers). This can be uniquely identifying; even when it isn't unique you usually only need a few bits of additional entropy to reconstruct[1] a unique tracking ID.
In my case, you're collecting and storing more than enough additional entropy to make a decent fingerprint because [window.innerWidth, window.innerHeight] == [847, 836]. Even if I resized the window, you could follow those changes simply by watching analytics events from the same IP that are temporally nearby (you are collecting and storing timestamps).
[1] An older comment where I discussed how this could be done (and why GA's supposed "anonymization" feature (aip=1) is a blatant lie): https://news.ycombinator.com/item?id=17170468
That doesn't provide any practical amount of privacy. For a longer discussion of why this is at best a placebo, see: https://news.ycombinator.com/item?id=17170468
I still get objecting to Google products on principle, but their privacy policy for GA seems pretty reasonable to me: https://support.google.com/analytics/answer/6004245
It absolutely isn't privacy-first if it requires running on someone else's machine and giving your users' data to them - another issue would be that while your server is in the EU, the hosting company is subject to US law, and all the stuff that comes with it (https://en.wikipedia.org/wiki/CLOUD_Act f.e.)
I do have some questions/comments and I apologize if they seem a bit rapid-fire.
* When I look at the "Top Pages", there are links. When I click the link, it brings me to that page on your site not a chart of hits for that page. Is that how it's meant to work?
* If I sign up for your service, do my stats become public? https://simpleanalytics.io/apple.com just says "This domain does not have any data yet" (presumably because Apple doesn't have your script installed). But that kinda indicates that any domain with your script installed would show up there. It might just be an error in the messaging, but probably something to fix.
* What's your backend like? I'm mostly curious because analytics at scale isn't an easy problem. Do you write to a log-structured system with high availability (like Kafka) and then process asynchronously? How do you handle making the chart of visitors? Do you roll up the stats periodically?
* Speaking of scale, if I started sending thousands or tens of thousands of requests per second at you, would that be bad? Is this more targeted at small sites?
* What do you do about bots? Bot traffic can be a large source of traffic that throws off numbers.
* How long before numbers are available? It's September 19th, but the last stats on the live demo are September 18th. Is it lagged by a day?
* Do you not want to track user-agents for privacy reasons as well? Seems like a UA doesn't really identify anyone, but it can be useful for determining if you want to support a browser.
* You're not counting anyone that has the "Do Not Track" header. To me, DNT is more about tracking than counting (which is different). Even if you counted my hit, it wouldn't be tracking me if you didn't record information like IP address and there were no cookies.
Kudos for launching something. I think my biggest suggestions would be fixing the live-demo page so it doesn't look like it's leaking other site's data and providing some guidance about limits. It's easy to think that you don't want to put limits on people, but any architecture is made with a certain scale in mind. There's no shame in that. Sometimes what you want is a "let us know if you need more than X" message. At the very least, it lets you prepare. People sometimes use products in ways you wouldn't imagine and ways you didn't intend which the system doesn't handle gracefully.
Good luck with your product!
I mean, can I just see stats of a site that uses the service?
e.g.
It’s intended as a mostly drop-in replacement for the GA analytics.js API and to be used as an AWS Lambda.
You can check it out here: https://github.com/NYPL/google-analytics-proxy
The example (https://trackingco.de/public/9ykvs7rk) does not work for me. Also, the first time I visited the site I saw Lightning Bitcoin and then left. You lost me as soon as I read that because I'm not interested in that. I was just trying to find a simple (but useful) analytics service that's easy to use.
source: https://github.com/allinurl/goaccess/blob/master/config/goac...
Sometime like this https://stackoverflow.com/questions/34031251/javascript-libr...
Another bit of feedback is to draft on as many related stories as you can here in HN (like you are doing now).
First, "IPs" might be confusing; "IP addresses" would be more accurate.
More importantly, you have to collect IP addresses (or any other value in the packet headers[1][2]) - even if you don't store it - if you want to receive any packets from the rest of the internet. Storage of those values is separate issue entirely, and it's good to hear that you are intending to NOT store IP addresses (and updating the documenting)!
Also, I strongly recommend using Drdrdrq's suggestion to lower the precision of the collected window dimensions, which should be done on the client i.e. "Math.floor(window.innerWidth/50)*50". This kind of bit-reduction makes fingerprinting a lot harder.
[1] https://en.wikipedia.org/wiki/IPv4#Header
[2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...
> When a customer of Analytics requests IP address anonymization, Analytics anonymizes the address as soon as technically feasible at the earliest possible stage of the collection network. The IP anonymization feature in Analytics sets the last octet of IPv4 user IP addresses and the last 80 bits of IPv6 addresses to zeros in memory shortly after being sent to the Analytics Collection Network. The full IP address is never written to disk in this case.
This product (https://truestats.com) collects the I.P. address and user agent for the purpose of detecting fraud (not selling data or profiling users). It is used for frequency checking and other patterns that would indicate fraud. We are still going through the legal analysis of how to deal with this, even though we have no idea who the visitors are.
I think considering the I.P. address as PII is a little much if you are not using it in a way that would violate privacy or selling the data.
It's also Open Source so you can see for yourself what is going on, or even self-host.
It wouldn't make sense to prioritize optimizing site design for the few people who are using a non-standard size.
There are lots of config options. Here's what I like to use:
// Google Analytics Code.
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};
// https://developers.google.com/analytics/devguides/collection/analyticsjs/field-reference
ga('create', 'UA-XXX-XX', 'auto', {
// The default cookie expiration is 2 years. We don't want our cookies
// around that long. We only want just long enough to see analytics on
// repeat visits. Instead, limit to 31 days. Field is in seconds:
// 31 * 24 * 60 * 60 = 2678400
'cookieExpires': 2678400,
// We don't need a cookie to track campaign information, so remove that.
'storeGac': false,
// Anonymize the ip address of the user.
'anonymizeIp': true,
// Always send all data over SSL. Unnecessary, since the site only loads on
// SSL, but defense in depth.
'forceSSL': true});
// Now, record 1 pageview event.
ga('send', 'pageview');By 2014 when I left, we had a few petabytes of analytics data for a very small but high traffic set of customers. Could we query all of that at once within a reasonable online SLA? No. We partitioned and sharded the data easily and only queried the partitions we needed.
If I were to do this now and didn't need near real-time (what is real-time?) I'd use sqlite. Otherwise I'ld use trickle-n-flip on postgres or mysql. There are literally 10+ year-old books[1] on this wrt RDBMS.
And yes, even with 2000 clients reaching billions of requests per day, only the top few stressed the system. The rest is long tail.
1. https://www.amazon.com/Data-Warehousing-Handbook-Rob-Mattiso...
Executing third party JS on your website is an access to the page content, so unless the customer never had any user data or sensitive data on the page, they'll have to categorise simpleanalytics as a data processor.
Referers are often on their own private data, for example https://www.linkedin.com/in/markalanrichards/edit identifies not just you looked at this user, but that you are this user as it is the profile editing page, unique to this account.
The difference between whether simpleanalytics get or store data might remove a GDPR issue for them, but it certainly is for customers. Having access to the IP addresses is sufficient for privacy to be invaded at any point or by accident (wrong logging parameter added by the next new dev), malice (how can we illegally use this and lie to customers) or compromise (hackers take control of the analytics system) and therefore puts users at risk of full tracking at any point. As mentioned earlier GDPR is also about access, it is definitely about storage but the part in between of being given data (not just access to take it and not putting it on disk) is definitely included too.
In summary, simpleanalytics need to stop lying and redo their privacy impact assessments. Meanwhile don't use third party analytics (I have no idea how you maintain security control on third party JS) and if you're silly enough to, then it definitely is a GDPR consideration that needs to be assessed, added to audit, added to privacy policies, etc.
Also, Google knows how to make profiles and it knows the importance of that data amd keeping it safe. It is also somewhat answerable to Consumer groups,users,shareholders,regulatory bodies. Indie dev doesn't know how to make good profile, more likely to sell the data to make revenue. Not ridiculing indie devs, just ridiculing your assumptions that if a solo dev is an angel.
https://www.labnol.org/internet/sold-chrome-extension/28377/
https://www.baekdal.com/thoughts/inside-story-what-i-did-to-...