So I built Simple Analytics. To ensure that it's fast, secure, and stable, I built it entirely using languages that I'm very familiar with. The backend is plain Node.js without any framework, the database is PostgreSQL, and the frontend is written in plain JavaScript.
I learned a lot while coding, like sending requests as JSON requires an extra (pre-flight) request, so in my script I use the "text/plain" content type, which does not require an extra request. The script is publicly available (https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...). It works out of the box with modern frontend frameworks by overwriting the "history.pushState"-function.
I am transparent about what I collect (https://simpleanalytics.io/what-we-collect) so please let me know if you have any questions. My analytics tool is just the start for what I want to achieve in the non-tracking movement.
We can be more valuable without exploiting user data.
https://github.com/simpleanalytics/cdn.simpleanalytics.io/bl...
https://simpleanalytics.io/what-we-collect
Anyway, cool project! I've always felt the same about using GA given I actually like to pretend I have some sort of privacy these days, and always have an adblocker on, so I hated setting it up for people. Definitely will be keeping an eye on this the next time someone asks me to setup GA.
anyways, wish you the best luck with your endeavor. btw you might want to fix links above.
I would however a little more skeptical with tools claiming to be privacy-first than I would be with GA (who I presume are not privacy-first). On that note, some quick questions:
- Any plans to open source? I've used Piwik/Matomo in the past, and while I'm not a massive fan of the code-quality of that project, it's at least auditable (and editable).
- You say you're transparent about what you collect—IPs aren't mentioned on that page[0]. Are IPs stored in full or how are they handled? I assume you log IPs?
- How do you discern unique page-views? You seem to be dogfooding and I see no cookies or localStorage keys set.
I just have a quick question. What subset of the javascript implementation does the tracking pixel provide? If all that is missing is screen size, I might just choose that to avoid running third party code. For performance, I combine, minify, and embed all scripts and styles into each page which lets me acheive perfect scores in the Chrome Auditor.
How are you storing all the information that analytics users want to know i.e. (What devices, what languages, what geolocations, what queries, what page navigations and clicks, etc.)
After reading what you collect I'm assuming you are doing a lot of JS sniffing of browser properties to gather this information along with IP address analysis is that correct? Or what are you plans about these features if you don't have them now?
Overall though I'd say great design + sales pitch. I think if the product delivers on enough features you will have something here. Great job!
What if you had one-day retention of IP addresses for per-day unique views? Seems like too important of a metric to eliminate completely, and one-day retention seems like a decent trade-off at the expense of being able to do unique analysis over longer time periods.
One can use hashes with regularly changing salts that are destroyed after a while to make older hashes unusable though for some purposes.
But to my eyes, expiring salts isn't much different than deleting ip addresses after one day. Just more machinery. People have to trust that you're doing either, so why bother beyond being able to use the word "hashing" in marketing language?
Apart from the unfortunate non-open-source answer, this sounds great!
I get others' concerns about wanting unique pageviews, but that metric is always a bit of a sketchy either-or for extremely privacy-conscious people. It's both an incredibly valuable metric, and also one that's difficult to square with complete privacy (basically it's always going to be pseudonymous at best).
what are the security implications of this?
Nice. You might want to add an explicit copyright/license though. Make it less (or more) dangerous for other devs to read it...
I think it could actually be quite useful to "standardize" on a simple (open/libre) front end for analytics (with an implied back-end standard).
Regardless of your intentions, you are collecting enough data to track users.
> I am transparent about what I collect ([URL])
That page doesn't mention that you are also collecting (and make no claim about storing) the globally-visible IP address (and any other data in the IP and TCP headers). This can be uniquely identifying; even when it isn't unique you usually only need a few bits of additional entropy to reconstruct[1] a unique tracking ID.
In my case, you're collecting and storing more than enough additional entropy to make a decent fingerprint because [window.innerWidth, window.innerHeight] == [847, 836]. Even if I resized the window, you could follow those changes simply by watching analytics events from the same IP that are temporally nearby (you are collecting and storing timestamps).
[1] An older comment where I discussed how this could be done (and why GA's supposed "anonymization" feature (aip=1) is a blatant lie): https://news.ycombinator.com/item?id=17170468
That doesn't provide any practical amount of privacy. For a longer discussion of why this is at best a placebo, see: https://news.ycombinator.com/item?id=17170468
It absolutely isn't privacy-first if it requires running on someone else's machine and giving your users' data to them - another issue would be that while your server is in the EU, the hosting company is subject to US law, and all the stuff that comes with it (https://en.wikipedia.org/wiki/CLOUD_Act f.e.)
Shared-source proprietary goes as far back as Burroughs B5000 mainframe whose customers got the source and could send in fixes/updates. Microsoft has a Shared Source program. Quite a few suppliers in embedded do it. There's also a company that sells UI software which gives the source to customers buying higher-priced version.
I will warn that people might still rip off and use your code. Given it's JavaScript, I think they can do that anyway with reverse engineering. It also sounds like they could build it themselves anyway. Like most software bootstrappers or startups, you're already in a race with other players that might copy you with clean slate implementations. So, I don't know if the risk is that big a deal or not. I figured I should mention it for fairness.
Given the choice between a lot of data about me given to a small provider and somewhat less data about me given to Google, I'd generally choose the former.
I’m not the OP, but where is there evidence that they’re storing the IP? Sure it’s in the headers that they process but that doesn’t mean they’re storing it.
However, I am a bit confused as to who would want this product. The sort of questions this product answers seem quite limited:
1. What URLs are getting lots of hits?
2. What referrers are generating lots of hits?
3. What screen sizes are those hits coming from?
What decisions can be drawn from those questions? This seems useful only to perhaps some blog, where they're wondering what sort of content is successful, where to advertise more, and whether to bother making a mobile website.
Without the ability to track user sessions -- even purely in localStorage -- you can't correlate pageview events. For instance, how would I answer a question like:
- How many high-interest users do I have? By "high interest", I mean someone who visited at least three pages on my website.
- Is a mobile website really worthwhile? How much of an effect does being on mobile have on whether someone will be "high-interest"?
I should think some anonymized user ID system -- even if it rotates anonymous IDs -- should be able to answer these questions without compromising privacy.
Also, I'll leave it to others to point out it's unlikely this product is exempt from GDPR.
The idea of privacy is much easier to sell if the data never leaves your own server, instead of using some analytics provider that might be run by the CIA or the Russian mafia for all we can prove.
Security matters if your concern is the data leaking to a potential malicious actor. The concern that I'm speaking to is the intended use of the data. Google is definitely going to use it for ad targeting and building a "shadow profile", but a small developer probably won't. This one says they won't, but even if they do they're likely to be much less effective than Google would be.
I'd imagine it's difficult to do in depth analytics with tracking users...
Sometime like this https://stackoverflow.com/questions/34031251/javascript-libr...
This is true. The legal department for the healthcare web sites I maintain doesn't let me store or track IP addresses, even for analytics.
I'm only allowed to tally most popular pages, display language chosen, and date/time. There might be one or two other things, but it's all super basic.
Having a random developer create a shadow profile isn't the same.
The scale is vastly different and can be used to track you from site to site.
There is 'justice' in the blog creator using analytics data to to improve the experience of blog visitors: a user's data will, theoretically and in aggregate, create a better experience for that user in the future. The class of 'users who browse this page' gets a benefit from the cost of providing data.
Selling browsing information to advertisers is sort of 'anti-justice'. Using blog visitor data to track and more effectively manipulate those visitors elsewhere on the internet into paying people money. The blog visitor's external online experience is made worse by browsing that blog.
First, "IPs" might be confusing; "IP addresses" would be more accurate.
More importantly, you have to collect IP addresses (or any other value in the packet headers[1][2]) - even if you don't store it - if you want to receive any packets from the rest of the internet. Storage of those values is separate issue entirely, and it's good to hear that you are intending to NOT store IP addresses (and updating the documenting)!
Also, I strongly recommend using Drdrdrq's suggestion to lower the precision of the collected window dimensions, which should be done on the client i.e. "Math.floor(window.innerWidth/50)*50". This kind of bit-reduction makes fingerprinting a lot harder.
[1] https://en.wikipedia.org/wiki/IPv4#Header
[2] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#...
This cannot be stressed enough. At my day job I write reasonably secure software on a team for big clients, then at home I write reasonably secure software independently for small clients.
Come new security issue, the big clients at day job get first priority. Not because they are big and not because they are paying more, but rather because as a team we can reallocate resources and work on issues in parallel. At home, there is only one Dotan to work on each independent client in series.
It wouldn't make sense to prioritize optimizing site design for the few people who are using a non-standard size.
Main question: How are you handling Safari Intelligent Tracking Protection 2.0?
Really, the central point that should be clear is that this is a question for lawyers. The GDPR is incredibly far-reaching.
If it means your website has to show a message ‘We transmit your info, but save nothing.’ It becomes a bit weird.
I can’t say I love having Google track me, but I don’t feel any better about someone else doing it either.
I might be able to help because I wrote an analytics tool a while back that tracks these three properties and some other stuff
1. Knowing which URLs are being visited allows me to see if a particular campaign or blog site is popular
2. The referrer tells me where a user came from, this is helpful to know if I'm being linked to reddit and should allocate more CPU cores from my host to the VMs responsible for a particular service
3. The screen size allows me to know what aspect ratios and sizes I should optimize for. My general rule is that any screen shape that can fit a 640x480 VGA screen without clipping should allow my website to be fully readable and usable.
4. I also track a trimmed down user agent; "Firefox", "Chrome", "IE", "Edge", "Safari" and other. All will include "(recent)" or "(old)" to indicate version and other will include the full user agent. This allows me to track what browsers people use and if people use outdated browsers ("(old)" usually means 1 year out of date, I try to adjust it regularly to keep the interval shorter)
5. Page Load Speed and Connection. This is a number in 10ms steps and a string that's either "Mobile" or "Wired", which uses a quick and dirty heuristic to evaluate based on if a connection is determined to be throttled, slow and a few other factors. Mobile means people use my website with devices that can't or shouldn't be drawing much bandwidth, Wired means I could go nuts. This allows me to adjust the size of my webpage to fit my userbase.
6. GeoIP: This is either "NAm", "SAm", "Eur", "Asi", "Chin", "OcA", "NAf", "SAf", "Ant" or "Other". I don't need to know more than the continent my users live on, it's good enough data. I track Chinese visitors separately since it interests me.
Overall the tool is fairly accurate and high performance + low bandwidth (a full analytics run takes 4KB of bandwidth including the script and POST request to the server). It doesn't collect any personal data and doesn't allow accurate tracking of any individual.
If I want to track high interest users, I collate some attributes together (Ie, Screen Size, User Agent, Continent) which gets me a rough enough picture of high interest stuff for what I care. You don't need to track specific user sessions, that stuff is covered under the GDPR and not necessary.
Before anyone asks if they could have this tool; nope. It's proprietary and mine. The code I've written for it isn't hard, very minimal and fast. I wrote all this over a weekend and I use influx + grafana for the output. You can do that too.
Both mine and the product of the HN post are likely not in the scope of the GDPR since no data is collected that can specifically identify a user.
You say that you do not store IP addresses, but why should anybody believe it?
Modern security is based on proof, not on trust.
Why is Google security better than anyone else? Monopolies often have more resource, but lack motive, because they are a monopoly. Without transparency we have no idea how secure Google's systems are, but we do know Google has been hacked before.
So I think you would have to notify the user that you are sending their IP address to the processor under legitimate interest and have a way for them to "object" to that use (i.e. turn off analytics). For legitimate interest, the objection can be after the fact, so having a configuration screen that stores a cookie that allows them to turn off analytics when they are on the site would probably do it.
Whether on purpose or by accident (or simply by mental bias) they seriously misrepresent the amount of people for whom JavaScript is blocked, not loading, disabled by default for unknown websites (me) or not available for any other reason.
Website owners and creators should at least have that information as a reliable metric to base their development choices on.
Could I ask what tech you're using for the graph data? I'm working on a similar SaaS (not analytics) which requires graphs. I'm a DevOps engineer for an ISP, and I do a lot of work with things like Graphite/Carbon, Prometheus and so on - but I can't seem to settle on what to use for personal projects. Do you use a TSDB at all? Or are you just storing it in SQL for now?
I can show the code, I will probably do this in my next blog post, but that does not guaranty anything.
> Modern security is based on proof, not on trust.
Is it? So if there is a hosted version of a open source tool, you are sure they use the same code on the hosted version a in the open source tool? It's still based on trust.
Also, Google knows how to make profiles and it knows the importance of that data amd keeping it safe. It is also somewhat answerable to Consumer groups,users,shareholders,regulatory bodies. Indie dev doesn't know how to make good profile, more likely to sell the data to make revenue. Not ridiculing indie devs, just ridiculing your assumptions that if a solo dev is an angel.
https://www.labnol.org/internet/sold-chrome-extension/28377/
https://www.baekdal.com/thoughts/inside-story-what-i-did-to-...