Data sharing has changed a lot even over the years I have worked in IT. When I started, data was primarily held within an organization, and sharing that data on a larger scale was almost unheard of. Open datasets and the like that we have today were utterly unknown. At the start of my career, data storage was also expensive. As such, storing data in racks of servers in a data centre was a cost that many businesses were very acutely aware of. These days, with the advent of object storage and a myriad of cloud providers offering object storage capacity at an almost unlimited scale, it's not something that businesses are as worried about when it comes to the expenditure in an IT budget.
How has data sharing changed? Obviously, inside a business, self-service business intelligence, reporting, and analytics tools are far more commonplace than they were a couple of decades ago. Gone are the static reports and dashboards that we were used to in those days. Self-service or AI-generated analytics is far more commonplace in a business to drive business needs and answers on a near real-time basis.
Of course, it's not just internal to businesses that data is now shared. You can use many open datasets to do your own research and augment your existing datasets with additional information. Some are free and easily downloadable, while others are commercially available. Companies spend a lot of money yearly buying third-party data to make better business decisions.
The other big change we've seen is the speed of connectivity. Data sets back in the day were limited by the amount of data that you could ship over the Internet, over a connection, or on written media. These days, with superfast connections around the globe, the locality of that data is less of a concern (except when it comes to the legality of using that data and the jurisdiction in which that data was created).
These days, with modern data legal frameworks like the GDPR and the user's right to be forgotten, companies have to pay more care and attention to the data they collect, process, and use to ensure that they don't violate GDPR compliance and similar data processing restrictions in other areas of the globe.
The Privacy Paradox: Balancing Transparency and Data Security
So, just what are GDPR and CCPA? The GDPR was brought in by the European Union, and it provides a number of securities for Internet users, including their right to be forgotten, along with who their data is shared with, and how it is shared.
The General Data Protection Regulation (GDPR) represents one of the most significant data protection and privacy shifts in modern history, certainly in the European Union. GDPR started in the early days of data regulation within the EU, beginning with the Data Protection Directive of 1995.
The Data Protection Directive was part of the European Union's early efforts to regulate how personal data was handled and shared, establishing a baseline for data privacy across member states. However, technology's rapid advancements and the growth of the digital economy quickly outpaced these regulations. Does that sound familiar?
Initial proposals for the GDPR began in 2012 with a regulation that would enforce uniform data protection across all EU member states. After four years of negotiations and discussions between stakeholders, legislators, and lobbyists, the GDPR was finally enacted in 2016.
The regulation was designed to empower European Union citizens with control over their personal data, mandate transparency in data handling, and establish hefty fines for non-compliance. We've seen a few; British Airways was fined 20m GBP in 2020 for a data breach.
So here are five essential points businesses need to be aware of to ensure GDPR compliance:
Lawful Basis for Data Processing: Businesses must establish a legal basis for collecting and processing personal data, such as consent, contractual necessity, or legitimate interest. Without a valid legal basis, data processing activities are non-compliant.
Data Subject Rights: GDPR grants individuals specific rights, including the right to access, correct, delete, and restrict the processing of their data. Businesses must have mechanisms to address and respond to these requests promptly.
Data Protection Impact Assessments (DPIAs): Businesses are required to conduct DPIAs for high-risk data processing activities to identify and mitigate privacy risks. DPIAs are crucial for evaluating the impact on individuals' privacy before launching new data-driven projects.
Breach Notification Requirements: In the event of a data breach, businesses must notify the relevant supervisory authority within 72 hours if there is a risk to individuals' rights and freedoms. Notifications to affected individuals may also be necessary under certain conditions.
Accountability and Documentation: GDPR mandates that businesses maintain comprehensive records of their data processing activities. This documentation must include details about data types, processing purposes, retention periods, and data-sharing practices and be available for auditing.
The GDPR isn't the only legislation dealing with data privacy. There is also the CCPA (California Consumer Privacy Act).
The CCPA is slightly different, and here are some critical distinctions in geographic scope and applicability. The CCPA applies specifically to businesses operating in California (as the name suggests) or doing business with California residents. It also has criteria based on annual revenue, data processing volume, and revenue derived from data sales, making it more focused on larger businesses. Unlike the GDPR, the CCPA doesn't require a legal basis for data processing in the same way. Instead, it focuses on giving consumers control over their personal data's sale, allowing them to opt out of data sales rather than explicitly consenting to data collection or processing.
Regarding individual rights, the CCPA provides rights for California residents, such as the right to know what personal data is collected, the right to delete personal data, and the right to opt out of data sales. However, the CCPA doesn't offer a set of rights as extensive as the GDPR, especially regarding data portability and the objection to processing.
When it comes to fines, they are worlds apart. The GDPR offers penalties of up to 4% of the company's global annual revenue or €20 million, whichever is higher. These really make it obvious how strict the GDPR's emphasis on data protection is.
The CCPA, in comparison, has much lower fines and penalties, which are capped at $2500 per violation or $7500 for intentional violations.
A distinctive feature of the CCPA is its focus on data sales. It allows consumers to opt out of selling their personal information. It requires businesses to display a Do Not Sell My Personal Information link on their websites, if applicable. The GDPR, in comparison, does not specifically address the sale of data but instead regulates the processing of personal data more broadly, including storage, collection, and sharing, based on lawful processing requirements.
Interoperability and Standards: Breaking Down Data Silos
Another aspect of data sharing in the modern era that has made a massive difference is the open standards of data. Historically, these would have been CSV files or other text-based files for processing and transferring data. Of course, these days, there's much more emphasis on the ability to move data around in a binary fashion while still maintaining interoperability between systems.
Parquet tables are a primary example of this. They have a binary format that is readable by many different database systems and is becoming the de facto standard for data processing at volume in the modern era.
Of course, many other examples of similar situations exist, such as JSON with a valid JSON schema, XML with an XML schema, and textual base representations of data that must conform to a schema. But there are also other...
Several formats allow different programming languages, operating systems, and APIs to produce and consume data when transferring data over the wire. For example, Google's Protobuf format has been around for several years. It allows data serialization in several different programming languages while being read by different programming languages. There's also more modern technology for data-specific processing, like Apache Arrows, which allows for the serialization of data and in-memory analytics, as well as the serialization of data for processing and transfer over the wire again across platforms and operating systems.
Of course, one of the things that drove this was the advent of business use cases for open-source technology. Twenty years ago, access to open-source technology was in its infancy, and so people used commercial vendors and competing file formats and storage formats. These days, access to open-source technology via the Apache Software Foundation or just getting it off GitHub makes it far more viable for businesses to leverage open formats.
The Role of Data Sharing in Driving Innovation and Policy
Open datasets and open data can really drive forward innovation and policy making. One of the projects I'm currently working on is collecting social media data to help organizations and governments make better policy decisions regarding social media platforms.
Open datasets and leveraging these datasets to augment existing data that businesses, organizations, and governments may have to allow for a better understanding of the wider environment and the information environment we live in today. It allows for better decision-making when it comes to targeting customers, government policy, or how your local authority decides which day to collect your bin!
Of course, open standards also play a part in this. Different organizations that would traditionally have competed at a technological level now have to drive innovation to attract customers because the open standards that they are built upon mean that there is a level playing field when it comes to the features and functionality that the underlying data formats may adhere to.
Future Trends in Data Sharing: AI, Decentralization, and Ethical Frameworks
As we look ahead to how trends are shaping the future of data sharing, there is the obvious elephant in the room—AI. AI is currently driving an awful lot of innovation in the data sector. Still, it's not the only thing being driven from a technological advancement perspective. AI-driven insights, though, will become necessary. So the data storage formats and how data is accessed to allow LLMs to learn and offer insight will be key.
Blockchain may have passed us by in terms of crypto in some respects, but its use is definitely a unique technology. In the not-too-distant future, we will eventually reach the plateau of productivity, where we can leverage an immutable chain to store, process, and transfer data in an open fashion.
Finally, as we collect more data and understand more about the environment around us, the ethical frameworks and the way that people prioritize responsible data use will shift. They have shifted and will continue to shift as we move further into a data-driven society where so much of what we do is driven by the day-to-day collection, understanding, and dissemination of data across businesses and organizations around the globe.