Beyond the Comfort Zone: The Dynamics of Technological AcceptanceNov 28, 2023
In recent weeks, when introducing our new AI-based platform for qualitative data analysis Qeludra AI, a few people expressed some uneasiness with the new technology pointing out issues around ethics and data privacy. There was also a lively discussion around it at the recent Symposium on AI in Qualitative Research organized by the CAQDAS Networking Project and the Social Research Associations.
In this article, I am primarily concentrating on issues related to data privacy and security, aiming to address the fears and uncertainties that may stem from a limited understanding of how AI processes, analyzes and stores data.
How New Technologies Have Historically Faced Skepticism
Throughout history, there have been several instances where new technologies were met with fear, skepticism, or resistance. These examples highlight the recurring pattern of apprehension toward innovation:
Printing Press (15th Century): When Johannes Gutenberg invented the printing press in the 15th century, it revolutionized the way information was disseminated. However, it also provoked fear among the clergy and some rulers who were concerned that it would lead to the spread of heretical ideas and reduce their control over the flow of information.
Railways (19th Century): People were afraid that the high speeds would be harmful to human health and the landscape. There were also concerns about the societal and economic impacts of such rapid transportation.
Automobiles (Late 19th - Early 20th Century): When cars first appeared, they were seen as noisy, dangerous, and a threat to traditional ways of life. There was resistance from those invested in horse-drawn transport, and early cars were subject to restrictive regulations like the United Kingdom's "Red Flag Act," which required a person to walk ahead of automobiles waving a red flag as a warning.
Telephones (Late 19th Century): People were initially uncomfortable with the idea of a voice traveling across long distances through wires, and there were fears about how this technology might invade personal privacy.
The Internet (Late 20th Century): One of the earliest and most persistent fears was about the loss of privacy and potential security breaches. People were concerned about how personal information might be used, misused, or stolen online. The idea of sharing personal details on a network accessible by millions was daunting.
Generative AI (Early 21st Century): When discussing the application of AI in qualitative data analysis, we see similar reactions. A widespread fear is about data trustworthiness: Can AI be trusted with sensitive data?
Additionally, it is essential for us, as researchers, to thoroughly comprehend the implications of these new technologies ourselves. This understanding is crucial before we seek informed consent from our participants, ensuring that we can convey the information in an accessible and transparent manner.
ChatGPT and its underlying LLM
I believe the first misunderstanding is related to discussing applications that utilize Large Language Models (LLMs) like GPT-3 or GPT-4, which is the technology behind ChatGPT.
ChatGPT is an AI chatbot designed for engaging in natural language conversations. I assume that most of you reading this article have likely had the chance to interact with it (?)
Integrating GPT-3 or GPT-4 into an application, however, means using the underlying natural language processing (NLP) technology to enable the app to understand and interpret user input. This allows developers to build all kinds of new capabilities into their apps, like language translation, text summarization, and answering user questions. The main difference is that these models are not standalone products, like ChatGPT - they are a tool for developers to use in building their own products.
If a company wants to integrate the GPT-3/4 models in their application, they are making a contract with OpenAI. This is also the case when integrating other models like Jurassic-2 by AI21laps, Claude 2 by Anthropic, LaMda by Google Bard, PalM by Google Generative, or Llama2 by Meta. There will be many more in the future. GPT-4, however, at the moment leads the pack as it outperforms other models in most benchmarks.
Therefore, it comes as no surprise that most applications currently integrate GPT models. Developers can access them via a so-called API (Application Programming Interface).
What is an API?
An API is a set of protocols and tools that allow different software applications to communicate with each other. It's like a translator that allows different software systems to "speak" the same language, so they can exchange data and functionality.
A key part of this process, especially concerning data security, is the use of an API key.
To use an API, a developer first needs to obtain an API key from the provider of the API. This key is a unique identifier that is used to authenticate the developer's requests to the API. It's like a password that grants access to the API's functionalities.
The developer then integrates the API into their application by writing code that makes requests to the API. These requests include the API key for authentication. This is how the application communicates with the API, sending data to and receiving data from it.
When the application makes a request to the API (such as to retrieve data or trigger a specific action), the API key is included in the request. The API provider uses this key to verify that the request is coming from an authorized source.
Data Traveling through Virtual Space
You can also think of it like data traveling through a virtual space. This is however different from throwing a ball from person A to person B where a passerby could intercept and steal the ball. This is how it actually works:
First, there is a Door, and You need a Key to Open this Door
When data is sent from one point to another (like from your application to an API), it's like going out of a secure door. The API key is like a key that unlocks this door. It verifies that the person sending the data is allowed to do so. Without this key, the door remains locked, and the data can't be sent.
Traveling Through a Tunnel (Data Encryption in Transit)
Once the data leaves the door, it travels through a virtual tunnel to its destination. This tunnel is a secure connection (often HTTPS). While the data is traveling, it's encrypted, meaning it's scrambled into a code that's extremely hard to decipher. It's like turning a readable message into a jumbled mess of characters that make no sense to anyone who intercepts it. This ensures that if someone were to somehow get hold of the data while it's traveling, they wouldn't be able to understand it.
Arrival and Another Door (Decryption)
When the data arrives at its destination (like the server hosting the LLM), it's like coming to another secure door. Here, the receiving system has a key (decryption key) that turns the scrambled message back into readable data. This key is different from the API key and is used to decode the encrypted data.
Processing the Data
Once decrypted, the data can be processed as needed. If it's a request for the LLM, the LLM will understand and process this data. Standard practices in the industry involve ensuring the integrity and confidentiality of data throughout its lifecycle, including the processing phase.
Sending Data Back
If a response is needed (like a reply from the LLM), the process happens in reverse. The response is encrypted at the server, travels through the secure tunnel back to the original sender, and is then decrypted by the sender's system to be read and understood. This process ensures that sensitive information remains confidential and secure from unauthorized access.
Data at Rest
When data is not being sent or used, it's stored or 'at rest'. When you read that your data is encrypted at rest, this means it's kept in that scrambled, unreadable state. This way, even if someone gains unauthorized access to the storage, they can't read the data without the correct decryption key.
I hope that the analogy helped to clarify the process of sending data to a Large Language Model (LLM) for processing. Now, let's delve deeper and explore precisely how an LLM utilizes your data.
What Happens to Data when they are "read" by an LLM?
When data is processed by a Large Language Model (LLM) like ChatGPT, the process is quite different from how humans read and understand text. Here's a simplified explanation of what happens:
First, the input text is broken down into smaller units called tokens. These can be words, parts of words, or even punctuation. Each token is then converted into a numerical format, known as an embedding. These embeddings capture semantic and syntactic information about the words, essentially translating them into a language the model can understand.
The numerical data (embeddings) then passes through multiple layers of the neural network. Each layer consists of numerous neurons that perform complex mathematical operations on the data. These layers are where the actual 'learning' of the model occurs. They can recognize patterns, make connections, and interpret the contextual meaning of the tokens.
Unlike humans who read sentence by sentence, LLMs process all tokens simultaneously. This allows them to understand context in a way that's different from linear reading. The model can relate words to each other within a sentence and across sentences, forming an understanding based on the entire input.
After processing the data, the model generates an output. In the case of a conversational AI like ChatGPT, this output is typically in the form of text that's coherent and contextually appropriate to the input it received.
To summarize: LLMs like ChatGPT process data through a complex network of mathematical operations that allow them to understand and generate human-like text. However, this process is fundamentally different from human reading and comprehension, as it's based on patterns in data rather than conscious understanding. So, the LLMs are just processing data and returning information you have been asking about. They are not reading the stories your research participants have been telling you.
My best guess is that the root of the concerns are related to the potential to data breaches.
What if Someone Hacks the Server?
This is a question unrelated to the use of AI for qualitative research. We have been storing our data digitally for over 30 years now. Despite security measures, no system is entirely immune to data breaches. The concern about the risk of unauthorized access or hacking needs to be extended to all data that are stored on servers, including university or company servers. This is simply a subjective issue about trust. You might trust your university or organizational server more than servers from a third-party tool.
Fact is that SAAS (Software As A Service) providers often specialize in data security and are likely to have more advanced security protocols and expertise compared to a typical university server setup. They usually invest heavily in security measures, including advanced encryption, regular security audits, and compliance with international data protection standards. SAAS providers generally have more resources to dedicate to securing their servers and data. This includes physical security, cybersecurity measures, and dedicated staff focusing on security and data protection. Universities, while often having robust security measures, may not match the level of investment and specialization of a dedicated SAAS provider.
Both university servers and SAAS providers are vulnerable to data breaches, but the nature of the risk can differ. SAAS providers, being larger targets due to the volume of data they handle, may face more frequent cyber threats. However, their advanced security measures might better mitigate these risks compared to some university servers.
Nonetheless, I've found that explaining the robust security measures that are in place can sometimes have the opposite effect. For instance, when a developer described the auditing process for SOC 2 compliance, it unexpectedly heightened concerns. One of the researchers reacted by saying, "What? More companies are involved? Does that mean even more people have access to our data?" This response highlights a paradox where increased security measures can inadvertently lead to more apprehension about data privacy.
Let’s take a look at what’s behind these various data security indicators:
Data Security Indicators
GDPR compliance refers to adhering to the requirements of the General Data Protection Regulation (GDPR), a comprehensive data protection law that came into effect in the European Union (EU) on May 25, 2018. The GDPR imposes strict rules on how organizations collect, store, process, and manage the personal data of individuals within the EU.
CCPA (California Consumer Privacy Act) is a law that aims to protect the personal information of California residents. It gives consumers the right to know what personal information businesses are collecting about them, the right to request that businesses delete their personal information, and the right to opt out of the sale of their personal information.
SOC 2 is a set of auditing standards that are designed to assess the security, availability, processing integrity, confidentiality, and privacy of a service organization's systems. It's often used by organizations that handle sensitive data, like financial or healthcare information.
SOC 3 is similar to SOC 2, but it's a less rigorous assessment and doesn't require an audit. It's typically used by organizations that want to demonstrate their security and privacy practices, but don't need to meet the more rigorous requirements of SOC 2.
The idea is that organizations that take steps to be GDPR, CCPA, SOC 2, and SOC 3 compliant demonstrate a commitment to protecting consumer data and adhering to high standards for data privacy and security. However, as observed these efforts are not universally interpreted as indicators of such commitment.
One reason for this could be attributed to a lack of awareness and understanding leading to an underappreciation of the efforts, organizations put into achieving compliance. To address this, I have included a detailed explanation of the four compliance standards mentioned above in the appendix.
Why Not Create Your Own LLM?
One might question why companies opt not to develop their own Large Language Models (LLMs) for greater control. This approach would allow them to know the data's origins, understand the specifics of its collection, and even host the models on their own servers.
Developing an LLM requires massive amounts of data, computing power, and expertise in natural language processing (NLP). Not many companies have these resources. Further, the cost of training and maintaining an LLM can be prohibitively expensive.
Developing and training a state-of-the-art LLM like GPT-3 can cost tens of millions of dollars. The German AI company Aleph Alpha has recently raised $500m (€460m) to bolster its research capacity and accelerate its development and commercialization of generative AI for applications in healthcare, finance, law, government, and security. Mistral AI, a French AI startup, raised $113 million to develop its model. It is an open-source model for anyone to download. However, it is by far not as powerful as the GPT models.
What about Open-Source Models?
At present, proprietary models tend to outperform open-source alternatives in terms of power and quality. Additionally, integrating an open-source model typically entails substantial initial roll-out expenses, and ongoing maintenance costs can be higher.
Consequently, startups or businesses prioritizing a swift market entry often opt for OpenAI models. However, given the rapid advancements in the field, transitioning from proprietary to open-source models is likely to become a more feasible and attractive option in the future.
Which Models are Integrated in Qeludra AI?
In our early access version, we integrate GPT-3, GPT-4, depending on the task. GPT-4, as it stands, is the most advanced among them. Our goal, as development progresses, is to provide users with the flexibility to choose the model that best fits their specific needs, including open-source models like Llama or Mistral.
At this juncture, opting for OpenAI's models seems the logical choice for reasons discussed above. Another reason is that the qualitative research community is in the middle of navigating a major technological shift and needs to figure out how to use AI. I believe that the most effective way to explore AI's potential is by using the highest quality model that is currently available.
The past year has been a whirlwind of advancements, and I anticipate this rapid pace of development to continue. Hence, employing OpenAI and Llama models is just our starting point. As the technology evolves, so too will our tool, growing and adapting in tandem.
Should this approach initially deter some researchers from using our tool, that is an outcome we are prepared to accept. Our commitment is not solely to develop software but also to push the boundaries of qualitative research methods.
As we continue to embrace innovative AI-based technologies, it becomes increasingly important to educate ourselves and the wider research community about the nuances of data privacy and security. This article, I hope, has contributed to that ongoing conversation, empowering researchers to make informed decisions in a rapidly evolving digital landscape.
Disclaimer: Despite my best efforts to accurately describe the technical aspects, if there are any inaccuracies, please don’t hesitate to inform me. I am open to corrections and value accurate information.
Key Aspects of GDPR Compliance
Lawful Basis for Processing: Organizations must have a lawful basis to process personal data. This can include the individual's consent, necessity for a contract, legal obligations, vital interests, public task, or legitimate interests.
Consent: Where consent is the basis for processing, it must be freely given, specific, informed, and unambiguous. Consent requests should be clear and separate from other terms and conditions.
Data Subject Rights: GDPR strengthens and expands the rights of individuals regarding their personal data, including:
- The right to be informed about how their data is used.
- The right to access their data.
- The right to rectification if their data is incorrect.
- The right to erasure (‘right to be forgotten’).
- The right to restrict processing.
- The right to data portability.
- The right to object to processing.
- Rights in relation to automated decision-making and profiling.
Data Protection Officers (DPO): Certain organizations are required to appoint a Data Protection Officer to oversee GDPR compliance and act as a point of contact for data subjects and supervisory authorities.
Data Breach Notification: GDPR imposes strict data breach notification requirements. Organizations must report certain types of data breaches to the relevant supervisory authority within 72 hours of becoming aware of the breach, and in some cases, to the individuals affected.
Privacy by Design: Organizations must integrate data protection principles into their processing activities and business practices, from the design stage of any product, service, or process.
Data Protection Impact Assessments (DPIAs): For processes that pose a high risk to individuals’ data rights and freedoms, GDPR mandates conducting DPIAs to identify and mitigate these risks.
Cross-Border Data Transfers: Transfers of personal data outside the EU are subject to strict conditions. GDPR ensures that data is transferred to non-EU countries that provide an adequate level of data protection or under safeguards like Standard Contractual Clauses.
Record Keeping: Organizations must keep detailed records of their data processing activities.
Accountability: GDPR emphasizes the principle of accountability. Organizations must not only comply with the GDPR but also be able to demonstrate compliance through documentation, policies, training, audits, and more.
Compliance with GDPR is not just about avoiding hefty fines; it's about respecting and protecting the privacy and rights of individuals. It applies to any organization, regardless of location, that processes personal data of individuals in the EU, making its impact global.
Set of requirements for CCPA (California Consumer Privacy Act):
Focus: Privacy protection.
Geographical Scope: Applies to businesses operating in California, USA.
- Gives California residents the right to know what personal data is being collected about them.
- Allows residents to request the deletion of their personal data.
- Permits residents to opt out of the sale of their personal data.
- Requires businesses to provide certain disclosures about their data collection and selling practices.
Goal: To enhance privacy rights and consumer protection for residents of California.
Set of requirements for SOC 2 (Service Organization Control 2):
Focus: Security, availability, processing integrity, confidentiality, and privacy of a system.
Applicability: Relevant to service providers storing customer data in the cloud.
- Based on five "Trust Service Criteria" set by the American Institute of CPAs (AICPA).
- Requires organizations to establish and follow strict information security policies and procedures.
Goal: To ensure that systems are set up so they assure security, availability, processing integrity, confidentiality, and privacy of customer data.
Becoming SOC 2 (Service Organization Control 2) certified involves a multi-step process that requires an organization to establish and maintain stringent data security measures. SOC 2 is not a one-time certification but an ongoing compliance process that demonstrates an organization's commitment to data security and privacy. Here's an overview of the steps involved:
Understand the SOC 2 Requirements:
Familiarize yourself with the SOC 2 framework, which is based on the Trust Service Criteria: Security, Availability, Processing Integrity, Confidentiality, and Privacy. Determine which of these criteria are relevant to your organization's services and operations.
Conduct a Readiness Assessment:
Perform an internal audit to assess your current practices and policies against the SOC 2 requirements. Identify gaps and areas that need improvement.
Develop and Implement Policies and Procedures:
Create or update your organization's policies and procedures to address the identified gaps. This often involves developing a comprehensive set of controls around data security, incident response, access controls, and more.
Choose a Third-Party Auditor:
Select a certified public accounting (CPA) firm that is authorized to conduct SOC 2 audits. It's important to choose a reputable and experienced auditor.
Undergo the Type I Audit:
The SOC 2 Type I report evaluates the design of your controls at a specific point in time. The auditor will assess whether your organization's systems and controls are designed appropriately to meet the relevant Trust Service Criteria.
Undergo the Type II Audit (if applicable):
The SOC 2 Type II report evaluates the operational effectiveness of those controls over a period of time, typically 6 to 12 months. The auditor will review the actual functioning of the controls over this period to ensure they are operating as intended.
Address Audit Findings:
If the auditor identifies any issues, work to resolve them promptly. Implement changes or improvements as recommended by the auditor.
Receive and Use the SOC 2 Report:
Once you pass the audit, you'll receive a SOC 2 report. This report can be shared with clients and stakeholders to demonstrate your commitment to data security.
Maintain Ongoing Compliance:
SOC 2 compliance is not a one-time event. Continue to monitor, update, and improve your controls. Regularly review and adapt to changes in your business or technology environment.
Conduct annual audits to maintain your SOC 2 compliance. This demonstrates your ongoing commitment to maintaining high standards of data security and privacy.
Set of requirements for SOC 3 (Service Organization Control 3):
Focus: Similar to SOC 2 but designed for a broader audience.
Applicability: Also relevant to service providers handling customer data in the cloud.
Key Differences from SOC 2: SOC 3 report is a public-facing document that provides a high-level overview of the organization's controls.
Less detailed than SOC 2 and does not include the full description of the tests and results.
Goal: To provide assurance on controls related to security, availability, processing integrity, confidentiality, and privacy, but in a format that is easier for the general public to understand.