ETQ-AI Data Collection App | Çağatay Yıldız

Classroom observation is expensive, subjective, and logistically difficult to scale. In Germany, high-quality teaching feedback is constrained not only by cost, but also by strict data protection law and institutional trust requirements.

In the ETQ-AI (Enhancing Teaching Quality with Artificial Intelligence project, our goal is to automate parts of teaching quality assessment: record classroom discourse, transcribe it, and use LLMs to score dimensions such as classroom management and student cognitive engagement. The long-term vision is simple but ambitious:

A teacher records a lesson and receives structured, actionable feedback — without a human observer in the room.

What sounds like an ML problem quickly became a systems, privacy, and deployment problem. As the technical lead and product owner for ETQ-AI data collection app, I owned the full product lifecycle — from system architecture and GDPR compliance to managing the development team. This post details how I bridged the gap between academic ML prototypes and a robust, privacy-first production mobile app used in real German classrooms. Coming from an academic background, a summary of lessons I learned in this applied projecct are:

Deployment metrics differ so much from benchmarks.
Privacy requires system-level handling.
Model failures can/must be handled architecturally.
Infrastructure decisions significantly influence timelines.
Ownership means maintaining coherence.

Let’s dive deeper into my project!

1. The challenge: beyond the model

The core research question was automated teaching quality assessment: can automates speech recognition (ASR) + LLMs replace human raters?

Automated Teaching Quality Assessment Pipeline

Raw Audio
Signal

Automatic Speech
Recognition

who talked
when

Speaker
Diarization

Student Teacher

⊕

Fusion

Student Teacher

Large Language
Model (LLM)

★

Teaching
Quality Scores

However, to answer that, we first needed a reliable data collection pipeline. We needed recordings from real classrooms with parental consent, strict GDPR compliance, and a format researchers could actually use. We also needed a user-friendly tool that teachers would actually want to use.

I acted as the bridge between research requirements and engineering reality. My role involved:

Product management: Defining the scope of data collection, roadmap, and user requirements.
Technical architecture: Designing the end-to-end data flow and making trade-offs between privacy, latency, and model accuracy.
Team leadership: Managing two student developers (Usman Amjad and Nitin Jain) who implemented the Flutter frontend, and a PhD student (Puja Maharjan) who buit the ML pipeline.
Stakeholder management: Coordinating between project members and teachers, and translating legal requirements into technical specifications.

2. Starting simple: the recording interface

The app’s core function is straightforward: record audio, transcribe it, let the teacher review and upload.

Left: Login via email or Google — we collect no personal data beyond authentication. Center: The main screen before any recordings. Right: A recording in progress.

Login supports email and Google authentication. We deliberately don’t collect any personal data at this stage. The main screen is minimal: a record button and a list of past recordings. This simplicity was intentional — our users are teachers in the middle of a workday, not power users.

Working with classroom audio in Germany means working under GDPR, and this constraint influenced almost every design decision.

We, as researchers, need data on a server to run experiments, but teachers (and students’ parents) need to trust that their data is handled properly. We landed on a design where personal data stays on the teacher’s device by default, and upload to our research servers is an explicit, separate action. The teacher can review the transcription before uploading, and can delete data from the device, from the server, or from both — independently. We also added a full account deletion option: one tap removes all associated data everywhere. This sounds simple, but implementing proper cascading deletion across device storage, cloud storage, and authentication records required careful coordination.

Left: Upload confirmation showing both audio and transcript delivered to the server. Right: Granular deletion — teachers can remove data from their device, from the cloud (S3), or both.

Infrastructure strategy: We store uploaded data on Amazon S3 servers in Frankfurt (eu-central-1) to ensure that the data stay within the EU. However, for transcription, we utilize high performance computing (HPC) servers at Göttingen. Separating storage (Frankfurt) from compute (Göttingen) was a strategic decision I made after mapping the data governance capabilities of each site. GWDG offered the necessary GPU power but was only authorized for transient processing, while AWS provided the compliant long-term storage we needed. Orchestrating the alignment between these sites was a key non-technical challenge I solved.

The three-component infrastructure: audio recorded on the teacher's smartphone is sent to GWDG HPC servers in Göttingen for transcription, while long-term storage lives on GDPR-compliant S3 servers in Frankfurt. Separating compute from storage was a deliberate architectural decision driven by differing data governance requirements at each institutional site.

4. Real world deployment challenges

This project taught me quite a bit about the differences between ideal academic research and real world deployment. Here are some examples:

4.1. On-device transformers didn’t survive contact with reality

Our initial plan was quite fun: running Whisper on-device for transcription so that classroom audio never needs to leave the teacher’s phone at all. Maximum privacy, minimum infrastructure.

To make this work, we built on whisper.cpp, a C/C++ port of Whisper optimized for local, CPU-based inference on resource-constrained devices. By stripping away Python/PyTorch dependencies and relying on low-level GGML tensor operations, it can run substantially smaller model variants directly on a smartphone without GPU acceleration. To integrate this into our Flutter app, we used flutter_rust_bridge, which auto-generates the FFI (Foreign Function Interface) bindings between Dart and Rust. On the Rust side, whisper-rs wraps the raw C API of whisper.cpp, giving us a safe interface. The resulting call chain became Flutter UI → Dart → Rust → whisper.cpp (C/C++).

We tested several Whisper variants. On a typical smartphone, only tiny, base, and small were feasible in terms of memory and latency. But because their transcription quality isn’t great, we turned to server-side transcription via API calls to HPC servers, where we could run large-v3 and large-v3-turbo. The accuracy difference was substantial. This meant the audio does travel to a server for transcription, which we handle by requiring explicit user consent and ensuring the audio is processed transiently (not stored on the transcription server).

Lesson learned: on-device ML is great for demos but is often impractical for production because of the limitations of mobile hardware.

4.2. Academic vs real-world benchmarking

We benchmarked a range of ASR models on both standard datasets and our own classroom recordings. The gap was stark. Even the best-performing models — Voxtral-Mini-3B-2507 and Whisper-large-v3-turbo — achieved word error rates (WER) around 0.64–0.66 on student speech and 0.28 on teacher speech. On standard LibriSpeech benchmarks, those same models score 0.02–0.07. That is roughly a 10× degradation in accuracy when moving from clean, studio-recorded speech to a real classroom.

Student speech was consistently harder than teacher speech, largely because of microphone placement: a phone sitting on the teacher’s desk captures the teacher clearly and students at a distance, often through background noise. Beyond microphone distance, classrooms introduce overlapping speech, scraping chairs, background noise, and code-switching between German and technical vocabulary — conditions far removed from the clean audio that most ASR training sets are built on. The training data for most of these models isn’t public, but the distribution shift is evident in the numbers. One notable case is that NVIDIA’s Canary-1b-v2 achieves an impressive 0.021 WER on LibriSpeech English, yet collapses to a WER above 1.0 on student classroom. High benchmark scores are simply not predictive of classroom performance.

4.3. Output repetition issue

We also discovered a surprising failure mode: models keep repeating incorrect transcriptions for dozens of times! We show an example below, where we transcribe a random audio from YouTube:

Whisper hallucinating on a classroom recording. After a few plausible sentences, the model enters a repetition loop before degenerating into repeated "(speaks in foreign language)" tags.

It’s impossible to pin this down to a single factor, but our manual inspections show that such hallucinations occur when the words are not clear, and grow significantly with audio duration — especially with noisy or multilingual input. The fix was to chunk the audio into ~30-second segments before sending each to the transcription API, then stitch the results together. We discovered that WER rises steeply with segment length across all models, roughly doubling as segments grow from 30 to 300 seconds. The 30-second sweet spot is short enough to avoid hallucinations while long enough to preserve context across sentence boundaries.

But chunking introduced its own engineering challenge: the segmentation and sequential API calls need to happen in the background while the teacher continues using the app. In Flutter, managing background and foreground processes properly is non-trivial, especially on iOS where background execution is tightly restricted. We spent more time on getting background transcription reliable than on any ML-related problem.

4.4. Audio recording: deceptively hard on mobile

Another “should be simple but isn’t” problem is continuous audio recording on a mobile device. We discovered that recording stops abruptly when the user switches to another app, when the screen times out and goes black, or when the phone enters power-saving mode. For a teacher recording a 45-minute lesson, this is a not acceptable.

The solution requires properly implementing background audio services — registering the app as a background audio provider, handling lifecycle events, managing the audio session correctly. On Android this is manageable; on iOS (where Flutter’s background execution support is more limited), it required significant workarounds. This is the kind of platform-specific systems engineering that you never think about when you’re designing a pipeline in a Jupyter notebook.

5. What I would do differently

Start with deployment constraints: We initially assumed transcription would be “good enough.” It wasn’t. Downstream components must be designed around noisy, imperfect inputs.
Budget 3x more time for platform-specific issues: Mobile OS constraints consumed more engineering time than all ML components combined.
Design for the user’s worst moment: If recording fails once during a 45-minute lesson, the tool loses trust. Reliability outweighs sophistication.

6. Transferable lessons for applied ML

This project changed how I think about applied machine learning:

Benchmarks are not deployment metrics Leaderboard performance does not predict out-of-domain robustness.
Privacy is a systems problem: GDPR influenced storage location, compute separation, consent flows, and deletion logic more than model choice did.
Model failures can be handled architecturally: Repetition hallucinations were mitigated through segmentation and orchestration — not model fine-tuning.
Infrastructure decisions dominate timelines: Background execution, mobile OS constraints, and data governance consumed more effort than training experiments.
Ownership means maintaining coherence: Keeping architecture, user needs, compliance, and research goals aligned required explicit technical leadership.

7. The broader takeaway

Building ETQ-AI data collection app taught me that the interesting engineering problems in applied ML are rarely about the models. They’re about the system around the models: how data flows, where it’s stored, who controls it, what happens when the model fails, and how to make it all invisible to the user. This is the kind of work that doesn’t produce papers but determines whether a research prototype becomes a usable tool. It’s also the lens I now bring to research problems more broadly: not just “what’s the best model?” but “what does it take to make this actually work?”

ETQ-AI is a joint project between the Hector Research Institute of Education Sciences and Psychology and the Cluster of Excellence “Machine Learning for Science” at the University of Tübingen. The app was implemented by Usman Amjad and Nitin Jain.