I Finally Tried Apple VisionKit: Building a Simple Document Scanner in SwiftUI

I have been writing quite a bit about iOS development recently.

A while back, I wrote about SwiftUI state management in my article SwiftUI State in iOS: A Practical Guide. After that, I wrote another article on MVVM in SwiftUI, where I talked about keeping an iOS app organized as it grows.

Both of those articles came from a very practical place. I was not trying to explain iOS development from a perfect textbook point of view. I was trying to explain the things that actually help when you are building real apps.

Recently, I heard about Apple’s VisionKit again and thought, “Why have I not properly tried this yet?”

I had used Apple’s Vision framework before in smaller ways, but VisionKit felt different. It is one of those frameworks that sounds advanced at first, but once you understand what it is trying to do, it becomes very practical.

You can use VisionKit to scan documents, detect text, scan QR codes, read barcodes, and work with camera-based data. In simple words, VisionKit helps your app understand useful information from the real world using the camera.

That immediately gave me an idea.

What if I build a simple SwiftUI app that lets the user scan a document, preview the scanned pages, and maybe later extract text from it?

That is what this article is about.

We are going to build a simple VisionKit-powered document scanner using SwiftUI.

What Is VisionKit?

VisionKit is an Apple framework that helps developers add camera-based scanning features to iOS apps.

Instead of building a full camera pipeline yourself, handling edge detection, document boundaries, perspective correction, and scanning UI, VisionKit gives you ready-made tools for common scanning use cases.

You can use VisionKit for things like:

Scanning paper documents
Capturing receipts
Scanning handwritten notes
Detecting live text from the camera
Scanning QR codes and barcodes
Building lightweight scanner-style apps

This is useful because many apps eventually need to deal with real-world information.

Maybe you are building an expense app and want to scan receipts.

Maybe you are building a notes app and want to scan handwritten notes.

Maybe you are building an inventory app and want to scan barcodes.

Or maybe you are like me and just curious enough to try the framework and see what it can do.

\VisionKit vs Vision

Before writing code, it helps to understand the difference between VisionKit and Vision.

The Vision framework is more focused on image analysis. It gives you APIs for things like text recognition, face detection, barcode detection, object tracking, and image classification.

VisionKit is more user-facing. It gives you ready-made UI components for scanning documents and data using the camera.

A simple way to think about it:

Vision is the engine.

VisionKit is the scanner experience.

For example, if you want to recognize text from an image manually, you may use Vision with VNRecognizeTextRequest.

But if you want to present a document scanner UI to the user, capture pages, and get scanned images back, VisionKit makes that much easier.

In this article, we will start with VisionKit document scanning.

What We Are Building

We will build a small SwiftUI app with the following features:

A button to open the document scanner
A VisionKit wrapper for VNDocumentCameraViewController
A SwiftUI view to display scanned pages
A simple model to store scanned images
Optional text recognition using Vision

This is not a full production scanner app yet. But it gives us the foundation.

Once this works, you can extend it into:

A receipt scanner
A handwritten notes scanner
A PDF scanner
A business card scanner
A document archive app

That is what I like about frameworks like this. You start with one simple demo, and suddenly you can see many product ideas around it.

Step 1: Create a New SwiftUI Project

Create a new iOS app in Xcode.

Use:

Interface: SwiftUI
Language: Swift

Then import VisionKit where needed:

import VisionKit

One important thing to remember is that VNDocumentCameraViewController is a UIKit controller. Since we are working in SwiftUI, we need to wrap it using UIViewControllerRepresentable.

This is a common pattern in SwiftUI when Apple gives us a UIKit API but we want to use it inside a SwiftUI app.

Step 2: Add Camera Permission

Because we are using the camera, we need to add camera permission in Info.plist.

Add this key:

<key>NSCameraUsageDescription</key>
<string>This app needs camera access to scan documents.</string>

Without this, your app will crash when trying to access the camera.

This is one of those small things that is easy to forget, especially when you are excited to test the scanner.

Step 3: Create a Scanned Document Model

Let us create a simple model to hold the scanned pages.

import SwiftUI

struct ScannedDocument: Identifiable {
    let id = UUID()
    var title: String
    var pages: [UIImage]
    var createdAt: Date
}

For now, we are keeping things simple.

Each scanned document has:

A title
A list of scanned page images
A created date

Later, you could add more fields like:

Folder name
PDF URL
OCR text
Tags
Last updated date

But for this article, images are enough.

Step 4: Create the VisionKit Scanner Wrapper

Now we need to wrap VNDocumentCameraViewController so it can be used in SwiftUI.

Create a new file called DocumentScannerView.swift.

import SwiftUI
import VisionKit

struct DocumentScannerView: UIViewControllerRepresentable {
    var onScanComplete: ([UIImage]) -> Void
    var onCancel: () -> Void
    func makeUIViewController(context: Context) -> VNDocumentCameraViewController {
        let scannerViewController = VNDocumentCameraViewController()
        scannerViewController.delegate = context.coordinator
        return scannerViewController
    }
    func updateUIViewController(
        _ uiViewController: VNDocumentCameraViewController,
        context: Context
    ) {
        // No update needed for this example
    }
    func makeCoordinator() -> Coordinator {
        Coordinator(
            onScanComplete: onScanComplete,
            onCancel: onCancel
        )
    }
    final class Coordinator: NSObject, VNDocumentCameraViewControllerDelegate {
        private let onScanComplete: ([UIImage]) -> Void
        private let onCancel: () -> Void
        init(
            onScanComplete: @escaping ([UIImage]) -> Void,
            onCancel: @escaping () -> Void
        ) {
            self.onScanComplete = onScanComplete
            self.onCancel = onCancel
        }
        func documentCameraViewController(
            _ controller: VNDocumentCameraViewController,
            didFinishWith scan: VNDocumentCameraScan
        ) {
            var scannedPages: [UIImage] = []
            for pageIndex in 0..<scan.pageCount {
                let image = scan.imageOfPage(at: pageIndex)
                scannedPages.append(image)
            }
            controller.dismiss(animated: true) {
                self.onScanComplete(scannedPages)
            }
        }
        func documentCameraViewControllerDidCancel(
            _ controller: VNDocumentCameraViewController
        ) {
            controller.dismiss(animated: true) {
                self.onCancel()
            }
        }
        func documentCameraViewController(
            _ controller: VNDocumentCameraViewController,
            didFailWithError error: Error
        ) {
            print("Document scanner failed: \(error.localizedDescription)")
            controller.dismiss(animated: true) {
                self.onCancel()
            }
        }
    }
}

There is quite a bit happening here, so let us break it down.

DocumentScannerView conforms to UIViewControllerRepresentable. This is what allows us to use a UIKit view controller inside SwiftUI.

Inside makeUIViewController, we create a VNDocumentCameraViewController.

Then we assign its delegate to the coordinator.

The coordinator receives scanner events.

When scanning finishes, this method is called:

func documentCameraViewController(
    _ controller: VNDocumentCameraViewController,
    didFinishWith scan: VNDocumentCameraScan
)

The scan object contains all the scanned pages. We loop through each page and convert it into a UIImage.

for pageIndex in 0..<scan.pageCount {
    let image = scan.imageOfPage(at: pageIndex)
    scannedPages.append(image)
}

Then we send those images back to SwiftUI using the onScanComplete closure.

This is one of my favorite parts of mixing UIKit and SwiftUI. UIKit still handles the scanner experience, but SwiftUI controls the app state.

Step 5: Build the Main SwiftUI Screen

Now let us create the main view.

import SwiftUI

struct ContentView: View {
    @State private var documents: [ScannedDocument] = []
    @State private var isShowingScanner = false
    var body: some View {
        NavigationStack {
            VStack {
                if documents.isEmpty {
                    ContentUnavailableView(
                        "No Scans Yet",
                        systemImage: "doc.viewfinder",
                        description: Text("Tap Scan Document to create your first scan.")
                    )
                } else {
                    List {
                        ForEach(documents) { document in
                            NavigationLink {
                                ScanDetailView(document: document)
                            } label: {
                                VStack(alignment: .leading, spacing: 6) {
                                    Text(document.title)
                                        .font(.headline)
                                    Text("\(document.pages.count) page(s)")
                                        .font(.subheadline)
                                        .foregroundStyle(.secondary)
                                    Text(document.createdAt.formatted())
                                        .font(.caption)
                                        .foregroundStyle(.secondary)
                                }
                                .padding(.vertical, 4)
                            }
                        }
                    }
                }
            }
            .navigationTitle("VisionKit Scanner")
            .toolbar {
                ToolbarItem(placement: .topBarTrailing) {
                    Button {
                        isShowingScanner = true
                    } label: {
                        Label("Scan", systemImage: "camera.viewfinder")
                    }
                }
            }
            .sheet(isPresented: $isShowingScanner) {
                DocumentScannerView(
                    onScanComplete: { images in
                        let newDocument = ScannedDocument(
                            title: "Scan \(documents.count + 1)",
                            pages: images,
                            createdAt: Date()
                        )
                        documents.append(newDocument)
                        isShowingScanner = false
                    },
                    onCancel: {
                        isShowingScanner = false
                    }
                )
            }
        }
    }
}

This gives us a simple scanner home screen.

If there are no scanned documents, we show an empty state.

If documents exist, we show them in a list.

When the user taps the scan button, we present the VisionKit scanner as a sheet.

.sheet(isPresented: $isShowingScanner) {
    DocumentScannerView(
        onScanComplete: { images in
            let newDocument = ScannedDocument(
                title: "Scan \(documents.count + 1)",
                pages: images,
                createdAt: Date()
            )

        documents.append(newDocument)
                    isShowingScanner = false
                },
                onCancel: {
                    isShowingScanner = false
                }
            )
        }

This connects nicely with the SwiftUI state concepts I wrote about before.

The scanner produces images.

SwiftUI stores those images in @State.

The UI updates automatically.

This is exactly why understanding state is so important in SwiftUI. Frameworks like VisionKit give you the feature, but SwiftUI state decides how your app reacts to it.

Step 6: Create a Detail View for Scanned Pages

Now let us create a view to preview the scanned pages.

import SwiftUI

struct ScanDetailView: View {
    let document: ScannedDocument
    var body: some View {
        ScrollView {
            LazyVStack(spacing: 20) {
                ForEach(Array(document.pages.enumerated()), id: \.offset) { index, image in
                    VStack(alignment: .leading, spacing: 8) {
                        Text("Page \(index + 1)")
                            .font(.headline)
                            .padding(.horizontal)
                        Image(uiImage: image)
                            .resizable()
                            .scaledToFit()
                            .clipShape(RoundedRectangle(cornerRadius: 12))
                            .shadow(radius: 4)
                            .padding(.horizontal)
                    }
                }
            }
            .padding(.vertical)
        }
        .navigationTitle(document.title)
        .navigationBarTitleDisplayMode(.inline)
    }
}

Now when the user scans a document, they can tap it and see all scanned pages.

This is already a useful little app.

It does not save anything permanently yet. It does not generate a PDF yet. It does not run OCR yet.

But the core scanning flow works.

That is usually how I like to build features.

First, make the core path work.

Then improve it.

Step 7: Add OCR Using Vision

VisionKit helps us scan the document, but what if we want to extract text from the scanned page?

For that, we can use Apple’s Vision framework.

Create a new file called TextRecognizer.swift.

import UIKit
import Vision

final class TextRecognizer {
    func recognizeText(from image: UIImage) async throws -> String {
        guard let cgImage = image.cgImage else {
            return ""
        }
        return try await withCheckedThrowingContinuation { continuation in
            let request = VNRecognizeTextRequest { request, error in
                if let error {
                    continuation.resume(throwing: error)
                    return
                }
                guard let observations = request.results as? [VNRecognizedTextObservation] else {
                    continuation.resume(returning: "")
                    return
                }
                let recognizedStrings = observations.compactMap { observation in
                    observation.topCandidates(1).first?.string
                }
                continuation.resume(returning: recognizedStrings.joined(separator: "\n"))
            }
            request.recognitionLevel = .accurate
            request.usesLanguageCorrection = true
            let handler = VNImageRequestHandler(cgImage: cgImage)
            do {
                try handler.perform([request])
            } catch {
                continuation.resume(throwing: error)
            }
        }
    }
}

This class takes a UIImage and returns recognized text.

The important part is this request:

let request = VNRecognizeTextRequest { request, error in
    // Handle OCR result
}

Then we configure it:

request.recognitionLevel = .accurate
request.usesLanguageCorrection = true

For many document scanning use cases, .accurate is a good starting point. If you need faster recognition, you can explore other options later.

Step 8: Show Recognized Text in the Detail View

Now let us update our detail view so it can run OCR on the first page.

import SwiftUI

struct ScanDetailView: View {
    let document: ScannedDocument
    @State private var recognizedText = ""
    @State private var isRecognizingText = false
    @State private var errorMessage: String?
    private let textRecognizer = TextRecognizer()
    var body: some View {
        ScrollView {
            LazyVStack(spacing: 20) {
                ForEach(Array(document.pages.enumerated()), id: \.offset) { index, image in
                    VStack(alignment: .leading, spacing: 8) {
                        Text("Page \(index + 1)")
                            .font(.headline)
                            .padding(.horizontal)
                        Image(uiImage: image)
                            .resizable()
                            .scaledToFit()
                            .clipShape(RoundedRectangle(cornerRadius: 12))
                            .shadow(radius: 4)
                            .padding(.horizontal)
                    }
                }
                Divider()
                    .padding(.horizontal)
                VStack(alignment: .leading, spacing: 12) {
                    Button {
                        Task {
                            await recognizeTextFromFirstPage()
                        }
                    } label: {
                        if isRecognizingText {
                            ProgressView()
                        } else {
                            Label("Recognize Text", systemImage: "text.viewfinder")
                        }
                    }
                    .buttonStyle(.borderedProminent)
                    .disabled(isRecognizingText || document.pages.isEmpty)
                    if let errorMessage {
                        Text(errorMessage)
                            .foregroundStyle(.red)
                            .font(.footnote)
                    }
                    if !recognizedText.isEmpty {
                        Text("Recognized Text")
                            .font(.headline)
                        Text(recognizedText)
                            .font(.body)
                            .textSelection(.enabled)
                            .padding()
                            .frame(maxWidth: .infinity, alignment: .leading)
                            .background(Color(.secondarySystemBackground))
                            .clipShape(RoundedRectangle(cornerRadius: 12))
                    }
                }
                .padding(.horizontal)
            }
            .padding(.vertical)
        }
        .navigationTitle(document.title)
        .navigationBarTitleDisplayMode(.inline)
    }
    private func recognizeTextFromFirstPage() async {
        guard let firstPage = document.pages.first else {
            return
        }
        isRecognizingText = true
        errorMessage = nil
        do {
            recognizedText = try await textRecognizer.recognizeText(from: firstPage)
        } catch {
            errorMessage = "Failed to recognize text: \(error.localizedDescription)"
        }
        isRecognizingText = false
    }
}

Now the app can scan a document and recognize text from the first page.

Again, this is not a production OCR pipeline yet, but it is enough to understand the flow.

The user scans a document.

VisionKit gives us images.

Vision extracts text from those images.

SwiftUI displays the result.

That is a pretty powerful combination.

Step 9: Move Logic Toward MVVM

In my MVVM article, I talked about why moving logic away from views helps as the app grows.

This scanner example is a perfect case where MVVM starts to make sense.

Right now, ContentView owns the scanned documents directly:

@State private var documents: [ScannedDocument] = []

That is fine for a small demo.

But if this app grows, we may want a view model.

Create a file called ScannerViewModel.swift.

import SwiftUI

@MainActor
final class ScannerViewModel: ObservableObject {
    @Published private(set) var documents: [ScannedDocument] = []
    @Published var isShowingScanner = false
    func addScannedImages(_ images: [UIImage]) {
        let document = ScannedDocument(
            title: "Scan \(documents.count + 1)",
            pages: images,
            createdAt: Date()
        )
        documents.append(document)
    }
    func showScanner() {
        isShowingScanner = true
    }
    func hideScanner() {
        isShowingScanner = false
    }
}

Then update ContentView:

import SwiftUI

struct ContentView: View {
    @StateObject private var viewModel = ScannerViewModel()
    var body: some View {
        NavigationStack {
            VStack {
                if viewModel.documents.isEmpty {
                    ContentUnavailableView(
                        "No Scans Yet",
                        systemImage: "doc.viewfinder",
                        description: Text("Tap Scan Document to create your first scan.")
                    )
                } else {
                    List {
                        ForEach(viewModel.documents) { document in
                            NavigationLink {
                                ScanDetailView(document: document)
                            } label: {
                                VStack(alignment: .leading, spacing: 6) {
                                    Text(document.title)
                                        .font(.headline)
                                    Text("\(document.pages.count) page(s)")
                                        .font(.subheadline)
                                        .foregroundStyle(.secondary)
                                    Text(document.createdAt.formatted())
                                        .font(.caption)
                                        .foregroundStyle(.secondary)
                                }
                                .padding(.vertical, 4)
                            }
                        }
                    }
                }
            }
            .navigationTitle("VisionKit Scanner")
            .toolbar {
                ToolbarItem(placement: .topBarTrailing) {
                    Button {
                        viewModel.showScanner()
                    } label: {
                        Label("Scan", systemImage: "camera.viewfinder")
                    }
                }
            }
            .sheet(isPresented: $viewModel.isShowingScanner) {
                DocumentScannerView(
                    onScanComplete: { images in
                        viewModel.addScannedImages(images)
                        viewModel.hideScanner()
                    },
                    onCancel: {
                        viewModel.hideScanner()
                    }
                )
            }
        }
    }
}

This version feels cleaner.

The view is mostly responsible for UI.

The view model handles scanner-related state.

This is where the concepts from SwiftUI state and MVVM start connecting with real frameworks.

It is one thing to learn @State, @StateObject, and ObservableObject in isolation. It is another thing to use them while integrating a real Apple framework like VisionKit.

Step 10: What About Live Text and QR Scanning?

So far, we used VisionKit for document scanning.

But VisionKit also has DataScannerViewController, which can scan live data from the camera.

This is useful when you want the user to point the camera at text, QR codes, barcodes, phone numbers, links, or other supported content.

A simple example looks like this:

import SwiftUI
import VisionKit

struct LiveDataScannerView: UIViewControllerRepresentable {
    var onRecognizedText: (String) -> Void
    func makeUIViewController(context: Context) -> DataScannerViewController {
        let scanner = DataScannerViewController(
            recognizedDataTypes: [
                .text(languages: ["en-US"])
            ],
            qualityLevel: .balanced,
            recognizesMultipleItems: true,
            isHighFrameRateTrackingEnabled: true,
            isHighlightingEnabled: true
        )
        scanner.delegate = context.coordinator
        do {
            try scanner.startScanning()
        } catch {
            print("Failed to start scanning: \(error.localizedDescription)")
        }
        return scanner
    }
    func updateUIViewController(
        _ uiViewController: DataScannerViewController,
        context: Context
    ) {
        // No update needed for this example
    }
    func makeCoordinator() -> Coordinator {
        Coordinator(onRecognizedText: onRecognizedText)
    }
    final class Coordinator: NSObject, DataScannerViewControllerDelegate {
        private let onRecognizedText: (String) -> Void
        init(onRecognizedText: @escaping (String) -> Void) {
            self.onRecognizedText = onRecognizedText
        }
        func dataScanner(
            _ dataScanner: DataScannerViewController,
            didTapOn item: RecognizedItem
        ) {
            switch item {
            case .text(let text):
                onRecognizedText(text.transcript)
            case .barcode(let barcode):
                if let payload = barcode.payloadStringValue {
                    onRecognizedText(payload)
                }
            @unknown default:
                break
            }
        }
    }
}

This is a different experience from document scanning.

With document scanning, the user captures pages.

With live data scanning, the camera is continuously looking for useful information.

You could use this for:

QR code scanning
Barcode scanning
Extracting a phone number from a business card
Detecting URLs
Reading labels
Scanning short pieces of text

If you are building real-world iOS tools, this opens up a lot of possibilities.

A Few Practical Notes

There are a few things I would keep in mind before using VisionKit in a production app.

First, always test on a real device. Camera-based APIs are not something you should fully trust from the simulator.

Second, design for failure. The user may cancel scanning. Camera permission may be denied. Text recognition may fail. The scanned image may be blurry.

Third, keep the first version simple. It is very tempting to add PDF export, OCR, folders, cloud sync, search, tags, and sharing immediately. But the better approach is to first make scanning feel good.

Fourth, think about privacy. If you are scanning receipts, IDs, documents, or notes, users care about where that data goes. If everything stays local, say that clearly. If you upload files to a server, explain why and how.

Where This Can Go Next

This little demo can become much more useful.

The next features I would add are:

Save scanned documents locally
Export scanned pages as PDF
Add OCR for all pages
Add search across recognized text
Add folders
Add rename and delete
Add share sheet support
Add basic image cleanup filters
Add iCloud sync later if needed

This is how I usually like learning a new framework.

Not by reading every API first.

Not by trying to build a perfect app immediately.

But by building one small useful flow, then improving it piece by piece.

VisionKit is a good framework for that style of learning because you can get something working quickly, but there is still enough depth to build a real app from it.

I have also uploaded the complete sample project to GitHub here:
https://github.com/sanjaynela/visionkit-scanner-ios

Feel free to clone it, try it on a real iPhone, and modify it for your own scanner-style app. If you build something interesting with VisionKit, I would love to hear about it.

Final Thoughts

Trying VisionKit reminded me why I enjoy iOS development.

Apple gives us these powerful native frameworks, but the real learning happens when we connect them to actual app ideas.

A document scanner sounds simple at first. But once you build it, you start thinking about OCR, PDFs, local storage, search, folders, privacy, and user experience.

That is when a small demo becomes a real product idea.

If you already understand SwiftUI state and MVVM, VisionKit is a nice next framework to explore because it gives you something practical to build. You can use SwiftUI for the app structure, MVVM for organization, VisionKit for scanning, and Vision for text recognition.

That combination is powerful.

And honestly, this is the kind of framework I wish I had tried earlier.

Sometimes the best way to learn a new Apple framework is not to wait until you need it at work. It is to build a small personal app, break a few things, fix them, and slowly understand where the framework fits.

That is exactly what I wanted to do with VisionKit.

And now I can already see a few app ideas coming from it.

I Finally Tried Apple VisionKit: Building a Simple Document Scanner in SwiftUI was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

I Finally Tried Apple VisionKit: Building a Simple Document Scanner in SwiftUI

What Is VisionKit?

\VisionKit vs Vision

What We Are Building

Step 1: Create a New SwiftUI Project

Step 2: Add Camera Permission

Step 3: Create a Scanned Document Model

Step 4: Create the VisionKit Scanner Wrapper

Step 5: Build the Main SwiftUI Screen

Step 6: Create a Detail View for Scanned Pages

Step 7: Add OCR Using Vision

Step 8: Show Recognized Text in the Detail View

Step 9: Move Logic Toward MVVM

Step 10: What About Live Text and QR Scanning?

A Few Practical Notes

Where This Can Go Next

Final Thoughts

NexaPay — Accept Card Payments, Receive Crypto

Related Articles

Pavel Durov announces Telegram is taking over TON blockchain, becoming its largest validator

Reflection SDD: Use a Reflection Harness to Level Up Your OpenSpec Workflow

Profiling Your Python API — How to Find the Bottleneck That’s Actually Slowing You Down

UBER: Design a Real-Time Quiz Platform Like Kahoot. The Quiz Isn’t the Hard Part.

Iran warns Israeli civilians amid escalating military conflict

Toncoin rebranding to Gram as TON blockchain returns to its roots