Making Our School News Scraper 90% Faster and Simpler

Overview

In December 2025, I fixed a big problem with my project: the SSU Announcement Scraper. This tool checks school news and sends alerts to students. For a long time, the old version was slow, expensive, and broke very easily.

To solve this, I performed a major refactoring task. I deleted 700 lines of old code and made the system much faster and more reliable.
Here is a detailed look at how I transformed a fragile prototype into a strong service.

Problems

The old version used a “Headless Chrome” browser inside AWS Lambda to visit the school portal.
While this worked at first, it was a bad way to build a long-term service because…

1. It was too heavy

Running a whole browser just to read a few lines of text uses a massive amount of memory. It is like using a big heavy truck to deliver a single small letter. Every time the scraper ran, the system had to start the engine (Chrome), which wasted resources.

2. It was slow and expensive

Because it had to load images, scripts, and styles, it took 30 to 60 seconds to finish. In the cloud world (AWS), time is money.
Long execution times meant higher monthly bills.

3. It broke easily (Maintenance problem)

If the school changed just one small thing on their website—like a button color or a table name—the scraper would stop working.
I had to spend hours every week fixing the code manually just to keep it running.

4. No tests at all

I didn’t have any automated tests to check if the code was working.
This meant I only found out about bugs after the service stopped sending news to students.


Solutions

The biggest change was stopping the “web scraping” entirely.
Instead of looking at the website like a human, I found a way to talk to a server directly using an API called SSUFID.

1. Cleaning and Organizing the Code

I deleted the messy old code and wrote a new, unified system.

  • Old code: 700+ lines of messy logic and complicated HTML rules.
  • New code: Only 200 lines of clean, simple code.

Result: A 72% reduction in code! It is now much easier for me (or anyone else) to read and understand how the system works.

2. Using AWS SSM for Smart Settings

In the past, if I wanted to add a new school department to the scraper, I had to rewrite the code and upload it again. Now, I use AWS Systems Manager (SSM). I can just change a setting in the AWS dashboard, and the scraper updates itself automatically. I don’t need to touch the code at all.

3. Technical Details

Furthermore, I used three main engineering modifications to make the service efficiently

Go errgroup

Checking websites one by one is very boring and slow. If you have 10 websites and each takes 3 seconds, that is 30 seconds total. I used Go’s errgroup to check all APIs at the same time (in parallel).

// Checking many websites at once
g, gCtx := errgroup.WithContext(ctx)
for _, url := range cfg.Urls {
    g.Go(func() error {
        // Fetch news from the API concurrently
        apiResponse, err := ssufid_request.SSUFIDRequest(gCtx, url)
        return nil
    })
}
if err := g.Wait(); err != nil {
    log.Printf("Something went wrong during the task: %v", err)
}

Because of this “parallel” work, the entire task now takes only approximately 3 seconds total (5 API requests), no matter how many APIs we add.

2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site1
2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site2
2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site3
2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site4
2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site5
2025/12/30 12:53:48 [DEBUG] Deserialize API response body
2025/12/30 12:53:48 [DEBUG] Deserialize API response body
2025/12/30 12:53:48 [DEBUG] Deserialize API response body
2025/12/30 12:53:48 [DEBUG] Deserialize API response body
2025/12/30 12:53:48 [DEBUG] Deserialize API response body
2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body
2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body
2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body
2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body
2025/12/30 12:53:50 [DEBUG] Successfully deserialized API response body

Fast search with DynamoDB GSI

We never want to send the same news alert to students twice.
To prevent this, I added a “Fast Index” called a GSI (Global Secondary Index) to my database.

Instead of looking through thousands of old news items one by one (which is very slow and expensive), the code uses the index to quickly ask: “Give me the 100 newest items for this department.” This “indexed search” is incredibly fast and keeps our database costs very low.

// Fast search using an Index instead of a slow scan
input := &dynamodb.QueryInput{
    TableName:              &cfg.DatabaseName,
    IndexName:              aws.String("TypeSubTypeCreatedAtIndex"),
    KeyConditionExpression: expr.KeyCondition(),
    ScanIndexForward:       aws.Bool(false), // Get newest items first
    Limit:                  aws.Int32(100),
}

Reference: https://docs.aws.amazon.com/ko_kr/code-library/latest/ug/go_2_dynamodb_code_examples.html

Testing: Unit and Integration Tests

To make sure the code never breaks, I implemented two levels of testing

1. Unit tests

I created “fake” versions of the database.
This lets me check the logic of my code on my laptop without needing the internet.

//go:build unit
// MockDynamoDBClient is a mock implementation of config.DynamoDBClient interface
type MockDynamoDBClient struct {
	BatchWriteItemFunc func(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error)
	BatchGetItemFunc   func(ctx context.Context, params *dynamodb.BatchGetItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchGetItemOutput, error)
	QueryFunc          func(ctx context.Context, params *dynamodb.QueryInput, optFns ...func(*dynamodb.Options)) (*dynamodb.QueryOutput, error)
}

// BatchWriteItem implements the DynamoDBClient interface
func (m *MockDynamoDBClient) BatchWriteItem(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error) {
	// Mock Logic
}

// BatchGetItem implements the DynamoDBClient interface
func (m *MockDynamoDBClient) BatchGetItem(ctx context.Context, params *dynamodb.BatchGetItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchGetItemOutput, error) {
	// Mock Logic
}

// Query implements the DynamoDBClient interface
func (m *MockDynamoDBClient) Query(ctx context.Context, params *dynamodb.QueryInput, optFns ...func(*dynamodb.Options)) (*dynamodb.QueryOutput, error) {
	// Mock Logic
}

func TestBatchSaveAnnouncements(t *testing.T) {
	t.Parallel()
	testCases := []struct {
		name                string
		mockDynamoClient    *MockDynamoDBClient
		announcements       []dto.ScrapeItem
		announcementSubType string
		expectedError       bool
		validator           func(*testing.T, error)
	}{
		// Test case 1
		{
			name: "Successfully add announcement batch into database",
			mockDynamoClient: &MockDynamoDBClient{
				BatchWriteItemFunc: func(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error) {
					return &dynamodb.BatchWriteItemOutput{}, nil
				},
			},
			announcements: createMockAnnouncements(2),
			expectedError: false,
			validator: func(t *testing.T, err error) {
				if err != nil {
					t.Errorf("Unexpected error: %v", err)
				}
			},
		},

		// Test case 2
		{
			...
		},

		// Test case 3
		{
			...
		},

		// Test case 4
		{
			...
		},
	}

	for _, testCase := range testCases {
		t.Run(testCase.name, func(t *testing.T) {
			// Run and validate test cases
		})
	}
}

// createMockAnnouncements is a helper function that generates mock announcements
func createMockAnnouncements(count int) []dto.ScrapeItem {
	items := make([]dto.ScrapeItem, count)
	...
	return items
}

2. Integration Tests

I used LocalStack and Testcontainers to run a “Fake AWS” inside a Docker container. This allows me to test if the code can actually talk to AWS SSM and DynamoDB correctly. For example, my test starts a real container, sets up fake parameters, and checks if the code can read them

func TestLoadConfig_Integration(t *testing.T) {
    ctx := context.Background()
    // Start a real LocalStack container to act like AWS
    container, err := utils.GetTestContainer(ctx)
    if err != nil {
       t.Fatalf("Failed to get test container: %v", err)
    }
    defer container.Terminate(ctx)

    // ... set up fake SSM parameters and check if our code loads them correctly
    appConfig := LoadConfig(ctx)

    if appConfig.DatabaseName != "test-database" {
       t.Errorf("Expected 'test-database', but got '%s'", appConfig.DatabaseName)
    }
}

Results

MetricOld Way (Scraping)New Way (API)Improvement
Time to Finish44 seconds3 seconds93% Faster
Lines of Code735 lines208 lines71% Simpler
Test CoverageNo TestsImplemented unit/integration tests51.8%

Scraper execution time

Original

image

New

image

Test coverage

It still needs more test cases to improve the current test coverage!

image

← all posts