포스트

Making Our School News Scraper 90% Faster and Simpler

Overview

In December 2025, I fixed a big problem with my project: the SSU Announcement Scraper. This tool checks school news and sends alerts to students. For a long time, the old version was slow, expensive, and broke very easily.

To solve this, I performed a major refactoring task. I deleted 700 lines of old code and made the system much faster and more reliable.
Here is a detailed look at how I transformed a fragile prototype into a strong service.

Problems

The old version used a “Headless Chrome” browser inside AWS Lambda to visit the school portal.
While this worked at first, it was a bad way to build a long-term service because…

1. It was too heavy

Running a whole browser just to read a few lines of text uses a massive amount of memory. It is like using a big heavy truck to deliver a single small letter. Every time the scraper ran, the system had to start the engine (Chrome), which wasted resources.

2. It was slow and expensive

Because it had to load images, scripts, and styles, it took 30 to 60 seconds to finish. In the cloud world (AWS), time is money.
Long execution times meant higher monthly bills.

3. It broke easily (Maintenance problem)

If the school changed just one small thing on their website—like a button color or a table name—the scraper would stop working.
I had to spend hours every week fixing the code manually just to keep it running.

4. No tests at all

I didn’t have any automated tests to check if the code was working.
This meant I only found out about bugs after the service stopped sending news to students.


Solutions

The biggest change was stopping the “web scraping” entirely.
Instead of looking at the website like a human, I found a way to talk to a server directly using an API called SSUFID.

1. Cleaning and Organizing the Code

I deleted the messy old code and wrote a new, unified system.

  • Old code: 700+ lines of messy logic and complicated HTML rules.
  • New code: Only 200 lines of clean, simple code.

Result: A 72% reduction in code! It is now much easier for me (or anyone else) to read and understand how the system works.

2. Using AWS SSM for Smart Settings

In the past, if I wanted to add a new school department to the scraper, I had to rewrite the code and upload it again. Now, I use AWS Systems Manager (SSM). I can just change a setting in the AWS dashboard, and the scraper updates itself automatically. I don’t need to touch the code at all.

3. Technical Details

Furthermore, I used three main engineering modifications to make the service efficiently

Go errgroup

Checking websites one by one is very boring and slow. If you have 10 websites and each takes 3 seconds, that is 30 seconds total. I used Go’s errgroup to check all APIs at the same time (in parallel).

// Checking many websites at once g, gCtx := errgroup.WithContext(ctx) for _, url := range cfg.Urls { g.Go(func() error { // Fetch news from the API concurrently apiResponse, err := ssufid_request.SSUFIDRequest(gCtx, url) return nil }) } if err := g.Wait(); err != nil { log.Printf("Something went wrong during the task: %v", err) }

Because of this “parallel” work, the entire task now takes only approximately 3 seconds total (5 API requests), no matter how many APIs we add.

2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site1 2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site2 2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site3 2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site4 2025/12/30 12:53:48 [DEBUG] Fetch announcement data from: https://site5 2025/12/30 12:53:48 [DEBUG] Deserialize API response body 2025/12/30 12:53:48 [DEBUG] Deserialize API response body 2025/12/30 12:53:48 [DEBUG] Deserialize API response body 2025/12/30 12:53:48 [DEBUG] Deserialize API response body 2025/12/30 12:53:48 [DEBUG] Deserialize API response body 2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body 2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body 2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body 2025/12/30 12:53:49 [DEBUG] Successfully deserialized API response body 2025/12/30 12:53:50 [DEBUG] Successfully deserialized API response body

Fast search with DynamoDB GSI

We never want to send the same news alert to students twice.
To prevent this, I added a “Fast Index” called a GSI (Global Secondary Index) to my database.

Instead of looking through thousands of old news items one by one (which is very slow and expensive), the code uses the index to quickly ask: “Give me the 100 newest items for this department.” This “indexed search” is incredibly fast and keeps our database costs very low.

// Fast search using an Index instead of a slow scan input := &dynamodb.QueryInput{ TableName: &cfg.DatabaseName, IndexName: aws.String("TypeSubTypeCreatedAtIndex"), KeyConditionExpression: expr.KeyCondition(), ScanIndexForward: aws.Bool(false), // Get newest items first Limit: aws.Int32(100), }

Reference: https://docs.aws.amazon.com/ko_kr/code-library/latest/ug/go_2_dynamodb_code_examples.html

Testing: Unit and Integration Tests

To make sure the code never breaks, I implemented two levels of testing

1. Unit tests

I created “fake” versions of the database.
This lets me check the logic of my code on my laptop without needing the internet.

//go:build unit // MockDynamoDBClient is a mock implementation of config.DynamoDBClient interface type MockDynamoDBClient struct { BatchWriteItemFunc func(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error) BatchGetItemFunc func(ctx context.Context, params *dynamodb.BatchGetItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchGetItemOutput, error) QueryFunc func(ctx context.Context, params *dynamodb.QueryInput, optFns ...func(*dynamodb.Options)) (*dynamodb.QueryOutput, error) } // BatchWriteItem implements the DynamoDBClient interface func (m *MockDynamoDBClient) BatchWriteItem(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error) { // Mock Logic } // BatchGetItem implements the DynamoDBClient interface func (m *MockDynamoDBClient) BatchGetItem(ctx context.Context, params *dynamodb.BatchGetItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchGetItemOutput, error) { // Mock Logic } // Query implements the DynamoDBClient interface func (m *MockDynamoDBClient) Query(ctx context.Context, params *dynamodb.QueryInput, optFns ...func(*dynamodb.Options)) (*dynamodb.QueryOutput, error) { // Mock Logic } func TestBatchSaveAnnouncements(t *testing.T) { t.Parallel() testCases := []struct { name string mockDynamoClient *MockDynamoDBClient announcements []dto.ScrapeItem announcementSubType string expectedError bool validator func(*testing.T, error) }{ // Test case 1 { name: "Successfully add announcement batch into database", mockDynamoClient: &MockDynamoDBClient{ BatchWriteItemFunc: func(ctx context.Context, params *dynamodb.BatchWriteItemInput, optFns ...func(*dynamodb.Options)) (*dynamodb.BatchWriteItemOutput, error) { return &dynamodb.BatchWriteItemOutput{}, nil }, }, announcements: createMockAnnouncements(2), expectedError: false, validator: func(t *testing.T, err error) { if err != nil { t.Errorf("Unexpected error: %v", err) } }, }, // Test case 2 { ... }, // Test case 3 { ... }, // Test case 4 { ... }, } for _, testCase := range testCases { t.Run(testCase.name, func(t *testing.T) { // Run and validate test cases }) } } // createMockAnnouncements is a helper function that generates mock announcements func createMockAnnouncements(count int) []dto.ScrapeItem { items := make([]dto.ScrapeItem, count) ... return items }

2. Integration Tests

I used LocalStack and Testcontainers to run a “Fake AWS” inside a Docker container. This allows me to test if the code can actually talk to AWS SSM and DynamoDB correctly. For example, my test starts a real container, sets up fake parameters, and checks if the code can read them

func TestLoadConfig_Integration(t *testing.T) { ctx := context.Background() // Start a real LocalStack container to act like AWS container, err := utils.GetTestContainer(ctx) if err != nil { t.Fatalf("Failed to get test container: %v", err) } defer container.Terminate(ctx) // ... set up fake SSM parameters and check if our code loads them correctly appConfig := LoadConfig(ctx) if appConfig.DatabaseName != "test-database" { t.Errorf("Expected 'test-database', but got '%s'", appConfig.DatabaseName) } }

Results

Metric Old Way (Scraping) New Way (API) Improvement
Time to Finish 44 seconds 3 seconds 93% Faster
Lines of Code 735 lines 208 lines 71% Simpler
Test Coverage No Tests Implemented unit/integration tests 51.8%

Scraper execution time

Original

image

New

image

Test coverage

It still needs more test cases to improve the current test coverage!

image

이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.

인기 태그