I have been building and maintaining a data lake in AWS for the past year or so and it has been a learning experience to say the least. Recently I had an issue where a AWS Glue crawler stopped updating a table in the catalog that represented raw syslog data that was being imported in.

The error being shown was:

INFO : Multiple tables are found under location [S3 bucket and path]. Table [table name] is skipped.

So basically the table stayed there, but any new data being added was not being updated into the table along with the partitions I was using. I was super confused as to what was going on. I ended up noticing that the crawler started to fail around a day that I new some activity was going on in the account by other users.

When I looked in the S3 bucket, I found someone had left a different file there. I believe in was happenstance by syncing over a file set that included an additional log file at the base level they didn’t notice.

Anyhow, that was enough for the crawler to be confused. I had the crawler set to treat everything in that path as a single table, so it didn’t like a file that that was in a syslog format like everything else.

Solution: Delete that file. Run the crawler again.

After thinking on it a bit, I remember that using the AWS CLI for S3 would have been a quick way to look for files. If you aren’t familiar with the command line interface (CLI) for AWS, Google for help on it.

Basically I tried using this S3 list command to recursively look through the folder and pipe the results to the sort command to get them in modified date order. Then I scrolled through the list to around the time the failures started and found the new file added that was the oddball.

aws s3 ls s3://bucketname/path/ --recursive | sort